WARPSPEED COMPUTATION OF OPTIMAL TRANSPORT, GRAPH DISTANCES, AND EMBEDDING ALIGNMENT

Abstract

Optimal transport (OT) is a cornerstone of many machine learning tasks. The current best practice for computing OT is via entropy regularization and Sinkhorn iterations. This algorithm runs in quadratic time and requires calculating the full pairwise cost matrix, which is prohibitively expensive for large sets of objects. To alleviate this limitation we propose to instead use a sparse approximation of the cost matrix based on locality sensitive hashing (LSH). Moreover, we fuse this sparse approximation with the Nyström method, resulting in the locally corrected Nyström method (LCN). These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even in complex, high-dimensional spaces. We thoroughly demonstrate these advantages via a theoretical analysis and by evaluating multiple approximations both directly and as a component of two real-world models. Using approximate Sinkhorn for unsupervised word embedding alignment enables us to train the model full-batch in a fraction of the time while improving upon the original on average by 3.1 percentage points without any model changes. For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn and outcompetes previous models by 48 %. LCN-Sinkhorn enables GTN to achieve this while still scaling log-linearly in the number of nodes.

1. INTRODUCTION

Measuring the distance between two distributions or sets of objects is a central problem in machine learning. One common method of solving this is optimal transport (OT). OT is concerned with the problem of finding the transport plan for moving a source distribution (e.g. a pile of earth) to a sink distribution (e.g. a construction pit) with the cheapest cost w.r.t. some pointwise cost function (e.g. the Euclidean distance). The advantages of this method have been shown numerous times, e.g. in generative modelling (Arjovsky et al., 2017; Bousquet et al., 2017; Genevay et al., 2018) , loss functions (Frogner et al., 2015) , set matching (Wang et al., 2019) , or domain adaptation (Courty et al., 2017) . Motivated by this, many different methods for accelerating OT have been proposed in recent years (Indyk & Thaper, 2003; Papadakis et al., 2014; Backurs et al., 2020) . However, most of these approaches are specialized methods that do not generalize to modern deep learning models, which rely on dynamically changing high-dimensional embeddings. In this work we aim to make OT computation for point sets more scalable by proposing two fast and accurate approximations of entropy-regularized optimal transport: Sparse Sinkhorn and LCN-Sinkhorn, the latter relying on our newly proposed locally corrected Nyström (LCN) method. Sparse Sinkhorn uses a sparse cost matrix to leverage the fact that in entropy-regularized OT (also known as the Sinkhorn distance) (Cuturi, 2013) often only each point's nearest neighbors influence the result. LCN-Sinkhorn extends this approach by leveraging LCN, a general similarity matrix approximation that fuses local (sparse) and global (low-rank) approximations, allowing us to simultaneously capture both kinds of behavior. LCN-Sinkhorn thus fuses sparse Sinkhorn and Nyström-Sinkhorn (Altschuler et al., 2019) . Both sparse Sinkhorn and LCN-Sinkhorn run in log-linear time. We theoretically analyze these approximations and show that sparse corrections can lead to significant improvements over the Nyström approximation. We furthermore validate these approximations by showing that they are able to reproduce both the Sinkhorn distance and transport plan significantly better than previous methods across a wide range of regularization parameters and computational budgets (as e.g. demonstrated in Fig. 1 ). We then show the impact of these improvements by employing Sinkhorn approximations end-to-end in two high-impact machine learning tasks. First, we incorporate them into Wasserstein Procrustes for word embedding alignment (Grave et al., 2019) . LCN-Sinkhorn improves upon the original method's accuracy by 3.1 percentage points using a third of the training time without any further model changes. Second, we develop the graph transport network (GTN), which combines graph neural networks (GNNs) with optimal transport, and further improve it via learnable unbalanced OT and multi-head OT. GTN with LCN-Sinkhorn is the first model that both overcomes the bottleneck of using a single embedding per graph and scales log-linearly in the number of nodes. In summary, our paper's main contributions are: • Locally Corrected Nyström (LCN), a flexible, log-linear time approximation for similarity matrices, leveraging both local (sparse) and global (low-rank) approximations. • Entropy-regularized optimal transport (a.k.a. Sinkhorn distance) with log-linear runtime via sparse Sinkhorn and LCN-Sinkhorn. These are the first log-linear approximations that are stable enough to substitute full entropy-regularized OT in models that leverage high-dimensional spaces. • The graph transport network (GTN), which combines a graph neural network (GNN) with multihead unbalanced LCN-Sinkhorn. GTN both sets the state of the art on graph distance regression and still scales log-linearly in the number of nodes.

2. SPARSE SINKHORN

Entropy-regularized optimal transport. In this work we focus on optimal transport between two discrete sets of points. We furthermore add entropy regularization, which enables fast computation and often performs better than regular OT (Cuturi, 2013) . Formally, given two categorical distributions modelled via the vectors p ∈ R n and q ∈ R m supported on two sets of points X p = {x p1 , . . . , x pn } and X q = {x q1 , . . . , x qm } in R d and the cost function c : R d × R d → R (e.g. the squared L 2 distance) giving rise to the cost matrix C ij = c(x pi , x qi ) we aim to find the Sinkhorn distance d λ c and the associated optimal transport plan P (Cuturi, 2013 ) d λ c = min P P , C F -λH(P ), s.t. P 1 m = p, P T 1 n = q, with the Frobenius inner product ., . F and the entropy H(P ) = -n i=1 m j=1 P ij log P ij . Note that d λ c includes the entropy and can thus be negative, while Cuturi (2013) originally used d

1/λ

Cuturi,c = P , C F . This optimization problem can be solved by finding the vectors s and t that normalize the columns and rows of the matrix P = diag(s)K diag( t) with the similarity matrix K ij = e -C ij λ , so that P 1 m = p and P T 1 n = q. This is usually achieved via the Sinkhorn algorithm, which initializes the normalization vectors as s (1) = 1 n and t (1) = 1 m and then updates them alternatingly via s (i) = p (Kt (i-1) ), t (i) = q (K T s (i) ) until convergence, where denotes elementwise division. Sparse Sinkhorn. The Sinkhorn algorithm is faster than non-regularized EMD algorithms, which run in O(n 2 m log n log(n max(C))) (Tarjan, 1997) . However, its computational cost is still quadratic in time, i.e. O(nm), which is prohibitively expensive for large n and m. We propose to overcome this by observing that the matrix K, and hence also P , is negligibly small everywhere except at each point's closest neighbors because of the exponential used in K's computation. We propose to leverage this by approximating C via the sparse matrix C sp , where C sp ij = C ij if x pi and x qj are "near", ∞ otherwise. (3) K sp and P sp follow according to the definitions of K and P . In this work we primarily consider neighbors with distance lower than r 1 as "near". Finding such neighbors can be efficiently solved via locality sensitive hashing (LSH) on X p ∪ X q . Locality sensitive hashing. LSH tries to filter "near" from "far" data points by putting them into different hash buckets. Points closer than a certain distance r 1 are put into the same bucket with probability at least p 1 , while those beyond some distance r 2 = c • r 1 with c > 1 are put into the same bucket with probability at most p 2 p 1 . There is a plethora of LSH methods for different cost functions (Wang et al., 2014; Shrivastava & Li, 2014 ), so we do not have to restrict our approach to a limited set of functions. In this work we focus on cross-polytope LSH (Andoni et al., 2015) and k-means LSH (Paulevé et al., 2010) , depending on the cost function (see App. H). Sparse Sinkhorn with LSH scales log-linearly with the number of points, i.e. O(n log n) for n ≈ m (see App. A and App. K for details). Unfortunately, LSH can fail when e.g. the cost between pairs is very similar (see App. B). However, we can alleviate these limitations by fusing K sp with the Nyström approximation.

3. LOCALLY CORRECTED NYSTRÖM AND LCN-SINKHORN

Nyström method. The Nyström method is a popular way of approximating similarity matrices that provides performance guarantees for many important tasks (Williams & Seeger, 2001; Musco & Musco, 2017) . It approximates a positive semi-definite (PSD) similarity matrix K via its low-rank decomposition K Nys = U A -1 V . Since the optimal decomposition via SVD is too expensive to compute, Nyström instead chooses a set of l landmarks L = {x l1 , . . . , x ll } and obtains the matrices via U ij = k(x pi , x lj ), A ij = k(x li , x lj ), and V ij = k(x li , x qj ), where k(x 1 , x 2 ) is an arbitrary PSD kernel, e.g. k(x 1 , x 2 ) = e -c(x 1 ,x 2 ) λ for Sinkhorn. Common methods of choosing landmarks from X p ∪ X q are uniform and ridge leverage score (RLS) sampling. We instead focus on k-means Nyström and sampling via k-means++, which we found to be significantly faster than recursive RLS sampling (Zhang et al., 2008) and perform better than both uniform and RLS sampling (see App. H). Sparse vs. Nyström. Exponential kernels like the one used for K (e.g. the Gaussian kernel) typically have a reproducing kernel Hilbert space that is infinitely dimensional. The resulting Gram matrix K thus always has full rank. A low-rank approximation like the Nyström method can therefore only account for its global structure and not the local structure around each point x. As such, it is ill-suited for any moderately low entropy regularization parameter, where the transport matrix P resembles a permutation matrix. Sparse Sinkhorn, on the other hand, cannot account for global structure and instead approximates all non-selected distances as infinity. It will hence fail if more than a handful of neighbors are required per point. These approximations are thus opposites of each other, and as such not competing but rather complementary approaches. Locally corrected Nyström. Since we know that the entries in our sparse approximation are exact, fusing this matrix with the Nyström method is rather straightforward. For all non-zero values in the sparse approximation K sp we first calculate the corresponding Nyström approximations, obtaining the sparse matrix K sp Nys . To obtain the locally corrected Nyström (LCN) approximation we remove these entries from K Nys and replace them with their exact values, i.e. K LCN = K Nys + K sp ∆ = K Nys -K sp Nys + K sp . LCN-Sinkhorn. To obtain the approximate transport plan PLCN we run the Sinkhorn algorithm with K LCN instead of K. However, we never fully instantiate K LCN . Instead, we only save the decomposition and directly use these parts in Eq. ( 2) via Altschuler et al. (2019) . As a result we obtain the decomposition of PLCN = PNys + P sp ∆ = PU PW + P sp -P sp Nys and the approximate distance (using Lemma A from Altschuler et al. ( 2019)) K LCN t = U (A -1 V t) + K sp ∆ t, similarly to d λ LCN,c = λ s T PU PW 1 m + 1 T n PU PW t + s T P sp ∆ 1 m + 1 T n P sp ∆ t . This approximation scales log-linearly with dataset size (see App. A and App. K for details). It allows us to smoothly move from Nyström-Sinkhorn to sparse Sinkhorn by varying the number of neighbors and landmarks. We can thus freely choose the optimal "operating point" based on the underlying problem and regularization parameter. We discuss the limitations of LCN-Sinkhorn in App. B.

4. THEORETICAL ANALYSIS

Approximation error. The main question we aim to answer in our theoretical analysis is what improvements to expect from adding sparse corrections to Nyström Sinkhorn. To do so, we first analyse approximations of K in a uniform and a clustered data model. In these we use Nyström and LSH schemes that largely resemble k-means, as used in most of our experiments. Relevant proofs and notes for this section can be found in App. C to G. Theorem 1. Let X p and X q have n samples that are uniformly distributed in a d-dimensional closed, locally Euclidean manifold with unit volume. Let furthermore C ij = x pi -x qj 2 and K ij = e -Cij /λ . Let the l landmarks L be arranged optimally and regularly so that the expected L 2 distance to the closest landmark is minimized. Denote R = 1 2 min x,y∈L,x =y xy 2 . Assume that the sparse correction K sp ij = K ij if and only if x qj is one of the k -1 nearest neighbors of x pi , and that the distance to x pi 's k-nearest neighbor δ k R. Then the expected maximum error in row i of the LCN approximation K LCN is E[ K i,: -K LCN,i,: ∞ ] = E[e -δ k /λ ] -E[K LCN,i,j ], (6) with j denoting the index of x pi 's k-nearest neighbor. Using the upper incomplete Gamma function Γ(., .) we can furthermore bound the second term by e - √ dR/λ ≤ E[K LCN,i,j ] ≤ 2d(Γ(d) -Γ(d, 2R/λ)) (2R/λ) d (1 + e -2R/λ ) + O(e -2 √ 3R/λ ). The error in Eq. ( 6) is dominated by the first term since δ k R. Note that R only decreases slowly with the number of landmarks since R ≥ ( (d/2)! l ) 1/d 1 2 √ π (Cohn, 2017) . Moving from pure Nyström to LCN by correcting the nearest neighbors' entries thus provides significant benefits, even for uniform data. For example, by just correcting the first neighbor we obtain a 68 % improvement in the first term (d = 32, λ = 0.05, n = 1000). This is even more pronounced in clustered data. Theorem 2. Let X p , X q ⊆ R d be distributed inside the same c clusters with cluster centers x c . Let r be the maximum L 2 distance of a point to its cluster center and D the minimum distance between two points from different clusters, with r D. Let each LSH bucket used for the sparse approximation K sp cover at least one cluster. Let K Nys use 1 ≤ l ≤ d and K LCN use l = 1 optimally distributed landmarks per cluster. Then the maximum error is max K -K Nys ∞ = 1 -max ∆∈[0,r] le -2 √ r 2 + l-1 2l ∆ 2 /λ 1 + (l -1)e -∆/λ -O(e -D/λ ), ( ) max K -K sp ∞ = e -D/λ , ( ) max K -K LCN ∞ = e -D/λ (1 -e -2r/λ (2 -e -2r/λ ) + O(e -D/λ )). Since we can lower bound Eq. ( 8) by 1 -le -2r/λ -O(e -D/λ ) we can conclude that the error in K Nys is close to 1 for any reasonably large r λ (which is the maximum error possible). The errors in K sp and K LCN on the other hand are vanishingly small, since r D. Moreover, these maximum approximation error improvements directly translate to improvements in the Sinkhorn approximation. We can show this by slightly adapting the error bounds for an approximate Sinkhorn transport plan and distance due to Altschuler et al. (2019) . Theorem 3 (Altschuler et al. (2019) ). Let X p , X q ⊆ R d have n samples. Denote ρ as the maximum distance between two samples. Let K be an approximation of the similarity matrix K with K ij = e -xpi-xqj 2/λ and K -K ∞ ≤ ε 2 e -ρ/λ , where ε = min(1, ε 50(ρ+λ log λn ε ) ). When performing the Sinkhorn algorithm until P 1 N -p 1 + P T 1 N -q 1 ≤ ε /2, the resulting approximate transport plan P and distance dλ c are bounded by |d λ c -dλ c | ≤ ε, D KL ( P P ) ≤ ε/λ. ( ) Convergence rate. We next show that approximate Sinkhorn converges as fast as regular Sinkhorn by slightly adapting the convergence bound by Dvurechensky et al. (2018) to account for sparsity. Theorem 4 (Dvurechensky et al. (2018) ). Given the matrix K ∈ R n×n and p, q the Sinkhorn algorithm gives a transport plan satisfying P 1 N -p 1 + P T 1 N -q 1 ≤ ε in iterations k ≤ 2 + -4 ln(min i,j { Kij | Kij > 0} min i,j {p i , q j }) ε . ( ) Backpropagation. Efficient gradient computation is almost as important for modern deep learning models as the algorithm itself. These models usually aim at learning the embeddings in X p and X q and therefore need gradients w.r.t. the cost matrix C. We can estimate these either via automatic differentiation of the unrolled Sinkhorn iterations or via the analytic solution that assumes exact convergence. Depending on the problem at hand, either the automatic or the analytic estimator will lead to faster overall convergence (Ablin et al., 2020) . LCN-Sinkhorn works flawlessly with automatic backpropagation since it only relies on basic linear algebra (except for choosing Nyström landmarks and LSH neighbors, for which we use a simple straight-through estimator (Bengio et al., 2013) ). To enable fast analytic backpropagation we provide analytic gradients in Proposition 1. Note that both backpropagation methods have runtime linear in the number of points n and m. Proposition 1. The derivatives of the distances d λ c and d λ LCN,c (Eqs. (1) and ( 5)) and the optimal transport plan P ∈ R n×m w.r.t. the (decomposed) cost matrix C ∈ R n×m in entropy-regularized OT and LCN-Sinkhorn are ∂d λ c ∂C = P , ∂ Pij ∂C kl = - 1 λ Pij δ ik δ jl , ∂d λ LCN,c ∂U = -λs(W t) T , ∂d λ LCN,c ∂W = -λ(s T U ) T tT , ∂d λ LCN,c ∂ log K sp = -λ P sp , ∂d λ LCN,c ∂ log K sp Nys = -λ P sp Nys , ∂ PU,ij ∂U kl = δ ik δ jl s i , ∂ PW,ij ∂U kl = P † U,ik s k PW,lj , ∂ PU,ij ∂W kl = PU,ik t l P † W,lj , ∂ PW,ij ∂W kl = δ ik δ jl t j , ∂ P sp ij ∂ log K sp kl = P sp ij δ ik δ jl , ∂ P sp Nys,ij ∂ log K sp Nys,kl = P sp Nys,ij δ ik δ jl , with δ ij denoting the Kronecker delta and † the Moore-Penrose pseudoinverse. Using these decompositions we can backpropagate through LCN-Sinkhorn in time O((n + m)l 2 + l 3 ).

5. GRAPH TRANSPORT NETWORK

Graph distance learning. The ability to predict similarities or distances between graph-structured objects is useful across a wide range of applications. It can e.g. be used to predict the reaction rate between molecules (Houston et al., 2019) , search for similar images (Johnson et al., 2015) , similar molecules for drug discovery (Birchall et al., 2006) , or similar code for vulnerability detection (Li et al., 2019) . We propose the graph transport network (GTN) to evaluate approximate Sinkhorn and advance the state of the art on this task. Graph transport network. GTN first uses a Siamese graph neural network (GNN) to embed two graphs independently as sets of node embeddings. These embeddings are then matched using enhanced entropy-regularized optimal transport. Given an undirected graph G = (V, E), with node set V and edge set E, node attributes x i ∈ R Hx and (optional) edge attributes e i,j ∈ R He , with i, j ∈ V, we update the node embeddings in each GNN layer via h (l) self,i = σ(W (l) node h (l-1) i + b (l) ), h (l) i = h (l) self,i + j∈Ni η (l) i,j h (l) self,j W edge e i,j , with N i denoting the neighborhood of node i, h (0) i = x i , h i ∈ R HN for l ≥ 1, the bilinear layer W edge ∈ R HN×HN×He , and the degree normalization η (1) i,j = 1 and η (l) i,j = 1/ deg i deg j for l > 1. This choice of η i,j allows our model to handle highly skewed degree distributions while still being able to represent node degrees. We found the choice of non-linearity σ not to be critical and chose a LeakyReLU. We do not use the bilinear layer W edge e i,j if there are no edge attributes. We aggregate each layer's node embeddings to obtain the final embedding of node i h final i = [h (1) self,i h (1) i h (2) i . . . h (L) i ]. Having obtained the embedding sets H final 1 and H final 2 of both graphs we use the L 2 distance as a cost function and then calculate the Sinkhorn distance, which is symmetric and permutation invariant w.r.t. the sets H final 1 and H final 2 . We obtain the embeddings for matching via h (0) i = MLP(h final i ) and obtain the final prediction via d = d λ c w out + b out , with learnable w out and b out . All weights in GTN are trained end-to-end via backpropagation. For small graphs we use the full Sinkhorn distance and scale to large graphs by leveraging LCN-Sinkhorn. GTN is more expressive than models that aggegrate node embeddings to a single fixed-size embedding for the entire graph but still scales log-linearly in the number of nodes, as opposed to previous approaches that scale quadratically. Note that GTN inherently performs graph matching and can therefore also be applied to this task. Learnable unbalanced OT. Since GTN regularly encounters graphs with disagreeing numbers of nodes it needs to be able to handle cases where p 1 = q 1 or where not all nodes in one graph have a corresponding node in the other and thus P 1 m < p or P T 1 n < q. Unbalanced OT allows us to handle both of these cases (Peyré & Cuturi, 2019) . Previous methods did so by swapping these requirements with a uniform divergence loss term on p and q (Frogner et al., 2015; Chizat et al., 2018) . However, these approaches uniformly penalize deviations from balanced OT and therefore cannot adapt to only ignore parts of the distribution. We propose to alleviate this limitation by swapping the cost matrix C with the bipartite matching (BP) matrix (Riesen & Bunke, 2009 ) C BP = C C (p,ε) C (ε,q) C (ε,ε) , C (p,ε) ij = c i,ε i = j ∞ i = j , C (ε,q) ij = c ε,j i = j ∞ i = j , C (ε,ε) ij = 0, and adaptively computing the costs c i,ε , c ε,j and c ε,ε based on the input sets X p and X q . Using the BP matrix adds minor computational overhead since we only need to save the diagonals c p,ε and c ε,q of C p,ε and C ε,q . We can then include the additional parts of C BP in the Sinkhorn algorithm (Eq. ( 2)) via K BP t = K t + c p,ε ť c ε,q t + 1 T n ť , K T BP s = K T ŝ + c ε,q š c p,ε ŝ + 1 T m š , ( ) where t denotes the upper and ť the lower part of the vector t. To calculate d λ c we can decompose the transport plan P BP in the same way as C BP , with a single scalar for P ε,ε . For GTN we obtain the deletion cost via c i,ε = α x pi 2 , with a learnable vector α ∈ R d . Multi-head OT. Inspired by attention models (Vaswani et al., 2017) we further improve GTN by using multiple OT heads. Using K heads means that we calculate OT in parallel for K separate sets of embeddings representing the same pair of objects and obtain a set of distances d λ c ∈ R K . We can then transform these distances to a final distance prediction using a set of linear layers h (k) i = W (k) h final i for head k and obtain the final prediction via d = MLP(d λ c ). Note that both learnable unbalanced OT and multi-head OT might be of independent interest.

6. RELATED WORK

Log-linear optimal transport. For an overview of optimal transport and its foundations see Peyré & Cuturi (2019) . On low-dimensional grids and surfaces OT can be solved using dynamical OT (Papadakis et al., 2014; Solomon et al., 2014) , convolutions (Solomon et al., 2015) , or embedding/hashing schemes (Indyk & Thaper, 2003; Andoni et al., 2008) . In higher dimensions we can use tree-based algorithms (Backurs et al., 2020) or hashing schemes (Charikar, 2002) , which are however limited to a previously fixed set of points X p , X q , on which only the distributions p and q change. For sets that change dynamically (e.g. during training) one common method of achieving log-linear runtime is a multiscale approximation of entropy-regularized OT (Schmitzer, 2019; Gerber & Maggioni, 2017) . Tenetov et al. (2018) recently proposed using a low-rank approximation of the Sinkhorn similarity 634(11) 0.308(14) 0.123(5) 0.645(14) 0.321(6) 0.125(12) 0.660(17) 0.330(9) 0.121(7) 0.667(16) 0.281(19) 0.125(9)  Nyström Skh. 1.183(5) 0.077(1) 0.045(5) 1.175(18) 0.068(1) 0.048(6) 1.172(13) 0.070(3) 0.052(4) 1.228(18) 0.091(2) 0 406(15) 0.673(12) 0.197(7) 0.368(12) 0.736(3) 0.201(3) 0.342(5) 0.725(4) 0.209(3) 0.465(10) 0.623(5) 0.210(4) matrix obtained via a semidiscrete approximation of the Euclidean distance. Altschuler et al. (2019) improved upon this approach by using the Nyström method for the approximation. These approaches still struggle with high-dimensional real-world problems, as we will show in Sec. 7. Sliced Wasserstein distance. Another approach to reduce the computational complexity of optimal transport (without entropy regularization) are sliced Wasserstein distances (Rabin et al., 2011) . However, they require the L 2 distance as a cost function and are either unstable in convergence or prohibitively expensive for high-dimensional problems (O(nd 3 )) (Meng et al., 2019) . Fast Sinkhorn. Another line of work has been pursuing accelerating entropy-regularized OT without changing its computational complexity w.r.t. the number of points. Original Sinkhorn requires O(1/ε 2 ) iterations (Dvurechensky et al., 2018) and Jambulapati et al. (2019) recently proposed an algorithm that reduces them to O(1/ε). Alaya et al. ( 2019) proposed to reduce the size of the Sinkhorn problem by screening out neglectable components, which allows for approximation guarantees. Genevay et al. (2016) proposed using a stochastic optimization scheme instead of Sinkhorn iterations. Essid & Solomon (2018) and Blondel et al. (2018) proposed alternative regularizations to obtain OT problems with similar runtimes as the Sinkhorn algorithm. This work is largely orthogonal to ours. Embedding alignment. For an overview of cross-lingual word embedding models see Ruder et al. (2019) . Unsupervised word embedding alignment was proposed by Conneau et al. (2018) , with subsequent advances by Alvarez-Melis & Jaakkola (2018) ; Grave et al. (2019) ; Joulin et al. (2018) . Graph matching and distance learning. Most recent approaches for graph matching and graph distance learning either rely on a single fixed-dimensional graph embedding (Bai et al., 2019; Li et al., 2019) , or only use attention or some other strongly simplified variant of optimal transport (Bai et al., 2019; Riba et al., 2018; Li et al., 2019) . Others break permutation invariance and are thus ill-suited for this task (Ktena et al., 2017; Bai et al., 2018) . So far only approaches using a single graph embedding allow faster than quadratic scaling in the number of nodes. Compared to the Sinkhorn-based image model concurrently proposed by Wang et al. (2019) GTN uses no CNN or cross-graph attention, but an enhanced GNN and embedding aggregation scheme. OT has recently been proposed for graph kernels (Maretic et al., 2019; Vayer et al., 2019) , which can (to a limited extent) be used for graph matching, but not for distance learning.

7. EXPERIMENTS

Approximating Sinkhorn. We start by directly investigating different Sinkhorn approximations. To do so we compute entropy-regularized OT on pairs of 10 000 word embeddings from Conneau et al. (2018) , which we preprocess with Wasserstein Procrustes alignment in order to obtain both close and distant neighbors. We let every method use the same total number of 40 neighbors and landmarks (LCN uses 20 each) and set λ = 0.05 (as in Grave et al. (2019) ). We measure transport plan approximation quality by (a) calculating the Pearson correlation coefficient (PCC) between all entries in the approximated plan and the true P and (b) comparing the sets of 0.1 % largest entries in the approximated and true P using the Jaccard similarity (intersection over union, IoU). In all figures the error bars denote standard deviation across 5 runs, which is often too small to be visible. Table 1 shows that both sparse Sinkhorn, LCN-Sinkhorn and factored OT (Forrow et al., 2019) obtain distances that are significantly closer to the true d λ c than Multiscale OT and Nyström-Sinkhorn. Furthermore, the transport plan computed by sparse Sinkhorn and LCN-Sinkhorn show both a PCC and IoU that are around twice as high as Multiscale OT, while Nyström-Sinkhorn and factored OT exhibit almost no correlation. LCN-Sinkhorn performs especially well in this regard. This is also evident in Fig. 1 , which shows how the 10 4 × 10 4 approximated OT plan entries compared to the true Sinkhorn values. Fig. 2 shows that sparse Sinkhorn offers the best trade-off between runtime and OT plan quality. Factored OT exhibits a runtime 2 to 10 times longer than the competition due to its iterative refinement scheme. LCN-Sinkhorn performs best for use cases with constrained memory (few neighbors/landmarks), as shown in Fig. 3 . The number of neighbors and landmarks directly determines memory usage and is linearly proportional to the runtime (see App. K). Fig. 9 shows that sparse Sinkhorn performs best for low regularizations, where LCN-Sinkhorn fails due to the Nyström part going out of bounds. Nyström Sinkhorn performs best at high values and LCN-Sinkhorn always performs better than both (as long as it can be calculated). Interestingly, all approximations except factored OT seem to fail at high λ. We defer analogously discussing the distance approximation to App. L. All approximations scale linearly both in the number of neighbors/landmarks and dataset size, as shown in App. K. Overall, we see that sparse Sinkhorn and LCN-Sinkhorn yield significant improvements over previous approximations. However, do these improvements also translate to better performance on downstream tasks? Embedding alignment. Embedding alignment is the task of finding the orthogonal matrix R ∈ R d×d that best aligns the vectors from two different embedding spaces, which is e.g. useful for unsupervised word translation. We use the experimental setup established by Conneau et al. (2018) by migrating Grave et al. (2019) 's implementation to PyTorch. The only change we make is using the full set of 20 000 word embeddings and training for 300 steps, while reducing the learning rate by half every 100 steps. We do not change any other hyperparameters and do not use unbalanced OT. After training we match pairs via cross-domain similarity local scaling (CSLS) (Conneau et al., 2018) . We use 10 Sinkhorn iterations, 40 neighbors for sparse Sinkhorn, and 20 neighbors and landmarks for LCN-Sinkhorn (for details see App. H). We allow both multiscale OT and Nyström Sinkhorn to use as many landmarks and neighbors as can fit into GPU memory and finetune both methods. Table 2 shows that using full Sinkhorn yields a significant improvement in accuracy on this task head 0.022(1) 3.7(1) 4.5(3)  8 OT heads 0.012(1) 3.2(1) 3.6(2 compared to the original approach of performing Sinkhorn on randomly sampled subsets of embeddings (Grave et al., 2019) . LCN-Sinkhorn even outperforms the full version in most cases, which is likely due to regularization effects from the approximation. It also runs 4.6x faster than full Sinkhorn and 3.1x faster than the original scheme. Sparse Sinkhorn runs 1.8x faster than LCN-Sinkhorn but cannot match its accuracy. LCN-Sinkhorn still outcompetes the original method after refining the embeddings with iterative local CSLS (Conneau et al., 2018) . Both multiscale OT and Nyström Sinkhorn fail at this task, despite their larger computational budget. This shows that the improvements achieved by sparse Sinkhorn and LCN-Sinkhorn have an even larger impact in practice. Graph distance regression. The graph edit distance (GED) is useful for various tasks, such as image retrieval (Xiao et al., 2008) or fingerprint matching (Neuhaus & Bunke, 2004) , but its computation is NP-complete (Bunke & Shearer, 1998) . Therefore, to use it on larger graphs we need to learn an approximation. We use the Linux dataset by Bai et al. (2019) and generate 2 new datasets by computing the exact GED using the method by Lerouge et al. (2017) on small graphs (≤ 30 nodes) from the AIDS dataset (Riesen & Bunke, 2008 ) and a set of preferential attachment graphs. We compare GTN to 3 state-of-the-art baselines: SiameseMPNN (Riba et al., 2018), SimGNN (Bai et al., 2019) , and the Graph Matching Network (GMN) (Li et al., 2019) . We tune the hyperparameters of all baselines and GTN on the validation set via a grid search. For more details see App. H to J. We first test both GTN and the proposed OT enhancements. Table 3 shows that GTN improves upon competing models by 20 % with a single head and by 48 % with 8 OT heads. These improvements break down when using regular balanced OT, showing the importance of learnable unbalanced OT. Having established GTN as a state-of-the-art model we next ask whether we can sustain its performance when using approximate OT. To test this we additionally generate a set of larger graphs with around 200 nodes and use the Pyramid matching (PM) kernel (Nikolentzos et al., 2017) as the prediction target, since these graphs are too large to compute the GED. See App. J for hyperparameter details. Table 4 shows that both sparse Sinkhorn and the multiscale method using 4 (expected) neighbors fail at this task, demonstrating that the low-rank approximation in LCN has a crucial stabilizing effect during training. Nyström Sinkhorn with 4 landmarks performs surprisingly well on the AIDS30 dataset, suggesting an overall low-rank structure with Nyström acting as regularization. However, it does not perform as well on the other two datasets. Using LCN-Sinkhorn with 2 neighbors and landmarks works well on all three datasets, with an RMSE increased by only 10 % compared to full GTN. App. K furthermore shows that GTN with LCN-Sinkhorn indeed scales linearly in the number of nodes across multiple orders of magnitude. This model thus allows to perform graph matching and distance learning on graphs that are considered large even for simple node-level tasks (20 000 nodes).

8. CONCLUSION

Locality sensitive hashing (LSH) and the novel locally corrected Nyström (LCN) method enable fast and accurate approximations of entropy-regularized OT with log-linear runtime: Sparse Sinkhorn and LCN-Sinkhorn. The graph transport network (GTN) is one example for such a model, which can be substantially improved with learnable unbalanced OT and multi-head OT. It sets the new state of the art for graph distance learning while still scaling log-linearly with graph size. These contributions enable new applications and models that are both faster and more accurate, since they can sidestep workarounds such as pooling. A COMPLEXITY ANALYSIS Sparse Sinkhorn. A common way of achieving a high p 1 and low p 2 in LSH is via the AND-OR construction. In this scheme we calculate B • r hash functions, divided into B sets (hash bands) of r hash functions each. A pair of points is considered as neighbors if any hash band matches completely. Calculating the hash buckets for all points with b hash buckets per function scales as O((n+m)dBbr) for the hash functions we consider. As expected, for the tasks and hash functions we investigated we obtain approximately m/b r and n/b r neighbors, with b r hash buckets per band. Using this we can fix the number of neighbors to a small, constant LCN-Sinkhorn. Both choosing landmarks via k-means++ sampling and via k-means with a fixed number of iterations have the same runtime complexity of O((n + m)ld). Precomputing W can be done in time O(nl 2 + l 3 ). The low-rank part of updating the vectors s and t can be computed in O(nl + l 2 + lm), with l chosen constant, i.e. independently of n and m. Since sparse Sinkhorn with LSH has a log-linear runtime we again obtain log-linear overall runtime for LCN-Sinkhorn. β

B LIMITATIONS

Sparse Sinkhorn. Using a sparse approximation for K works well in the common case when the regularization parameter λ is low and the cost function varies enough between data pairs, such that the transport plan P resembles a sparse matrix. However, it can fail if the cost between pairs is very similar or the regularization is very high, if the dataset contains many hubs, i.e. points with a large number of neighbors, or if the distributions p or q are spread very unevenly. Furthermore, sparse Sinkhorn can be too unstable to train a model from scratch, since randomly initialized embeddings often have no close neighbors (see Sec. 7). LCN-Sinkhorn largely alleviates these limitations. LCN-Sinkhorn. Since we cannot calculate the full cost matrix, LCN-Sinkhorn cannot provide accuracy guarantees in general. Highly concentrated distributions p and q might have adverse effects on LCN-Sinkhorn. However, we can compensate for these by sampling landmarks or neighbors proportional to each point's probability mass. We therefore do not expect LCN-Sinkhorn to break down in this scenario. If the regularization parameter is low or the cost function varies greatly, we sometimes observed stability issues (over-and underflows) with the Nyström approximation because of the inverse A -1 , which cannot be calculated in log-space. Due to its linearity the Nyström method furthermore sometimes approximates similarities as negative values, which leads to a failure if the result of the matrix product in Eq. ( 2) becomes negative. In these extreme cases we also observed catastrophic elimination caused by the correction K sp ∆ . Since this essentially means that optimal transport will be very local, we recommend using sparse Sinkhorn in these scenarios. This again demonstrates the complementarity of the sparse approximation and Nyström: In cases where one fails we can often resort to the other.

C PROOF OF THEOREM 1

We first prove a lemma that will be useful later on. Lemma A. Let K be the Nyström approximation of the similarity matrix K ij = e -xi-xj 2/λ . Let x i and x j be data points with equal L 2 distance r i and r j to all l landmarks, which have the same distance ∆ to each other. Then Kij = le -(ri+rj )/λ 1 + (l -1)e -∆/λ (21) Proof. The inter-landmark distance matrix is A = e -∆/λ 1 l×l + (1 -e -∆/λ )I l , where 1 l×l denotes the constant 1 matrix. Using the identity (b1 n×n + (a -b)I n ) -1 = -b (a -b)(a + (n -1)b) 1 n×n + 1 a -b I n (23) we compute Kij = U i,: A -1 V :,j = e -ri/λ e -ri/λ • • • -e -∆/λ (1 -e -∆/λ )(1 + (l -1)e -∆/λ ) 1 l×l + 1 1 -e -∆/λ I l    e -rj /λ e -rj /λ . . .    = e -(ri+rj )/λ 1 -e -∆/λ -l 2 e -∆/λ 1 + (l -1)e -∆/λ + l = e -(ri+rj )/λ 1 -e -∆/λ l -le -∆/λ 1 + (l -1)e -∆/λ = le -(ri+rj )/λ 1 + (l -1)e -∆/λ (24) Now consider the error K i,: -K LCN,i,: ∞ . The k -1 nearest neighbors are covered by the sparse correction and therefore the next nearest neighbor has distance δ k . The expected distance from the closest landmark is greater than the expected distance inside the surrounding d-ball of radius R, i.e. E[r] ≥ E V (R) [r] = d d+1 R. Because furthermore δ k R, the error is dominated by the first term and the maximum error in row i is given by the k-nearest neighbor of i, denoted by j. Thus E[ K i,: -K LCN,i,: ∞ ] = E[K i,j -K LCN,i,j ] = E[K i,j ] -E[K LCN,i,j ] = E[e -δ k /λ ] -E[K LCN,i,j ] Note that we can lower bound the first term using Jensen's inequality. However, we were unable to find a reasonably tight upper bound and the resulting integral (ignoring exponentially small boundary effects, see Percus & Martin ( 1998)) E[e -δ k /λ ] = n! (n -k)!(k -1)! ((d/2)!) 1/d √ π 0 e -r/λ V (r) k-1 (1 -V (r)) n-k dV (r) dr dr, ( ) with the volume of the d-ball V (r) = π d/2 r d (d/2)! does not have an analytical solution. We thus have to resort to calculating this expectation numerically. We lower bound the second term by (1) ignoring every landmark except the closest one, since additional landmarks can only increase the estimate K LCN,i,j . We then (2) upper bound the L 2 distance to the closest landmark r by √ dR/2, since this would be the furthest distance to the closest point in a d-dimensional grid. Any optimal arrangement minimizing E[min y∈L xy 2 | x ∈ X p ] would be at least as good as a grid and thus have furthest distances as small or smaller than those in a grid. Thus, E[K LCN,i,j ] (1) ≥ e -2r/λ (2) ≥ e - √ dR/λ . ( ) We upper bound this expectation by considering that any point outside the inscribed sphere of the space closest to a landmark (which has radius R) would be further away from the landmarks and thus have a lower value e -d/λ . We can therefore reduce the space over which the expectation is taken to the ball with radius R, i.e. E[K LCN,i,j ] ≤ E V (R) [K LCN,i,j ] Next we (1) ignore the contributions of all landmarks except for the closest 2, since a third landmark must be further away from the data point than √ 3R, adding an error of O(e -2 √ 3R/λ ). We then (2) lower bound the distances of both points to both landmarks by the closest distance to a landmark r = min{ x i -x l1 2 , x i -x l2 2 , x j -x l1 2 , x j -x l2 2 } and use Lemma A to obtain E V (R) [K LCN,i,j ] (1) = E V (R) [K LCN, 2 landmarks,i,j ] + O(e -2 √ 3R/λ ) (2) ≤ E V (R) 2e -2r/λ 1 + e -2R/λ + O(e -2 √ 3R/λ ) = 2E V (R) [e -2r/λ ] 1 + e -2R/λ + O(e -2 √ 3R/λ ). Assuming Euclideanness in V (R) we obtain E V (R) [e -2r/λ ] = 1 V (R) R 0 e -2r/λ dV (r) dr dr = d R R 0 e -2r/λ r d-1 dr = d (2R/λ) d (Γ(d) -Γ(d, 2R/λ))

D PROOF OF THEOREM 2

Note that this theorem does not use probabilistic arguments but rather geometrically analyzes the maximum possible error. K sp is correct for all pairs inside a cluster and 0 otherwise. We therefore obtain the maximum error by considering the closest possible pair between clusters. By definition, this pair has distance D and thus max K -K sp ∞ = e -D/λ (32) LCN is also correct for all pairs inside a cluster, so we again consider the closest possible pair x i , x j between clusters. We furthermore only consider the landmarks of the two concerned clusters, adding an error of O(e -D/λ ). Hence, K LCN, 2 landmarks,ij = e -r/λ e -(r+D)/λ 1 e -(2r+D)/λ e -(2r+D)/λ 1 -1 e -(r+D)/λ e -r/λ = 1 1 -e -(4r+2D)/λ e -r/λ e -(r+D)/λ 1 -e -(2r+D)/λ -e -(2r+D)/λ 1 e -(r+D)/λ e -r/λ = 1 1 -e -(4r+2D)/λ e -r/λ e -(r+D)/λ e -(r+D)/λ -e -(3r+D)/λ ) e -r/λ -e -(3r+2D)/λ = 1 1 -e -(4r+2D)/λ (e -(2r+D)/λ -e -(4r+D)/λ + e -(2r+D)/λ -e -(4r+3D)/λ ) = e -(2r+D)/λ 1 -e -(4r+2D)/λ (2 -e -2r/λ -e -(2r+2D)/λ ) = e -D/λ e -2r/λ (2 -e -2r/λ ) -O(e -2D/λ ) (33) and thus max K -K LCN ∞ = e -D/λ (1 -e -2r/λ (2 -e -2r/λ ) + O(e -D/λ )). (34) For pure Nyström we need to consider the distances inside a cluster. In the worst case two points overlap, i.e. K ij = 1, and lie at the boundary of the cluster. Since r D we again only consider the landmarks in the concerned cluster, adding an error of O(e -D/λ ). Because of symmetry we can optimize the worst-case distance from all landmarks by putting them on a (l -1)-simplex centered on the cluster center. Since there are at most d landmarks in each cluster there is always one direction in which the worst-case points are r away from all landmarks. The circumradius of an (l -1)-simplex with side length ∆ is l-1 2l ∆. Thus, the maximum distance to all landmarks is r 2 + l-1 2l ∆ 2 . Using Lemma A we therefore obtain the Nyström approximation K Nys,ij = le -2 √ r 2 + l-1 2l ∆ 2 /λ 1 + (l -1)e -∆/λ + O(e -D/λ ) E NOTE ON THEOREM 3 Lemmas C-F and and thus Theorem 1 by Altschuler et al. (2019) are also valid for Q outside the simplex so long as Q 1 = n and it only has non-negative entries. Any P returned by Sinkhorn fulfills these conditions. Therefore the rounding procedure given by their Algorithm 4 is not necessary for this result. Furthermore, to be more consistent with Theorems 1 and 2 we use the L 2 distance instead of L 2 2 in this theorem, which only changes the dependence on ρ.

F NOTES ON THEOREM 4

To adapt Theorem 1 by Dvurechensky et al. (2018) to sparse matrices (i.e. matrices with some K ij = 0) we need to redefine ν := min i,j {K ij |K ij > 0}, i.e. take the minimum only w.r.t. non-zero elements in their Lemma 1.

G PROOF OF PROPOSITION 1

Theorem A (Danskin's theorem). Consider a continuous function φ : R k × Z → R, with the compact set Z ⊂ R j . If φ(x, z) is convex in x for every z ∈ Z and φ(x, z) has a unique maximizer z, the derivative of f (x) = max z∈Z φ(x, z) is given by the derivative at the maximizer, i.e. ∂f ∂x = ∂φ(x, z) ∂x . We start by deriving the derivatives of the distances. To show that the Sinkhorn distance fulfills the conditions for Danskin's theorem we first identify x = C, z = P , and φ(C, P ) = -P , C F + λH(P ). We next observe that the restrictions P 1 m = p and P T 1 n = q define a compact, convex set for P . Furthermore, φ is a continuous function and linear in C, i.e. both convex and concave for any finite P . Finally, φ(C, P ) is concave in P since P , C F is linear and λH(P ) is concave. Therefore the maximizer P is unique and Danskin's theorem applies to the Sinkhorn distance. Using ∂C Nys,ij ∂U kl = ∂ ∂U kl -λ log( a U ia W aj ) = -λδ ik W lj a U ia W aj = -λδ ik W lj K Nys,ij , ∂C Nys,ij ∂W kl = ∂ ∂W kl -λ log( a U ia W aj ) = -λδ jl U ik a U ia W aj = -λδ jl U ik K Nys,ij , PNys,ij K Nys,ij = b PU,ib PW,bj a U ia W aj = si tj b U ib W bj a U ia W aj = si tj b U ib W bj a U ia W aj = si tj and the chain rule we can calculate the derivative w.r.t. the cost matrix as  ∂d λ c ∂C = - ∂ ∂C -P , C F + λH( P ) = P , (42) ∂d λ LCN,c ∂U kl = i,j ∂C Nys,ij ∂U kl ∂d λ LCN,c ∂C Nys,ij = -λ i,j δ ik W lj PNys,ij K Nys,ij = -λ i,j δ ik W lj si tj = -λs k j W lj tj = -λs(W t) T kl , ∂d λ LCN,c ∂W kl = i,j ∂C Nys,ij ∂W kl ∂d λ LCN,c ∂C Nys,ij = -λ i,j δ jl U ik PNys,ij K Nys,ij = -λ i,j δ jl U ik si tj = -λ i si U ik tl = -λ(s T U ) T tT kl , ∂g i ∂x j (x) = -J f (x,y),y (x, g(x)) -1 i,: ∂ ∂x f (x, g(x)) :,j Next we derive the transport plan derivatives. To apply the implicit function theorem we identify x = C and y = P as flattened matrices. We will index these flat matrices via index pairs to simplify interpretation. We furthermore identify f (C, P ) = ∂ ∂ P P , C F -λH( P ) = C + λ(log P + 1). The minimizer P cannot lie on the boundary of P 's valid region since lim p→0 ∂ ∂p p log p = lim p→0 log p + 1 = -∞ and therefore Pij > 0. Hence, f (C, P ) = 0 with P (C) = arg min P P , C F -λH(P ) and we find that g(x) = P (C). We furthermore obtain J fij (C,P ),P ab (C, P (C)) = ∂ ∂ Pab C ij + λ(log Pij + 1) = δ ia δ jb λ/ Pij , ∂ ∂C kl f ab (C, P (C)) = ∂ ∂C kl C ab + λ(log Pab + 1) = δ ak δ bl . J fij (C,P ),P ab (C, P (C)) is hence a diagonal matrix and invertible since λ/ Pij > 0. We can thus use the implicit function theorem and obtain ∂ Pij ∂C kl = - a,b J fij (C,P ),P ab (C, P (C)) -1 ∂ ∂C kl f ab (C, P (C)) = - a,b δ ia δ jb 1 λ Pij δ ak δ bl = - 1 λ Pij δ ik δ jl . To extend this result to LCN-OT we use We can calculate the pseudoinverses P † U = ( P T U PU ) -1 PU and P † W = P T W ( PW P T W ) -1 in time O((n + m)l 2 + l 3 ) since PU ∈ R n×l and PW ∈ R l×m . We do not fully instantiate the matrices required for backpropagation but instead save their decompositions, similar to the transport plan PNys = PU PW . We can then compute backpropagation in time O((n + m)l 2 ) by applying the sums over i and j in the right order. We thus obtain O((n + m)l 2 + l 3 ) overall runtime for backpropagation. ∂ PU,ij ∂ PNys,ab = ∂ ∂ PNys,ab k PNys,ik P † W,kj = δ ia P † W,bj , P † W,lj = s i U ik t l P † W,lj = PU,ik t l P † W,lj ,

H CHOOSING LSH NEIGHBORS AND NYSTRÖM LANDMARKS

We focus on two LSH methods for obtaining near neighbors. Cross-polytope LSH (Andoni et al., 2015) uses a random projection matrix R ∈ R d×b/2 with the number of hash buckets b, and then decides on the hash bucket via h(x) = arg max([x T R -x T R]), where denotes concatenation. K-means LSH computes k-means and uses the clusters as hash buckets. We further improve the sampling probabilities of cross-polytope LSH via the AND-OR construction. In this scheme we calculate B • r hash functions, divided into B sets (hash bands) of r hash functions (Paulevé et al., 2010; Nistér & Stewénius, 2006) . Since the graph transport network (GTN) uses the L 2 distance between embeddings as a cost function we use (hierarchical) k-means LSH and k-means Nyström in both sparse OT and LCN-OT. For embedding alignment we use cross-polytope LSH for sparse OT since similarities are measured via the dot product. For LCN-OT we found that using k-means LSH works better with Nyström using k-means++ sampling than cross-polytope LSH. This is most likely due to a better alignment between LSH samples and Nyström. We convert the cosine similarity to a distance via Berg et al., 1984) to use k-means with dot product similarity. Note that this is actually based on cosine similarity, not the dot product. Due to the balanced nature of OT we found this more sensible than maximum inner product search (MIPS). For both experiments we also experimented with uniform and recursive RLS sampling but found that the above mentioned methods work better. d cos = 1 - x T p xq xp 2 xq 2 (

I IMPLEMENTATIONAL DETAILS

Our implementation runs in batches on a GPU via PyTorch (Paszke et al., 2019) and PyTorch Scatter (Fey & Lenssen, 2019) . To avoid over-and underflows we use log-stabilization throughout, i.e. we save all values in log-space and compute all matrix-vector products and additions via the log-sum-exp trick log i e xi = max j x j + log( i e xi-maxj xj ). Since the matrix A is small we compute its inverse using double precision to improve stability. Surprisingly, we did not observe any benefit from using the Cholesky decomposition or not calculating A -1 and instead solving the equation B = AX for X. We furthermore precompute W = A -1 V to avoid unnecessary operations. We use 3 layers and an embedding size H N = 32 for GTN. The MLPs use a single hidden layer, biases and LeakyReLU non-linearities. The single-head MLP uses an output size of H N, match = H N and a hidden embedding size of 4H N , i.e. the same as the concatenated node embedding, and the multi-head MLP uses a hidden embedding size of H N . To stabilize initial training we scale the node embeddings by d n√ HN, match directly before calculating OT. d denotes the average graph distance in the training set, n the average number of nodes per graph, and H N, match the matching embedding size, i.e. 32 for single-head and 128 for multi-head OT.

J GRAPH DATASET GENERATION AND EXPERIMENTAL DETAILS

The dataset statistics are summarized in Table 5 . Each dataset contains the distances between all graph pairs in each split, i.e. 10 296 and 1128 distances for preferential attachment. The AIDS dataset was generated by randomly sampling graphs with at most 30 nodes from the original AIDS dataset (Riesen & Bunke, 2008) . Since not all node types are present in the training set and our choice of GED is permutation-invariant w.r.t. types, we permuted the node types so that there are no previously unseen types in the validation and test sets. For the preferential attachment datasets we first generated 12, 4, and 4 undirected "seed" graphs (for train, val, and test) via the initial attractiveness model with randomly chosen parameters: 1 to 5 initial nodes, initial attractiveness of 0 to 4 and 1/2n and 3/2n total nodes, where n is the average number of nodes (20, 200, 2000, and 20 000) . We then randomly label every node (and edge) in these graphs uniformly. To obtain the remaining graphs we edit the "seed" graphs between n/40 and n/20 times by randomly adding, type editing, or removing nodes and edges. Editing nodes and edges is 4x and adding/deleting edges 3x as likely as adding/deleting nodes. Most of these numbers were chosen arbitrarily, aiming to achieve a somewhat reasonable dataset and process. We found that the process of first generating seed graphs and subsequently editing these is crucial for obtaining meaningfully structured data to learn from. For the GED we choose an edit cost of 1 for changing a node or edge type and 2 for adding or deleting a node or an edge. We represent node and edge types as one-hot vectors. We train all models except SiamMPNN (which uses SGD) and GTN on Linux with the Adam optimizer and mean squared error (MSE) loss for up to 300 epochs and reduce the learning rate by a factor of 10 every 100 steps. On Linux we train for up to 1000 epochs and reduce the learning rate by a factor of 2 every 100 steps. We use the parameters from the best epoch based on the validation set. We choose hyperparameters for all models using multiple steps of grid search on the validation set, see Tables 6 to 8 for the final values. We use the originally published result of SimGNN on Linux and thus don't provide its hyperparameters. GTN uses 500 Sinkhorn iterations. We obtain the final entropy regularization parameter from λ base via λ = λ base d n 1 log n , where d denotes the average graph distance and n the average number of nodes per graph in the training set. The factor d/n serves to estimate the embedding distance scale and 1/ log n counteracts the entropy scaling with n log n. Note that the entropy regularization parameter was small, but always far from 0, which shows that entropy regularization actually has a positive difference decreases as we increase the number of neighbors/landmarks. This gap could either be due to details in low-level CUDA implementations and hardware or the fact that low-rank approximations require 2x as many multiplications for the same number of neighbors/landmarks. In either case, both Table 9 and Fig. 5 show that the runtimes of all approximations scale linearly both in the dataset size and the number of neighbors and landmarks, while full Sinkhorn scales quadratically. We furthermore investigate whether GTN with approximate Sinkhorn indeed scales log-linearly with the graph size by generating preferential attachment graphs with 200, 2000, and 20 000 nodes (±50 %). We use the Pyramid matching (PM) kernel (Nikolentzos et al., 2017) as prediction target. Fig. 6 shows that the runtime of LCN-Sinkhorn scales almost linearly (dashed line) and regular full Sinkhorn quadraticly (dash-dotted line) with the number of nodes, despite both achieving similar accuracy and LCN using slightly more neighbors and landmarks on larger graphs to sustain good accuracy. Full Sinkhorn went out of memory for the largest graphs. L DISTANCE APPROXIMATION Figs. 7 and 8 show that for the chosen λ = 0.05 sparse Sinkhorn offers the best trade-off between computational budget and distance approximation, with LCN-Sinkhorn and multiscale OT coming in second. Factored OT is again multiple times slower than the other methods and thus not included in Fig. 7 . Note that d λ c can be negative due to the entropy offset. This picture changes as we increase the regularization. For higher regularizations LCN-Sinkhorn is the most precise at constant computational budget (number of neighbors/landmarks), as shown in Fig. 9 . Note that the crossover points in this figure roughly coincide with those in Fig. 4 . Keep in mind that in most cases the OT plan is more important than the raw distance approximation, since it determines the training gradient and tasks like embedding alignment don't use the distance at all. This becomes evident in the fact that sparse Sinkhorn achieves a better distance approximation than LCN-Sinkhorn but performs worse in both downstream tasks investigated in Sec. 7.



Figure 1: The proposed methods (sparse and LCN-Sinkhorn) show a clear correlation with the full Sinkhorn transport plan, as opposed to previous methods. Entries of approximations (y-axis) and full Sinkhorn (x-axis) for pre-aligned word embeddings (EN-DE). Color denotes sample density.

Figure 2: Tradeoff between OT plan approximation (via PCC) and runtime. Sparse Sinkhorn offers the best tradeoff, with LCN-Sinkhorn trailing closely behind. The arrow indicates factored OT results far outside the range.

Figure 4: OT plan approximation quality for varying entropy regularization λ. Sparse Sinkhorn performs best for low and LCN-Sinkhorn for moderate and factored OT for very high λ.

in expectation with b r = min(n, m)/β. We thus obtain a sparse cost matrix C sp with O(max(n, m)β) non-infinite values and can calculate s and t in linear time O(N sink max(n, m)β), where N sink ≤ 2 + -4 ln(mini,j { Kij | Kij >0} mini,j {pi,qj }) ε (see Theorem 4) denotes the number of Sinkhorn iterations. Calculating the hash buckets with r = log min(n,m)-log β log b takes O((n + m)dBb(log min(n, m) -log β)/ log b). Since B, b, and β are small, we obtain roughly log-linear scaling with the number of points overall, i.e. O(n log n) for n ≈ m.

Implicit function theorem). Let f : R n × R m → R m be a continuously differentiable function with f (a, b) = 0. If its Jacobian matrix J fi(x,y),yj = ∂fi ∂yj (a, b) is invertible, then there exists an open set a ∈ U ⊂ R n on which there exists a unique continuously differentiable function g : U → R m with g(a) = b and ∀x ∈ U : f (x, g(x)) = 0. Moreover,

Figure 7: Sinkhorn distance approximation at different runtimes (varied via the number of neighbors/landmarks). The dashed line denotes the true Sinkhorn distance. Sparse Sinkhorn consistently performs best. The arrow indicates factored OT results far outside the depicted range.

Figure 8: Sinkhorn distance approximation for varying computational budget. The dashed line denotes the true Sinkhorn distance.Sparse Sinkhorn mostly performs best, with LCN-Sinkhorn coming in second. Factored OT performs well with very few landmarks.

Figure9: Sinkhorn distance approximation for varying entropy regularization λ at constant computational budget. Sparse Sinkhorn performs best for low λ, LCN-Sinkhorn for moderate and high λ and factored OT for high λ.

Mean and standard deviation (w.r.t. last digits, in parentheses)  of relative Sinkhorn distance error, IoU of top 0.1 % and correlation coefficient (PCC) of OT plan entries across 5 runs. Sparse Sinkhorn and LCN-Sinkhorn consistently achieve the best approximation in all 3 measures.

Accuracy and standard deviation (w.r.t. last digits, in parentheses) across 5 runs for unsupervised word embedding alignment with Wasserstein Procrustes. LCN-Sinkhorn improves upon the original by 3.1 pp. before and 2.0 pp. after iterative CSLS refinement. *Migrated and re-run on GPU via PyTorch

RMSE for GED regression across 3 runs and the targets' standard deviation σ. GTN outperforms previous models by 48 %.

RMSE for graph distance regression across 3 runs. Using LCN-Sinkhorn with GTN increases the error by only 10 % and allows log-linear scaling.

Graph dataset statistics.

Hyperparameters for the Linux dataset.

Hyperparameters for the preferential attachment GED dataset.

annex

Table 9 : Runtimes (ms) of Sinkhorn approximations for EN-DE embeddings at different dataset sizes. Full Sinkhorn scales quadratically, while all approximationes scale at most linearly with the size. Sparse approximations are 2-4x faster than low-rank approximations, and factored OT is multiple times slower due to its iterative refinement scheme. Note that similarity matrix computation time (K) primarily depends on the LSH/Nyström method, not the OT approximation. For LCN-OT we use roughly 10 neighbors for LSH (20 k-means clusters) and 10 k-means landmarks for Nyström on pref. att. 200. We double these numbers for pure Nyström Sinkhorn, sparse OT, and multiscale OT. For pref. att. 2k we use around 15 neighbors (10 • 20 hierarchical clusters) and 15 landmarks and for pref. att. 20k we use roughly 30 neighbors (10 • 10 • 10 hierarchical clusters) and 20 landmarks. The number of neighbors for the 20k dataset is higher and strongly varies per iteration due to the unbalanced nature of hierarchical k-means. This increase in neighbors and landmarks and PyTorch's missing support for ragged tensors largely explains LCN-OT's deviation from perfectly linear runtime scaling.We perform all runtime measurements on a compute node using one Nvidia GeForce GTX 1080 Ti, two Intel Xeon E5-2630 v4, and 256GB RAM.

K RUNTIMES

Table 9 compares the runtime of the full Sinkhorn distance with different approximation methods using 40 neighbors/landmarks. We separate the computation of approximate K from the optimal transport computation (Sinkhorn iterations), since the former primarily depends on the LSH and Nyström methods we choose. We observe a 2-4x speed difference between sparse (multiscale OT and sparse Sinkhorn) and low-rank approximations (Nyström Sinkhorn and LCN-Sinkhorn), while factored OT is multiple times slower due to its iterative refinement scheme. In Fig. 5 we observe that this runtime gap stays constant independent of the number of neighbors/landmarks, i.e. the relative

