WARPSPEED COMPUTATION OF OPTIMAL TRANSPORT, GRAPH DISTANCES, AND EMBEDDING ALIGNMENT

Abstract

Optimal transport (OT) is a cornerstone of many machine learning tasks. The current best practice for computing OT is via entropy regularization and Sinkhorn iterations. This algorithm runs in quadratic time and requires calculating the full pairwise cost matrix, which is prohibitively expensive for large sets of objects. To alleviate this limitation we propose to instead use a sparse approximation of the cost matrix based on locality sensitive hashing (LSH). Moreover, we fuse this sparse approximation with the Nyström method, resulting in the locally corrected Nyström method (LCN). These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even in complex, high-dimensional spaces. We thoroughly demonstrate these advantages via a theoretical analysis and by evaluating multiple approximations both directly and as a component of two real-world models. Using approximate Sinkhorn for unsupervised word embedding alignment enables us to train the model full-batch in a fraction of the time while improving upon the original on average by 3.1 percentage points without any model changes. For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn and outcompetes previous models by 48 %. LCN-Sinkhorn enables GTN to achieve this while still scaling log-linearly in the number of nodes.

1. INTRODUCTION

Measuring the distance between two distributions or sets of objects is a central problem in machine learning. One common method of solving this is optimal transport (OT). OT is concerned with the problem of finding the transport plan for moving a source distribution (e.g. a pile of earth) to a sink distribution (e.g. a construction pit) with the cheapest cost w.r.t. some pointwise cost function (e.g. the Euclidean distance). The advantages of this method have been shown numerous times, e.g. in generative modelling (Arjovsky et al., 2017; Bousquet et al., 2017; Genevay et al., 2018) , loss functions (Frogner et al., 2015) , set matching (Wang et al., 2019) , or domain adaptation (Courty et al., 2017) . Motivated by this, many different methods for accelerating OT have been proposed in recent years (Indyk & Thaper, 2003; Papadakis et al., 2014; Backurs et al., 2020) . However, most of these approaches are specialized methods that do not generalize to modern deep learning models, which rely on dynamically changing high-dimensional embeddings. In this work we aim to make OT computation for point sets more scalable by proposing two fast and accurate approximations of entropy-regularized optimal transport: Sparse Sinkhorn and LCN-Sinkhorn, the latter relying on our newly proposed locally corrected Nyström (LCN) method. Sparse Sinkhorn uses a sparse cost matrix to leverage the fact that in entropy-regularized OT (also known as the Sinkhorn distance) (Cuturi, 2013) often only each point's nearest neighbors influence the result. LCN-Sinkhorn extends this approach by leveraging LCN, a general similarity matrix approximation that fuses local (sparse) and global (low-rank) approximations, allowing us to simultaneously capture both kinds of behavior. LCN-Sinkhorn thus fuses sparse Sinkhorn and Nyström-Sinkhorn (Altschuler et al., 2019) . Both sparse Sinkhorn and LCN-Sinkhorn run in log-linear time. We theoretically analyze these approximations and show that sparse corrections can lead to significant improvements over the Nyström approximation. We furthermore validate these approximations by showing that they are able to reproduce both the Sinkhorn distance and transport plan significantly better than previous methods across a wide range of regularization parameters and computational budgets (as e.g. demonstrated in Fig. 1 ). We then show the impact of these improvements by employing Sinkhorn approximations end-to-end in two high-impact machine learning tasks. First, we incorporate them into Wasserstein Procrustes for word embedding alignment (Grave et al., 2019). LCN-Sinkhorn improves upon the original method's accuracy by 3.1 percentage points using a third of the training time without any further model changes. Second, we develop the graph transport network (GTN), which combines graph neural networks (GNNs) with optimal transport, and further improve it via learnable unbalanced OT and multi-head OT. GTN with LCN-Sinkhorn is the first model that both overcomes the bottleneck of using a single embedding per graph and scales log-linearly in the number of nodes. In summary, our paper's main contributions are: • Locally Corrected Nyström (LCN), a flexible, log-linear time approximation for similarity matrices, leveraging both local (sparse) and global (low-rank) approximations. • Entropy-regularized optimal transport (a.k.a. Sinkhorn distance) with log-linear runtime via sparse Sinkhorn and LCN-Sinkhorn. These are the first log-linear approximations that are stable enough to substitute full entropy-regularized OT in models that leverage high-dimensional spaces. • The graph transport network (GTN), which combines a graph neural network (GNN) with multihead unbalanced LCN-Sinkhorn. GTN both sets the state of the art on graph distance regression and still scales log-linearly in the number of nodes.

2. SPARSE SINKHORN

Entropy-regularized optimal transport. In this work we focus on optimal transport between two discrete sets of points. We furthermore add entropy regularization, which enables fast computation and often performs better than regular OT (Cuturi, 2013) . Formally, given two categorical distributions modelled via the vectors p ∈ R n and q ∈ R m supported on two sets of points X p = {x p1 , . . . , x pn } and X q = {x q1 , . . . , and the associated optimal transport plan P (Cuturi, 2013) x d λ c = min P P , C F -λH(P ), s.t. P 1 m = p, P T 1 n = q, with the Frobenius inner product ., . Cuturi,c = P , C F . This optimization problem can be solved by finding the vectors s and t that normalize the columns and rows of the matrix P = diag(s)K diag( t) with the similarity matrix K ij = e -C ij λ , so that P 1 m = p and P T 1 n = q. This is usually achieved via the Sinkhorn algorithm, which initializes the normalization vectors as s (1) = 1 n and t (1) = 1 m and then updates them alternatingly via s (i) = p (Kt (i-1) ), t (i) = q (K T s (i) ) (2) until convergence, where denotes elementwise division. Sparse Sinkhorn. The Sinkhorn algorithm is faster than non-regularized EMD algorithms, which run in O(n 2 m log n log(n max(C))) (Tarjan, 1997). However, its computational cost is still quadratic



Figure 1: The proposed methods (sparse and LCN-Sinkhorn) show a clear correlation with the full Sinkhorn transport plan, as opposed to previous methods. Entries of approximations (y-axis) and full Sinkhorn (x-axis) for pre-aligned word embeddings (EN-DE). Color denotes sample density.

qm } in R d and the cost function c : R d × R d → R (e.g. the squared L 2 distance) giving rise to the cost matrix C ij = c(x pi , x qi ) we aim to find the Sinkhorn distance d λ c

F and the entropy H(P ) = -log P ij . Note that d λ c includes the entropy and can thus be negative, while Cuturi (2013) originally used d 1/λ

