A HIGHER PRECISION ALGORITHM FOR COMPUTING THE 1-WASSERSTEIN DISTANCE

Abstract

We consider the problem of computing the 1-Wasserstein distance W(µ, ν) between two d-dimensional discrete distributions µ and ν whose support lie within the unit hypercube. There are several algorithms that estimate W(µ, ν) within an additive error of ε. However, when W(µ, ν) is small, the additive error ε dominates, leading to noisy results. Consider any additive approximation algorithm with execution time T (n, ε). We propose an algorithm that runs in O(T (n, ε/d) log n) time and boosts the accuracy of estimating W(µ, ν) from ε to an expected additive error of min{ε, (d log √ d/ε n)W(µ, ν)}. For the special case where every point in the support of µ and ν has a mass of 1/n (also called the Euclidean Bipartite Matching problem), we describe an algorithm to boost the accuracy of any additive approximation algorithm from ε to an expected additive error of min{ε, (d log log n)W(µ, ν)} in O(T (n, ε/d) log log n) time. Given two discrete probability distributions µ and ν whose support A and B, respectively, lie inside the d-dimensional unit hypercube [0, 1] d with max{|A|, |B|} = n, the 1-Wasserstein distance W(µ, ν) (also called the Earth Mover's distance) between them is the minimum cost required to transport mass from ν to µ under the Euclidean metric. The special case where |A| = |B| = n and the mass at each point of A ∪ B is 1/n is called the Euclidean Bipartite Matching (EBM) problem. In machine learning applications, one can improve a model µ by using its Earth Mover's distance from a distribution ν built on real data. Consequently, it has been extensively used in generative models (Deshpande et al.



Computing the 1-Wasserstein distance between µ and ν can be modeled as a linear program and solved in O(n 3 log n) time (Edmonds & Karp (1972) ; Orlin (1988) ), which is computationally expensive. There has been substantial effort on designing ε-additive-approximation algorithms that estimate W(µ, ν) within an additive error of ε in n 2 poly(d, log n, 1/ε) time (Cuturi (2013) ; Lin et al. (2019) ; Lahn et al. (2019) ). When W(µ, ν) is significantly smaller than ε, however, the cost produced by such algorithms will be unreliable as it is dominated by the error parameter ε. To get a higher accuracy in this case, for α > 0, one can compute an α-relative approximation of the 1-Wasserstein distance, which is a cost w that satisfies W(µ, ν) ≤ w ≤ αW(µ, ν). There has been considerable effort on designing relative-approximation algorithms; however, many such methods suffer from curse of dimensionality; i.e, their execution time grows exponentially in d. Furthermore, they rely on fairly involved data structures that have good asymptotic execution times but are slow in practice and difficult to implement, making them impractical (Agarwal & Sharathkumar (2014) ; Fox & Lu (2020) ; Agarwal et al. (2022a; b) ). The only exception to this is a classical greedy algorithm, based on a d-dimensional quadtree, that returns an O(d log n)-relative approximation of the 1-Wasserstein distance in O(nd) time. It has been used in various machine-learning and computervision applications (Gupta et al. (2010) ; Backurs et al. (2020) ). In the case of the Euclidean Bipartite Matching, Agarwal & Varadarajan (2004) and Indyk (2007) generalized the algorithm in the hierarchical framework to achieve a relative approximation ratio of O(d 2 log(1/ε)) in Õ(n 1+ε ) timefoot_0 . In this paper, we design an algorithm that combines any additive approximation algorithm with the hierarchical quad-tree based framework. As a result, our algorithm achieves better guarantees for both additive and relative approximations. To our knowledge, this is the first result that combines the power of additive and relative approximation techniques, leading to improvement in both settings.

1.1. PROBLEM DEFINITION

We are given two discrete distributions µ and ν. Let A and B be the points in the support of µ and ν, respectively. For the distribution µ (resp. ν), suppose each point a ∈ A (resp. b ∈ B) has a probability of µ a (resp. ν b ) associated with it, where a∈A µ a = b∈B ν b = 1. Let G(A, B) denote the complete bipartite graph where, for any pair of points a ∈ A and b ∈ B, there is an edge from a to b of cost ∥a -b∥, i.e, the Euclidean distance between a and b. For each point a ∈ A (resp. b ∈ B), we assign a weight η(a) = -µ a (resp. η(b) = ν b ). We refer to any point v ∈ A ∪ B with a negative (resp. positive) weight as a demand point (resp. supply point) with a demand (resp. supply) of |η(v)|. Given any subset of points V ⊆ A ∪ B, the weight η(V ) is simply the sum of the weights of its points; i.e, η(V ) = v∈V η(v). For any edge (a, b) ∈ A × B, let the cost of transporting a supply of β from b to a be β∥a -b∥. In this problem, our goal is to transport all supplies from supply points to demand points at the minimum cost. More formally, a transport plan is a function σ : A × B → R ≥0 that assigns a non-negative value to each edge of G(A, B) indicating the quantity of supplies transported along the edge. The transport plan σ is such that the total supplies transported into (resp. from) any demand (resp. supply) point a ∈ A (resp. b ∈ B) is equal to -η(a) (resp. η(b)). The cost of the transport plan σ, denoted by w(σ), is given by (a,b)∈A×B σ(a, b)∥a -b∥. The goal of this problem is to find a minimum-cost transport plan. If two points a ∈ A and b ∈ B are co-located (i.e, they share the same coordinates), then due to the metric property of the Euclidean distances, if η(b) = -η(a), we can match the supplies to the demands at zero cost and remove the points from the input. Otherwise, if η(b) ̸ = -η(a), we replace the two points with a single point of weight η(a) + η(b). By definition, if the weight of the newly created point is negative (resp. positive), we consider it a demand (resp. supply) point. In our presentation, we always consider A and B to be the point sets obtained after replacing all the colocated points. Observe that, after removing the co-located points, the total supply U = η(B) may be less than 1. However, it is easy to see that η(B) = -η(A); i.e, the problem instance defined on A∪B is balanced. We say that a transport plan σ is an ε-close transport plan if w(σ) ≤ W(µ, ν) + εU . In many applications, the distributions µ and ν are continuous or large (possibly unknown) discrete distributions. In such cases, it might be computationally expensive or even impossible to compute W(µ, ν). Instead, one can draw two sets A and B of n samples each from µ and ν, respectively. Each point a ∈ A (resp. b ∈ B) is assigned a weight of η(a) = -1/n (resp. η(b) = 1/n). One can approximate the 1-Wasserstein distance between the distributions µ and ν by simply solving the 1-Wasserstein problem defined on G(A, B). This special case where every point has the same demand and supply is called the Euclidean Bipartite Matching (EBM) problem. A matching M is a set of vertex-disjoint edges in G(A, B) and has a cost 1/n (a,b)∈M ∥a-b∥. For the EBM problem, the optimal transport plan is simply a minimum-cost matching of cardinality n. For any point set P in the Euclidean space, let C max (P ) := max (a,b)∈P ×P ∥a -b∥ denote the distance of its farthest pair and C min (P ) := min (a,b)∈P ×P,a̸ =b ∥a -b∥ denote the distance of its closest pair. The spread of the point set, denoted by ∆(P ), is the ratio ∆(P ) = C max (P )/C min (P ). When P is obvious from the context, we simply use C min , C max , and ∆ to denote the distance of its closest and farthest pair and its spread. 2022b) improved the dependence on d slightly and achieved an execution time of Ω(n(dε -1 log log n) d ). Nonetheless, the exponential dependence on d makes it unsuitable for higher dimensions. For higher dimensions, a quad-tree based greedy algorithm provides an expected O(d log n) approximation in O(nd) time. This algorithm constructs a randomly-shifted recursive partition of space by splitting each cell into 2 d cells with half the side-length. For every cell of the quad-tree, the algorithm moves excess supply present inside its child to meet any excess demand present inside another child. Agarwal & Varadarajan (2004) combined a different hierarchical partition with an exact solver to get an expected O(dfoot_1 log 1/ε)-approximation in Õ(n 1+ε ) time. Indyk (2007) combined the hierarchical greedy framework with importance sampling to estimate the cost of the Euclidean bipartite matching within a constant factor of the optimal. Additionally, there are Õ(nd) time algorithms that approximate the matching cost with a factor of O(log 2 n) (Andoni et al. (2008) ; Chen et al. ( 2022)). There are also approximation algorithms that run in Õ(n 2 ) time; however, these algorithms rely on several black-box reductions and at present, there are no usable implementations of these algorithms (Agarwal & Sharathkumar (2014) ; Sherman (2017) ). The lack of fast exact and relative approximations that are also implementable have motivated machine-learning researchers to design additive-approximation algorithms, which we discuss next. Additive Approximations: Cuturi ( 2013) introduced a regularized version of the optimal transport problem that produces an ε-close transport plan, which can be solved using the Sinkhorn method. For input points within the unit hypercube, such an algorithm produces an ε-close transport plan in Õ(n 2 d/ε 2 ) time 2 (Lin et al. (2019) ). One can also adapt graph theoretic approaches including the algorithm by Gabow & Tarjan (1989) to obtain an ε-close solution in O(n 2 √ d/ε + nd/ε 2 ) time for points within the unit hypercube (Lahn et al. (2019) ). Some of the additive approximation methods, including the Sinkhorn method, are highly parallelizable. For instance, the algorithm by Jambulapati et al. (2019) has a parallel depth of Õ(1/ε); see also Altschuler et al. (2017; 2019) ; Blanchet et al. (2018) ; Quanrud (2018) ; Dvurechensky et al. (2018) ; Guo et al. (2020) .

1.3. OUR RESULTS

Let T (n, ε) be the time taken by an ε-additive-approximation algorithm on an input of n points in the unit hypercube. In Theorem 1.1 and 1.2, we present new algorithms that improve the accuracy of any additive-approximation algorithm for the 1-Wasserstein problem and the EBM problem, respectively. Theorem 1.1. Given two discrete distributions µ and ν whose support lie in the d-dimensional unit hypercube and have spread n O(1) , and a parameter ε > 0, a transport plan with an expected additive error of min{ε, (d log √ d/ε n)W(µ, ν)} can be computed in O(T (n, ε/d) log √ d/ε n) time; here, W(µ, ν) is the 1-Wasserstein distance between µ and ν. Theorem 1.2. Given two sets of n points A and B in the d-dimensional unit hypercube and a parameter ε > 0, a matching whose expected cost is within an additive error of min{ε, (d log log n)w * } of the optimal matching cost w * can be computed in O(T (n, ε/d) log log n) time with high probability. Typical additive-approximation algorithms run in T (n, ε) = Õ(n 2 ) time and compute an ε-close transport plan for any arbitrary cost function; i.e, they make no assumption about the distance between points. The inputs to our algorithm, on the other hand, are point sets in the Euclidean space. Therefore, one can use an approximate dynamic nearest neighbor data structure to improve the execution time of such additive-approximation algorithms. In particular, the algorithm by Lahn et al. (2019) In contrast, all previous methods that achieve sub-logarithmic approximation require Ω(n 5/foot_3 ) time (Agarwal & Sharathkumar (2014) ). We also note that all of our algorithms extend to computing transport plans under any ℓ p -metric in a straight forward way 4 . For simplicity, we restrict our presentation only to the Euclidean metric. Overview of the algorithm: Our algorithm uses the hierarchical greedy paradigm. In our presentation, we refer to a hypercube as a cell. For any cell □, let V □ = (A ∪ B) ∩ □. Unlike a quadtree based greedy algorithm, which splits each cell □ into 2 d cells, we split it into min{|V □ |, (4 √ d/ε) d } cells. Thus, the height of the resulting tree T reduces from O(log n) to O(log √ d/ε n). For any cell □ of T and any child □ ′ of □, we move any excess supply or demand inside □ ′ to its center. Let A □ (resp. B □ ) be a set consisting of the center points of all children of □ with excess demand (resp. supply). For any child □ ′ of □ with excess demand (resp. supply), we assign a weight of η(V □ ′ ) to its center point in A □ (resp. B □ ). Using an additive-approximation algorithm, we compute an (ε/d)-close transport cost between A □ and B □ in T (|V □ |, ε/d) time. We report the sum of the transport costs computed at all cells of T as an approximate 1-Wasserstein distance. This simple algorithm guarantees improvement in the quality of the solutions produced by both additive and relative-approximation algorithms. From the perspective of relative-approximation algorithms, Agarwal & Varadarajan (2004) as well as Indyk (2007) have utilized a similar hierarchical framework to design approximation algorithms. However, unlike our algorithm, they used an exact solver at each cell that takes Ω |A □ ∪ B □ |foot_2 time. As a result, to obtain near-quadratic execution time, they divided every cell into O |V □ | 2/3 children, i.e., 1/ε d ≤ n 2/3 or d = O(log 1/ε n). This also forces the height of the tree to be O(d log 1/ε n), leading to an O(d 2 log 1/ε n)-factor approximation. We replace the Ω(n 3 ) time exact solver in their algorithm with a T (|V □ |, ε) = Õ(|V □ | 2 ) time additive-approximation algorithm. Therefore, each instance (regardless of the number of non-empty children) can be solved in Õ(|V □ | 2 ) time. As a result, we are able to improve the approximation factor from O(d 2 log 1/ε n) to O(d log √ d/ε n) and also remove restrictions on the dimension. Our algorithm now works for any dimension! Technical Challenge: Using an additive-approximation algorithm at each cell increases the error that may be difficult to bound. We use the following observation to overcome this challenge. In Section 2.2, we show that for any point set with spread ∆, an additive-approximation algorithm can be used to compute a 2-relative approximation in T (n, 1/∆) time. The algorithm guarantees that the spread of the point set at each cell □ is O(d/ε), and as a result, we get a 2-relative approximation by using an additive-approximation algorithm in O(T (n, ε/d)) time. Improvements for EBM: For the EBM problem, we obtain an improvement in the approximation ratio, as stated in Theorem 1.2, as follows: Instead of dividing each cell into a fixed number min{n, (4 √ d/ε) d } of children, we divide it into min{n, n d/2 i } children at each level i; here, level of any cell in the tree is equal to the length of the path from the root to this cell. By doing so, we reduce the height of the tree to O(log log n). To analyze the running time, we show that the number of remaining unmatched points over all level i cells is Õ(n 1-1 2 i ). Since there are only sub-linearly many points remaining, we can afford a larger spread of O( d ε n 1 2 i ) and the resulting execution time of the additive approximation will continue to be O(T (n, ε d )) per level. Open Question: Although our algorithms work in any dimension, we would like to note that the relative approximation factor grows linearly in the dimension. Recently, Chen et al. (2022) removed the dependence on the dimension d from the approximation factor of quad-tree greedy algorithm by using a data dependent weight assignment. It is an open question if their approach can be adapted in our framework leading to a similar improvement in the approximation factor of our algorithms.

2. PRELIMINARIES

We begin by introducing a few notations. For a cell □, we denote its side-length by ℓ □ and its center by c □ . For a parameter ℓ, let G(□, ℓ) denote a grid that partitions □ into smaller cells with sidelength ℓ. Recall that V □ denotes the subset of A ∪ B that lies inside □. We say that □ is non-empty if V □ is non-empty. Recall that η(V □ ) = v∈V □ η(v). We define the weight of □ to be η(V □ ) and denote it by η(□). We call a cell □ a deficit cell if η(□) < 0, a surplus cell if η(□) > 0, and a neutral cell if η(□) = 0. In this section, we provide a simple transformation of the input for achieving an additive approximation of the 1-Wasserstein distance. Furthermore, we show how an additive-approximation algorithm for the 1-Wasserstein problem can be used to obtain a (1 + ε)-relative-approximation algorithm that runs in T (n, ε/∆) time, where ∆ is the spread of input points.

2.1. ADDITIVE APPROXIMATION IN EUCLIDEAN SPACE

For any ε > 0, given an additive-approximation algorithm that runs in T (n, ε) time, we present an algorithm to compute an ε-close transport cost for distributions inside the d-dimensional unit hypercube □ * in O(min{T (n, ε 2 ), n + T (( 2 √ d ε ) d , ε 2 )}) time. The algorithm works in two steps. In the first step, it constructs a grid G := G(□ * , ε 2 √ d ) on the unit hypercube □ * and computes a transport plan σ 1 , as follows. For each non-empty neutral cell of G, σ 1 arbitrarily transports all supplies to demands within the cell. Similarly, for any deficit (resp. surplus) cell, σ 1 arbitrarily transports supplies from (resp. to) all supply (resp. demand) points inside the cell to (resp. from) some arbitrary demand (resp. supply) points within the cell. In the second step, the algorithm constructs a set of demand points A and a set of supply points B as follows: For any deficit cell □ (resp. surplus cell □ ′ ), the point c □ (resp. c □ ′ ) is added to A (resp. B) with a weight of η(□) (resp. η(□ ′ )). Note that A ∪ B is a balanced instance for the 1-Wasserstein problem. The algorithm computes an ε 2 -close transport plan σ 2 on the instance A ∪ B in T (|A| + |B|, ε/2) time (See Figure 1 ). The algorithm returns w(σ 1 ) + w(σ 2 ) as an ε-close transport cost on A ∪ B. We provide a discussion on the accuracy of the algorithm in Appendix A. The following lemma follows from the fact that |A| + |B| is bounded by min{2n, (2 √ d/ε) d }. Lemma 2.1. Given two point sets A and B in the d-dimensional unit hypercube and a parameter ε > 0, an ε-close transport cost can be computed in O(min{T (n, ε 2 ), n + T (( 2 √ d ε ) d , ε 2 )}) time. Instead of transporting supplies inside cells arbitrarily, our algorithm in Section 3 recursively applies the same algorithm in each cell and obtains a higher accuracy.

2.2. RELATIVE APPROXIMATION FOR LOW SPREAD POINT SETS

In this section, we show that an ε-additive-approximation algorithm can be used to obtain a (1 + ε)relative-approximation algorithm for the 1-Wasserstein problem that runs in T (n, ε/∆) time; here, 

3. AN

O(d log √ d/ε n)-APPROXIMATION ALGORITHM FOR 1-WASSERSTEIN PROBLEM In this section, we present our algorithm that satisfies the bounds presented in Theorem 1.1. We begin by defining a hierarchical partitioning and a tree T associated with the hierarchical partitioning. Each node in T corresponds to a non-empty cell in our hierarchical partition, and we do not distinguish between the two. We partition each cell of side-length ℓ into ⌈4 √ d/ε⌉ d cells of side-length at most (ε/4 √ d)ℓ. We construct our hierarchical partition in a randomly-shifted fashion as follows. Hierarchical Partitioning: First, we pick a point ξ uniformly at random from the unit hypercube [0, 1] d and set □ * = [-1, 1] d + ξ. Note that □ * is a hypercube of side-length 2 containing all points in A ∪ B. We designate □ * as the root of T . Let κ = ⌈4 √ d/ε⌉. For any cell □, if only one point of A ∪ B lies inside □, we designate □ as a leaf cell in T . For any non-leaf cell □, we construct its children by partitioning □, using a grid G □ = G(□, ℓ □ /κ), into ⌈4 √ d/ε⌉ d cells and create a child node for each non-empty cell of this grid. We denote the set of children of a non-leaf cell □ by C [□]. Assuming the spread of A ∪ B is n O(1) , the height of T , denoted by h, is O(log √ d/ε n). Similar to quadtrees, our hierarchical partitioning can be seen as a sequence of grids ⟨G 0 , G 1 , . . . , G h ⟩, where G 0 is the root and each grid G i refines the cells of grid G i-1 . For each grid G i , we denote the cell-side-length of G i by ℓ i . Next, we describe our algorithm. Computing an approximate 1-Wasserstein distance: In this section, we present our algorithm for computing an approximate cost of an optimal transport plan on A ∪ B. algorithm from Lemma 2.2, we compute a 2-approximate transport plan σ □ on I □ . Our algorithm then reports the total cost of the transport plans computed at all cells of T , i.e, w := □∈T w(σ □ ), as an approximate 1-Wasserstein distance. Retrieving an approximate transport plan: We retrieve an approximate transport plan σ on the point set A ∪ B by processing grids ⟨G h , . . . , G 0 ⟩ in decreasing order of their level. First, for each non-empty cell □ of G h , □ is a leaf cell and V □ contains only one point. We map the only point in V □ to c □ . For some i < h, assume (inductively) that after processing the non-empty cells of the grid G i+1 , the following conditions (i)-(iii) hold for the current transport plan σ within any cell □ in G i+1 : (i) if □ is a neutral cell, then σ transports every supply to some demand point inside □, (ii) if □ is a deficit (resp. surplus) cell, then σ transports all supplies (resp. demands) inside □ to (resp. from) some demand (resp. supply) point within □, and, (iii) if □ is a deficit (resp. surplus) cell, the excess demand (resp. supply) is mapped to c □ . Given this, we show how to process any non-empty cell □ of G i so that (i)-(iii) holds for □. Recollect that σ □ is a transport plan computed by our algorithm on I □ . By condition (iii), the excess supplies or demands at any child □ ′ of □ is mapped to c □ ′ . Therefore, for any pair of children □ 1 , □ 2 ∈ C[□], where □ 1 is a surplus cell and □ 2 is a deficit cell, the transport plan σ transports σ □ (c □1 , c □2 ) supplies from c □1 to c □2 . In addition, for any child □ 1 (resp. □ 2 ) of □, suppose □ 1 (resp. □ 2 ) is a surplus (resp. deficit) cell. If σ □ (c □1 , c □ ) > 0 (resp. σ □ (c □2 , c □ ) > 0), then we map the supplies (resp. demands) from c □1 (resp. c □2 ) to c □ . It is easy to confirm that after processing □, (i)-(iii) holds for □. From triangle inequality, w(σ) is upper-bounded by the total cost of the transport plans computed for each cell of T ; i.e, w(σ) ≤ □∈T w(σ □ ). Efficiency: For any i, let C i denote the set of non-empty cells of T at level i. For each cell □ ∈ C i , let n □ be the number of points in I □ . Since the spread of the points in I □ is O(d/ε), executing the algorithm from Lemma 2.2 on I □ takes O(T (n □ , ε/d)) time (the same bound holds for the root cell as well). Since I □ contains at most one point for each non-empty child of □, n □ ≤ min{|V □ |, (4 √ d/ε) d }. Therefore, □∈Ci n □ ≤ □∈Ci |V □ | = n. Since T (n, ε) = Ω(n), the running time of our algorithm on cells at level i is O( □∈Ci T (n □ , ε/d)) = O(T (n, ε/d)). Summing over all levels, the running time of our algorithm is O(T (n, ε/d) log √ d/ε n). When the dimension is a small constant, we get an improved running time as follows. For each level i of T , there are at most n non-empty cells at level i and the instance created at each cell has a size of at most (4 d) and the overall running time will be improved to n( d ε ) O(d) log √ d/ε n. Quality of Approximation: In this part, we analyze the approximate 1-Wasserstein distance computed by our algorithm. First, we show that the reported cost is ε-close. For the root cell □ * , our algorithm computes an ε/d-close transport plan σ □ * . The remaining demands and supplies are recursively transported within the children of □ * , where each child has diameter ε/2. Therefore, similar to Section 2.1, we can argue that our algorithm reports an ε-close 1-Wasserstein distance. Next, we show that the reported cost is an O(d log √ d/ε n)-approximation of the 1-Wasserstein distance. √ d/ε) d . Therefore, O( □∈Ci T (n □ , ε/d)) = O(nT ((4 √ d/ε) d , ε/d)) = n( d ε ) O( For each level i < h, we show that the expected cost of the transport plans computed for all cells of level i is E □∈Ci w(σ □ ) = O(d)w(σ * ). Here, σ * is an optimal transport plan on A ∪ B, C i denotes the set of non-empty cells of T at level i, and the expectation is over the choice of the random shift of the hierarchical partitioning. We bound E □∈Ci w(σ □ ) in two steps as follows. In the first step, we assign a budget to every edge (a, b) with σ * (a, b) > 0 and show that the total budget assigned to all such edges is (in expectation) O(d)w(σ * ). In the second step, we redistribute this budget to the cells of level i in a way that the budget received by any cell □ is at least w(σ □ )/2. Summing over all O(log √ d/ε n) levels of T , the expected value of the total cost computed at all levels i < h is O(d log √ d/ε n)w(σ * ). Additionally, we show that the expected cost of mapping all points to the centers of the cells of level  h is O(d log √ d/ε n)w(σ * ).

5. EXPERIMENTS

In this section, we conduct experiments to show that our algorithms from Section 3 and Section 4 improve the accuracy of the additive approximation algorithms. We test an implementation of our algorithm, written in Python, on discrete probability distributions derived from real-world and synthetic data sets. All tests are executed on a computer with a 2.50 GHz Intel Core i7 processor and 8GB of RAM using a single computation thread. Datasets: We test our algorithms on two sets of n samples taken from synthetic (distribution) and real-world data sets. For each dataset, we use our algorithms to compute the minimum-cost matching between these samples and present our results averaged over 10 executions. To generate the synthetic data, we sample from a uniform distribution inside a 2-dimensional unit square placed on a random plane in 15-dimensional space (15D). For a real-world dataset, we use the Adult Census Data (UCI repository), which is a point cloud in R 6 with continuous features for 35, 000 individuals, divided into two categories by income (Dua & Graff (2017) ). See Appendix E for the results of our experiments on additional datasets. Results: In our first experiment, we compare our algorithm with existing additive approximation schemes, namely the Sinkhorn method (Cuturi ( 2013)) (resp. LMR algorithm (Lahn et al. (2019) )). We use the Sinkhorn (resp. LMR) algorithm as a black-box within our algorithm from Section 3 (1-Wasserstein algorithm) and compare its execution time with the standard Sinkhorn (resp. LMR) implementation. We set the parameters of the Sinkhorn and LMR algorithms in such a way that the error produced by them matches with the error produced by our 1-Wasserstein algorithm. As shown in Figure 2 (a)-(d), our algorithm runs significantly faster than both the Sinkhorn and the LMR algorithms while producing solutions of similar quality. In our second experiment, we also compare the accuracy as well as the computation time of our algorithms (1-Wasserstein algorithm and our EBM algorithm from Section 4) with the additive approximation algorithm from Section 2.1 (Geometric-Additive algorithm). For this experiment, our algorithms use the Sinkhorn algorithm as a black-box. The results of this experiment are shown in Figure 2 (e)-(h). We observe that our algorithms achieves a better accuracy (see Figure 2 (a) and (c)), especially on real-world data sets. As expected, the execution time of our algorithms increase slightly (see Figure 2 (d) ). Furthermore, we observe that as the sample size increases, the costs returned by our algorithms converge to the optimal cost, whereas the Geometric-Additive Approximation based approach does not. Finally, we highlight the scalability of our algorithms. For an input of 3 million points drawn from a 2 dimensional uniform distribution on the unit square, our 1-Wasserstein algorithm runs in 593 seconds and computes an approximate transport cost of 0.0093. Furthermore, for an input of 1.5 million points for the 15D dataset, our 1-Wasserstein algorithm computes an approximate cost of 0.0304 in 608 seconds.



Õ() hides poly(d, log n, 1/ε) factors in the execution time. The execution time of Sinkhorn algorithm as well as other additive approximations depend on the diameter C of the point set. In the case of d-dimensional unit hypercube, C = √ d. The appendix is provided in the supplemental material. Owing to a higher complexity of LSH in arbitrary ℓp-metrics, the approximation factor in Theorem 1.3 and 1.4 is slightly higher for arbitrary ℓp-metrics. -Creating an instance at each cell: For each cell □ of T , we create a balanced instance I □ of the 1-Wasserstein problem as follows. If □ is a leaf cell, then it contains a single point u of A ∪ B, which we add to I □ . In addition, we add the center point c □ with a weight -η(□) to I □ . Otherwise, □ is a non-leaf cell. For any child □ ′ ∈ C[□], if □ ′ is a deficit or surplus cell, we add c □ ′ with a weight η(□ ′ ) to I □ . The weight η(□ ′ ) represents the excess demands or supplies of the points in V □ ′ . Furthermore, we add the center point c □ with a weight -η(□) to I □ . The instance I □ is balanced and has a spread of O(d/ε). -Estimating the 1-Wasserstein distance: In this step, for the root cell □ * , we compute an ε/d-close transport plan σ □ * on I □ * . Furthermore, for each cell □ at any level i > 0, using the



1.2 RELATED WORK Relative Approximations: In fixed dimensional settings, i.e., d = O(1), there is extensive work on the design of near-linear time Monte-Carlo (1 + ε)-relative approximation al-gorithms for 1-Wasserstein and EBM problems. The execution time of these algorithms are Ω(n(dε -1 log n) d ) (Khesin et al. (2019); Raghvendra & Agarwal (2020); Fox & Lu (2020); Agarwal et al. (2022a)). A recent algorithm presented by Agarwal et al. (

runs in O(1/ε) phases, where each phase executes one iteration of Gabow-Tarjan's algorithm. As shown byAgarwal & Sharathkumar (2014), one can use a 1 √ ε -approximate dynamic nearest neighbor data structure with a query/update time of O(n ε + d log n)(Andoni et al. (2014)) to execute each iteration of Gabow-Tarjan's algorithm in Õ(n 1+ε ) time. Combining with the algorithms from Theorem 1.1 and 1.2, we obtain the following relative-approximation algorithms. The details are provided in Appendix D 3 .Theorem 1.3. Let µ and ν be two discrete distributions whose support lie in the d-dimensional unit hypercube and have polynomial spread, and ε > 0 be a parameter. An O(d/ε 3/2 )-approximate transport plan under Euclidean metric can be computed in O(d 2 n 1+ε /ε) time. Theorem 1.4. Given two sets of n points A and B in the d-dimensional unit hypercube and a parameter ε > 0, an O( d √ ε log d ε )-approximate matching can be computed in O(dn 1+ε log d ε ) time with high probability. In contrast to our results, Agarwal & Varadarajan (2004) compute an O(d 2 log 1 ε )-approximate matching in the same time. Therefore, our algorithm computes a more accurate matching for d > 1 √ ε . For instance, consider the case where d = √ log n. For any arbitrarily small constant ε > 0, the algorithm of Theorem 1.4 will run in Õ(n 1+ε ) time and return an O( √ log n)-approximation.

Figure 1: (a) The algorithm transports supplies (red disks) to demands (blue circles) within each cell and creates an instance by moving any excess supplies or demands to the center of the corresponding cells, (b) An ε/2-close transport plan is computed on the new problem instance.

Figure 2: (a) and (b) Comparison with the Sinkhorn algorithm, (c) and (d) Comparison with the LMR algorithm, and (e)-(h) Comparison with the Geometric-Additive algorithm

ACKNOWLEDGEMENT

We would like to acknowledge, Advanced Research Computing (ARC) at Virginia Tech, which provided us with the computational resources used to run the experiments. Research presented in this paper was funded by NSF grants CCF-1909171, CCF-2223871, IIS-1814493, CCF-2007556, and CCF-2223870. We would like to thank the anonymous reviewers for their useful feedback.

4. AN O(d log log n)-APPROXIMATION ALGORITHM FOR EBM PROBLEM

Suppose A and B are two sets of n points inside the d-dimensional unit hypercube, where each point a ∈ A (resp. b ∈ B) has a weight η(a) = -1/n (resp. η(b) = 1/n). In this section, we present an approximation algorithm for the EBM problem satisfying the bounds claimed in Theorem 1.2. Note that by invoking Lemma 2.1 on A ∪ B, one can compute an ε-close transport cost on A ∪ B.To boost the accuracy of the algorithm of Lemma 2.1, in this section, we present an O(d log log n)approximation algorithm for the EBM problem. To satisfy the bounds claimed in Theorem 1.2, one can then report the minimum of the costs computed by the two algorithms.

Input transformation:

We transform input points such that (1) all coordinates are positive integers bounded by n O(1) , (2) an optimal matching on the transformed points is a (1 + ε)approximate matching with respect to the original points, and (3) the cost of the optimal matching is O(d 3/2 n log n/ε). Similar transformations have been applied in several papers in the literature (Agarwal et al. (2017) ; Lahn & Raghvendra (2021) ). We describe this transformation in Appendix C.1. As before, we can match and remove any co-located points a ∈ A and b ∈ B.for some k ≥ 1, we describe our EBM algorithm. Our algorithm is easily adaptable to use any algorithm with a running time ofwhere k ≥ 1 and t is a fixed constant. Overview of the algorithm: Similar to Section 3, our EBM algorithm constructs a hierarchical partitioning and the associated tree T , and executes the algorithm from Lemma 2.2 on the instance created for each cell of T . In contrast to Section 3, the tree T constructed by our EBM algorithm has height O(log log n), resulting in an improved approximation factor. The hierarchical partitioning of this section differs from the one in Section 3 in two ways. First, we partition the root cell into a grid G 1 with cell-side-length of Θ(d 5/2 n log n/ε 2 ) at the first level. The grid G 1 may result in a high branching factor at the root; however, we show that, with probability at least 1 -ε/ √ d, no edges of an optimal matching will cross G 1 . Therefore, with that probability, all cells of G 1 are neutral cells and the problem instance for the root is an empty instance; i.e, the branching factor of the root will not impact the running time of our algorithm. Second, for any cell □ of level i, instead of splitting □ into a fixed number min{n, (4Although this results in a spread of Õ(n 1/2 i ) for the problem instance I □ , we show that the expected number of remaining unmatched points over all cells of level i is Õ(n 1-1/2 i ). Therefore, the total execution time of our algorithm remains T (n, ε/d) per level. These modifications result in a tree T of height O(log log n). We describe the details below.Hierarchical Partitioning: Define δ := 5d 2 ε -1 log n. Similar to Section 3, we define a cell □ * as a randomly-shifted hypercube that contains all points of A ∪ B and has a side-length of 2 max{C max , ℓ 1 /ε}, where ℓ 1 = √ d ε δn. We designate □ * as the root of T (□ * is at level 0 of T ). Define a grid G 1 := G(□ * , ℓ 1 ). We add each non-empty cell of G 1 to the tree as the children of □ * . We construct the hierarchical partitioning in a recursive fashion as follows. For any non-root cell □ of T , if □ contains only one point of A ∪ B, then we designate □ as a leaf cell. Otherwise, let □ be a cell of level i. Define the grid G □ = G(□, δn 1/2 i ) and add the non-empty cells of G □ to T as the children of □. For simplicity in presentation, we assume n 1/2 i is an integer. For any cell □, denote the set of children of □ in T by C [□] . The height of T , denoted by h, is O(log log n).Similar to Section 3, our hierarchical partitioning is also a sequence of grids ⟨G 0 , G 1 , . . . , G h ⟩, where G 0 is the root cell □ * , G 1 has a cell-side-length of ℓ 1 = √ d ε δn, and for each 2.Computing an approximate matching cost: To estimate the matching cost, similar to Section 3, our algorithm creates an instance of the 1-Wasserstein problem for each cell of the tree T . Using the algorithm from Lemma 2.2, our algorithm computes a 2-approximate transport plan for the instance created for each cell and returns the total cost of such transport plans as an approximate matching cost. This completes the description of the algorithm.We describe the details of retrieving a matching in Appendix C.2, the quality of approximation in Appendix C.3, and the efficiency of our algorithm in Appendix C.4.

