Efficient estimates of optimal transport via low-dimensional embeddings Conference

Abstract

Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension. Recent work aims specifically at reducing this cost by computing OT using low-rank projections of the data (seen as discrete measures) (Paty & Cuturi, 2019). We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension. We demonstrate the idea with neural networks. We use Sinkhorn Divergences (SD) to approximate OT distances as they are differentiable and allow for gradient-based optimisation. We illustrate on synthetic data how our technique preserves accuracy and displays a low sensitivity of computational costs to the data dimension.

1. Introduction

Optimal Transport metrics (Kantorovich, 1960) or Wasserstein distances, have emerged successfully in the field of machine learning, as outlined in the review by Peyré et al. (2017) . They provide machinery to lift distances on X to distances over probability distributions in P(X ). They have found multiple applications in machine learning: domain adaptation Courty et al. (2017 ), density estimation (Bassetti et al., 2006) and generative networks (Genevay et al., 2017; Patrini et al., 2018) . However, it is prohibitively expensive to compute OT between distributions with support in a high-dimensional space and might not even be practically possible as the sample complexity can grow exponentially as shown by Dudley (1969) . Similarly, work by Weed et al. (2019) showed a theoretical improvement when the support of distributions is found in a low-dimensional space. Furthermore, picking the ground metric that one should use is not obvious when using high-dimensional data. One of the earlier ideas from Santambrogio (2015) showed that OT projections in a 1-D space may be sufficient enough to extract geometric information from high dimensional data. This further prompted Kolouri et al. (2018) to use this method to build generative models, namely the Sliced Wasserstein Autoencoder. Following a similar approach Paty & Cuturi (2019) and Muzellec & Cuturi (2019) project the measures into a linear subspace E of low-dimension k that maximizes the transport cost and show how this can be used in applications of color transfer and domain adaptation. This can be seen as an extension to earlier work by Cuturi & Doucet (2014) whereby the cost function is parameterized. munity was the seminal paper by Cuturi (2013) that introduced the idea of entropic regularization of OT distances and the Sinkhorn algorithm. Since then, regularized OT has been successfully used as a loss function to construct generative models such as GANs (Genevay et al., 2017) or RBMs (Montavon et al., 2015) and computing Barycenters (Cuturi & Doucet, 2014; Claici et al., 2018) . More recently, the new class of Sinkhorn Divergences was shown by Feydy et al. (2018) to have good geometric properties, and interpolate between Maximum Mean Discrepancies (MMD) and OT. Building on this previous work, we introduce a general framework for approximating highdimensional OT using low-dimensional projections f by finding the subspace with the worst OT cost, i.e. the one maximizing the ground cost on the low-dimensional space. By taking a general family of parameterizable f φ s that are 1-Lipschitz, we show that our method generates a pseudo-metric and is computationally efficient and robust. We start the paper in §2 with background on optimal transport and pseudo-metrics. In §3 we define the theoretical framework for approximating OT distances and show how both linear (Paty & Cuturi, 2019) and non-linear projections can be seen as a special instance of our framework. In §4 we present an efficient algorithm for computing OT distances using Sinkhorn Divergences and f φ s that are 1-Lipschitz under the L 2 norm. We conclude in §5 with experiments illustrating the efficiency and robustness of our method.

2. Preliminaries

We start with a brief reminder of the basic notions needed for the rest of the paper. Let X be a set equipped with a map d X : X × X → R ≥0 with non-negative real values. The pair (X , d X ) is said to be a metric space and d X is said to be a metric on X if it satisfies the usual properties: • d X (x, y) = 0 if and only if x = y • d X (x, y) = d X (y, x) • d X (x, z) ≤ d X (x, y) + d X (y, z) If d X verifies the above except for the only if condition, it is called a pseudo-metric, and (X , d X ) is said to be a pseudo-metric space. For a pseudo-metric, it may be that d X (x, y) = 0 while x = y. We write d X ≤ d X if for all x, y d X (x, y) ≤ d X (x, y). It is easy to see that: 1) "≤" is a partial order on pseudo-metrics over X, 2) "≤" induces a complete lattice structure on the set of pseudo-metrics over X , where 3) suprema are computed pointwise (but not infima). Consider X , Y, two metric spaces equipped with respective metrics d X , d Y . A map f from X to Y is said to be α-Lipschitz continuous if d Y (f (x), f (x )) ≤ αd X (x, x ). A 1-Lipschitz map is also called non-expansive. Given a map f from X to Y one defines the pullback of d Y along f as: f (d Y )(x, x ) = d Y (f (x), f (x )) (1) It is easily seen that: 1) f (d Y ) is a pseudo-metric on X , 2) f (d Y ) is a metric iff f is injective, 3) f (d Y ) ≤ d X iff f is non-expansive, 4) f (d Y ) is the least pseudo-metric on the set X such that f is non-expansive from (X , f (d Y )) to (X , d Y ). Thereafter, we assume that all metric spaces considered are complete and separable, i.e. have a dense countable subset.

