Efficient estimates of optimal transport via low-dimensional embeddings Conference

Abstract

Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension. Recent work aims specifically at reducing this cost by computing OT using low-rank projections of the data (seen as discrete measures) (Paty & Cuturi, 2019). We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension. We demonstrate the idea with neural networks. We use Sinkhorn Divergences (SD) to approximate OT distances as they are differentiable and allow for gradient-based optimisation. We illustrate on synthetic data how our technique preserves accuracy and displays a low sensitivity of computational costs to the data dimension.

1. Introduction

Optimal Transport metrics (Kantorovich, 1960) or Wasserstein distances, have emerged successfully in the field of machine learning, as outlined in the review by Peyré et al. (2017) . They provide machinery to lift distances on X to distances over probability distributions in P(X ). They have found multiple applications in machine learning: domain adaptation Courty et al. (2017 ), density estimation (Bassetti et al., 2006) and generative networks (Genevay et al., 2017; Patrini et al., 2018) . However, it is prohibitively expensive to compute OT between distributions with support in a high-dimensional space and might not even be practically possible as the sample complexity can grow exponentially as shown by Dudley (1969) . Similarly, work by Weed et al. (2019) showed a theoretical improvement when the support of distributions is found in a low-dimensional space. Furthermore, picking the ground metric that one should use is not obvious when using high-dimensional data. One of the earlier ideas from Santambrogio (2015) showed that OT projections in a 1-D space may be sufficient enough to extract geometric information from high dimensional data. This further prompted Kolouri et al. (2018) to use this method to build generative models, namely the Sliced Wasserstein Autoencoder. Following a similar approach Paty & Cuturi (2019) and Muzellec & Cuturi (2019) project the measures into a linear subspace E of low-dimension k that maximizes the transport cost and show how this can be used in applications of color transfer and domain adaptation. This can be seen as an extension to earlier work by Cuturi & Doucet (2014) whereby the cost function is parameterized. One of the fundamental innovations that made OT appealing to the machine learning com-

