Efficient estimates of optimal transport via low-dimensional embeddings Conference

Abstract

Optimal transport distances (OT) have been widely used in recent work in Machine Learning as ways to compare probability distributions. These are costly to compute when the data lives in high dimension. Recent work aims specifically at reducing this cost by computing OT using low-rank projections of the data (seen as discrete measures) (Paty & Cuturi, 2019). We extend this approach and show that one can approximate OT distances by using more general families of maps provided they are 1-Lipschitz. The best estimate is obtained by maximising OT over the given family. As OT calculations are done after mapping data to a lower dimensional space, our method scales well with the original data dimension. We demonstrate the idea with neural networks. We use Sinkhorn Divergences (SD) to approximate OT distances as they are differentiable and allow for gradient-based optimisation. We illustrate on synthetic data how our technique preserves accuracy and displays a low sensitivity of computational costs to the data dimension.

1. Introduction

Optimal Transport metrics (Kantorovich, 1960) or Wasserstein distances, have emerged successfully in the field of machine learning, as outlined in the review by Peyré et al. (2017) . They provide machinery to lift distances on X to distances over probability distributions in P(X ). They have found multiple applications in machine learning: domain adaptation Courty et al. (2017) , density estimation (Bassetti et al., 2006) and generative networks (Genevay et al., 2017; Patrini et al., 2018) . However, it is prohibitively expensive to compute OT between distributions with support in a high-dimensional space and might not even be practically possible as the sample complexity can grow exponentially as shown by Dudley (1969) . Similarly, work by Weed et al. (2019) showed a theoretical improvement when the support of distributions is found in a low-dimensional space. Furthermore, picking the ground metric that one should use is not obvious when using high-dimensional data. One of the earlier ideas from Santambrogio (2015) showed that OT projections in a 1-D space may be sufficient enough to extract geometric information from high dimensional data. This further prompted Kolouri et al. (2018) to use this method to build generative models, namely the Sliced Wasserstein Autoencoder. Following a similar approach Paty & Cuturi (2019) and Muzellec & Cuturi (2019) project the measures into a linear subspace E of low-dimension k that maximizes the transport cost and show how this can be used in applications of color transfer and domain adaptation. This can be seen as an extension to earlier work by Cuturi & Doucet (2014) whereby the cost function is parameterized. One of the fundamental innovations that made OT appealing to the machine learning com-munity was the seminal paper by Cuturi (2013) that introduced the idea of entropic regularization of OT distances and the Sinkhorn algorithm. Since then, regularized OT has been successfully used as a loss function to construct generative models such as GANs (Genevay et al., 2017) or RBMs (Montavon et al., 2015) and computing Barycenters (Cuturi & Doucet, 2014; Claici et al., 2018) . More recently, the new class of Sinkhorn Divergences was shown by Feydy et al. (2018) to have good geometric properties, and interpolate between Maximum Mean Discrepancies (MMD) and OT. Building on this previous work, we introduce a general framework for approximating highdimensional OT using low-dimensional projections f by finding the subspace with the worst OT cost, i.e. the one maximizing the ground cost on the low-dimensional space. By taking a general family of parameterizable f φ s that are 1-Lipschitz, we show that our method generates a pseudo-metric and is computationally efficient and robust. We start the paper in §2 with background on optimal transport and pseudo-metrics. In §3 we define the theoretical framework for approximating OT distances and show how both linear (Paty & Cuturi, 2019) and non-linear projections can be seen as a special instance of our framework. In §4 we present an efficient algorithm for computing OT distances using Sinkhorn Divergences and f φ s that are 1-Lipschitz under the L 2 norm. We conclude in §5 with experiments illustrating the efficiency and robustness of our method.

2. Preliminaries

We start with a brief reminder of the basic notions needed for the rest of the paper. Let X be a set equipped with a map d X : X × X → R ≥0 with non-negative real values. The pair (X , d X ) is said to be a metric space and d X is said to be a metric on X if it satisfies the usual properties: • d X (x, y) = 0 if and only if x = y • d X (x, y) = d X (y, x) • d X (x, z) ≤ d X (x, y) + d X (y, z) If d X verifies the above except for the only if condition, it is called a pseudo-metric, and (X , d X ) is said to be a pseudo-metric space. For a pseudo-metric, it may be that d X (x, y) = 0 while x = y. We write d X ≤ d X if for all x, y d X (x, y) ≤ d X (x, y). It is easy to see that: 1) "≤" is a partial order on pseudo-metrics over X, 2) "≤" induces a complete lattice structure on the set of pseudo-metrics over X , where 3) suprema are computed pointwise (but not infima). Consider X , Y, two metric spaces equipped with respective metrics d X , d Y . A map f from X to Y is said to be α-Lipschitz continuous if d Y (f (x), f (x )) ≤ αd X (x, x ). A 1-Lipschitz map is also called non-expansive. Given a map f from X to Y one defines the pullback of d Y along f as: f (d Y )(x, x ) = d Y (f (x), f (x )) (1) It is easily seen that: 1) f (d Y ) is a pseudo-metric on X , 2) f (d Y ) is a metric iff f is injective, 3) f (d Y ) ≤ d X iff f is non-expansive, 4) f (d Y ) is the least pseudo-metric on the set X such that f is non-expansive from (X , f (d Y )) to (X , d Y ). Thereafter, we assume that all metric spaces considered are complete and separable, i.e. have a dense countable subset. Let (X , d X ) be a (complete separable) metric space. Let Σ X be the σ-algebra generated by the open sets of X (aka the Borelian subsets). We write P(X ) for the set of probability distributions on (X , Σ X ). Given a measurable map f : X → Y, and µ ∈ P(X) one defines the push-forward of µ along f as: f # (µ)(B) = µ(f -1 (B)) (2) for B ∈ Σ Y . It is easily seen that f # (µ) is a probability measure on (Y, Σ Y ) Given µ in P(X ), ν in P(Y), a coupling of µ and ν is a probability measure γ over X × Y such that for all A in Σ X , B in Σ Y , γ(A × X ) = µ(A), and γ(X × B) = ν(B). Equivalently, µ = π 0# (π), and ν = π 1# (π) for π 0 , π 1 the respective projections. We write Γ(µ, ν) for the set of couplings of µ and ν. There are several ways to lift a given metric structure on d X to one on P(X ). We will be specifically interested in metrics on P(X ) derived from optimal transport problems. The p-Wasserstein metric with p ∈ [1, ∞) is defined by: W p (d X )(µ, ν) p = inf γ∈Γ(µ,ν) X ×X d p X dγ (3) Villani (2008) establishes that if d X is (pseudo-) metric so is W p (d X ) . The natural 'Dirac' embedding of X into P(X ) is isometric (there is only one coupling). The idea behind the definition is that d p X is used as a measure of the cost of transporting units of mass in X , while a coupling γ specifies how to transport the µ distribution to the ν one. One can therefore compute the mean transportation cost under γ, and pick the optimal γ. Hence the name optimal transport. In most of the paper, we are concerned with the case X = R d + for some large d with a metric structure d X given by the Euclidean norm, and we wish to compute the W 2 metric between distributions with finite support. Since OT metrics are costly to compute in high dimension, to estimate these efficiently, and mitigate the impact of dimension, we will use a well-chosen family of f s to push the data along a map with a low dimensional co-domain Y also equipped with the Euclidean metric. The reduction maps may be linear or not. They have to be non-expansive to guarantee that the associated pull-back metrics are always below the Euclidean one, and therefore we provide a lower estimate of W 2 (d 2 ).

3. Approximate OT with General Projections -GPW

With the ingredients from the above section in place, we can now construct a general framework for approximating Wasserstein-like metrics by low-dimensional mappings of X . We write simply W instead of W p as the value of p plays no role in the development. Pick two metric spaces (X , d X ), (Y, d Y ), and a family S = (f φ : X → Y; φ ∈ S) of mappings from X to Y. Define a map from P(X ) × P(X ) to non-negative reals as follows: d S (µ, ν) = sup S W (d Y )(f φ# (µ), f φ# (ν)) (4) Equivalently and more concisely d S can be defined as: d S (µ, ν) = sup φ W ( fφ (d Y ))(µ, ν) (5) It is easily seen that: 1. the two definitions are equivalent 2. d S is a pseudo-metric on P(X ) 3. d S is a metric (not just a pseudo one) if the family f φ jointly separates points in X , and 4. if the f φ s are non-expansive from (X , d X ) to (Y, d Y ), then d S ≤ W (d X ) The second point follows readily from the second definition. Each fφ (d Y ) is a pseudo-metric on X obtained by pulling back d Y (see preceding section), hence, so is W ( fφ (d Y )) on P(X ), and therefore d S being the supremum of this family (in the lattice of pseudo-metrics over X ) is itself a pseudo-metric. The first definition is important because it allows one to perform the OT computation in the target space where it will be cheaper. Thus we have derived from S a pseudo-metric d S on the space of probability measures P(X ). We assume from now on that mappings in S are non-expansive. By point 4. above, we know that d S is bounded above by W (d X ). We call d S the generalized projected Wasserstein metric (GPW) associated to S. In good cases, it is both cheaper to compute and a good estimate.

3.1. SRW as an instance of GPW

In Paty & Cuturi (2019) , the authors propose to estimate W 2 metrics by projecting the ambient Euclidean X into k-dimensional linear Euclidean subspaces. Specifically, their derived metric on P(X), written S k , can be defined as (Paty & Cuturi, 2019, Th. 1, Eq. 4) : S 2 k (µ, ν) = sup Ω W 2 2 (d Y )(Ω 1/2 # (µ), Ω 1/2 # (ν)) where: 1) d Y is the Euclidean metric on Y, 2) Ω contains all positive semi-definite matrices of trace k (and therefore admitting a well-defined square root) with associated semi-metric smaller than d X . We recognise a particular case of our framework where the family of mappings is given by the linear mappings √ Ω : R d = X → Y = R k under the constraints above. In particular, all mappings used are linear. The authors can complement the general properties of the approach with a specific explicit bound on the error and show that S 2 k ≤ W 2 2 (d X ) ≤ (d/k)S 2 k . In the general case, there is no upper bound available, and one has only the lower one.

3.2. Non-linear embeddings for approximating Wasserstein distances

Using the same Euclidean metric spaces, X = R d , Y = R k , we observe that our framework does not restrict us to use linear functions as mappings. One could use a family of mapping given by a neural network (f φ : X → Y; φ ∈ S) where φ ranges over network weights. However, not any φ is correct. Indeed, by point 4) in the list of properties of d S , we need f φ s to be non-expansive. Ideally, we could pick S to be the set of all weights such that f φ is non-expansive. There are two problems one needs to solve in order to reduce the idea to actual tractable computations. First, one needs an efficient gradient-based search to look for the weights φ which maximise sup S W (d Y )(f φ# (µ), f φ# (ν)) (see 4). Second, as the gradient update may take the current f φ out of the non-expansive maps, one needs to project back efficiently in the space of non-expansive. Both problems already have solutions which are going to re-use. For the first point, we will use Sinkhorn Divergence (SD) (Genevay et al., 2017) . Recent work Feydy et al. (2018) shows that SD, which one can think as a regularised version of W , is a sound choice as a loss function in machine learning. It can approximate W closely and without bias (Genevay et al., 2017) , has better sample complexity (Genevay et al., 2019) , as well as quadratic computation time. Most importantly, it is fully differentiable. For the second problem, one can 'Lipshify' the linear layers of the network by dividing their (operator) norm after each update. We will use linear layers with Euclidean metrics, and this will need to estimate the spectral radius of each layer. The same could be done with linear layers using a mixture of L 1 , L 2 and L ∞ metrics. In fact computing the L 1 → L 1 operator norm for linear layers is an exact operation, as opposed to using the spectral norm for L 2 → L 2 case where we approximate using the power method. Note that the power method can only approximate the L 2 norm and gradient ascent methods used in the maximization phase are stochastic making our approximation susceptible to more variables. However, it is extremely efficient since it requires computation of optimal transport distances only in the low-dimensional space. We can see this as a trade-off between exactness and efficiency.

4. Computational details

In this section we propose Algorithm 1 for stochastically estimating d S between two measures with finite support where the class of mappings S is as defined above. Note that this algorithm can further be used during the training of a discriminator as part of a generative network with an optimal transport objective, similar to Genevay et al. (2017) . The Sinkhorn Divergence alternative for d S now uses Sinkhorn divergences as a proxy for OT (compare with equation 4): SD φ, (µ, ν) = W (d Y )(f φ# (µ), f φ# (ν)) - 1 2 W (d Y )(f φ# (µ), f φ# (µ)) - 1 2 W (d Y )(f φ# (ν), f φ# (ν)) ( ) where W is the well-known Sinkhorn regularized OT problem (Cuturi, 2013) . The nonparameterized version of the divergence has been shown by Feydy et al. (2018) to be an unbiased estimator of W (µ, ν) and converges to the true OT distance when = 0. Their paper also constructs an effective numerical scheme for computing the gradients of the Sinkhorn divergence on GPU, without having to back-propagate through the Sinkhorn iterations, by using autodifferentiation and the detach methods available in PyTorch (Paszke et al., 2019) . Moreover, work by Schmitzer (2019) devised an -scaling scheme to trade-off between guaranteed convergence and speed. This gives us further control over how fast the algorithm is. It is important to note that the minimization computation happens in the low-dimensional space, differently from the approach in Paty & Cuturi (2019) , which makes our algorithm scale better with dimension, as seen in §5. Feydy et al. (2018) established that the gradient of 7 w.r.t to the input measures µ, ν is given by the dual optimal potentials. Since we are pushing the measures through a differentiable function f φ , we can do the maximization step via a stochastic gradient ascent method such as SGD or ADAM (Kingma & Ba, 2014) . Finally, after each iteration, we project back into the space of 1-Lipschitz functions f φ . For domain-codomain L 2 ←→ L 2 the Lipschitz constant of a fully connected layer is given by the spectral norm of the weights, which can be approximated in a few iterations of the power method. Since non-linear activation functions such as ReLU are 1-Lipschitz, in order to project back into the space of constraints we suggest to normalize each layer's weights with the spectral norm, i.e. for layer i we have φ i := φ i /||φ i ||. Previous work done by Neyshabur et al. (2017) as well as Yoshida & Miyato (2017) and Miyato et al. (2018) showed that with smaller magnitude weights, the model can better generalize and improve the quality of generated samples when used on a discriminator in a GAN. We note that if we let f φ to be a 1-Layer fully connected network with no activation, the optimization we perform is very similar with the optimization done by Paty & Cuturi (2019) . The space of 1-Lipschitz functions we are optimizing over is larger and our method is stochastic, but we are able to recover very similar results at convergence. Moreover, our method applies to situations where the data lives in a non-linear manifold that an f φ such as a neural network is able to model. Comparing different numerical properties of the Subspace Robust Wasserstein distances in 6 with our Generalized Projected Wasserstein Distances is the focus of the next section.

Algorithm 1 Ground metric parameterization through φ

Input: sions (d, 20, k) and 1-Lipschitz, optimizer ADAM , power method iterations λ, SD φ, unbiased Sinkhorn Divergence. Measures µ = n i δ xi a i and ν = n j δ yj b j , f φ : R d → R k 2-Layer network with dimen- Output: f φ , SD φ, Initialize: lr, ,λ, f φ ∼ N (0, 10), Objective ← SD (blur = 2 , p = 2, debias = T rue) for t → 1, . . . , maxiter do L ← -SD φ, (f φ # µ, f φ # ν) (pushforward through f φ and evaluate SD in lower space) grad φ ← Autodiff L) (maximization step with autodiff) φ ← φ +ADAM(grad φ ) (gradient step with SGD and scheduler) φ ← P roj λ 1-Lip (φ) (projection into 1-Lipschitz space of functions) end for

5. Experiments

We consider similar experiments as presented in Forrow et al. (2019) and Paty & Cuturi (2019) and show the mean estimation of SD 2 φ,k (µ, ν) for different values of k, as well as robustness to noise. We also show how close the distance generated by the linear projector from Paty & Cuturi (2019) is to our distance and highlight the trade-off in terms of computation time with increasing number of dimensions. In order to illustrate our method, we construct two empirical distributions μ, ν by taking samples from two independent measures µ = N (0, Σ 1 ) and ν = N (0, Σ 2 ) that live in a 10 dimensional space. Similarly to Paty & Cuturi (2019) we construct the covariance matrices Σ 1 , Σ 2 such that they are of rank 5, i.e. the support of the distributions is given by a 5 dimensional linear subspace. Throughout our experiments we fix f φ to be a 2-layer neural network with a hidden layer of 16 units, activation function ReLU and output of dimension k. We initialize the weights from N (0, 10) and use a standard ADAM optimizer with a decaying cyclic learning rate (Smith, 2017) bounded by [0.1, 1.0]. Decreasing and increasing the learning rate via a scheduler allows us to not fall into local optima. The batch size for the algorithm is set to n = 500, which is the same number of samples that make up the two measures. Besides the neural network variables, we set the regularization strength small enough, to = 0.001, and the scaling to -scaling = 0.95 such that we can accurately estimate the true optimal transport distance, but not spend too much computational time during the Sinkhorn iterates.

5.1. 10-D Gaussian Data OT estimation using SD φ,k

This leaves us with three variables of interest during the computation of SD φ,k , namely k, d, λ (latent dimension, input dimension, power method iterations). The power method iterations plays an important role during the projection step, as for a small number of iterations, there is a chance of breaking the constraint. At the same time, running the algorithm for too long is computationally expensive. In Figure 1 we used λ = 5 power iterations and show the values of SD 2 k,φ after running 1 for 500 iterations. We compare them to the true OT distance for various levels of k and observe that even with a small number of power iterations, the estimation approaches the true value as k increases. Furthermore, we see that for k = 5 and k = 7 the algorithm converges after 200 steps. Using 20 power iterations, we show how the approximation behaves in the presence of noise as a function of the latent space k. We add Gaussian noise in the form of N (0, I) to μ, ν and show in Figure 2 the comparison between no noise and noise for both SRW distances defined in 6 and GPW in 4. We observe that SD 2 φ,k behaves similarly to S 2 k in the presence of noise.

5.2. Computation time

In Figure . 8 of Paty & Cuturi (2019) they note that their method when using Sinkhorn iterates is quadratic in dimension because of the eigen-decomposition of the displacement matrix. Fundamentally different, we are always optimizing in the embedded space, making the computation of the Sinkhorn iterates linear with dimension. Note that there is the extra computation involved with pushing the measures through the neural network and backpropagating as well as the projection step that depends on the power iteration method. In order to run this experiment we set λ = 5 and generate μ, ν by changing dimension d but leaving the rank of Σ 1 , Σ 2 equal to 5. The latent space is fixed to k = 5. In Figure 3 we plot the normalized distances using the two approaches as a function of dimension and see that the gap gets bigger with increasing dimensions, but it is stable. In Figure 4 we plot the log of the relative computation time, taking the d = 10 as a benchmark in both cases. We see that the time to compute SD 2 φ is linear in dimension and is significantly lower than its counterpart S 2 k as we increase the number of dimensions. This can be traced back to Algorithm. 1 and Algorithm. 2 of Paty & Cuturi (2019) where at each iteration step, the computation of OT distances in the data space is prohibitively expensive. 

6. Conclusion

In this paper we presented a new framework for approximating optimal transport distances using a wide family of embedding functions that are 1-Lipschitz. We showed how linear projectors can be considered as a special case of such functions and proceeded to define neural networks as another class of embeddings. We showed how we can use existing tools to build an efficient algorithm that is robust and constant in the dimension of the data. Future work includes showing the approximation is valid for datasets where the support of distributions lies in a low-dimensional non-linear manifold, where we hypothesize that linear projects would fail. Other work includes experimenting with different operator norms such as L 1 or L inf for the linear layers and the approximation of W 1 . An extension of the projection step in 1 to convolutional layers would allow us to experiment with real datasets such as CIFAR-10 and learn a discriminator in an adversarial way with SD k,φ as a loss function. This can be used to show that the data naturally clusters in the embedding space.



Figure 1: Mean estimation of SD 2 φ (µ, ν) for different values of the latent dimension k. Horizontal line is constant and shows the true W 2 (µ, ν). The shaded area shows the standard deviation over 20 runs.

Figure 2: Mean normalized distances with and without noise for SD 2 φ (µ, ν) and S 2 k (µ, ν) as a function of latent dimension k. The shaded area shows the standard deviation over 20 runs.

Figure 3: Comparison between normalized SD 2 φ (µ, ν) and normalized S 2 k (µ, ν) as a function of dimension. The shaded area shows the standard deviation over 20 runs.

