OUTLIER-ROBUST OPTIMAL TRANSPORT

Abstract

Optimal transport (OT) provides a way of measuring distances between distributions that depends on the geometry of the sample space. In light of recent advances in solving the OT problem, OT distances are widely used as loss functions in minimum distance estimation. Despite its prevalence and advantages, however, OT is extremely sensitive to outliers. A single adversarially-picked outlier can increase OT distance arbitrarily. To address this issue, in this work we propose an outlier-robust OT formulation. Our formulation is convex but challenging to scale at a first glance. We proceed by deriving an equivalent formulation based on cost truncation that is easy to incorporate into modern stochastic algorithms for regularized OT. We demonstrate our model applied to mean estimation under the Huber contamination model in simulation as well as outlier detection on real data.

1. INTRODUCTION

Optimal transport is a fundamental problem in applied mathematics. In its original form (Monge, 1781) , the problem entails finding the minimum cost way to transport mass from a prescribed probability distribution µ on X to another prescribed distribution ν on X . Kantorovich (1942) relaxed Monge's formulation of the optimal transport problem to obtain the Kantorovich formulation: OT(µ, ν) min Π∈F (µ,ν) E (X1,X2)∼Π c(X 1 , X 2 ) , where F(µ, ν) is the set of couplings between µ and ν (probability distributions on X × X whose marginals are µ and ν) and c is a cost function, where we typically assume c(x, y) ≥ 0 and c(x, x) = 0. Compared to other notions of distance between probability distributions, optimal transport uniquely depends on the geometry of the sample space. Recent advancements in optimization for optimal transport (Cuturi, 2013; Solomon et al., 2015; Genevay et al., 2016; Seguy et al., 2018) enabled its broad adaptation in machine learning applications where geometry of the data is important. See (Peyré & Cuturi, 2018) for a survey. Optimal transport has found applications in natural language processing (Kusner et al., 2015; Huang et al., 2016; Alvarez-Melis & Jaakkola, 2018; Yurochkin et al., 2019) , generative modeling (Arjovsky et al., 2017) , clustering (Ho et al., 2017) , domain adaptation (Courty et al., 2014; 2017) , large-scale Bayesian modeling (Srivastava et al., 2018) , and many other domains. Many applications use OT as a loss in an optimization problem of the form: θ ∈ arg min θ∈Θ OT(µ n , ν θ ), where {ν θ } θ∈Θ is a collection of parametric models, µ n is the empirical distribution of the samples. Such estimators are called minimum Kantorovich estimators (MKE) (Bassetti et al., 2006) . They are popular alternatives to likelihood-based estimators, especially in generative modeling. For example, when OT(•, •) is the Wasserstein-1 distance and ν θ is a generator parameterized by a neural network with weights θ, equation 1.2 corresponds to the Wasserstein GAN (Arjovsky et al., 2017) . One drawback of optimal transport is its sensitivity to outliers. Because all the mass in µ must be transported to ν, a small fraction of outliers can have an outsized impact on the optimal transport problem. For statistics and machine learning applications in which the data is corrupted or noisy, this is a major issue. For example, the poor performance of Wasserstein GANs in the presence of outliers was noted in the recent works on outlier-robust generative learning with f -divergence GANs (Chao et al., 2018; Wu et al., 2020) . The problem of outlier-robustness in MKE has not been studied, with the exception of two concurrent works (Staerman et al., 2020; Balaji et al., 2020) . In this paper, we propose a modification of OT to address its sensitivity to outliers. Our formulation can be used as a loss in equation 1.2 so that it is robust to a small fraction of outliers in the data. To keep things simple, we consider the -contamination model (Huber & Ronchetti, 2009) . Let ν θ0 be a member of a parametric model {ν θ : θ ∈ Θ} and let µ = (1 -)ν θ0 + ν, where µ is the data-generating distribution, > 0 is the fraction of outliers, and ν is the distribution of the outliers. Although the fraction of outliers is capped at , the value of the outliers is arbitrary, so the outliers may have an arbitrarily large impact on the optimal transport problem. Our goal is to modify the optimal transport problem so that it is more robust to outliers. We have in mind the downstream application of learning θ 0 from (samples from) µ in the -contamination model. Our main contributions are as follows: 1. We propose a robust OT formulation that is suitable for statistical estimation in thecontamination model using MKE. 2. We show that our formulation is equivalent to the original OT problem with a clipped transport cost. This connection enables us to leverage the voluminous literature on computational optimal transport to develop efficient algorithm to perform MKE robust to outliers. 3. Our formulation enables a new application of optimal transport: outlier detection in data.

2.1. ROBUST OT FOR MKE

To promote outlier-robustness in MKE, we need to allow the corresponding OT problem to ignore the outliers in the data distribution µ. The -contamination model imposes a cap on the fraction of outliers, so it is not hard to see that µ -ν θ0 TV ≤ , where • TV is the total-variation norm defined as µ TV = 1 2 |µ(dx)|. This suggests we solve a TV-constrained/regularized version of equation 1.2. The constrained version min θ∈Θ,μ OT(μ, ν θ ) subject to µ -μ TV ≤ suffers from identification issues. In particular, it cannot distinguish between "clean" distributions within TV distance of ν θ0 . This makes it unsuitable as a loss function for statistical estimation, because it cannot lead to a consistent estimator. However, its regularized counterpart min θ∈Θ,s OT(µ + s, ν θ ) + λ s TV , (2.1) where λ > 0 is a regularization parameter, does not suffer from this issue. In the rest of this paper, we work with the TV-regularized formulation equation 2.1. The main idea of our formulation is to allow for modifications of µ, while penalizing their magnitude and ensuring that the modified µ is still a probability measure. Below we formulate this intuition in an optimization problem titled ROBOT (ROBust Optimal Transport): Formulation 1: ROBOT(µ, ν) =                                  min Π∈F + (R d ×R d ) s∈F (R d ) C(x, y) Π(dx, dy) + λ s TV subject to B×R d Π(dx, dy) = B (µ(dx) + s(dx)) ≥ 0 ∀ B ∈ B(R d ) (Borel σ-algebra) R d ×C Π(dx, dy) = C ν(dy) ∀ C ∈ B(R d ) s(dx) = 0. (2.2) Here F(R d ) denotes the set of all signed measures with finite total variation on R d , F + (R d × R d ) is the set of all measures with finite total variation on R d × R d . The first and the last constraints ensure that µ + s is a valid probability measure, while λ s TV penalizes the amount of modifications in µ. It is worth noting that we can identify exact locations of outliers in µ by inspecting µ + s, i.e. if µ(x) + s(x) = 0, then x got eliminated and is an outlier. ROBOT, unlike classical OT, guarantees that an adversarially picked outliers can not increase the distance arbitrarily. Let μ = (1 -)µ + µ c , i.e. μ is µ contaminated with outliers from µ c , and let ν be an arbitrary measure (in MKE, μ is the contaminated data and ν is the model we learn). Adversary can arbitrarily increase OT(μ, ν) by manipulating the outlier distribution µ c . For ROBOT we have the following bound: Theorem 2.1. Let μ = (1 -)µ + µ c for some ∈ [0, 1), then ROBOT(μ, ν) ≤ (OT(µ, ν) + λ µ -µ c TV ) ∧ λ μ -ν TV ∧ OT(μ, ν). (2.3) This bound has two key takeaways: since TV norm of any two distributions is bounded by 1, adversary can not increase ROBOT(μ, ν) arbitrarily; in the absence of outliers, ROBOT is bounded by classical OT. See Appendix C for the proof.

Related work

We note connection between equation 2.2 and unbalanced OT (UOT) (Chizat., 2017; Chizat et al., 2018) . UOT is typically formulated by replacing TV norm with KL(µ + s|µ) and adding an analogous term for ν. Chizat et al. (2018) studied entropy regularized UOT with various divergences penalizing marginal violations. Optimization problems similar to equation 2.2 have also been considered outside of the ML literature (Piccoli & Rossi, 2014; Liero et al., 2018) . We are unaware of prior applications of UOT to outlier-robustness, but it was studied in the concurrent work of Balaji et al. (2020) . Another relevant variation of OT is partial OT (Figalli, 2010; Caffarelli & McCann, 2010) . It may also be considered for outlier-robustness, but it has a drawback of forcing mass destruction rather than adjusting marginals to ignore outliers when they are present. A concurrent work by Staerman et al. (2020) took a different path: they replaced the expectation in the Wasserstein-1 dual with a median-of-means to promote robustness. It is unclear what is the corresponding primal, making it hard to interpret as an optimal transport problem. A major challenge with the aforementioned methods, including our Formulation 1, is the difficulty of the optimization problem. This is especially the case for MKEs, where a transport problem has to be solved in every iteration to obtain the gradient of the model parameters. Chizat et al. (2018) proposed a Sinkhorn-like algorithm for entropy regularized UOT, but it is not amenable to stochastic optimization. Balaji et al. (2020) proposed a stochastic optimization algorithm based on the UOT dual, but it requires two additional neural networks (total of four including dual potentials) to parameterize modified marginal distributions (i.e., µ + s and analogous one for ν). Optimizing with a median-of-means in the objective function as in (Staerman et al., 2020 ) is also challenging. The key contribution of our work is a formulation equivalent to equation 2.2, which is easily compatible with the large body of classical OT optimization techniques (Cuturi, 2013; Solomon et al., 2015; Genevay et al., 2016; Seguy et al., 2018) . More efficient equivalent formulation At a first glance, there are two issues with equation 2.2: it appears asymmetric and it is unclear if it can be optimized efficiently. Below we present an equivalent formulation that is free of these issues: Formulation 2: ROBOT(µ, ν) =                  min Π∈F + (R d ×R d ) C λ (x, y)Π(dx, dy) subject to B×R d Π(dx, dy) = B µ(dx) ∀ B ∈ B(R d ) R d ×C Π(dx, dy) = C ν(dy) ∀ C ∈ B(R d ), (2.4) where C λ is the truncated cost function defined as C λ (x, y) = C(x, y) ∧ 2λ. Looking at equation 2.4, it is not apparent that it adds robustness to MKE, but it is symmetric, easy to combine with entropic regularization by simply truncating the cost, and benefits from stochastic optimization algorithms (Genevay et al., 2016; Seguy et al., 2018) . This formulation also has a distant relation to the idea of loss truncation for achieving robustness (Shen & Sanghavi, 2019) . Pele & Werman (2009) considered the Earth Mover Distance (discrete OT) with truncated cost to achieve computational improvements; they also mentioned its potential to promote robustness against outlier noise but did not explore this direction. In Section 3, we establish equivalence between the two ROBOT formulations, equation 2.2 and equation 2.4. This equivalence allows us to obtain an efficient algorithm based on equation 2.4 for robust MKE. We also provide a simple procedure for computing optimal s in equation 2.2 from the solution of equation 2.4, enabling a new OT application: outlier detection. We verify the effectiveness of robust MKE and outlier detection in our experiments in Section 4. Before presenting the equivalence proof, we formulate the discrete analogs of the two ROBOT formulations for their practical value.

2.2. DISCRETE ROBOT FORMULATIONS

In practice we typically encounter samples from the distributions, rather then the distributions themselves. Sampling is also built into stochastic optimization. In this subsection, we present the discrete versions of the ROBOT formulations. The key detail is that, in equation 2.2, µ, ν and s are all supported on R d , while in the discrete case the empirical measures µ n ∈ ∆ n-1 and ν m ∈ ∆ m-1 are supported on a set of points (∆ r is the unit probability simplex in R r ). As a result, to formulate a discrete version of equation 2.2, we need to augment µ n and ν m with each others' supports. To be precise, let supp( µ n ) = {X 1 , . . . , X n } and supp(ν m ) = {Y 1 , . . . , Y m }. Define C = {Z 1 , Z 2 , . . . , Z m+n } = {X 1 , . . . , X n , Y 1 , . . . , Y m }. Then discrete analog of equation 2.2 is Formulation 1 (discrete): ROBOT(µ n , ν m ) =            min Π∈R (m+n)×(m+n) s∈R m+n C aug , Π + λ [ s 1 1 + t 1 1 ] subject to Π1 m+n = µ n + s 1 t 1 , Π 1 m+n = 0 ν m Π 0, 1 m+n s = 0, (2.5) where C aug ∈ R (m+n)×(m+n) is the augmented cost function C aug,i,j = c(Z i , Z j ) (c is the ground cost, e.g., squared Euclidean distance), s = (s 1 , t 1 ) and 1 r is the vector all ones in R r . The TV norm got replaced with its discrete analog, the L 1 norm. Similarly to its continuous counterpart, the optimization problem is harder than the typical OT due to additional constraint optimization variable s and increased cost matrix size. The discrete analog of equation 2.4 is straightforward: Formulation 2 (discrete): ROBOT(µ n , ν m ) = min Π∈R n×m C λ , Π subject to Π1 n = µ n , Π 1 m = ν m , Π 0, (2.6) where C λ,i,j = c(X i , Y j )∧2λ. As in the continuous case, it is easy to adapt modern (regularized) OT solvers without any computational overhead. As in the continuous case, formulations of equation 2.5 and equation 2.6 are equivalent. It is also possible to recover s of equation 2.5 from the solution of equation 2.6 to perform outlier detection. Two-sided formulation So far we have assumed that one of the input distributions does not have outliers, which is the setting of MKE, where the clean distribution corresponds to the model we learn. In some applications, both distributions may be corrupted. To address this case, we provide an equivalent two-sided formulation, analogous to UOT with TV norm: Formulation 3 (two-sided): ROBOT(µ n , ν m ) =            min Π∈R (m+n)×(m+n) s1∈R m+n , s2∈R m+n C aug , Π + λ [ s 1 1 + t 1 1 + s 2 1 + t 2 1 ] subject to Π1 m+n = µ n + s 1 t 1 , Π 1 m+n = s 2 ν m + t 2 Π 0, 1 m+n s 1 = 0, 1 m+n s 2 = 0. (2.7) where s 1 = (s 1 , t 1 ) and s 2 = (s 2 , t 2 ) .

3. EQUIVALENCE OF THE ROBOT FORMULATIONS

In this section we present our main theorem, which demonstrates the equivalence between two formulations of the robust optimal transport: Theorem 3.1. For any two measures µ and ν, ROBOT(µ, ν) has same value for both the formulations, i.e., Formulation 1 is equivalent to Formulation 2 both for continuous and discrete case. Moreover, we can recover optimal coupling of one formulation from the other. Below we sketch the proof of this theorem and highlight some important techniques used in the proof. We focus on the discrete case as it is more intuitive and has concrete practical implications in our experiments. A complete proof can be found in Appendix A. Please also see Appendix A.2 for the proof of equivalence between Formulations 1, 2 and 3 in the discrete case.

3.1. PROOF SKETCH

In the remainder of this section we consider the discrete case, i.e., equation 2.5 for Formulation 1 (F1) and equation 2.6 for Formulation 2 (F2). Suppose Π * 2 is an optimal solution of F2. Then we construct a feasible solution Π * 1 , s * 1 = (s * 1 , t * 1 ) of F1 based on Π * 2 with the same value of the objective function as F2 and claim that (Π * 1 , s * 1 ) is an optimal solution. We prove the claim by contradiction: if (Π * 1 , s * 1 ) is not optimal, then there exists another pair ( Π1 , s1 ) which is optimal for F1 with strictly less objective value. We then construct another feasible solution Π * 2,new of Formulation 2 which has the same objective value as of ( Π1 , s1 ) for F1. This implies Π * 2,new has strictly less objective value for F2 than Π * 2 , which is a contradiction. The two main pillars of this proof are (1) to construct a feasible solution of F1 starting from a feasible solution of F2 and (2) to show that the solution constructed is indeed optimal for F1. Hence step (1) gives a recipe to construct an optimal solution of F1 starting from an optimal solution of F2. We elaborate the first point in the next subsection, which has practical implications for outlier detection. The other point is more technical; interested readers may go through the proof in Appendix A.1. Algorithm 1 Generating optimal solution of F1 from F2 1: Start with Π * 2 ∈ R n×m , an optimal solution of Formulation 2. 2: Create an augmented matrix Π ∈ R m+n×m+n with all 0. Divide Π into four blocks: Π =     Π 11 n×n Π 12 n×m Π 21 m×n Π 22 m×m     3: Set Π 12 ← Π * 2 and collect all the indices I = {(i, j) : C i,j > 2λ}. 4: Set Π 12 (i, j) ← 0 for (i, j) ∈ I. 5: Set Π 22 (j, j) ← n i=1 Π * 2 (i, j)1 (i,j)∈I for all 1 ≤ j ≤ m and set Π * 1 ← Π. 6: Set s * 1 (i) ≤ m j=1 Π * 2 (i, j)1 (i,j)∈I for all 1 ≤ i ≤ n. 7: Set t * 1 (j) = Π 22 (j, j) for all 1 ≤ j ≤ m. 8: return Π * 1 , s * 1 , t * 1 .

3.2. GOING FROM FORMULATION 2 TO FORMULATION 1

Let Π * 2 (respectively Π * 1 ) be an optimal solution of F2 (respectively F1). Recall that Π * 1 has dimension (m + n) × (m + n). From the column sum constraint in F1, we need to take the first n columns of Π * 1 to be exactly 0, whereas the last m columns must sum up to ν m . For any matrix A, we denote by A[(a : b) × (c : d)] the submatrix consisting of rows from a to b and columns from c to d. Our main idea is to put a modified version of Π * 2 in Π * 1 [(1 : n) × (n + 1 : m + n)] and make Π * 1 [(n+1 : m+n)×(n+1 : m+n)] diagonal. First we describe how to modify Π * 2 . Observe that, if for some (i, j) C i,j > 2λ, we expect X i ∈ supp(µ n ) to be an outlier resulting in high transportation cost, which is why we truncate the cost in F2. Therefore, to get an optimal solution of F1, we make the corresponding value of optimal plan 0 and dump the mass into the corresponding slack variable t * 1 in the diagonal of the bottom right submatrix. This changes the row sum, which is taken care of by s * 1 . But, as we are not moving this mass outside the corresponding column, the column sum of Π * 1 [(1 : (m + n)) : ((n + 1) : (m + n)) ] remains same as column sum of Π * 2 , which is ν n . We summarize this procedure in Algorithm 1. Figure 1 : Constructing optimal solution of Formulation 1 from optimal solution of Formulation 2. Example. In Figure 1 , we provide an example to visualize the construction. On the left, we have Π * 2 , an optimal solution of Formulation 2. The blue triangles denote the positions where the corresponding cost value is ≤ 2λ, and light-green squares denote the positions where the corresponding value of the cost matrix is > 2λ. To construct an optimal solution Π * 1 of Formulation 1 from this Π * 2 , we first create an augmented matrix of size 6 × 6. We keep all the entries of of left 6 × 3 sub-matrix as 0 (in this picture blank elements indicate 0). On the right submatrix, we put Π * 2 into the top-right block, but remove the masses from light-green squares, i.e. where cost value is > 2λ, and put it in the diagonal entries of the bottom right block as shown in Figure 1 . This mass contributes to the slack variables s 1 and t 1 , and this augmented matrix along with s 1 , t 1 give us an optimal solution of Formulation 1.

3.3. OUTLIER DETECTION WITH ROBOT

Our construction algorithm has practical consequences for outlier detection. Suppose we have two datasets, a clean dataset ν m (i.e., has no outliers) and an outlier-contaminated dataset µ n . We can detect the outliers in µ n without directly solving costly Formulation 1 by following Algorithm 2. In this algorithm, λ is a regularization parameter that can be chosen via cross-validation or heuristically (see Section 4.2 for an example). In Section 4.2, we use this algorithm to perform outlier detection on image data. Algorithm 2 Outlier detection in contaminated data 1: Start with µ n (contaminted data) and ν m (clean data). 2: Solve Formulation 2 and obtain Π * 2 using a suitable value of λ. 3: Use Algorithm 1 to obtain Π * 1 , s * 1 , t * 1 from Π * 2 . 4: Find I, the set of all the indices where µ n + s * 1 = 0. 5: Return I as the indices of outliers in µ n . Table 1 : Robust mean estimation with GANs using different distribution divergences. True mean is η 0 = 0 5 ; sample size n = 1000; contamination proportion = 0.2. We report results over 30 experiment restarts. ) and the goal is to estimate η 0 . Prior work has advocated for using f -divergence GANs (Chao et al., 2018; Wu et al., 2020) for this problem and pointed out inefficiencies of Wasserstein GAN in the presence of outliers. We show that our robust OT formulation allows us to estimate the uncontaminated mean η 0 comparably or better than a variety of f -divergence GANs. We also use this simulated setup to study sensitivity to the regularization hyperparameter λ. In our second experiment, we present a new application of optimal transport enabled by ROBOT. Suppose we have collected a curated dataset ν m (i.e., we know that it has no outliers)-such data collection is expensive, and we want to benefit from it to automate subsequent data collection. Let µ n be a second dataset collected "in the wild," i.e., it may or may not have outliers. We demonstrate how ROBOT can be used to identify outliers in µ n using the curated dataset ν m .

4.1. ROBUST MEAN ESTIMATION

Following Wu et al. (2020) , we consider a simple generator of the form g θ (x) = x + θ, x ∼ N (0, I d ), d is the data dimension. The basic idea of robust mean estimation with GANs is to minimize various distributional divergences between samples from g θ and observed data simulated from (1 -)N (η 0 , I d ) + N (η 1 , I d ). The goal is to estimate η 0 with θ. To efficiently implement ROBOT GAN, we use a standard min-max optimization approach: solve the inner max (ROBOT) and use gradient descent for the outer min parameter. To solve ROBOT, it is straightforward to adopt any of the prior stochastic regularized OT solvers: the only modification is the truncation of the cost entries as in equation 2.6. We use the stochastic algorithm for semi-discrete regularized OT from (Genevay et al., 2016, Algorithm 2) . We summarize ROBOT GAN in Algorithm 3. Line 5 -Line 10 perform the inner optimization where we solve entropy regularized OT dual with truncated cost and Line 11 -Line 12 perform gradient update of θ. Algorithm 3 ROBOT GAN 1: Input: robustness regularizion λ, entropic regularization α, data distribution µ n ∈ ∆ n-1 , supp(µ n ) = X = [X 1 , . . . , X n ], steps sizes τ and γ 2: Initialize: Initialize θ = θ init , set number of iterations M and L, i = 0, v = ṽ = 0. 3: for j = 1, . . . , M do 4: Generate z ∼ N (0, I d ) and set z = z + θ.

5:

Set the cost vector c ∈ R n as c(k) = c(X k , z) ∧ 2λ for k = 1, . . . , n.

6:

for i = 1, . . . , L do solve entropy regularized OT dual 7: Set h ← ṽ-c α and do the normalized exponential transformation u ← e h 1,e h .

8:

Calculate the gradient ∇ṽ ← µ n -u.

9:

Update ṽ ← ṽ + γ∇ṽ and v ← (1/(j + i))ṽ + (j + i -1/(j + i))v. Set Π(k) = 0 for k such that C(X k , z) > 2λ for k = 1, . . . , n. 12: Calculate gradient with respect to θ as ∇θ = 2 z k Π(k) -X Π 13: Update θ ← θ -τ ∇θ. 14: Ouput: θ For the f -divergence GANs (Nowozin et al., 2016) we use the code of Wu et al. (2020) for GANs with Jensen-Shannon (JS) loss, squared Hellinger (SH) loss and Reverse Kullback-Leibler (RKL) loss. For the exact expression of these divergences see Table 1 of Wu et al. (2020) . We report estimation error measured by the Euclidean distance between true uncontaminated mean η 0 and estimated mean θ for various contamination distributions in Table 1 . ROBOT GAN performs well across all considered contamination distributions. As the difference between true mean η 0 and contamination mean η 1 increases, the estimation error of all methods tends to increase. However, when it becomes easier to distinguish outliers from clean samples, i.e., η 1 = 2 • 1 5 , performance of ROBOT noticeably improves. We also compared to the Sinkhorn-based UOT algorithm (Chizat et al., 2018) available in the Python Optimal Transport (POT) library (Flamary & Courty, 2017) ; to obtain a UOT GAN, we modified steps 5-11 of Algorithm 3 for computing Π. Unsurprisingly, both ROBOT and UOT perform similarly: recall equivalence to Formulation 3, which is similar to UOT with TV norm. The key insight of our work is the equivalence to classical OT with truncated cost, that greatly simplifies optimization and allows to use existing stochastic OT algorithms. In this experiment, the sample size n = 1000 is sufficiently small for the Sinkhorn-based UOT POT implementation to be effective, but it breaks in the experiment we present in Section 4.2. We also tried the code of Balaji et al. (2020) based on CVXPY (Diamond & Boyd, 2016) , but it is too slow even for the n = 1000 sample size. In the previous experiment, we set λ = 0.5. Now we demonstrate empirically that there is a broad range of λ values performing well. In Figure 2a , we study sensitivity of λ under various contamination proportions holding η 0 = 1 5 and η 1 = 5 • 1 5 fixed. Horizontal lines correspond to λ = ∞, i.e., vanilla OT. The key observations are: there is a wide range of λ efficient at all contamination proportions, and ROBOT is always at least as good as vanilla OT (even when there is no contamination = 0). In Figure 2b , we present a similar study varying the mean of the contamination distribution and holding = 0.2 fixed. We see that as the contamination distribution gets closer to the true distribution, it becomes harder to pick a good λ, but the performance is always at least as good as the vanilla OT (horizontal lines).

4.2. OUTLIER DETECTION FOR DATA COLLECTION

Our robust OT formulation equation 2.5 enables outlier identification. Let ν m be a clean dataset and µ n potentially contaminated with outliers. Recall that ROBOT allows modification of one of the input distributions to eliminate potential outliers. We can identify outliers in µ n as follows: if µ n (i) + s * 1 (i) = 0, then X i , the ith point in µ n , is an outlier. Instead of directly solving equation 2.5, which may be inefficient, we use our equivalence results and solve an easier optimization problem equation 2.6, followed by recovering s to find outliers via Algorithm 2. Let ν m be a clean dataset consisting of 10k MNIST digits and µ n be a dataset collected "in the wild" consisting of (different) 8k MNIST digits and 2k Fashion MNIST images. We compute ROBOT(µ n , ν m ) to identify outlier Fashion MNIST images in µ n . For each point in µ n we obtain a prediction, outlier or clean, which allows us to evaluate accuracy. ROBOT outlier detection is 90% accurate in this experiment. We also comment on λ selection: since we know that ν m is clean, we can subsample two datasets from it, compute vanilla OT to obtain transportation plan Π and set λ to be half the maximum distance between matched elements, i.e. 2λ = max i,j {C ij : Π ij > 0}, where C is the cost matrix for the two subsampled datasets. This procedure is essentially estimating maximum distance between matched clean samples. We also present a random sample of outliers identified by our method in Figure 3 . All of the sampled outliers are Fashion MNIST images, although 90% accuracy suggests that some of the outliers were not identified. Decreasing λ can help to find more outliers, but may result in some clean samples being mistaken for outliers. We conclude that ROBOT can be used to assist in data collection once an initial set of clean data has been acquired. As we mentioned previously, the Sinkhorn-based UOT POT implementation is too expensive for this experiment due to larger sample size, yielding memory errors on a personal laptop with 16GB RAM. For comparison, we also consider a heuristic distance-based approach for identifying outliers. We estimate diameter τ of the set of clean dataset ν m by taking the 99th percentile of the pairwise distance matrix of samples in ν m . If outliers and clean data have disjoint support, we can adopt a simple heuristic: for each sample in the potentially contaminated µ n compute an average distance to the clean samples in ν m and declare a sample as an outlier if this average distance is greater than the diameter τ of the clean data. The accuracy of this procedure is 85.4%, inferior to the ROBOT accuracy of 90%. The disjoint support assumption justifying the distance-based heuristic might be too strong in practice. ROBOT continues to be effective even when the supports of clean and outlier distributions are not easily separable.

5. SUMMARY AND DISCUSSION

We proposed and studied ROBOT, a robust formulation of optimal transport. We showed that although the problem is seemingly asymmetric and challenging to optimize, there is an equivalent formulation based on cost truncation that is symmetric and compatible with modern stochastic optimization methods for OT. ROBOT closely resembles unbalanced optimal transport (UOT). In our formulation, we added a TV regularizer to the vanilla optimal transport problem. This is motivated by the -contamination model. In UOT, the TV regularizer is typically replaced with a KL divergence. Other choices of the regularizer may lead to new properties and applications. Studying equivalent, simpler formulations of UOT with different divergences may be a fruitful future work direction. From the practical perspective, in our experiments we observed no degradation of ROBOT GAN in comparison to OT GAN, even when there were no outliers. It is possible that replacing OT with ROBOT may be beneficial for various machine learning applications of OT. Data encountered in practice may not be explicitly contaminated with outliers, but it often has errors and other deficiencies, suggesting that a "no-harm" robustness is desirable. A PROOF OF THEOREM 3.1

A.1 PROOF OF DISCRETE VERSION

Proof. Define a matrix Π as: Π(i, j) = 0, if C(i, j) > 2λ Π * 2 (i, j), otherwise Also define s ∈ R n and t ∈ R m as: s * 1 (i) = - m j=1 Π * 2 (i, j)1 C(i,j)>2λ and similarly define: t * 1 (j) = n i=1 Π * 2 (i, j)1 C(i,j)>2λ These vectors corresponds to the row sums and the column sums of the elements of the optimal transport plan of Formulation 2, where the cost function exceeds 2λ. Note that, these co-ordinates of the optimal transport plan corresponding to those co-ordinates of cost matrix, where the cost is greater than 2λ and contribute to the objective value via their sum only, hence any different arrangement of these transition probabilities with same sum gives the same objective value. Now based on this Π obtained we construct a feasible solution of Formulation 1 following Algorithm 1: Π * 1 = 0 Π 0 diag(t * 1 ) The row sums of Π * 1 is: Π * 1 1 = µ n + s * 1 t * 1 and it is immediate from the construction that the column sums of Π * 1 is ν m . Also as: n i=1 s * 1 (i) = m j=1 t * 1 (j) = (i,j):Ci,j >2λ Π * 2 (i, j) and s * 1 0, t * 1 0, we have: 1 (µ n + s * 1 + t * 1 ) = 1 p = 1 . Therefore, we have (Π * 1 , s * 1 , t * 1 ) is a feasible solution of Formulation 1. Now suppose this is not an optimal solution. Pick an optimal solution Π, s, t of Formulation 1 so that: C aug , Π + λ s 1 + t 1 < C aug , Π * 1 + λ [ s * 1 1 + t * 1 1 ] The following two lemmas provide some structural properties of any optimal solution of Formulation 1: Lemma A.1. Suppose Π * 1 , s * 1 , t * 1 are optimal solution for Formulation 1. Divide Π * 1 into four parts corresponding to augmentation as in algorithm 1: Π * 1 = Π * 1,11 Π * 1,12 Π * 1,21 Π * 1,22 Then we have Π * 1,11 = Π * 1,21 = 0 and Π * 1,22 is a diagonal matrix. Lemma A.2. If Π * 1 , s * 1 , t * 1 is an optimal solution of Formulation 1 then: 1. If C i,j > 2λ then Π * 1 (i, j) = 0. 2. If C i,j < 2λ for some i and for all 1 ≤ j ≤ n, then s * 1 (i) = 0. 3. If C i,j < 2λ for some j and for all 1 ≤ i ≤ m, then t * 1 (j) = 0. 4. If C i,j < 2λ then s * 1 (i)t * 1 (j) = 0. We provide the proofs in the next subsection. By Lemma A.1 we can assume without loss of generality: Π = 0 Π12 0 diag( t) Now based on Π, s, t we create a feasible solution namely Π * 2,new of Formulation 2 as follows: Define the set of indices {i 1 , • • • , i k } and {j 1 , . . . , j l } as: si1 , si2 , . . . , si k > 0 and tj1 , tj2 , . . . , tj l > 0 . Then by part (4) of Lemma A.2 we have C iα,j β > 2λ for α ∈ {1, . . . , k} and β ∈ {1, . . . , l}. Also by part (2) of Lemma A.2 the value of transport plan at these co-ordinates is 0. Now distribute the mass of slack variables in these co-ordinates such that the marginals of new transport plan becomes exactly µ n and ν m . This new transport plan is our Π * 2,new . Recall that, s 1 = t 1 . Hence, here the regularizer value decreases by 2λ s 1 and the cost value increased by exactly 2λ s 1 as we are truncating the cost. Hence we have: C λ , Π * 2,new = C aug , Π + λ s 1 + t 1 < C aug , Π * 1 + λ [ s * 1 1 + t * 1 1 ] = C λ , Π * 2 which is contradiction as Π * 2 is the optimal solution of Formulation 2. This completes the proof for the discrete part.

A.2 PROOF OF EQUIVALENCE FOR TWO SIDED FORMULATION

Here we prove that our two sided formulation, i.e. Formulation 3 (equation 2.7) is equivalent to Formulation 1 (equation 2.5) for the discrete case. Towards that end, we introduce another auxiliary formulation and show that both Formulation 1 and Formulation 3 are equivalent to the following auxiliary formulation of the problem. Formulation 4: W R,L,4 (p, q) =          min Π∈R m×n ,s1∈R m ,s2∈R n C, Π + λ [ s 1 1 + s 2 1 ] subject to Π1 n = p + s 1 Π T 1 m = q + s 2 Π 0 (A.1) First we show that Formulation 1 and Formulation 4 are equivalent in a sense that they have the same optimal objective value. Theorem A.3. Suppose C is a cost function such that C(x, x) = 0. Then Formulation 1 and Formulation 4 has same optimal objective value. Proof. Towards that end, we show that given one optimal variables of one formulation we can get optimal variables of other formulation with the same objective value. Before going into details we need the following lemma whose proof is provided in Appendix B: Now we prove that optimal value of Formulation 1 and Formulation 4 are same. Let (Π * 1 , s * 1,1 , t * 1,1 ) is an optimal solution of Formulation 1. Then we claim that (Π * 1 , s * 1,1 , t * 1,1 ) is also an optimal solution of Formulation 4. Clearly it is feasible solution of Formulation 4. Suppose it is not optimal, i.e. there exists another optimal solution ( Π4 , s4,1 , s4,2 ) such that: C, Π4 + λ( s4,1 1 + s4,2 2 ) < C, Π * 1,12 + λ( s * 1,1 1 + t * 1,1 1 ) Now based on ( Π4 , s4,1 , s4,2 ) we construct a feasible solution of Formulation 1 as follows: Π1 = 0 Π4 0 -diag(s 4,2 ) Note that we proved in Lemma A.4 s4,2 0, hence we have Π1 0. Now as the column sums of Π4 is q + s4,2 , we have column sums of Π1 = [0 q ] and the row sums are [(p + s4,1 ) s 4,2 ] . Hence we take s1,1 = s4,1 and s1,2 = s4,2 . Then it follows: C aug , Π1 + λ [ s1,1 1 + s1,2 1 ] = C, Π4 + λ [ s4,1 1 + s4,2 1 ] < C, Π * 1,12 + λ s * 1,1 1 + t * 1,1 1 = C aug , Π * 1 + λ s * 1,1 1 + t * 1,1 1 This is contradiction as we assumed (Π * 1 , s * 1,1 , t * 1,2 ) is an optimal solution of Formulation 1. Therefore we conclude (Π * 1 , s * 1,1 , t * 1,1 ) is also an optimal solution of Formulation 4 which further concludes Formulation 1 and Formulation 4 have same optimal values. This completes the proof of the theorem. Theorem A.5. The optimal objective value of Formulation 3 and Formulation 4 are same. Proof. Like in the proof of Theorem A.3 we also prove couple of lemmas. Lemma A.6. Any optimal transport plan Π * 3 of Formulation 3 has the following structure: If we write,  Π * 3 = Π * 3,

0.

Proof. The line of argument is same as in proof of Lemma A.4. Next we establish equivalence. Suppose (Π * 3 , s * 3,1 , t * 3,1 , s * 3,2 , t * 3,2 ) are optimal values of Formulation 3. We claim that (Π * 3,12 , s * 3,1 -s * 3,2 , t * 3,1 -t * 3,2 ) forms an optimal solution of Formulation 4. The objective value will then also be same as s * 3,1 0, s * 3,2 0 (Lemma A.7) implies s * 3,1 -s * 3,2 1 = s * 3,1 1 + s * 3,2 1 and similarly t * 3,1 0, t * 3,2 0 implies t * 3,1 -t * 3,2 1 = t * 3,1 1 + t * 3,2 1 . Feasibility is immediate. Now for optimality, we again prove by contradiction. Suppose they are not optimal. Then lets say Π4 , s4,1 , s4,2 are an optimal triplet of Formulation 4. Now construct another feasible solution of Formulation 3 as follows: Set s3,2 = t3,2 = 0, s3,1 = s4,1 and t3,1 = s4,2 . Set the matrix as: Π3 = 0 Π4 0 -diag(s 4,2 ) Then it follows that Π3 , s3,1 , s3,2 , t3,1 , t3,2 is a feasible solution of Formulation 3. Finally we have: C aug , Π3 + λ s3,1 1 + s3,2 1 + t3,1 1 + t3,2 1 = C aug , Π3 + λ [ s4,1 1 + s4,2 1 ] = C, Π4 + λ [ s4,1 1 + s4,2 1 ] < C, Π * 3,12 + λ s * 3,1 -s * 3,2 1 + t * 3,1 -t * 3,2 1 = C aug , Π * 3 + λ s * 3,1 1 + s * 3,2 1 + t * 3,1 1 + t * 3,2 1 This contradicts the optimality of (Π * 3 , s * 3,1 , s * 3,2 , t * 3,1 , t * 3,2 ). This completes the proof. Lemma A.8. Assume that µ, ν is such that x dµ, x dν < ∞. Moreover, assume that C(x, y) in equation 2.2 is the l 1 norm, i.e., C(x, y) = x -y . Then, there exists s with µ + s being a probability measure such that W 1 (µ + s, ν) + λ s T V = ROBOT (µ, ν), (A.3) where W 1 is the Wasserstein-1 norm with the cost function C(•, •) as mentioned above. Proof. Let µ n , ν m be the empirical measures relative to µ, ν respectively. We know that since µ n , ν m are discrete, there exists s n satisfying W 1 (µ n + s n , ν m ) = ROBOT (µ n , ν m ). We provide the proof in the following steps. Step Let T n be the optimal transport map from µ n to ν n . Then, for every i ≤ n, there exists a unique j ≤ n, such that T n (X i ) = Y j . Define τ n : {1, . . . , n} → {1, . . . , n} such that τ n (i) = j if T n (X i ) = Y j .Then µ n + s n = i δ Zi /n, where Z i = X i or Y τn(i) and δ x is the Dirac delta mass at x. Then, let Z ∼ µ n + s n P ω (Z / ∈ K |µ n + s n ) ≤ i 1 (Xi / ∈K ) /n + i 1 (Yi / ∈K ) /n (A.4) Therefore, E(P ω (Z / ∈ K |µ n +s n )) ≤ /2. Moreover, V ar(P ω (Z / ∈ K |µ n +s n )) = o(n -1 ) → 0. Therefore, lim n→∞ P µ n ×ν n (P ω (Z / ∈ K |µ n + s n ) ≤ ) → 1. Therefore µ n + s n is almost surely tight and thus by Prokhorov's Theorem also relatively compact. Step 2: Therefore, for ω almost surely, there exists a subsequence {n k } k≥1 such that µ n k + s n k converges weakly to a limit (dependent on ω) µ ⊕ s which is a probability measure. Moreover, x d(µ n k + s n k ) < ∞ almost surely. By Bolzano-Weierstrass Theorem, there exists a further subsequence {n k l } l such that x d(µ n k l + s n k l ) → x d(µ + s) almost surely. For the sake of convenience, without loss of generality, we will replace the sub-subsequence {n k l } l with {n k } k≥1 henceforth. Thus, by Theorem 6.9 of (Villani, 2009)  , W 1 (µ n k + s n k , µ ⊕ s) → 0 almost surely. Moreover, W 1 (µ n k , µ) → 0 almost surely. Therefore s n k T V → µ ⊕ s -µ T V almost surely. Step 3: Consider an arbitrary S = S + -S -, such that S + and S -are positive measures on R d , and µ + S is a probability measure. Let S T V = γ. Then, S - T V = S + T V = γ/2. Then consider X 1 , . . . , X n ∼ (µ -S -)/(1 -γ), Y 1 , . . . , Y n ∼ S -/γ, Z 1 , . . . , Z n ∼ S + /γ. Then for any bounded continuous function f , lim n→∞ i f (X i )/n = f (x) (P -s -) (1 -γ) (dx) lim n→∞ i f (Z i )/n = f (x) s + γ (dx) (A.5) Therefore, the distribution given by (µ + S) n (A) = (1 -γ) i 1 Xi∈A + (γ) i 1 Zi∈A satisfies, → µ+S, and therefore from (Villani, 2009) ,  lim n→∞ W 1 ((µ+S) n , ν n ) → W 1 (µ+S, ν). Moreover, S n T V = S T V , where S n satisfies S n (A) = γ n i 1 Zi∈A -i 1 Yi∈A . But, W 1 (µ n k + s n k , ν n k ) + λ s n k T V ≤ W 1 ((µ + S) n k , ν n k ) + λ S n k T V . Therefore, taking limits, W 1 (µ ⊕ s, ν) + λ µ ⊕ s -µ T V ≤ W 1 (µ + S, ν) + λ S T V , * 1 1 = Q. To prove that Π * 1,22 is diagonal, we use the fact that the any diagonal entry the cost matrix is 0. Now suppose Π * 1,22 is not diagonal. Then define a matrix Π as following: set Π11 = Π21 = 0, Π12 = Π * 1,12 and: Π22 (i, j) = m k=1 Π * 1,22 (k, i), if j = i 0, if j = i Also define ŝ = s * 1 and t as t(i) = Π22 (i, i). Then clearly ( Π, ŝ, t) is a feasible solution of Formulation 1. Note that: t 1 = 1 Π22 1 = 1 Π * 1,22 1 = t * 1 1 and by our construction C aug , Π < C aug , Π * 1 . Hence ( Π, ŝ, t) reduces the value of the objective function of Formulation 1 which is a contradiction. This completes the proof.

B.2 PROOF OF LEMMA A.2

Proof. 1. Suppose Π * 1 (i, j) > 0. Then dump this mass to s * 1 (j) and make it 0. In this way C aug , Π * 1 will decrease by > 2λΠ * 1 (i, j) and the regularizer value will increase by atmost 2λΠ * 1 (i, j), resulting in overall reduction in the objective value, which leads to a contradiction. 2. Suppose each entry of i th row of C is < 2λ. Then if s * 1 (i) > 0, we can distribute this mass in the i th row such that, s * 1 (i) = a 1 + a 2 + • • • + a m with the condition that t * 1 (j) ≥ a j . Now we reduce t * 1 as: t * 1 (j) ← t * 1 (j) -a j Hence the value C aug , Π * 1 (i, j) will increase by a value < 2λs * 1 (i) but the value of regularizer will decrease by the value of 2λs * 1 (i), resulting in overall decrease in the value of objective function. 3. Same as proof of part (2) by interchanging row and column in the argument. 4. Suppose not. Then choose < s * 1 (i) ∧ t * 1 (j), Add to Π * 1 (i, j). Hence the cost function value C aug , Π * 1 will increase by < 2λ but the regularizer value will decrease by 2λ , resulting in overall decrease in the objective function.

B.3 PROOF OF LEMMA A.4

Proof. For the notational simplicity, we drop the subscript 4 now as we will only deal with the solution of Formulation 4 and there will be no ambiguity. We prove the Lemma by contradiction. Suppose s * 1,i > 0. Then we show one can come up with another solution ( Π, s1 , s2 ) of Formulation 4 such that it has lower objective value. To construct this new solution, make: s1,j = s * 1,j , if j = i 0, if j = i Now to change the optimal transport plan, we will only change i th row of Π * . We subtract a 1 , a 2 , . . . , a n ≥ 0 from i th column of Π * in such a way, such that none of the elements are negative. Hence the column sum will be change, i.e. the value of s2 will be: s2,j = s * 2,j -a j ∀1 ≤ j ≤ n .

Now clearly from our construction:

C, Π ≤ C, Π * For the regularization part, note that, as we only reduced i th element of s * 1 , we have s1 1 = s * 1 1 -s * 1,i . And by simple triangle inequality, s2 1 ≤ s * 2 1 + a 1 1 = s * 2 1 + s * 1,i by construction a i 's, as a i ≥ 0 and i a i = s * 1,i . Hence we have: s1 1 + s2 1 ≤ s * 1 1 -s * 1,i + s * 2 1 + s * 1,i = s * 1 1 + s * 2 1 . Hence the value corresponding to regularizer will also decrease. This completes the proof. and hence by construction: So the objective value is overall reduced. This contradicts the optimality of Π * 3 which completes the proof. s2 1 = 1 Π * 3,22 1 = s * 2 1 -1 Π * 3,21 1 . s3 1 = 1 Π * 3,11 1 + 1 Π * 3,21 1 = s *



Figure 2: Empirical study of regularization hyperparameter λ sensitivity

transformation of v as in Step 7, i.e. set h ← v-c α and set Π ← e h 1,e h . 11:

Figure 3: Random sample of outliers detected by ROBOT from a dataset of MNIST digits contaminated with Fashion MNIST images.

and thus the proof holds with s = µ ⊕ s -µ. B PROOF OF ADDITIONAL LEMMAS B.1 PROOF OF LEMMA A.1 Proof. The fact that Π * 1,11 = Π * 1,21 = 0 follows from the fact that Π * 1 0 and Π

PROOF OFLEMMA A.6    Proof. We prove this lemma by contradiction. Suppose Π * 3 does not have the structure mentioned in the statement of Lemma. Construct another transport plan for Formulation 3 Π3 as follows: Keep Π3,12 = Π * 3,12 and set Π3,12 = 0. Construct the other parts as:Π3,11 (i, j) = m k=1 Π * 3,11 (i, k) + n k=1 Π * 3,21 (k, i), if i = j 0, if i = j and Π3,22 (i, j) = n k=1 Π * 3,22 (k, i), if i = j 0, if i = jIt is immediate from the construction that:C aug , Π3 ≤ C aug , Π *As for the regularization term: Note the by our construction s4 will be same as s * 4 as column sum of Π3,22 is same as Π * 3,22 . For the other three:

And also by our construction, s1 = s * 1 + c where c = (Π * 3,21 ) 1. As a consequence we have c 1 = 1 Π * 3,21 1. Then it follows:

1: Almost surely µ × ν, there exists a subsequence {n k } k≥1 such that {µ n + s n } n and {ν n } n is relatively compact. µ and ν are probability measures on R d and are therefore tight. Let K be such that P µ (X / ∈ K ), P ν (Y / ∈ K ) ≤ /4. Consider the empirical distributions ν n = i δ Yi /n, µ n i δ Xi /n of ν, µ respectively. Here, X i ∼ µ and Y i ∼ ν . Fix an ω. Then {X 1 , . . . , X n , Y 1 , . . . , Y n } is fixed. Now by the construction for the discrete case, s n has support in {X 1 , . . . , X n , Y 1 , . . . , Y n }.

A.3 PROOF OF CONTINUOUS VERSION

Proof. In this proof we denote by F 1 the optimization problem of equation equation 2.2 and by F 2 the optimization problem equation equation 2.4. Assume that µ n and ν m denote the respective empirical measures relative to µ, ν. From Villani (2009) , we know that µ n , ν n converge weakly to µ and ν respectively. Therefore, ROBOT 2 (µ n , µ) → 0. Similary for ν n and ν. Thus, by triangle inequality,where s + and s -are positive measures onTherefore, the distribution given byP + s, and therefore from (Villani, 2009) ,Here δ x is the Dirac mass at x. Moreover, s n = s , where s n satisfiesSuch an sn exists by the proof of the discrete part because µ n , ν n are discrete measures.Then, similar to the Step 1 in the proof of Lemma A.8, there exists a probability measure µ ⊕ s and a subsequence {n k } k≥1 such that µ n k + s n k almost surely converges weakly to µ ⊕ s.

Moreover, similar to Step 2 of Lemma

Therefore, ROBOT 2 (µ, ν) = lim sup n→∞ ROBOT (µ n , ν n ) ≥ ROBOT (µ, ν). Thus the equality holds.C PROOF OF THEOREM 2.1Proof. The proof is immediate from the Formulation 1. Recall that the Formulation 1 can restructured as:where the infimum is taking over all measure dominated by some common measure σ (with respect to which µ, µ c , ν are dominated). Hence, ROBOT (μ, ν) ≤ OT (P, ν) + λ P -μ T V for any particular choice of P . Taking P = µ we get that ROBOT (μ, ν) ≤ OT (µ, ν) + λ µ -μ T V = OT (µ, ν)) + λ µ -µ c T VTaking P = ν we get ROBOT (μ, ν) ≤ λ ν -μ T V and finally taking P = μ we get ROBOT (μ, ν) ≤ OT (μ, ν). This completes the proof.

