OUTLIER-ROBUST OPTIMAL TRANSPORT

Abstract

Optimal transport (OT) provides a way of measuring distances between distributions that depends on the geometry of the sample space. In light of recent advances in solving the OT problem, OT distances are widely used as loss functions in minimum distance estimation. Despite its prevalence and advantages, however, OT is extremely sensitive to outliers. A single adversarially-picked outlier can increase OT distance arbitrarily. To address this issue, in this work we propose an outlier-robust OT formulation. Our formulation is convex but challenging to scale at a first glance. We proceed by deriving an equivalent formulation based on cost truncation that is easy to incorporate into modern stochastic algorithms for regularized OT. We demonstrate our model applied to mean estimation under the Huber contamination model in simulation as well as outlier detection on real data.

1. INTRODUCTION

Optimal transport is a fundamental problem in applied mathematics. In its original form (Monge, 1781) , the problem entails finding the minimum cost way to transport mass from a prescribed probability distribution µ on X to another prescribed distribution ν on X . Kantorovich (1942) relaxed Monge's formulation of the optimal transport problem to obtain the Kantorovich formulation: OT(µ, ν) min Π∈F (µ,ν) E (X1,X2)∼Π c(X 1 , X 2 ) , (1.1) where F(µ, ν) is the set of couplings between µ and ν (probability distributions on X × X whose marginals are µ and ν) and c is a cost function, where we typically assume c(x, y) ≥ 0 and c(x, x) = 0. Compared to other notions of distance between probability distributions, optimal transport uniquely depends on the geometry of the sample space. Recent advancements in optimization for optimal transport (Cuturi, 2013; Solomon et al., 2015; Genevay et al., 2016; Seguy et al., 2018) Many applications use OT as a loss in an optimization problem of the form: θ ∈ arg min θ∈Θ OT(µ n , ν θ ), (1.2) where {ν θ } θ∈Θ is a collection of parametric models, µ n is the empirical distribution of the samples. Such estimators are called minimum Kantorovich estimators (MKE) (Bassetti et al., 2006) . They are popular alternatives to likelihood-based estimators, especially in generative modeling. For example, when OT(•, •) is the Wasserstein-1 distance and ν θ is a generator parameterized by a neural network with weights θ, equation 1.2 corresponds to the Wasserstein GAN (Arjovsky et al., 2017) . One drawback of optimal transport is its sensitivity to outliers. Because all the mass in µ must be transported to ν, a small fraction of outliers can have an outsized impact on the optimal transport problem. For statistics and machine learning applications in which the data is corrupted or noisy, this is a major issue. For example, the poor performance of Wasserstein GANs in the presence of outliers was noted in the recent works on outlier-robust generative learning with f -divergence GANs (Chao et al., 2018; Wu et al., 2020) . The problem of outlier-robustness in MKE has not been studied, with the exception of two concurrent works (Staerman et al., 2020; Balaji et al., 2020) .

