INFOOT: INFORMATION MAXIMIZING OPTIMAL TRANSPORT

Abstract

Optimal transport aligns samples across distributions by minimizing the transportation cost between them, e.g., the geometric distances. Yet, it ignores coherence structure in the data such as clusters, does not handle outliers well, and cannot integrate new data points. To address these drawbacks, we propose InfoOT, an information-theoretic extension of optimal transport that maximizes the mutual information between domains while minimizing geometric distances. The resulting objective can still be formulated as a (generalized) optimal transport problem, and can be efficiently solved by projected gradient descent. This formulation yields a new projection method that is robust to outliers and generalizes to unseen samples. Empirically, InfoOT improves the quality of alignments across benchmarks in domain adaptation, cross-domain retrieval, and single-cell alignment.

1. INTRODUCTION

Optimal Transport (OT) provides a general framework with a strong theoretical foundation to compare probability distributions based on the geometry of their underlying spaces (Villani, 2009) . Besides its fundamental role in mathematics, OT has increasingly received attention in machine learning due to its wide range of applications in domain adaptation (Courty et al., 2017; Redko et al., 2019; Xu et al., 2020) , generative modeling (Arjovsky et al., 2017; Bousquet et al., 2017) , representation learning (Ozair et al., 2019; Chuang et al., 2022) , and generalization bounds (Chuang et al., 2021) . The development of efficient algorithms (Cuturi, 2013; Peyré et al., 2016) has significantly accelerated the adoption of optimal transport in these applications. Computationally, the discrete formulation of OT seeks a matrix, also called transportation plan, that minimizes the total geometric transportation cost between two sets of samples drawn from the source and target distributions. The transportation plan implicitly defines (soft) correspondences across these samples, but provides no mechanism to relate newly-drawn data points. Aligning these requires solving a new OT problem from scratch. This limits the applicability of OT, e.g., to streaming settings where the samples arrive in sequence, or very large datasets where we can only solve OT on a subset. In this case, the current solution cannot be used on future data. To overcome this fundamental constraint, a line of work proposes to directly estimate a mapping, the pushforward from source to target, that minimizes the transportation cost (Perrot et al., 2016; Seguy et al., 2017) . Nevertheless, the resulting mapping is highly dependent on the complexity of the mapping function (Galanti et al., 2021) . OT could also yield alignments that ignore the intrinsic coherence structure of the data. In particular, by relying exclusively on pairwise geometric distances, two nearby source samples could be mapped to disparate target samples, as in Figure 1 , which is undesirable in some settings. For instance, when applying OT for domain adaptation, source samples with the same class should ideally be mapped to similar target samples. To mitigate this, prior work has sought to impose structural priors on the OT objective, e.g., via submodular cost functions (Alvarez-Melis et al., 2018) or a Gromov-Wasserstein regularizer (Vayer et al., 2018b; a) . However, these methods still suffer from sensitivity to outliers (Mukherjee et al., 2021) and imbalanced data (Hsu et al., 2015; Tan et al., 2020) . This work presents Information Maximization Optimal Transport (InfoOT), an information-theoretic extension of the optimal transport problem that generalizes the usual formulation by infusing it with global structure in form of mutual information. In particular, InfoOT seeks alignments that maximize mutual information, an information-theoretic measure of dependence, between domains. To do so, we treat the pairs selected by the transportation plan as samples drawn from the joint distribution and estimate the mutual information with kernel density estimation based on the paired samples (Moon et al., 1995) . Interestingly, this results in an OT problem where the cost is the log ratio between the estimated joint and marginal distributions f XY (x, y)/(f X (x)f Y (y)). Empirically, we show that using a cost combining mutual information with geometric distances yields better alignments across different applications. Moreover, akin to Gromov-Wasserstein (Mémoli, 2011), the mutual information estimator only relies on intra-domain distances, which -unlike the standard OT formulation-makes it suitable for aligning distributions whose supports lie in different metric spaces, e.g., supports with different modalities or dimensionality (Alvarez-Melis & Fusi, 2020; Demetci et al., 2020) . By estimating a joint density, InfoOT naturally yields a novel method for out-of-sample transportation by taking an expectation over the estimated densities conditioned on the source samples, which we refer to as conditional projection. Typically, samples are mapped via a barycentric projection (Ferradans et al., 2014; Flamary et al., 2016) , which corresponds to the weighted average of target samples, where the weights are determined by the transportation plan. The barycentric projection inherits the disadvantages of standard OT: sensitivity to outliers and failing to generalize to new samples. In contrast, our proposed conditional projection is robust to outliers and cross-domain class-imbalanced data (Figure 1 and 4 ) by averaging over samples with importance sampling, where the weight is, again, the ratio between the estimated joint and marginal densities. Furthermore, this projection is well-defined even for unseen samples, which widens the applicability of OT in streaming or large-scale settings where solving OT for the complete dataset is prohibitive. In short, this work makes the following contributions: • We propose InfoOT, an information-theoretic extension to the optimal transport that regularizes alignments by maximizing mutual information; • We develop conditional projection, a new projection method for OT that is robust to outliers and class imbalance in data, and generalizes to new samples; • We evaluate our approach via experiments in domain adaptation, cross-domain retrieval, and single-cell alignment.

2. RELATED WORKS

Optimal Transport Optimal transport provides an elegant framework to compare and align distributions. The discrete formulation, also called Earth Mover's Distance (EMD), finds an optimal coupling between empirical samples by solving a linear programming problem (Bonneel et al., 2011) . To speed up the computation, Cuturi (2013) propose the Sinkhorn distance, an entropic regularized version of EMD that can be solved more efficiently via the Sinkhorn-Knopp algorithm (Knight, 2008) . Compared to EMD, this regularized formulation typically yields denser transportation plans, where samples can be associated with multiple target points. Various extensions of OT have been proposed to impose stronger priors, e.g., Alvarez-Melis et al. (2018) incorporate additional structure by leveraging a submodular transportation cost, while Flamary et al. ( 2016) induce class coherence through a group-sparsity regularizer. The Gromov-Wasserstein (GW) distance (Mémoli, 2011) is a variant of OT in which the transportation cost is defined upon intra-domain pairwise distances. Therefore, GW has been adopted to align 'incomparable spaces' (Alvarez-Melis & Jaakkola, 2018; Demetci et al., 2020) as the source and target domains do not need to lie in the same space. Since the GW objective is no longer a linear program, it is typically optimized using projected gradient



Figure 1: Illustration of InfoOT on 2D point cloud. Compared to classic OT, InfoOT preserves the cluster structure, where the source points from the same cluster are mapped to the same target cluster. For projection estimation (dashed lines), the new conditional projection improves over barycentric projection with better outlier robustness and out-of-sample generalization.

