INFOOT: INFORMATION MAXIMIZING OPTIMAL TRANSPORT

Abstract

Optimal transport aligns samples across distributions by minimizing the transportation cost between them, e.g., the geometric distances. Yet, it ignores coherence structure in the data such as clusters, does not handle outliers well, and cannot integrate new data points. To address these drawbacks, we propose InfoOT, an information-theoretic extension of optimal transport that maximizes the mutual information between domains while minimizing geometric distances. The resulting objective can still be formulated as a (generalized) optimal transport problem, and can be efficiently solved by projected gradient descent. This formulation yields a new projection method that is robust to outliers and generalizes to unseen samples. Empirically, InfoOT improves the quality of alignments across benchmarks in domain adaptation, cross-domain retrieval, and single-cell alignment.

1. INTRODUCTION

Optimal Transport (OT) provides a general framework with a strong theoretical foundation to compare probability distributions based on the geometry of their underlying spaces (Villani, 2009) . Besides its fundamental role in mathematics, OT has increasingly received attention in machine learning due to its wide range of applications in domain adaptation (Courty et al., 2017; Redko et al., 2019; Xu et al., 2020) , generative modeling (Arjovsky et al., 2017; Bousquet et al., 2017) , representation learning (Ozair et al., 2019; Chuang et al., 2022) , and generalization bounds (Chuang et al., 2021) . The development of efficient algorithms (Cuturi, 2013; Peyré et al., 2016) has significantly accelerated the adoption of optimal transport in these applications. Computationally, the discrete formulation of OT seeks a matrix, also called transportation plan, that minimizes the total geometric transportation cost between two sets of samples drawn from the source and target distributions. The transportation plan implicitly defines (soft) correspondences across these samples, but provides no mechanism to relate newly-drawn data points. Aligning these requires solving a new OT problem from scratch. This limits the applicability of OT, e.g., to streaming settings where the samples arrive in sequence, or very large datasets where we can only solve OT on a subset. In this case, the current solution cannot be used on future data. To overcome this fundamental constraint, a line of work proposes to directly estimate a mapping, the pushforward from source to target, that minimizes the transportation cost (Perrot et al., 2016; Seguy et al., 2017) . Nevertheless, the resulting mapping is highly dependent on the complexity of the mapping function (Galanti et al., 2021) . OT could also yield alignments that ignore the intrinsic coherence structure of the data. In particular, by relying exclusively on pairwise geometric distances, two nearby source samples could be mapped to disparate target samples, as in Figure 1 , which is undesirable in some settings. For instance, when applying OT for domain adaptation, source samples with the same class should ideally be mapped to similar target samples. To mitigate this, prior work has sought to impose structural priors on the OT objective, e.g., via submodular cost functions (Alvarez-Melis et al., 2018) or a Gromov-Wasserstein regularizer (Vayer et al., 2018b; a) . However, these methods still suffer from sensitivity to outliers (Mukherjee et al., 2021) and imbalanced data (Hsu et al., 2015; Tan et al., 2020) . This work presents Information Maximization Optimal Transport (InfoOT), an information-theoretic extension of the optimal transport problem that generalizes the usual formulation by infusing it with global structure in form of mutual information. In particular, InfoOT seeks alignments that maximize mutual information, an information-theoretic measure of dependence, between domains. To Compared to classic OT, InfoOT preserves the cluster structure, where the source points from the same cluster are mapped to the same target cluster. For projection estimation (dashed lines), the new conditional projection improves over barycentric projection with better outlier robustness and out-of-sample generalization. do so, we treat the pairs selected by the transportation plan as samples drawn from the joint distribution and estimate the mutual information with kernel density estimation based on the paired samples (Moon et al., 1995) . Interestingly, this results in an OT problem where the cost is the log ratio between the estimated joint and marginal distributions f XY (x, y)/(f X (x)f Y (y)). Empirically, we show that using a cost combining mutual information with geometric distances yields better alignments across different applications. Moreover, akin to Gromov-Wasserstein (Mémoli, 2011) , the mutual information estimator only relies on intra-domain distances, which -unlike the standard OT formulation-makes it suitable for aligning distributions whose supports lie in different metric spaces, e.g., supports with different modalities or dimensionality (Alvarez-Melis & Fusi, 2020; Demetci et al., 2020) . By estimating a joint density, InfoOT naturally yields a novel method for out-of-sample transportation by taking an expectation over the estimated densities conditioned on the source samples, which we refer to as conditional projection. Typically, samples are mapped via a barycentric projection (Ferradans et al., 2014; Flamary et al., 2016) , which corresponds to the weighted average of target samples, where the weights are determined by the transportation plan. The barycentric projection inherits the disadvantages of standard OT: sensitivity to outliers and failing to generalize to new samples. In contrast, our proposed conditional projection is robust to outliers and cross-domain class-imbalanced data (Figure 1 and 4) by averaging over samples with importance sampling, where the weight is, again, the ratio between the estimated joint and marginal densities. Furthermore, this projection is well-defined even for unseen samples, which widens the applicability of OT in streaming or large-scale settings where solving OT for the complete dataset is prohibitive. In short, this work makes the following contributions: • We propose InfoOT, an information-theoretic extension to the optimal transport that regularizes alignments by maximizing mutual information; • We develop conditional projection, a new projection method for OT that is robust to outliers and class imbalance in data, and generalizes to new samples; • We evaluate our approach via experiments in domain adaptation, cross-domain retrieval, and single-cell alignment.

2. RELATED WORKS

Optimal Transport Optimal transport provides an elegant framework to compare and align distributions. The discrete formulation, also called Earth Mover's Distance (EMD), finds an optimal coupling between empirical samples by solving a linear programming problem (Bonneel et al., 2011) . To speed up the computation, Cuturi (2013) propose the Sinkhorn distance, an entropic regularized version of EMD that can be solved more efficiently via the Sinkhorn-Knopp algorithm (Knight, 2008) . Compared to EMD, this regularized formulation typically yields denser transportation plans, where samples can be associated with multiple target points. Various extensions of OT have been proposed to impose stronger priors, e.g., Alvarez-Melis et al. (2018) incorporate additional structure by leveraging a submodular transportation cost, while Flamary et al. (2016) induce class coherence through a group-sparsity regularizer. The Gromov-Wasserstein (GW) distance (Mémoli, 2011 ) is a variant of OT in which the transportation cost is defined upon intra-domain pairwise distances. Therefore, GW has been adopted to align 'incomparable spaces' (Alvarez-Melis & Jaakkola, 2018; Demetci et al., 2020) as the source and target domains do not need to lie in the same space. Since the GW objective is no longer a linear program, it is typically optimized using projected gradient descent (Peyré et al., 2016; Solomon et al., 2016) . The Fused-GW, which combines the OT and GW objectives, was proposed by Vayer et al. (2018a) to measure graph distances. Mutual Information and OT The proposed InfoOT extends the standard OT formulation by maximizing a kernel density estimated mutual information. Recent works (Bai et al., 2020; Khan & Zhang, 2022) also explore the connection between OT and information theory. Liu et al. (2021) consider a semi-supervised setting for estimating a variant of mutual information, where the unpaired samples are leveraged to minimize the estimation error. Ozair et al. (2019) replace the KL divergence in mutual information with Wasserstein distance and develop a loss function for representation learning. In comparison, the objective of InfoOT is to seek alignments that maximize the mutual information while being fully unsupervised by parameterizing the joint densities with the transportation plan. Another line of work also combines OT with kernel density estimation (Canas & Rosasco, 2012; Mokrov et al., 2021) , but focuses on different applications.

3. BACKGROUND ON OT AND KDE

Optimal Transport Let {x i } n i=1 ∈ X n and {y i } m i=1 ∈ Y m be the empirical samples and C ∈ R n×m be the transportation cost for each pair, e.g,. Euclidean cost C ij = ∥x i -y j ∥. Given two sets of weights over samples p ∈ R n + and q ∈ R m + where n i=1 p i = m i=1 q i = 1, and a cost matrix C, Kantorovich's formulation of optimal transport solves min Γ∈Π(p,q) ⟨Γ, C⟩, Π(p, q) = {Γ ∈ R n×m + |γ1 m = p, γ T 1 n = q}, where Π(p, q) is a set of transportation plans that satisfies the flow constraint. In practice, the Sinkhorn distance (Cuturi, 2013) , an entropic regularized version of OT, can be solved more efficiently via the Sinkhorn-Knopp algorithm. In particular, the Sinkhorn distance solves min Γ∈Π(p,q) ⟨Γ, C⟩ -ϵH(Γ), where H(Γ) =i,j Γ ij log Γ ij is the entropic regularizer that smooths the transportation plan. Kernel Density Estimation Kernel Density Estimation (KDE) is a non-parametric density estimation method based on kernel smoothing (Parzen, 1962; Rosenblatt, 1956) . Here, we consider a generalized KDE for metric spaces (X , d X ) and (Y, d Y ) (Li et al., 2020; Pelletier, 2005) . In particular, given a paired dataset {x i , y i } n i=1 ∈ {X n , Y n } sampled i.i.d. from an unknown joint density f XY and a kernel function K : R → R, KDE estimates the marginals and the joint density as where K h (t) = K( t h )/Z h and the normalizing constant Z h makes equation 1 integrate to one. The bandwidth parameter h controls the smoothness of the estimated densities. Figure 2 illustrates an example of KDE on 1D data. In this work, we do not need to estimate the normalizing constant as only the ratio between joint and marginal densities fXY (x, y)/( fX (x) fY (y)) is considered while estimating the mutual information. For all the presented experiments, we adopt the Gaussian kernel: fX (x) = 1 n i K h1 (d X (x, x i )) ; fXY (x, y) = 1 n i K h1 (d X (x, x i )) K h2 (d X (y, y i )) , K h (d X (x, x ′ )) = 1 Z h exp - d X (x, x ′ ) 2 2h 2 σ 2 , where σ 2 controls the variance. The Gaussian kernel has been successfully adopted for KDE beyond the Euclidean space (Li et al., 2020; Said et al., 2017) , and we found it to work well in our experiments. For simplicity, we also set h 1 = h 2 = h for all the experiments.

4. INFORMATION MAXIMIZING OT

Optimal transport captures the geometry of the underlying space through the ground metric in its objective. Additional information is not directly captured in this metric -such as coherence structure-will therefore be ignored when solving the problem. This is undesirable in applications where this additional structure matters, for instance in domain adaptation, where class coherence is crucial. As a concrete example, the cluster structure in the dataset in Figure 1 is ignored by classic OT. Intuitively, the reason for this issue is that the classic OT is too local: the transportation cost considers each sample separately, without respecting coherence across close-by samples. Next, we show that mutual information estimated with KDE can introduce global structure into OT maps.

4.1. MEASURING GLOBAL STRUCTURE WITH MUTUAL INFORMATION

Formally, mutual information measures the statistical dependence of two random variables X, Y : I(X, Y ) = Y X f X,Y (x, y) log f XY (x, y) f X (x)f Y (y) dxdy where f XY is joint density and f X , f Y are marginal probability density functions. For paired datasets, various mutual information estimators have been defined (Belghazi et al., 2018; Moon et al., 1995; Poole et al., 2019) . In contrast, we are interested in the inverse: given unpaired samples {x i } n i=1 , {y j } m i=1 , can we find alignments that maximize the mutual information? Discrete v.s. Continuous. An immediate idea is to treat the discrete transportation plan Γ as the joint distribution between X and Y , and write the mutual information as i,j Γ ij log(nmΓ ij ) = log(nm) -H(Γ). In this case, maximizing mutual information would be equivalent to minimizing the entropic regularizer H(Γ) introduced by (Cuturi, 2013) . For a finite set of samples, this mutual information estimator is trivially maximized for any one-to-one mapping as then H(Γ) = 0. Figure 3 (a) illustrates two one-to-one mappings Γ a and Γ b between points sampled from multi-mode Gaussian distributions, where Γ a preserves the cluster structure and Γ b is simply a random permutation. They both maximize the mutual information estimate above, yet Γ a is a better alignment with high coherence. In short, directly using the transportation plan estimated from finite samples as the joint distribution to estimate mutual information between continuous random variables is problematic. In contrast, joint distributions estimated with KDE tend to be smoother, such as Γ a in Figure 3 (b) . This suggests that KDE may lead to a better objective for the alignment problem.

4.2. INFOOT: MAXIMIZING MUTUAL INFORMATION WITH KDE

Instead of directly interpreting the OT plan as the joint distribution for the mutual information, we use it to inform the definition of a different one. In particular, we treat Γ ij as the weight of pair (x i , y j ) within the empirical samples drawn from the unknown joint distribution with density f XY . Intuitively, Γ ij defines what empirical samples we obtain by sampling from the joint distribution. Given a transportation plan Γ, the kernelized joint density in equation 1 can be rewritten as fΓ (x, y) = i j Γ ij K h (d X (x, x i )) K h (d Y (y, y j )) . The 1/n factor is dropped as the plan Γ is already normalized ( ij Γ ij = 1). Specifically, we replace the prespecified paired samples in equation 1 with the ones selected by the plan Γ. Definition 1 (Kernelized Mutual Information). The KDE estimated mutual information reads ÎΓ (X, Y ) = i,j Γ ij log fΓ (x i , y j ) f (x i ) f (y j ) = i,j Γ ij log nm • k,l Γ kl K h (d X (x i , x k )) K h (d Y (y j , y l )) k K h (d X (x i , x k )) • l K h (d Y (y j , y l )) . The estimation has two folds: (1) approximating the joint distribution with KDE, and (2) estimating the integral in equation 2 with paired empirical sample (x i , y j ) weighted by Γ ij . The normalizing constant Z h in equation ( 1) cancels out while calculating the ratio between joint and marginal probability densities. To maximize the empirical mutual information ÎΓ (X, Y ), the plan has to map close-by points i, k to close-by points j, l. Maximizing this information can be interpreted as an optimal transport problem: (InfoOT) max Γ∈Π(p,q) ÎΓ (X, Y ) = min Γ∈Π(p,q) i,j Γ ij • log f (x i ) f (y j ) fΓ (x i , y j ) . ( ) Instead of pairwise (Euclidean) distances, the transportation cost is now the log ratio between the estimated marginal and joint densities. The following lemma illustrates the asymptotic relation between the kernel estimated mutual information and the entropic regularizer. Lemma 2. When h → 0 and K(•) is the Gaussian kernel, we have ÎΓ (X, Y ) → -H(Γ)+log(nm). When the bandwidth h goes to zero, the estimated density is the sum of delta functions centered at the samples, and the estimated mutual information degenerates back to the standard entropic regularizer (Cuturi, 2013) . Note that the formulation of InfoOT does not require the support of X and Y to be comparable. Similar to Gromov-Wasserstein (Mémoli, 2011) , InfoOT only relies on intra-domain distances, which makes it an appropriate objective for aligning distributions when the supports do not lie in the same metric space, e.g., supports with different modalities or dimensionalities, as section 6.4 shows. Fused InfoOT: Incorporating the Geometry. When the geometry between domains is informative, the mutual information can act as a regularizer that refines the alignment. Along with a weighting parameter λ, we define the Fused InfoOT as (F-InfoOT) min Γ∈Π(p,q) ⟨Γ, C⟩ -λ ÎΓ (X, Y ) = min Γ∈Π(p,q) i,j Γ ij • C ij + λ • log f (x i ) f (y j ) fΓ (x i , y j ) . The transportation cost becomes the weighted sum between the pairwise distances C and the log ratio of joint and marginals densities. As Figure 1 illustrates, the mutual information regularizer excludes alignments that destroy the cluster structure while minimizing the pairwise distances. Practically, we found F-InfoOT suitable for general OT applications such as unsupervised domain adaptation (Flamary et al., 2016) and color transfer (Ferradans et al., 2014) where the geometry between source and target is informative.

4.3. NUMERICAL OPTIMIZATION

As the transportation cost is dependent on Γ, the objective is no longer linear in Γ and cannot be solved with linear programming. Instead, we adopt the projected gradient descent introduced in (Peyré et al., 2016) . In particular, Benamou et al. (2015) show that the projection can be done by simply solving the Sinkhorn distance (Cuturi, 2013) if the non-linear objective is augmented with the entropic regularizer H(Γ). For instance, we can augment F-InfoOT as follows: min Γ∈Π(p,q) ⟨Γ, C⟩ -λ ÎΓ (X, Y ) -ϵH(Γ). In this case, the update of projected gradient descent reads Γ t+1 ← arg min Γ∈Π(p,q) Γ, C -λ∇ Γ ÎΓt (X, Y ) -ϵH(Γ). ( ) The update is done by solving the sinkhorn distance (Cuturi, 2013) , where the cost function is the gradient to the objective of F-InfoOT. We provide a detailed derivation of (5) in Appendix A.2. Matrix Computation Practically, the optimization can be efficiently computed with matrix multiplications. The gradient with respect to the transportation plan Γ is ∂ ÎΓ (X, Y ) ∂Γ ij = log fΓ (x i , y j ) f (x i ) f (y j ) + k,l Γ kl K h (d X (x i , x k )) K h (d Y (y j , y l )) fΓ (x k , y l ) . Let K X and K Y be the kernel matrices where (K X ) ij = K h (d X (x i -x j )), (K Y ) ij = K h ((d Y (y i -y j )) . The gradient has the following matrix form: ∇ Γ ÎΓ (X, Y ) = log K X ΓK T Y ⊘ M X M T Y + K X Γ ⊘ K X ΓK T Y K T Y where (M X ) i = f (x i ), (M Y ) i = f (y i ) are the marginal density vectors and ⊘ denotes elementwise division. The gradient can be computed with matrix multiplications in O(n 2 m + nm 2 ).

5. CONDITIONAL PROJECTION WITH INFOOT

Many applications of optimal transport involve mapping source points to a target domain. For instance, when applying OT for domain adaptation, the classifiers are trained on projected source samples that are mapped to the target domain. When X = Y, given a transportation plan Γ, a barycentric projection maps source samples to the target domain by minimizing the weighted cost to target samples (Flamary et al., 2016; Perrot et al., 2016) . The mapping is equivalent to the weighted average of the target samples when the cost function is the squared Euclidean distance c(x, y) = ∥x -y∥ 2 : x i → arg min y∈Y m j=1 Γ ij ∥y -y j ∥ 2 = 1 m j=1 Γ ij m j=1 Γ ij y j . Despite its simplicity, the barycentric projection fails when (a) aligning data with outliers, (b) imbalanced data, and (c) mapping new samples. For instance, if sample x i is mostly mapped to an outlier y j , then its projection will be close to y j . Similar problems occur when applying OT for domain adaptation. If the size of a same class differs between domains, false alignments would emerge due to the flow constraint of OT as Figure 4 illustrates, which worsen the subsequent projections. Since the barycentric projection relies on the transportation plan to calculate the weights, any new source sample requires re-computing OT to obtain the transportation plan for it. This can be computationally prohibitive in large-scale settings. In the next section, we show that the densities estimated via InfoOT can be leveraged to compute the conditional expectation, which leads to a new mapping approach that is both robust and generalizable.

5.1. CONDITIONAL EXPECTATION VIA KDE

When treating the transportation plan as a probability mass function in the right-hand side of (equation 6), the barycentric projection resembles the conditional expectation E[Y |X = x]. Indeed, the classic definition of barycentric projection (Ambrosio et al., 2005) is defined as the integral over the conditional distribution. But, this again faces the issues discussed in section 4. Instead, equipped with the densities estimated via KDE and InfoOT, the conditional expectation can be better estimated with classical Monte-Carlo importance sampling using samples from the marginal P Y : x → E P Y |X=x [y] = E y∼P Y f Y |X=x (y) f Y (y) y = E y∼P Y f XY (x, y) f X (x)f Y (y) y ≈ 1 Z fΓ (x, y j ) fX (x) fY (y j ) y j where Z = m j=1 fΓ (x, y j )/( fX (x) fY (y j )) is the normalizing constant. Compared to the barycentric projection, the importance weight for each y j is the ratio between the joint and the marginal densities. To distinguish the KDE conditional expectation with barycentric projection, we will refer to the proposed mapping as conditional projection. Out-of-sample Mapping. The conditional projection is well-defined for any x ∈ X , and naturally generalizes to new samples without recomputing the OT. Importantly, the importance weight fΓ (x, y)/( fX (x) fY (y)) can be interpreted as a similarity score between (x, y), which is useful for retrieval tasks as section 6.3 shows. The conditional projection tends to cluster points together with larger bandwidths that lead to more averaging. We found that using a smaller bandwidth (e.g., h = 0.1) for the conditional projection improves the diversity of the projection when the dataset is less noisy, e.g., the data in Figure 1 . For noisy or imbalanced datasets, the same bandwidth used for optimizing InfoOT works well. Note that analogous to Lemma 2, when the bandwidth h → 0, the conditional projection converges to the barycentric projection, making the barycentric projection a special case of the conditional projection (Figure 4 ).

6. EXPERIMENTS

We now evaluate InfoOT with experiments in point cloud matching, domain adaptation, crossdomain retrieval, and single-cell alignment. All the optimal transport approaches are implemented or adopted from the POT library (Flamary et al., 2021) . Detailed experimental settings and additional experiments can be found in the appendix.

6.1. POINT CLOUD MATCHING

We begin with a 2D toy example, where both source and target samples are drawn from a Gaussian distribution with 2 modes, but the latter is rotated and has two outliers added to it, as Figure 1 shows. We compare the behavior of different variants of OT and mappings on this data. Perhaps not surprisingly, standard OT maps the source points in the same cluster to two different target clusters, overlooking the intrinsic structure of the data. In comparison, the alignment of InfoOT retains the cluster structure. On the right hand side, the barycentric projection maps two source points wrongly to the target outliers, while the conditional projection is not affected by the outliers. Lastly, we demonstrate an out-of-sample mapping with the conditional projection, where newly sampled points are correctly mapped to clusters. Figure 4 depicts an class-imbalanced setting, where the corresponding clusters in source and target have different numbers of samples. Therefore, the barycentric projection wrongly maps samples from the same source cluster to different target clusters. When increasing the bandwidth in the conditional projection, the smoothing effect of KDE gradually corrects the mapping and yields more concentrated projections. In appendix B.1, we further demonstrate that InfoOT improves the baselines in a color transfer task, where pixels are treated as points in RGB space.

6.2. DOMAIN ADAPTATION

Next, we apply the fused version of InfoOT to two domain adaptation benchmarks: MNIST-USPS and the Office-Caltech dataset (Gong et al., 2012) . The MNIST (M) and USPS (U) are digit classification datasets, and the Office-Caltech dataset contains 4 domains: Amazon (A), Dslr (D), Webcam (W) and Caltech10 (C), with images labeled as one of 10 classes. For MNIST and USPS, the raw images are directly used to compute the distances, while we adopt decaf6 features (Donahue et al., 2014) extracted from pretrained neural networks for Office-Caltech. Following previous works on OT for domain adaptation (Alvarez-Melis et al., 2018; Flamary et al., 2016; Perrot et al., 2016) , the source samples are first mapped to the target, and 1-NN classifiers are then trained on the projected samples with source labels. The barycentric projection is adopted for all the baselines, while F-InfoOT is tested with both barycentric and conditional projection. The Fused-InfoOT with conditional projection (F-InfoOT * ) performs significantly better than the barycentric counterpart (F-InfoOT) and the other baselines when the dataset exhibit class imbalance, e.g., Office-Caltech. Office-Caltech ImageCLEF P@1 P@5 P@15 P@1 P@5 P@15 L2-NN 70.0±13.9 62.9±14.0 53. Following Flamary et al. (2016) , we present the results over 10 independent trials. In each trial of Office-Caltech, the target data is divided into 90%/10% train-test split, where OT and 1-NN classifiers are only computed on the training set. For MNIST-USPS, only 2000 samples from the source and target training set are used, while the original test sets are used. The strength of the entropy regularizer ϵ is set to 1 for every entropic regularized OT, and the λ of F-InfoOT is set to 100 for all the experiments. The bandwidth for each benchmark is selected from {0.2, 0.3, ..., 0.8} with the circular validation procedure (Bruzzone & Marconcini, 2009; Perrot et al., 2016; Zhong et al., 2010) on M→U and A→D, which is 0.4 and 0.5, respectively. We compare F-InfoOT with Sinkhorn distance (Cuturi, 2013) , group-lasso regularized OT (Flamary et al., 2016) , fused Gromov-Wasserstein (FGW) (Vayer et al., 2019) , and linear-mapping OT (Perrot et al., 2016) . For OTs involving intradomain distances such as F-InfoOT and FGW, we adopt the following class-conditional distance for the source: ∥x i -x j ∥ + 5000 • 1 f (xi)̸ =f (xj ) , where the second term is a penalty on class mismatch (Alvarez-Melis & Fusi, 2020; Yurochkin et al., 2019) and f is the labeling function. As Table 1 shows, F-InfoOT with barycentric projection outperforms the baselines in both benchmarks, demon-strating that mutual information captures the intrinsic structure of the datasets. In Office-Caltech, many datasets exhibit the class-imbalance problem, which makes F-InfoOT with conditional projection significantly outperform the barycentric projection and the other baselines. Figure 5 visualizes the projected source and target samples with tSNE (Van der Maaten & Hinton, 2008) . The barycentric projection tends to produce one-to-one alignments, which suffer from class-imbalanced data. In contrast, conditional projection yields concentrated projections that preserves the class structure. The cell types are indicated by colors.

6.3. CROSS-DOMAIN RETRIEVAL

We now consider unsupervised cross-domain image retrieval, where given a source sample, the algorithms have to determine the top-k similar target samples. Given fixed source and target samples, this can be formulated as an optimal transport problem, where the transportation plan Γ ij gives the similarity score between the candidate source sample x i and target samples y j . Nevertheless, this formulation fails when new source samples come. For standard OT, one has to solve the OT problem again to obtain the alignment for new samples. In contrast, the importance weight fΓ (x, y)/( fX (x) fY (y)) defined in conditional projection ( 7) naturally provides the similarity score between the candidate x and each target sample y. We test F-InfoOT on the Office-Caltech (Gong et al., 2012) and ImageClef datasets (Caputo et al., 2014) , where we adopt the same hyperparameter for Office-Caltech from the previous section. In the unsupervised setting, the in-domain transportation cost for the source is the Euclidean distance instead of the class-conditional distance. To compare with standard OTs, we adopt a nearest neighbor approach for the baselines: (1) retrieve the nearest source sample given an unseen sample, and (2) use the transportation plan of the nearest source sample to retrieve target samples. Along with a simple nearest neighbor retrieval baseline (L2-NN), the average top-k precision over 10 trials is shown in Table 2 . The fuesed InfoOT significantly outperforms the baselines on Office-Caltech across different choices of k.

6.4. SINGLE CELL ALIGNMENT

Finally, we examine InfoOT in unsupervised alignment between incomparable spaces with the single-cell multi-omics dataset from (Demetci et al., 2020) . Recent techniques allow to obtain different cellular features at the single-cell resolution (Buenrostro et al., 2015; Chen et al., 2019; Stoeckius et al., 2017) . Nevertheless, different features are typically collected from different sets of cells, and aligning them is crucial for unified data analysis. We examine InfoOT with the sc-GEM (Cheow et al., 2016) and SNARE-seq (Chen et al., 2019) dataset provided by (Demetci et al., 2020) and follow the same data preprocessing steps, distance calculation, and evaluation setup. Here, two evaluation metrics are considered: "fraction of samples closer than the true match" (FOSCTTM) (Liu et al., 2019) and the label transfer accuracy (Cao et al., 2020) . We compare InfoOT with UnionCom (Cao et al., 2020) , MMD-MA (Liu et al., 2019) , and SCOT (Demetci et al., 2020) , where SCOT is an optimal transport baseline with Gromov-Wasserstein distance. Similarly, the bandwidth for each dataset is selected from {0.2, 0.3, ..., 0.8} with the circular validation procedure. As Table 3 shows, InfoOT significantly improves the baselines on the sc-GEM dataset, while being comparable on the SNARE-seq dataset, demonstrating the applicability of InfoOT on cross-domain alignment. Figure 6 further visualizes the barycentric projection with InfoOT, where we can see that cells with the same type are well aligned.

7. CONCLUSION

In this work, we propose InfoOT, an information-theoretic extension of optimal transport. InfoOT produces smoother, coherent alignments by maximizing the mutual information estimated with KDE. InfoOT leads to a new mapping method, conditional projection, that is robust to class imbalance and generalizes to unseen samples. We extensively demonstrate the applicability of InfoOT across benchmarks in different modalities.

A PROOFS

A.1 PROOF OF LEMMA 2 Proof. In the limit when h → 0, the Gaussian kernel converges to K h (t) = 1/Z h if t = 0 0 otherwise. Therefore, the kernel K h (d X (x i , x k )) will only have non-zero value when x i = x k , which implies that the kernelized mutual information will converge as follows: lim h→0 ÎΓ (X, Y ) = lim h→0 i,j Γ ij log n 2 • k,l Γ kl K h (d X (x i , x k )) K h (d Y (y j , y l )) k K h (d X (x i , x k )) • l K h (d Y (y j , y l )) . = i,j Γ ij log n 2 • Γ ij /Z 2 h 1/Z 2 h = i,j Γ ij log Γ ij + 2 log(n) = -H(Γ) + 2 log(n).

A.2 PROJECTED GRADIENT DESCENT

The classic mirror descent iteration is written as: x t+1 ← arg min x {τ ⟨∇f (x t ), x⟩ + D(x∥x t )} . When D(y||x) is the KL divergence: D KL (y||x) = i y i log yi xi , the update has the following form: (xt) . (x t+1 ) i = e log(xt)i-τ ∇f (xt) = (x t ) i e -τ ∇f In our case, before the projection, the update reads Γt) ) . Γ ′ t+1 = Γ t ⊙ e -τ (C-λ∇Γ t ÎΓ t (X,Y )-ϵ∇H( Next, we solve the following projection w.r.t. KL metric: Γ t+1 = arg min Γ∈Π(p,q) D KL (Γ∥Γ ′ t+1 ). As Benamou et al. (2015) shows, the KL projection is equivalent to solving the entropic regularized optimal transport problem, which is usually refer to the sinkhorn distance (Cuturi, 2013) . Following (Peyré et al., 2016) , we set the stepsize τ = 1/ϵ to simplify the iterations and reach the following update rule: Γ t+1 ← arg min Γ∈Π(p,q) Γ, C -λ∇ Γ ÎΓt (X, Y ) -ϵH(Γ).

B ADDITIONAL EXPERIMENTS B.1 COLOR TRANSFER

Color transfer aims to transfer the colors of the target images into the source image. Optimal transport achieves this by treating pixels as points in the RGB space, and maps the source pixels to the target ones. Here, 500 pixels are sampled from each image to compute the OT, then the barycentric projection is applied to map all the source pixels to target. We compare fused InfoOT with standard OT, Sinkhorn distance (Cuturi, 2013) , and linear mapping estimation (Perrot et al., 2016) and show the results in Figure 7 . We can see that InfoOT produces a sharper results than the baselines while decently recovering the colors in the target image. 

B.2 WORD EMBEDDING ALIGNMENT

Here, we explore the possibility of applying InfoOT for unsupervised word embedding alignment. We follow the setup in (Alvarez-Melis & Jaakkola, 2018) , where the goal is to recover cross-lingual correspondences with word embedding in different languages. In this case, the pairwise distance between domains might not be meaningful, as the word embedding models are trained separately. Previous works suggest that cross-lingual word vector spaces are approximately isometric, which makes Gromov-Wasserstein an ideal choice due to its ability to align isometric spaces. Here, we treat GW as the oracle, and show that InfoOT can perform comparably to GW (Alvarez-Melis & Jaakkola, 2018) and other baselines such as InvOT (Alvarez-Melis et al., 2019) , Adv-NN (Conneau et al., 2017) , and supervised PROCRUSTES. We report the results on the dataset of Conneau et al. (2017) in Table 4 , where both GW and InfoOT are trained with 12000 words and refined with Cross-Domain Similarity Scaling (CSLS) (Conneau et al., 2017) . The entropy regularizer is 0.0001 and 0.02 for GW and InfoOT, respectively. We can see that InfoOT performs comparably with the baselines and GW, demonstrating its applicability in recovering cross-lingual correspondence. Here, we report the performance of InfoOT with different weights for entropic regularizer and mutual information on domain adaptation. As ⟨Γ, C⟩ + ϵ m D KL (Γ1, p) + ϵ m D KL (Γ T 1, q) -ϵH(Γ). We show the best results of UOT in 

B.5 EXPERIMENTS BEYOND 1-NN CLASSIFIER

We report the performances of InfoOT and baselines with general k-NN classifiers and linear SVM classifiers in Table 7 . We can see that fused-InfoOT consistently outperforms the baselines beyond 1-NN classifiers on Office-Caltech domain adaptation benchmark. In addition, compared to the baselines, the performance of InfoOT is more robust to the choice of the number of neighbors k.

B.6 BASELINES BEYOND OPTIMAL TRANSPORT

We compare InfoOT with the following non-OT baselines: Geodesic Flow Kernel (GFK) (Gong et al., 2012) , CORrelation Alignment (CORAL) (Sun et al., 2016) , Scatter Component Analysis (SCA) (Ghifary et al., 2016) , Joint distribution alignment (JDA) (Long et al., 2013) , Transfer Joint Matching (TJM) (Long et al., 2014a) , Deep Domain Confusion (DDC) (Tzeng et al., 2014) , Deep Adaptation Network (DAN) (Long et al., 2014b) , and Manifold Embedded Distribution Alignment (MEDA) (Wang et al., 2018) . For fair comparison, we report the performance of Fused-InfoOT calculated with full source and target dataset instead of the 10-fold setting in the main context. As Table 8 shows, InfoOT performs comparably to many baselines without training or finetuning neural networks. C LIMITATIONS While we have illustrated successful applications of InfoOT, there are limitations. One could expect InfoOT to perform worse when the geometry of input spaces provides little information. In particular, for raw inputs such as image datasets, InfoOT would not perform well without pre-extracted features. It is also non-trivial to directly apply InfoOT to very large-scale problems with millions of data points. Computational-efficient extensions such as mini-batch optimal transport (Nguyen et al., 2022) should be considered to apply InfoOT to large-scale datasets.



Robustness against Noisy Data. By definition in equation 3, the joint density fΓ (x, y) measures the similarity of (x, y) to all other pairs selected by the transportation plan Γ. Even if x is aligned with outliers or wrong clusters, as long as the points near x are mostly aligned with the correct samples, the conditional projection will project x to similar samples as they are upweighted by the joint density in (7). This makes the mapping much less sensitive to outliers and imbalanced datasets. See Figure1and Figure4for illustrations.



Figure 1: Illustration of InfoOT on 2D point cloud. Compared to classic OT, InfoOT preserves the cluster structure, where the source points from the same cluster are mapped to the same target cluster. For projection estimation (dashed lines), the new conditional projection improves over barycentric projection with better outlier robustness and out-of-sample generalization.

Figure 2: Example of KDE. The bandwidth affect the smoothness.

Figure 3: Measuring Structure with Mutual information. (a) The Γa and Γ b are two one-to-one mappings where Γa preserves the cluster structure and Γ b is a random permutation; (b) The estimated joint density of Γa is more concentrated than the one of Γ b , which also leads to higher mutual information under KDE.

Figure4: Projection under imbalanced samples. When the cluster sizes mismatch between source and target, barycentric projection wrongly projects samples to the incorrect cluster. In contrast, increasing the bandwidth of conditional projection gradually improves the robustness and yields better projection.

Figure 5: tSNE visualization of projections. We show the t-SNE visualization of projectrd source samples (circles) along with the target samples (triangles) on A→C. Classes are indicated by colors.

Single-Cell Alignment Performance. Similar to GW, InfoOT also performs well in cross-domain alignment.

Figure 6: scGEM alignment with InfoOT. The cell types are indicated by colors.

Figure 7: Color Transfer via Optimal Transport. Fused InfoOT produces sharper results while preserving the target color compared to the baselines.

Optimal Transport for Domain Adaptation.

Optimal Transport for Cross-Domain Retrieval. With conditional projection, InfoOT is capable to perform alignment for unseen samples without any modification.



Unbalanced OT.



InfoOT with different hyperparameters. We test the Fused-InfoOT with conditional projection by varying the regularizer weights (λ, ϵ). Note that Table1in the main paper shows the results of (λ = 100, ϵ = 1). InfoOT 80.6±5.7 81.4±5.8 79.7±5.0 76.4±7.1 82.9±7.0 F-InfoOT * 85.5±5.6 85.4±5.5 85.4±5.5 81.7±7.2 81.4±5.3

Results beyond 1-NN. We evaluate the performance with k-NN classifiers and linear classifiers. GFK CORAL SCA JDA TJM DDC DAN MEDA F-InfoOT *

Baselines beyond OT.

