FEATURE-ROBUST OPTIMAL TRANSPORT FOR HIGH-DIMENSIONAL DATA Anonymous

Abstract

Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature-robust optimal transport (FROT) for highdimensional data, which solves high-dimensional OT problems using feature selection to avoid the curse of dimensionality. Specifically, we find a transport plan with discriminative features. To this end, we formulate the FROT problem as a min-max optimization problem. We then propose a convex formulation of the FROT problem and solve it using a Frank-Wolfe-based optimization algorithm, whereby the subproblem can be efficiently solved using the Sinkhorn algorithm. Since FROT finds the transport plan from selected features, it is robust to noise features. To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can find a strong correspondence by determining important layers. We show that the FROT algorithm achieves state-of-the-art performance in real-world semantic correspondence datasets.

1. INTRODUCTION

Optimal transport (OT) is a machine learning problem with several applications in the computer vision and natural language processing communities. The applications include Wasserstein distance estimation (Peyré et al., 2019) , domain adaptation (Yan et al., 2018) , multitask learning (Janati et al., 2019) , barycenter estimation (Cuturi & Doucet, 2014) , semantic correspondence (Liu et al., 2020) , feature matching (Sarlin et al., 2019) , and photo album summarization (Liu et al., 2019) . The OT problem is extensively studied in the computer vision community as the earth mover's distance (EMD) (Rubner et al., 2000) . However, the computational cost of EMD is cubic and highly expensive. Recently, the entropic regularized EMD problem was proposed; this problem can be solved using the Sinkhorn algorithm with a quadratic cost (Cuturi, 2013) . Owing to the development of the Sinkhorn algorithm, researchers have replaced the EMD computation with its regularized counterparts. However, the optimal transport problem for high-dimensional data has remained unsolved for many years. Recently, a robust variant of the OT was proposed for high-dimensional OT problems and used for divergence estimation (Paty & Cuturi, 2019; 2020) . In the robust OT framework, the transport plan is computed with the discriminative subspace of the two data matrices X ∈ R d×n and Y ∈ R d×m . The subspace can be obtained using dimensionality reduction. An advantage of the subspace robust approach is that it does not require prior information about the subspace. However, given prior information such as feature groups, we can consider a computationally efficient formulation. The computation of the subspace can be expensive if the dimensionality of data is high, for example, 10 4 . One of the most common prior information items is a feature group. The use of group features is popular in feature selection problems in the biomedical domain and has been extensively studied in Group Lasso (Yuan & Lin, 2006) . The key idea of Group Lasso is to prespecify the group variables and select the set of group variables using the group norm (also known as the sum of 2 norms). For example, if we use a pretrained neural network as a feature extractor and compute OT using the features, then we require careful selection of important layers to compute OT. Specifically, each layer output is regarded as a grouped input. Therefore, using a feature group as prior information is a natural setup and is important for considering OT for deep neural networks (DNNs). In this paper, we propose a high-dimensional optimal transport method by utilizing prior information in the form of grouped features. Specifically, we propose a feature-robust optimal transport (FROT) problem, for which we select distinct group feature sets to estimate a transport plan instead of determining its distinct subsets, as proposed in (Paty & Cuturi, 2019; 2020) . To this end, we formulate the FROT problem as a min-max optimization problem and transform it into a convex optimization problem, which can be accurately solved using the Frank-Wolfe algorithm (Frank & Wolfe, 1956; Jaggi, 2013) . The FROT's subproblem can be efficiently solved using the Sinkhorn algorithm (Cuturi, 2013 ). An advantage of FROT is that it can yield a transport plan from high-dimensional data using feature selection, using which the significance of the features is obtained without any additional cost. Therefore, the FROT formulation is highly suited for high-dimensional OT problems. Through synthetic experiments, we initially demonstrate that the proposed FROT is robust to noise dimensions (See Figure 1 ). Furthermore, we apply FROT to a semantic correspondence problem (Liu et al., 2020) and show that the proposed algorithm achieves SOTA performance.

Contribution:

• We propose a feature robust optimal transport (FROT) problem and derive a simple and efficient Frank-Wolfe based algorithm. Furthermore, we propose a feature-robust Wasserstein distance (FRWD). • We apply FROT to a high-dimensional feature selection problem and show that FROT is consistent with the Wasserstein distance-based feature selection algorithm with less computational cost than the original algorithm. • We used FROT for the layer selection problem in a semantic correspondence problem and showed that the proposed algorithm outperforms existing baseline algorithms.

2. BACKGROUND

In this section, we briefly introduce the OT problem.

Optimal transport (OT):

The following are given: independent and identically distributed (i.i. 



(a) OT on clean data. (b) OT on noisy data. (c) FROT on noisy data (η = 1).

Figure 1: transport plans between two synthetic distributions with 10-dimensional vectors x = (x , z x ), y = (y , z y ), where two-dimensional vectors x ∼ N (µ x , Σ x ) and y ∼ N (µ y , Σ y ) are true features; and z x ∼ N (0 8 , I 8 ) and z y ∼ N (0 8 , I 8 ) are noisy features. (a) OT between distribution x and y is a reference. (b) OT between distribution x and y. (c) FROT transport plan between distribution x and y where true features and noisy features are grouped, respectively.

d.) samples X = {x i } n i=1 ∈ R d×n from a d-dimensional distribution p, and i.i.d. samples Y = {y j } m j=1 ∈ R d×m from the d-dimensional distribution q. In the Kantorovich relaxation of OT, admissible couplings are defined by the set of the transport plan:U (µ, ν) = {Π ∈ R n×m + : Π1 m = a, Π 1 n = b},where Π ∈ R n×m + is called the transport plan, 1 n is the n-dimensional vector whose elements are ones, and a = (a 1 , a 2 , . . . , a n ) ∈ R n + and b = (b 1 , b 2 , . . . , b m ) ∈ R m + are the weights. The OT problem between two discrete measures µ = n i=1 a i δ xi and ν = m j=1 b j δ yj determines the

