A HYPERGRADIENT APPROACH TO ROBUST REGRESSION WITHOUT CORRESPONDENCE

Abstract

We consider a regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature, most of existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework -ROBOT-for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to the inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking.

1. INTRODUCTION

Regression analysis has been widely used in various machine learning applications to infer the the relationship between an explanatory random variable (i.e., the input) X ∈ R d and a response random variable (i.e., the output) Y ∈ R o (Stanton, 2001) . In the classical setting, regression is used on labeled datasets that contain paired samples {x i , y i } n i=1 , where x i , y i are realizations of X, Y , respectively. Unfortunately, such an input-output correspondence is not always available in some applications. One example is flow cytometry, which is a physical experiment for measuring properties of cells, e.g., affinity to a particular target (Abid & Zou, 2018) . Through this process, cells are suspended in a fluid and injected into the flow cytometer, where measurements are taken using the scattering of a laser. However, the instruments are unable to differentiate the cells passing through the laser, such that the correspondence between the cell proprieties (i.e., the measurements) and the cells is unknown. This prevents us from analyzing the relationship between the instruments and the measurements using classical regression analysis, due to the missing correspondence. Another example is multi-object tracking, where we need to infer the motion of objects given consecutive frames in a video. This requires us to find the correspondence between the objects in the current frame and those in the next frame. The two examples above can be formulated as a shuffled regression problem. Specifically, we consider a multivariate regression model Y = f (X, Z; w) + ε, where X ∈ R d , Z ∈ R e are two input vectors, Y ∈ R o is an output vector, f : R d+e → R o is the unknown regression model with parameters w and ε is the random noise independent on X and Z. When we sample realizations from such a regression model, the correspondence between (X, Y ) and Z is not available. Accordingly, we collect two datasets D 1 = {x i , y i } n i=1 and D 2 = {z j } n j=1 , and there exists a permutation π * such that (x i , z π(i) ) corresponds to y i in the regression model. Our goal is to recover the unknown model parameter w. Existing literature also refer to the shuffled regression problem as unlabeled sensing, homomorphic sensing, and regression with an unknown permutation (Unnikrishnan et al., 2018) . Throughout the rest of the paper, we refer to it as Regression WithOut Correspondence (RWOC). A natural choice of the objective for RWOC is to minimize the sum of squared residuals with respect to the regression model parameter w up to the permutation π(•) over the training data, i.e., min w,π L(w, π) = n i=1 y i -f x i , z π(i) ; w 2 2 . (1) Existing works on RWOC mostly focus on theoretical properties of the global optima to equation 1 for estimating w and π (Pananjady et al., 2016; 2017b; Abid et al., 2017; Elhami et al., 2017; Hsu et al., 2017; Unnikrishnan et al., 2018; Tsakiris & Peng, 2019) . The development of practical algorithms, however, falls far behind from the following three aspects: • Most of the works are only applicable to linear regression models. • Some of the existing algorithms are of very high computational complexity, and can only handle small number of data points in low dimensions (Elhami et al., 2017; Pananjady et al., 2017a; Tsakiris et al., 2018; Peng & Tsakiris, 2020) . For example, Abid & Zou (2018) adopt an Expectation Maximization (EM) method where Metropolis-Hastings sampling is needed, which is not scalable. Other algorithms choose to optimize with respect to w and π in an alternating manner, e.g., alternating minimization in Abid et al. (2017) . However, as there exists a strong interaction between w and π, the optimization landscape of equation 1 is ill-conditioned. Therefore, these algorithms are not effective and often get stuck in local optima. • Most of the works only consider the case where there exists an exact one-to-one correspondence between D 1 and D 2 . For many more scenarios, however, these two datasets are not necessarily well aligned. For example, consider D 1 and D 2 collected from two separate databases, where the users overlap, but are not identical. As a result, there exists only partial one-to-one correspondence. A similar situation also happens to multiple-object tracking: Some objects may leave the scene in one frame, and new objects may enter the scene in subsequent frames. Therefore, not all objects in different frames can be perfectly matched. The RWOC problem with partial correspondence is known as robust-RWOC, or rRWOC (Varol & Nejatbakhsh, 2019) , and is much less studied in existing literature. To address these concerns, we propose a new computational framework -ROBOT (Regression withOut correspondence using Bilevel OptimizaTion). Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques (Duchi et al., 2008; Luise et al., 2018) . Our hypergradient approach views the data correspondence as an operator of the regression, i.e., for a given w, the optimal correspondence is π(w) = arg min π L(w, π). (2) Accordingly, when applying gradient descent to (1), we need to find the gradient with respect to w by differentiating through both the objective function L and the data correspondence π(w). For simplicity, we refer as such a gradient to "hypergradient". Note that due to its discrete nature, π(w) is actually not continuous in w. Therefore, such a hypergradient does not exist. To address this issue, we further propose to construct a smooth approximation of π(w) by adding an additional regularizer to equation 2, and then we replace π(w) with our proposed smooth replacement when computing the hyper gradient of w. Moreover, we also propose an efficient and scalable implementation of hypergradient computation based on simple first order algorithms and implicit differentiation, which outperforms conventional automatic differentiation in terms of time and memory cost. ROBOT can also be extended to the robust RWOC problem, where D 1 and D 2 are not necessarily exactly aligned, i.e., some data points in D 1 may not correspond to any data point in D 2 . Specifically, we relax the constraints on the permutation π(•) (Liero et al., 2018) to automatically match related data points and ignore the unrelated ones. At last, we conduct thorough numerical experiments to demonstrate the effectiveness of ROBOT. For RWOC (i.e., exact correspondence), we use several synthetic regression datasets and a real gated flow cytometry dataset, and we show that ROBOT outperforms baseline methods by significant margins. For robust RWOC (i.e., inexact correspondence), in addition to synthetic datasets, we consider a vision-based multiple-object tracking task, and then we show that ROBOT also achieves significant improvement over baseline methods. Notations. Let • 2 denote the 2 norm of vectors, •, • the inner product of matrices, i.e., A, B = i,j A ij B ij for matrices A and B. a i:j are the entries from index i to index j of vector a. Let 1 n denote an n-dimensional vector of all ones. Denote d (•) d(•) the gradient of scalars, and ∇ (•) (•) the Jacobian of tensors. We denote [v 1 , v 2 ] the concatenation of two vectors v 1 and v 2 . N (µ, σ 2 ) is the Gaussian distribution with mean µ and variance σ 2 .

2. ROBOT: A HYPERGRADIENT APPROACH FOR RWOC

We develop our hypergradient approach for RWOC. Specifically, we first introduce a continuous formulation equivalent to (1), and then propose a smooth bi-level relaxation with an efficient hypergradient descent algorithm.

2.1. EQUIVALENT CONTINUOUS FORMULATION

We propose a continuous optimization problem equivalent to (1). Specifically, we rewrite an equivalent form of (1) as follows, min w min S∈R n×n L(w, S) = C(w), S subject to S ∈ P, (3) where P denotes the set of all n × n permutation matrices, C(w) ∈ R n×n is the loss matrix with C ij (w) = y i -f (x i , z j ; w) 2 2 . Note that we can relax S ∈ P, which is the discrete feasible set of the inner minimization problem of (3), to a convex set, without affecting the optimality, as suggested by the next theorem. Proposition 1. Given any a ∈ R n and b ∈ R m , we define Π(a, b) = {A ∈ R n×m : A1 m = a, A 1 n = b, A ij ≥ 0}. The optimal solution to the inner discrete minimization problem of (3) is also the optimal solution to the following continuous optimization problem, min S∈R n×n C(w), S , s.t. S ∈ Π(1 n , 1 n ). This is a direct corollary of the Birkhoff-von Neumann theorem (Birkhoff, 1946; Von Neumann, 1953) , and please refer to Appendix A for more details. Theorem 1 allows us to replace P in (3) with Π(1 n , 1 n ), which is also known as the Birkhoff polytopefoot_0 (Ziegler, 2012). Accordingly, we obtain the following continuous formulation, min w min S∈R n×n C(w), S subject to S ∈ Π(1 n , 1 n ). (5) Remark 1. In general, equation 3 can be solved by linear programming algorithms (Dantzig, 1998) . 2.2 CONVENTIONAL WISDOM: ALTERNATING MINIMIZATION Conventional wisdom for solving (5) suggests to use alternating minimization (AM, Abid et al. (2017) ). Specifically, at the k-th iteration, we first update S by solving S (k) = arg min S∈Π(1n,1n) L(w (k-1) , S), and then given S (k) , we update w using gradient descent or exact minimization, i.e., w (k) = w (k-1) -η∇ w L(w (k-1) , S (k) ). However, AM works poorly for solving (5) in practice. This is because w and S have a strong interaction throughout the iterations: A slight change to w may lead to significant change to S. Therefore, the optimization landscape is ill-conditioned, and AM can easily get stuck at local optima.

2.3. SMOOTH BI-LEVEL RELAXATION

To tackle the aforementioned computational challenge, we propose a hypergradient approach, which can better handle the interaction between w and S. Specifically, we first relax (5) to a smooth bilevel optimization problem, and then we solve the relaxed bi-level optimization problem using the hypergradient descent algorithm. We rewrite (5) as a smoothed bi-level optimization problem, min w F (w) = C(w), S * (w) , subject to S * (w) = arg min S∈Π(1n,1n) C(w), S + H(S), (6) where H(S) = log S, S is the entropy of S. The regularizer H(S) in equation 6 alleviates the sensitivity of S * (w) to w. Note that if without such a regularizer, we solve S * (w) = arg min S∈Π(1n,1n) C(w), S . (7) The resulting S * (w) can be discontinuous in w. This is because S * (w) is the optimal solution of a linear optimization problem, and usually lies on a vertex of Π(1 n , 1 n ). This means that if we change w, S * (w) either stays the same or jumps to another vertex of Π(1 n , 1 n ). The jump makes S * (w) highly sensitive to w. To alleviate this issue, we propose to smooth S * (w) by adding an entropy regularizer to the lower level problem. The entropy regularizer enforces S * (w) to stay in the interior of Π(1 n , 1 n ), and S * (w) changes smoothly with respect to w, as suggested by the following theorem. Theorem 1. For any > 0, S * (w) is differentiable, if the cost C(w) is differentiable with respect to w. Consequently, the objective F (w) = C(w), S * (w) is also differentiable. The proof is deferred to Appendix C. Note that (6) provides us a new perspective to interpret the relationship between w and S. As can be seen from ( 6), w and S have different priorities: w is the parameter of the leader problem, which is of the higher priority; S is the parameter of the follower problem, which is of the lower priority, and can also be viewed as an operator of w -denoted by S * (w). Accordingly, when we minimize (6) with respect to w using gradient descent, we should also differentiate through S * . We refer to such a gradient as "hypergradient" defined as follows, ∇ w F (w) = ∂F (w) ∂C(w) ∂C(w) ∂w + ∂F (w) ∂S * (w) ∂S * (w) ∂w = ∇ w L(w, S) + ∂F (w) ∂S * (w) ∂S * (w) ∂w . We further examine the alternating minimization algorithm from the bi-level optimization perspective: Since ∇ w L(w (k-1) , S (k) ) is not differentiable through S (k) , AM is essentially using an inexact gradient. From a game-theoretic perspectivefoot_1 , (6) defines a competition between the leader w and the follower S. When using AM, S only reacts to what w has responded. In contrast, when using the hypergradient approach, the leader essentially recognizes the follower's strategy and reacts to what the follower is anticipated to response through ∂F (w) ∂S * (w) ∂S * (w) ∂w . In this way, we can find a better descent direction for w. Remark 2. We use a simple example of quadratic minimization to illustrative why we expect the bilevel optimization formulation in (6) to enjoy a benign optimization landscape. We consider a quadratic function L(a 1 , a 2 ) = a P a + b a, where a 1 ∈ R d1 , a 2 ∈ R d2 , a = [a 1 , a 2 ], P ∈ R (d1+d2)×(d1+d2) , b ∈ R d1+d2 . Let P = ρ1 d1+d2 1 d1+d2 + (1 -ρ)I d1+d2 , where I d1+d2 is the identity matrix, and ρ is a constant. We solve the following bilevel optimization problem, min a1 F (a 1 ) = L(a 1 , a * 2 (a 1 )) subject to a * 2 (a 1 ) = arg min a2 L(a 1 , a 2 ) + λ a 2 2 2 , ( ) where λ is a regularization coefficient. The next proposition shows that ∇ 2 F (a 1 ) enjoys a smaller condition number than ∇ 2 a1a1 L(a 1 , a 2 ), which corresponds to the problem that AM solves. Proposition 2. Given F defined in (9), we have λ max (∇ 2 F (a 1 )) λ min (∇ 2 F (a 1 )) = 1 + 1 -ρ + λ d 2 ρ -ρ + λ + 1 • d 1 ρ 1 -ρ and λ max (∇ 2 a1a1 L(a 1 , a 2 )) λ min (∇ 2 a1a1 L(a 1 , a 2 )) = 1 + d 1 ρ 1 -ρ . The proof is deferred to Appendix B.1. As suggested by Proposition 2, F (a 1 ) is much betterconditioned than L(a 1 , a 2 ) in terms of a 1 for high dimensional settings.

2.4. SOLVING RWOC BY HYPERGRADIENT DESCENT

We present how to solve (6) using our hypergradient approach. Specifically, we compute the "hypergradient" of F (w) based on the following theorem. Theorem 2. The gradient of F with respect to w is ∇ w F (w) = 1 n,n i,j=1   (1 -C ij )S * ,ij + n,n h, =1 C h S * ,h P hij + n,n h, =1 C h S * ,h Q ij   ∇ w C ij . ( ) The definition of P and Q and the proof is deferred to Appendix C. Theorem 2 suggests that we first solve the lower level problem in (6), S * = arg min S∈Π(1n,1n) C(w), S + H(S), and then substitute S * into (10) to obtain ∇ w F (w). Note that the optimization problem in ( 11) can be efficiently solved by a variant of Sinkhorn algorithm (Cuturi, 2013; Benamou et al., 2015) . Specifically, (11) can be formulated as an entropic optimal transport (EOT) problem (Monge, 1781; Kantorovich, 1960) , which aims to find the optimal way to transport the mass from a categorical distribution with weight µ = [µ 1 , . . . , µ n ] to another categorical distribution with weight ν = [ν 1 , . . . , ν m ] , Γ * = arg min Γ∈Π(µ,ν) M, Γ + H(Γ), with Π(µ, ν) = {Γ ∈ R n×m : Γ1 m = µ, Γ 1 n = ν, Γ ij ≥ 0}, where M ∈ R n×m is the cost matrix with M ij the transport cost. When we set the two categorical distributions as the empirical distribution of D 1 and D 2 , respectively, 12) is a scaled lower problem of ( 6), and their optimal solutions satisfies S * = nΓ * . Therefore, we can apply Sinkhorn algorithm to solve the EOT problem in equation 12: At the -th iteration, we take M = C(w) and µ = ν = 1 n /n, one can verify that ( p ( +1) = µ Gq ( ) and q ( +1) = ν G p ( +1) , where q (0) = 1 n 1 n and G ij = exp -C ij (w) , G ∈ R n×n , and the division here is entrywise. Let p * and q * denote the stationary points. Then we obtain S * ,ij = np * i G ij q * j . Remark 3. The Sinkhorn algorithm is iterative and cannot exactly solve (11) within finite steps. As the Sinkhorn algorithm is very efficient and attains linear convergence, it suffices to well approximate the gradient ∇ w F (w) using the output inexact solution.

3. ROBOT FOR ROBUST CORRESPONDENCE

We next propose a robust version of ROBOT to solve rRWOC (Varol & Nejatbakhsh, 2019) . Note that in (6), the constraint S ∈ Π(1 n , 1 n ) enforces a one-to-one matching between D 1 and D 2 . For rRWOC, however, such an exact matching may not exist. For example, we have n < m, where n = |D 1 |, m = |D 2 |. Therefore, we need to relax the constraint on S. Motivated by the connection between ( 6) and ( 12), we propose to solve the lower problemfoot_2 : (S * r (w), μ * , ν * ) =arg min S∈Π(μ,ν) C(w), S + H(S), ( ) subject to μ 1 n = n, ν 1 m = m, μ -1 n 2 2 ≤ ρ 1 , ν -1 m 2 2 ≤ ρ 2 , where S * r (w) ∈ R n×m denotes an inexact correspondence between D 1 and D 2 . As can be seen in ( 13), we relax the marginal constraint Π(1, 1) in (6) to Π(μ, ν), where μ, ν are required to not deviate much from 1. Problem (13) relaxes the marginal constraints Π(1, 1) in the original problem to Π(μ, ν), where μ, ν are picked such that they do not deviate too much from 1foot_3 . Illustrative examples of the exact and robust alignments are provided in Figure 1 . Computationally, (13) can be solved by taking the Sinkhorn iteration and the projected gradient iteration in an alternating manner (See more details in Appendix D). Given S * r (w), we solve the upper level optimization in (6) to obtain w * , i.e., w * = arg min w C(w), S * r (w) . Similar to the previous section, we use a first-order algorithm to solve this problem, and we derive explicit expressions for the update rules. See Appendix E for details. We evaluate ROBOT and ROBOTrobust on both synthetic and realworld datasets, including flow cytometry and multi-object tracking. We first present numerical results and then we provide insights in the discussion section. Experiment details and auxiliary results can be found in Appendix G.

4.1. UNLABELED SENSING

Data Generation. We follow the unlabeled sensing setting (Tsakiris & Peng, 2019) and generate n = 1000 data points {(y i , z i )} n i=1 , where z i ∈ R e . Note here we take d = 0. We first generate z i , w ∼ N (0 e , I e ), and ε i ∼ N (0, ρ 2 noise ). Then we compute y i = z i w +ε i . We randomly permute the order of 50% of z i so that we lose the Z-to-Y correspondence. We generate the test set in the same way, only without permutation. approach to estimate w.

Baselines and

We initialize AM, EM and ROBOT using the output of RS with multi-start. We adopt a linear model f (Z; w) = Z w. Models are evaluated by the relative error on the test set, i.e., error = i ( y i -y i ) 2 / i (y i -ȳ) 2 , where y i is the predicted label, and ȳ is the mean of {y i }. Results. We visualize the results in Figure 2 . In all the experiments, ROBOT achieves better results than the baselines. Note that the relative error is larger for all methods except Oracle as the dimension and the noise increase. For low dimensional data, e.g., e = 5, our model achieves even better performance than Oracle. We have more discussions on using RS as initializations in Appendix G.5.

4.2. NONLINEAR REGRESSION

Data Generation. We mimic the scenario where the dataset is collected from different platforms. Specifically, we generate n data points {(y i , [x i , z i ])} n i=1 , where x i ∈ R d and z i ∈ R e . We first generate x i ∼ N (0 d , I d ), z i ∼ N (0 e , I e ), w ∼ N (0 d+e , I d+e ), and ε i ∼ N (0, ρ 2 noise ). Then we compute y i = f ([x i , z i ]; w) + ε i . Next, we randomly permute the order of {z i } so that we lose the data correspondence. Here, D 1 = {(x i , y i )} and D 2 = {z j } mimic two parts of data collected from two separate platforms. Since we are interested in the response on platform one, we treat all data from platform two, i.e., D 2 , as well as 80% of data in D 1 as the training data. The remaining data from D 1 are the test data. Notice that we have different number of data on D 1 and D 2 , i.e., the correspondence is not exactly one-to-one. Baselines and Training. We consider a nonlinear function f (X, Z; w) = d k=1 sin ([X, Z] k w k ). In this case, we consider only two baselines -Oracle and LS, since the other baselines in the previous section are designed for linear models. We evaluate the regression models by the transport cost divided by i (y i -ȳ) 2 on the test set. Results. As shown in Figure 3 , ROBOT-robust consistently outperforms ROBOT and LS, demonstrating the effectiveness of our robust formulation. Moreover, ROBOT-robust achieves better performance than Oracle when the number of training data is large or when the noise level is high. In flow cytometry (FC), a sample containing particles is suspended in a fluid and injected into the flow cytometer, but the measuring instruments are unable to preserve the correspondence between the particles and the measurements. Different from FC, gated flow cytometry (GFC) uses "gates" to sort the particles into one of many bins, which provides partial ordering information since the measurements are provided individually for each bin. In practice, there are usually 3 or 4 bins.

4.3. FLOW CYTOMETRY

Settings. We adopt the dataset from Knight et al. (2009) . Following Abid et al. (2017) , the outputs y i 's are normalized, and we select the top 20 significant features by a linear regression on the top 1400 items in the dataset. We use 90% of the data as the training data, and the remaining as test data. For ordinary FC, we randomly shuffle all the labels in the training set. For GFC, the training set is first sorted by the labels, and then divided into equal-sized groups, mimicking the sorting by gates process. The labels in each group are then randomly shuffled. To simulate gating error, 1% of the data are shuffled across the groups. We compare ROBOT with Oracle, LS, Hard EM (a variant of Stochastic EM proposed in Abid & Zou (2018) ), Stochastic EM, and AM. We use relative error on the test set as the evaluation metric. Results. As shown in Figure 4 , while AM achieves good performance on GFC when the number of groups is 3, it behaves poorly on the FC task. ROBOT, on the other hand, is efficient on both tasks.

4.4. MULTI-OBJECT TRACKING

In this section we extend our method to vision-based Multi-Object Tracking (MOT), a task with broad applications in mobile robotics and autonomous driving, to show the potential of applying RWOC to more real-world tasks. Given a video and the current frame, the goal of MOT is to predict the locations of the objects in the next frame. Specifically, object detectors (Felzenszwalb et al., 2009; Ren et al., 2015) first provide us the potential locations of the objects by their bounding boxes. Then, MOT aims to assign the bounding boxes to trajectories that describe the path of individual objects over time. Here, we formulate the current frame and the objects' locations in the current frame as D 2 = {z j }, while we treat the next frame and the locations in the next frame as Existing deep learning based MOT algorithms require large amounts of annotated data, i.e., the ground truth of the correspondence, during training. Different from them, our algorithm does not require the correspondence between D 1 and D 2 , and all we need is the video. This task is referred to as unsupervised MOT (He et al., 2019) . D 1 = {(x i , y i )}. Related Works. To the best of our knowledge, the only method that accomplishes unsupervised end-toend learning of MOT is He et al. (2019) . However, it targets tracking with low densities, e.g., Sprites-MOT, which is different from our focus. Settings. We adopt the MOT17 (Milan et al., 2016) and the MOT20 (Dendorfer et al., 2020) datasets. Scene densities of the two datasets are 31.8 and 170.9, respectively, which means the scenes are pretty crowded as we illustrated in Figure 5 . We adopt the DPM detector (Felzenszwalb et al., 2009) on MOT17 and the Faster-RCNN detector (Ren et al., 2015) on MOT20 to provide us the bounding boxes. Inspired by Xu et al. (2019b) , the cost matrix is computed as the average of the Euclidean center-point distance and the Jaccard distance between the bounding boxes, C ij (w) = 1 2 c(f (z j ; w)) -c(y i ) 2 √ H 2 + W 2 + J (f (z j ; w), y i ) , where c(•) is the location of the box center, H and W are the height and the width of the video frame, and J (•, •) is the Jaccard distance defined as 1-IoU (Intersection-over-Union). We utilize the single-object tracking model SiamRPNfoot_4 (Li et al., 2018) as our regression model f . We apply ROBOT-robust with ρ 1 = ρ 2 = 10 -3 . See Appendix G for more detailed settings. Results. We demonstrate the experiment results in Table 1 , where the evaluation metrics follow Ristani et al. (2016) . In the table, ↑ represents the higher the better, and ↓ represents the lower the better. ROBOT signifies the model trained by ROBOT-robust, and w/o ROBOT means the pretrained model in Li et al. (2018) . The scores are improved significantly after training with ROBOT-robust. We also include the scores of the SORT model (Bewley et al., 2016) obtained from the dataset platform. Different from SiamRPN and SiamRPN+ROBOT, SORT is a supervised learning model. As shown, our unsupervised training framework achieves comparable or even better performance.

5. DISCUSSION

Sensitivity to initialization. As stated in Pananjady et al. (2017b) , obtaining the global optima of (1) is in general an NP-hard problem. Some "global" methods use global optimization techniques and have exponential complexity, e.g., Elhami et al. (2017) , which is not applicable to large data. The other "local" methods only guarantee converge to local optima, and the convergence is very sensitive to initialization. Compared with existing "local" methods, our method is computationally efficient and greatly reduces the sensitivity to initialization. To demonstrate such an advantage, we run AM and ROBOT with 10 different initial solutions, and then we sort the results based on (a) the averaged residual on the training set, and (b) the relative prediction error on the test set. We plot the percentiles in Figure 6 . Here we use fully shuffled data under the unlabeled sensing setting, and we set n = 1000, e = 5, ρ 2 noise = 0.1, and = 10 -2 . We can see that ROBOT can find "good" solutions in 30% of the cases (The relative prediction error is smaller than 1), but AM is more sensitive to the initialization and cannot find "good" solutions. 11). An alternative approach to approximate the Jacobian is the automatic differentiation through the Sinkhorn iterations for updating S when solving (11). As suggested by Figure 7 (a), running Sinkhorn iterations until convergence (200 Sinkhorn iterations) can lead to a better solution 6 . In order to apply AD, we need to store all the intermediate updates of all the Sinkhorn iterations. This require the memory usage to be proportional to the number of iterations, which is not necessarily affordable. In contrast, applying our explicit expression for the backward pass is memory-efficient. Moreover, we also observe that AD is much more time-consuming than our method. The timing performance and memory usage are shown in Figure 7 Connection to EM. Abid & Zou (2018) adopt an Expectation Maximization (EM) method for RWOC, where S is modeled as a latent random variable. Then in the M-step, one maximizes the expected likelihood of the data over S. This method shares the same spirit as ours: We avoid updating w using one single permutation matrix like AM. However, this method is very dependent on a good initialization. Specifically, if we randomly initialize w, the posterior distribution of S in this iteration would be close to its prior, which is a uniform distribution. In this way, the follow-up update for w is not informative. Therefore, the solution of EM would quickly converge to an undesired stationary point. Figure 8 2019) assume only a small fraction of the correspondence is missing. Our method is also applicable to these problems, as long as the additional constraints can be adapted to the implicit differentiation. More applications of RWOC. RWOC problems generally appear for two reasons. First, the measuring instruments are unable to preserve the correspondence. In addition to GFC and MOT, we list a few more examples: SLAM tracking (Thrun, 2007) , archaeological measurements (Robinson, 1951) , large sensor networks (Keller et al., 2009) , pose and correspondence estimation (David et al., 2004) , and the genome assembly problem from shotgun reads (Huang & Madan, 1999) . Second, the data correspondence is masked for privacy reasons. For example, we want to build a recommender system for a new platform, borrowing user data from a mature platform.

A CONNECTION BETWEEN OT AND RWOC

Theorem 1. Denote Π(a, b) = {S ∈ R n×m : S1 m = a, S 1 n = b, S ij ≥ 0} for any a ∈ R n and b ∈ R m . Then at least one of the optimal solutions of the following problem lies in P. min S∈R n×n C(w), S , s.t. S ∈ Π(1 n , 1 n ). Proof. Denote the optimal solution of ( 14) as Z * . As we mentioned earlier, this is a direct corollary of Birkhoff-von Neumann theorem (Birkhoff, 1946; Von Neumann, 1953) . Specifically, Birkhoff-von Neumann theorem claims that the polytope Π(1 n , 1 n ) is the convex hull of the set of n × n permutation matrices, and furthermore that the vertices of Π(1 n , 1 n ) are precisely the permutation matrices. On the other hand, ( 14) is a linear optimization problem. There would be at least one optimal solutions lies at the vertices given the problem is feasible. As a result, there would be at least one Z * being a permutation matrix.

B TWO PERSPECTIVES OF THE MOTIVATIONS OF BILEVEL OPTIMIZATION B.1 FASTER CONVERGENCE

The bilevel optimization formulation has a better gradient descent iteration complexity than alternating minimization. To see this, consider a quadratic function F (a 1 , a 2 ) = a P a + b a, where d1+d2) . To further simplify the discussion, we assume a 1 ∈ R d1 , a 2 ∈ R d2 , a = [a 1 , a 2 ] ∈ R (d1+d2) , P ∈ R (d1+d2)×(d1+d2) , b ∈ R ( P = ρ1 (d1+d2) 1 (d1+d2) + (1 -ρ)I d1+d2 , where I d1+d2 is the identity matrix. Then we have the following proposition. Proposition 1. Given F defined in (9), we have λ max (∇ 2 F (a 1 )) λ min (∇ 2 F (a 1 )) = 1 + 1 -ρ + λ 1 -ρ d 1 ρ d 2 ρ -ρ + λ + 1 and λ max (∇ 2 a1a1 L(a 1 , a 2 )) λ min (∇ 2 a1a1 L(a 1 , a 2 )) = 1 + d 1 ρ 1 -ρ . Proof. For alternating minimization, the Hessian for a 1 is a submatrix of P , i.e., H AM = ρ1 d1 1 d1 + (1 -ρ)I d1 , whose condition number is C AM = 1 + d 1 ρ 1 -ρ . We now compute the condition number for ROBOT. Denote P = P 11 P 12 P 21 P 22 , b = b 1 b 2 , where P 11 ∈ R d1×d1 , P 12 ∈ R d1×d2 , P 21 ∈ R d2×d1 , P 22 ∈ R d2×d2 , and b 1 ∈ R d1 , b 2 ∈ R d2 . ROBOT first minimize over a 2 , a * 2 (a 1 ) = arg min a2 F (a 1 , a 2 ) = -(P 22 + λI d2 ) -1 (P 21 a 1 + b 2 /2). Substituting a * 2 (a 1 ) into F (a 1 , a 2 ), we can obtain the Hessian for a 1 is H ROBOT = P 11 -P 12 (P 22 + λI d2 ) -1 P 21 . Using Sherman-Morrison formula, we can explicitly express P -1 22 as P -1 22 = 1 1 -ρ + λ I d2 - ρ (1 -ρ + λ)(1 -ρ + λ + ρd 2 ) 1 d2 1 d2 . Substituting it into H ROBOT , H ROBOT = P 11 -P 12 P -1 22 P 21 = (1 -ρ)I d1 + ρ - d 2 ρ 2 d 2 ρ -ρ + λ + 1 1 d1 1 d1 . Therefore, the condition number is C ROBOT = 1 + 1 -ρ + λ 1 -ρ d 1 ρ d 2 ρ -ρ + λ + 1 . Note that C AM increases linearly with respect to d 1 . Therefore, the optimization problem inevitably becomes ill-conditioned as dimension increase. In contrast, C ROBOT can stay in the same order of magnitude when d 1 and d 2 increase simultaneously. Since the iteration complexity of gradient descent is proportional to the condition number (Bottou et al., 2018) , ROBOT needs fewer iterations to converge than AM.

C DIFFERENTIABILITY

Theorem 2. For any > 0, S * (w) is differentiable, as long as the cost C(w) is differentiable with respect to w. As a result, the objective L (w) = C(w), S * (w) is also differentiable. Proof. The proof is analogous to Xie et al. (2020) . We first prove the differentiability of S * (w). This part of proof mirrors the proof in Luise et al. (2018) . By Sinkhorn's scaling theorem (Sinkhorn & Knopp, 1967) , S * (w) = diag(e ξ * (w) )e -C(w) diag(e ζ * (w) ).  Therefore, since C ij (w) is differentiable, Γ * , is differentiable if (ξ * (w), ζ * (w)) is differentiable as a function of w. Let us set L(ξ, ζ; µ, ν, C) = ξ T µ + ζ T ν - n,m i,j=1 e -C ij -ξ i -ζ j . ∇ w L = 1 n,n i,j=1   (1 -C ij )S * ,ij + n,n h, =1 C h S * ,h dξ * h dC ij + n,n h, =1 C h S * ,h dζ * dC ij   ∇ w C ij , where ∇ C ξ * ∇ C ζ * = -H -1 D 0 with -H -1 D ∈ R (2n-1)×n×n , 0 ∈ R 1×n×n , D ij = 1 δ i S * ,ij , = 1, • • • , n; δ j S * ,ij , = n + 1, • • • , 2n -1, H -1 = - (diag(µ)) -1 + (diag(µ)) -1 S * K -1 S * T (diag(µ)) -1 -(diag(µ)) -1 S * K -1 -K -1 S * T (diag(µ)) -1 K -1 , and K = diag(ν) -S * T (diag(µ)) -1 S * , ν = ν1:n-1, S * = S * ,1:n,1:n-1 . Proof. This result is straightforward combining the Sinkhorn's scaling theorem and Theorem 3 in Xie et al. (2020) .

D ALGORITHM OF THE FORWARD PASS FOR ROBOT-ROBUST

For better numerical stability, in practice we add two more regularization terms, S * r (w), μ * , ν * = arg min S∈Π(μ,ν), μ,ν∈∆n C(w), S + H(S) + 1 h(μ) + 2 h(ν), s.t. F(μ, µ) ≤ ρ 1 , F(ν, ν) ≤ ρ 2 , where h(μ) = i μi log μi is the entropy function for vectors. This can avoid the entries of μ and ν shrink to zeros when updated by gradient descent. We remark that since we have entropy term H(S), the entries of S would not be exactly zeros. Furthermore, we have μ = S1 and μ = S1. Therefore, theoretically the entries of μ and ν will not be zeros. We only add the two more entropy terms for numerical consideration. The detailed algorithm is in Algorithm 1. Although the algorithm is not guaranteed to converge to a feasible solution, in practice it usually converges to a good solution (Wang et al., 2015) . Algorithm 1 Solving S * r for robust matching Require: C ∈ R m×n , µ, ν, K, , L, η G ij = e -C ij μ = µ, ν = ν b = 1 n for l = 1, • • • , L do a = μ/(Gb), b = ν/(G T a) μ = μ -η(e a + 1 * log μ), ν = ν -η(e b + 2 * log ν) μ = max{μ, 0}, ν = max{ν, 0} μ = μ/(μ 1), ν = ν/(ν 1) if μ -µ 2 2 > ρ 1 then μ = µ + √ ρ 1 μ-µ μ-µ 2 end if if ν -ν 2 2 > ρ 2 then ν = ν + √ ρ 2 ν-ν ν-ν 2 end if end for S = diag(a) G diag(b)

E ALGORITHM OF THE BACKWARD PASS FOR ROBOT-ROBUST

Since the derivation is tedious, we first summarize the outline of the derivation, then provide the detailed derivation. E.1 SUMMARY Given μ * , ν * , S * r (w), we compute the Jacobian matrix dS * r (w)/dw using implicit differentiation and differentiable programming techinques. Specifically, the Lagrangian function of Problem ( 16) is L = C, S + H(S) + 1 h(μ) + 2 h(ν) -ξ (Γ1 m -µ) -ζ (Γ 1 n -ν) + λ 1 (μ 1 n -1) + λ 2 (ν 1 m -1) + λ 3 ( μ -µ 2 2 -ρ 1 ) + λ 4 ( ν -ν 2 2 -ρ 2 ). where ξ and ζ are dual variables. The KKT conditions (Stationarity condition) imply that the optimal solution Γ * , can be formulated using the optimal dual variables ξ * and ζ * as, S * r = diag(e ξ * )e -C diag(e ζ * ). (17) By the chain rule, we have dS * r dw = dS * r dC dC dw = ∂S * r ∂C + ∂S * r ∂ξ * dξ * dC + ∂S * r ∂ζ * dζ * dC dC dw . Therefore, we can compute dS * r (w)/dw if we obtain dξ * dC and dζ * dC . Substituting (17) into the Lagrangian function, at the optimal solutions we obtain dr * dC = 0. L = L(ξ * , ζ * , μ * , ν * , λ * 1 , λ * 2 , λ * 3 , λ * 4 ; C). Denote r * = [(ξ * ) , (ζ * ) , (μ) , (ν) , λ * 1 , λ * 2 , λ * 3 , λ * 4 ] , Rerranging terms, we obtain dr * dC = - ∂φ(r * ; C) ∂r * -1 ∂φ(r * ; C) ∂C . Combining ( 17), (18), and (19), we can then obtain dS * r (w)/dw.

E.2 DETAILS

Now we provide the detailed derivation for computing dS * r /dw. Since S * r is the optimal solution of an optimization problem, we can follow the implicit function theorem to solve for the closed-form expression of the gradient. Specifically, we adopt F(μ, ν) = i (μ i -µ i ) 2 , and rewrite the optimization problem as min μ,ν,S C, S + ij S ij (log S ij -1) + 1 i μi (log μi -1) + 2 j νj (log νj -1), s.t., j S ij = μi , i S ij = νj , i μi = 1, j νj = 1, i (μ i -µ i ) 2 ≤ ρ 1 , j (ν j -ν j ) 2 ≤ ρ 2 . The Language of the above problem is L(C, S, μ, ν, ξ, ζ, λ 1 , λ 2 , λ 3 , λ 4 ) = C, S + ij S ij (log S ij -1) + 1 i μi (log μi -1) + 2 j νj (log νj -1) -ξ (S1 m -μ) -ζ (S 1 n -ν) + λ 1 ( i μi -1) + λ 2 ( j νj -1) + λ 3 ( i (μ i -µ i ) 2 -ρ 1 ) + λ 4 ( j (ν j -ν j ) 2 -ρ 2 ). Easy to see that the Slater's condition holds. Denote  L * = L(C, S * r , μ * , ν * , ξ * , ζ * , λ * 1 , λ * 2 , λ * 3 , λ * 4 ). Following the KKT conditions, dL * dS * r,ij = C ij + log S * r,ij -ξ * i -ζ * j = 0. . Denote F ij = e ξ i +ζ j -C ij . Denote φ = dL dξ = μ -F 1 m , ψ = dL dζ = ν -F 1 n , p = dL dμ = ξ + λ 1 1 n + 2λ 3 (μ -µ) + 1 log μ, q = dL dν = ζ + λ 2 1 m + 2λ 4 (ν -ν) + 2 log ν, χ 1 = dL dλ 1 = μ 1 n -1, χ 2 = dL dλ 2 = ν 1 m -1, χ 3 = λ 3 ( μ -µ 2 2 -ρ 1 ), χ 4 = λ 4 ( ν -ν 2 2 -ρ 2 ). Denote χ = [χ 1 , χ 2 , χ 3 , χ 4 ], and λ = [λ 1 , λ 2 , λ 3 , λ 4 ]. Following the KKT conditions, we have φ = 0, ψ = 0, p = 0, q = 0, χ = 0, at the optimal solutions. Therefore, for the optimal solutions we have Therefore, we have              dξ * dC dζ * dC dμ * dC dν * dC dλ * dC              = -               ∂φ ∂ξ * ∂φ ∂ζ * ∂φ ∂ μ * ∂φ ∂ ν * ∂φ ∂λ * ∂ψ ∂ξ * ∂ψ ∂ζ * ∂ψ ∂ μ * ∂ψ ∂ ν * ∂ψ ∂λ * ∂p ∂ξ * ∂p ∂ζ * ∂p ∂ μ * ∂p ∂ ν * ∂p ∂λ * ∂q ∂ξ * ∂q ∂ζ * ∂q ∂ μ * ∂q ∂ ν * ∂q ∂λ * ∂χ ∂ξ * ∂χ ∂ζ * ∂χ ∂ μ * ∂χ ∂ ν * ∂χ ∂λ *               -1              ∂φ ∂C ∂ψ ∂C ∂p ∂C ∂q ∂C ∂χ ∂C              . After some derivation, we have                              dξ * dC dζ * dC d μ * dC dν * dC dλ * 1 dC dλ * 2 dC dλ * 3 dC dλ * 4 dC                             = -                  - 1 diag 0 0 0 μ -µ 2 2 -ρ 1 0 0 0 0 2λ 4 (ν -ν) 0 0 0 ν -ν 2 2 -ρ 2                  -1                ∂φ ∂C ∂ψ ∂C 0 0 0 0 0 0                , and ∂φ h ∂C ij = 1 δ hi S ij , ∀h = 1, • • • , n, i = 1, • • • , n, j = 1, • • • , m ∂ψ ∂C ij = 1 δ j S ij , ∀ = 1, • • • , m -1, i = 1, • • • , n, j = 1, • • • , m. To efficiently solve for the inverse in the above equations, we denote A =       - 1 diag(μ) - 1 S * r I n 0 - 1 (S * r ) - 1 diag(ν) 0 I m I n 0 2λ 3 I n + diag( 1 μ ) 0 0 I m 0 2λ 4 I m + diag( 2 ν )       , B 1 = 1 n 0 2(μ -µ) 0 0 1 m 0 2(ν -ν) , C 1 =     1 n 0 0 1 m 2λ 3 (μ -µ) 0 0 2λ 4 (ν -ν)     , D =    0 0 0 0 0 0 0 0 0 0 μ -µ 2 2 -ρ 1 0 0 0 0 ν -ν 2 2 -ρ 2    . We first A -1 using the rules for inverting a block matrix, A -1 = K -KL -LK L + LKL =: A 1 A 2 A 3 A 4 where L = 2λ 3 I n + diag( 1 μ ) 0 0 2λ 4 I m + diag( 1 ν ) -1 , K = 1 diag(μ) S * r (S * r ) diag(ν) + L -1 . Then using the rules of inverting a block matrix again, we have    dξ * dC dζ * dC    = (A 1 + A 2 B 1 (D -C 1 A 4 B 1 ) -1 C 1 A 3 )    ∂φ ∂C ∂ψ ∂C    . Therefore, the bottleneck of computation is the inverting step in computing K. Note L is a diagonal matrix, we can further lower the computation cost by applying the rules for inverting a block matrix again. The value of λ 3 and λ 4 can be estimated from the fact p = 0, q = 0 . We detail the algorithm in Algorithm 2.  L = diag([ (2λ 3 1 n + 1 μ ) -1 , (2λ 4 1 m + 2 ν ) -1 ]) A1 = [H 1 , H 2 ; H 3 , H 4 ] A 2 = -A 1 • L A 3 = A 2 A 4 = L + L • A 1 • L E = A 1 + A 2 • B1(D -C • A 4 • B) -1 C • A 3 , where B1, C 1 , D defined above [J 1 , J 2 ; J 3 , J 4 ] = E, where For better visualization, we only include this baseline in one experiment. Furthermore, we adopt two new baselines: Sliced-GW (Vayer et al., 2019) and Sinkhorn-GW (Xu et al., 2019a) , which can be used to align distributions and points sets. J 1 ∈ R n×n , J 2 ∈ R n×m , J 3 ∈ R m×n , J 4 ∈ R m×m [ dξ * dC ] nij ← [J 1 ] ni S ij + [J 2 ] nj S ij [ dζ * dC ] mij ← [J 3 ] mi S ij Results. We visualize the fitting error of regression models in Figure 11 . We can see that ROBOT outperforms all the baselines except Oracle. Also, our model can beat the Oracle model when the dimension is low or when the noise is large. 



This is a common practice in integer programming(Marcus & Ree, 1959). The bilevel formulation can be viewed as a Stackelberg game. The idea is inspired by the marginal relaxation of optimal transport, first independently proposed byKondratyev et al. (2016) andChizat et al. (2018a), and later developed byChizat et al. (2018c);Liero et al. (2018).Chizat et al. (2018b) share the same formulation as ours. Here we measure the deviation using the Euclidean distance, and more detailed discussions can be found in Appendix F The initial weights of f are obtained from https://github.com/foolwood/DaSiamRPN. We remark that running one iteration sometimes cannot converge.



Figure 1: Illustrative example of exact (L) and robust (R) alignments. The robust alignment can drop potential outliers and only match data points close to each other.4 EXPERIMENT

Figure 2: Unlabeled sensing. Results are the mean over 10 runs. SNR= w 2 2 /ρ 2 noise is the signal-to-noise ratio.

Figure 3: Nonlinear regression. We use n = 1000, d = 2, e = 3, ρ 2 noise = 0.1 as defaults.

Figure 4: Relative error of different methods.

Figure 5: One frame in MOT20 with detected bounding boxes in yellow.

Figure 6: Results of different initialization of AM and ROBOT.

Figure 7: The comparisons to AD. (a) Convergence under different number of Sinkhorn iterations of AD. (b) Time comparison. (c) Memory comparison. ROBOT v.s. Automatic Differentiation (AD). Our algorithm computes the Jacobian matrix directly based on the KKT condition of the lower problem (11). An alternative approach to approximate the Jacobian is the automatic differentiation through the Sinkhorn iterations for updating S when solving (11). As suggested by Figure7(a), running Sinkhorn iterations until convergence (200 Sinkhorn iterations) can lead to a better solution6 . In order to apply AD, we need to store all the intermediate updates of all the Sinkhorn iterations. This require the memory usage to be proportional to the number of iterations, which is not necessarily affordable. In contrast, applying our explicit expression for the backward pass is memory-efficient. Moreover, we also observe that AD is much more time-consuming than our method. The timing performance and memory usage are shown in Figure7(b)(c), where we set n = 1000.

(b)(c), where we set n = 1000.

Figure 8: Expected correspondence in EM.

illustrates an example of converged correspondence, where we adopt n = 30, o = e = 1, d = 0. For this reason, we initialize EM with good initial points, either by RS or AM throughout all experiments. Related works with additional constraints. There is another line of research which improves the computational efficiency by solving variants of RWOC with additional constraints. Specifically, Haghighatshoar & Caire (2017); Rigollet & Weed (2018) assume an isotonic function (note that such an assumption may not hold in practice), and Shi et al. (2018); Slawski & Ben-David (2019); Slawski et al. (2019a;b); Varol & Nejatbakhsh (

and recall that (ξ * , ζ * ) = arg max ξ,ζ L(ξ, ζ; µ, ν, C). The differentiability of (ξ * , ζ * ) is proved using the Implicit Function theorem and follows from the differentiability and strict convexity in (ξ * , ζ * ) of the function L. Theorem 3. Denoting L = C(w), S * (w) . The gradient of L with respect to w is

and φ(r * ; C) = ∂L(r * ; C)/∂r * . At the optimal dual variable r * , the KKT condition immediately yields φ(r * ; C) ≡ 0. By the chain rule, we have dφ(r * ; C) dC = ∂φ(r * ; C) ∂C + ∂φ(r * ; C) ∂r *

Computing the gradient for wRequire: C ∈ R m×n , µ, ν, , dC dw Run forward pass to get S = S * r , μ, ν, ξ, ζ x 1 = n/2 i=1 (μ i -µ i ), x 2 = n i= n/2 (μ i -µ i ), b 1 = -n/2 i=1 ξ i , b 2 = -n i= n/2 ξ i [λ 1 , λ 3 ] = [ n/2 , x 1 ; nn/2 , x 2 ] -1 [b1, b2] x 1 = m/2 j=1 (ν j -ν j ), x 2 = m j= m/2 (ν j -ν j ), b 1 = -m/2 j=1 ζ j , b 2 = -m j= m/2 ζ j [λ 2 , λ 4 ] = [ m/2 , x 1 ; mm/2 , x 2 ] -1 [b1, b2] μ = μ + (2λ 3 1 n + 1 μ ) -1 , ν = ν + (2λ 4 1 m + 2 ν ) -1 ν = ν[: -1], S = S[:, : -1] K ← diag(ν ) -(S ) T (diag(μ)) -1 S H 1 ← (diag(μ)) -1 + (diag(μ)) -1 S K -1 (S ) (diag(μ)) -1 H 2 ← -(diag(μ)) -1 S K -1 H 3 ← (H 2 ) H 4 ← K -1 Pad H 2 to be [n, m]with value 0 Pad H 3 to be [m, n] with value 0 Pad H 4 to be [m, m] with value 0

Figure 11: Linear regression. We use n = 1000, d = 2, e = 3, ρ 2 noise = 0.1 as defaults.

Stochastic EM(Abid & Zou, 2018): A stochastic EM approach to recover the permutation. 5. Robust Regression (RR, Slawski & Ben-David (2019);Slawski et al. (2019a)). A two-stage block coordinate descent approach to discard outliers and fit regression models. 6. Random Sample (RS,Varol & Nejatbakhsh (2019)): A random sample consensus (RANSAC)

Experiment results on MOT.

+ [J 4 ] mj S ij Pad dζ * dC to be [m, n, m] with value 0 [ dL dC ] ij ← 1 (-C ij S ij + n,m C nm S nm [ da * dC ] nij + n,m C nm S nm [ db

ACKNOWLEDGEMENT

This works is partially supported by NSF IIS-2008334. Hongteng Xu is supported in part by Beijing Outstanding Young Scientist Program (NO. BJJWZYJH012019100020098) and National Natural Science Foundation of China (No. 61832017). Xiaojing Ye is partially supported by NSF DMS-1925263. Hongyuan Zha is supported in part by a grant from Shenzhen Institute of Artificial Intelligence and Robotics for Society. We also appreciate the fruitful discussions with Bo Dai and Yan Li.

F DIFFERENT FORMS OF MARGINAL RELAXATION

In this paper we adopt F to be the Euclidean distance. This is because this choice provides an OT plan that fits our intuition -the data points with significantly larger transportation cost should not be considered. Figure 9 shows an illustration. Here, the input distributions are the empirical distributions of the scalars on the left and the bottom. Notice that there are three support points in µ that are far away from others, i.e., 10.72, 10.89, 10.96. In Figure 9 (a), the optimal solution Γ * r automatically ignores them, matching only the rest of the scalars. One alternative choice of F is the Kullback-Leibler (KL) divergence (Chizat et al., 2018b) , whose resulted formulation possesses an efficient algorithm for the forward pass, and the differentiability for the backward pass. We do not adopt it because the OT plan generated by this choice does not fit out intuition: As shown in Figure 9 (b), the OT plan tends to ignore the points that are away from the mean, even with a very small ρ 1 and ρ 2 . For both figures, we adopt = 10 -5 .

G MORE ON EXPERIMENTS

G.1 UNLABELED SENSINGWe now provide more training details for experiments in Section 4.1. Here, AM and ROBOT is trained with batch size 500 and learning rate 10 -4 for 2, 000 iterations. For the Sinkhorn algorithm in ROBOT we set = 10 -4 . We run RS for 2 × 10 5 iterations with inlier threshold as 10 -2 . Other settings for the hyper-parameters in the baselines follows the default settings of their corresponding papers.

G.2 NONLINEAR REGRESSION

For the nonlinear regression experiment in Section 4.2, ROBOT and ROBOT-robust is trained with learning rate 10 -4 for 80 iterations. For n = 100, 200, 500, 1000, 2000, we set batch size 10, 30, 50, 100, 300, respectively.We set = 10 -4 for the Sinkhorn algorithm in ROBOT. For Oracle and LS, we perform ordinary regression model and ensure convergence, i.e., learning rate 5 × 10 -2 for 100 iterations.

G.3 FLOW CYTOMETRY

We provide more details for the Flow Cytometry experiment in Section 4.3. In the FC seting, ROBOT is trained with batch size 1260 and learning rate 10 -4 for 80 iterations. In the GFC seting, ROBOT is trained with batch size 1260 and learning rate 6×10 -4 for 60 iterations. We set = 10 -4 for the Sinkhorn algorithm in ROBOT. Other settings for the hyper-parameters in the baselines follows the default settings of their corresponding papers. EM is initialized by AM.

G.4 MULTI-OBJECT TRACKING

For the MOT experiments in Section 4.4, the reported results of MOT17 (train) and MOT17 (dev) is trained on MOT17 (train), and the reported results of MOT20 (train) and MOT20 (dev) is trained on MOT20 (train). Each model is trained for 1 epoch. We adopt Adam optimizer with learning rate= 10 -5 , = 10 -4 , and η = 10 -3 . To track the birth and death of the tracks, we adapt the inference code of Xu et al. (2019b) .G.5 COMBINATION WITH RS 2 , although RS cannot perform well itself, retraining the output of RS using our algorithms increases the performance by a large margin. To show that combining RS and ROBOT can achieve better results than RS alone, we compare the following two cases: i). Subsample 2 × 10 5 times using RS; ii). Subsample 10 5 times using RS followed by ROBOT for 50 training steps. The result is shown in Table 2 . For a larger permutation proportion, RS alone cannot perform as well as RS+ROBOT combination. Here, we have 10 runs for each proportion. We adopt SNR= 100, d = 5 for data, and = 10 -4 , learning rate 10 -4 for ROBOT training.

G.6 THE EFFECT OF ρ 1 AND ρ 2

We visualize S * r computed from the robust optimal transport problem in Figure 10 . The two input distributions are Unif(0, 2) and Unif(0, 1). We can see that with large enough ρ 1 and ρ 2 , Unif(0, 1) would be aligned with the first half of Unif(0, 2). 0 .0 4 0 .1 4 0 .1 7 0 .7 7 0 .8 5 0 .8 8 1 .0 6 1 .0 9 1 .1 0 1 .1 4 1 .2 1 1 .2 9 1 .4 3 1 .5 6 1 .5 8 1 .6 7 1 .7 4 1 .7 8 1 .8 5 1 .9 3 0.12 0.14 0.41 0.46 0.52 0.64 0.78 0.80 0.94 0.98 

G.7 COMPARISON OF RESIDUALS IN LINEAR REGRESSION

Settings. We generate n data points {(y i , [x i , z i ])} n i=1 , where x i ∈ R d and z i ∈ R e . We first generate x i ∼ N (0 d , I d ), z i ∼ N (0 e , I e ), w ∼ N (0 d+e , I d+e ), and ε i ∼ N (0, ρ 2 noise ). Then we compute y i = f ([x i , z i ]; w) + ε i . Next, we randomly permute the order of {z i } so that we lose the data correspondence. Here, D 1 = {(x i , y i )} and D 2 = {z j } mimic two parts of data collected from two separate platforms.We adopt a linear model f (x; w) = x w. To evaluate model performance, we use error= i ( y iy i ) 2 / i (y i -ȳ) 2 , where y i is the predicted label, and ȳ is the mean of {y i }.Baselines. We use Oracle, LS, Stochastic-EM as the baselines. Notice that without a proper initialization, Stochastic-EM performs well in partially permuted cases, but not in fully shuffled cases.

