SINKHORN DISCREPANCY FOR COUNTERFACTUAL GENERALIZATION

Abstract

Estimating individual treatment effects from observational data is highly challenging due to the existence of treatment selection bias. Most prevalent approaches mitigate this issue by aligning distributions of different treatment groups in the representation space. However, there are two critical problems circumvented: (1) mini-batch sampling effects (MSE), where the alignment easily fails due to the outcome imbalance or outliers at a mini-batch level; (2) unobserved confounder effects (UCE), where the unobserved confounders damage the correct alignment. To tackle these problems, we propose a principled approach named Entire Space CounterFactual Regression (ESCFR) based on a generalized sinkhorn discrepancy for distribution alignment within the stochastic optimal transport framework. Based on the framework, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that our proposed ESCFR can successfully tackle the treatment selection bias and achieve significantly better performance than state-of-the-art methods.

1. INTRODUCTION

Estimating individual treatment effect (ITE) with randomized controlled trials is a common practice in causal inference, which has been widely used in e-commerce (Betlei et al., 2021) , education (Cordero et al., 2018) , and health care (Schwab et al., 2020) . For example, drug developers would conduct clinical A/B tests to evaluate the drug effects. Although randomized controlled trials are the gold standard (Pearl & Mackenzie, 2018) for causal inference, it is often prohibitively expensive to conduct such experiments. Hence, observational data that can be acquired without intervention has been a tempting shortcut. For example, drug developers tend to assess drug effects with post-marketing monitoring reports instead of clinical A/B trials. With the growing access to observational data, estimating ITE from observational data has attracted intense research interest. Estimating ITE with observational data has two main challenges: (1) missing counterfactuals, i.e., only one factual outcome out of all potential outcomes can be observed; (2) treatment selection bias, i.e., individuals have their preferences for treatment selection, making units in different treatment groups heterogeneous. To handle missing counterfactuals, meta-learners (Künzel et al., 2019) decompose the ITE estimation task into solvable factual outcome estimation subproblems. However, the treatment selection bias makes it difficult to generalize the factual outcome estimators trained within respective treatment groups to the entire population; consequently, the derived ITE estimator is biased. Beginning with counterfactual regression (Shalit et al., 2017) and its revolutionary performance, most prevalent methods handle the selection bias by minimizing the distribution discrepancy between groups in the representation space (see Liuyi et al., 2018; Hassanpour & Greiner, 2020; Cheng et al., 2022) . However, two critical issues with these methods have long been neglected, which significantly impedes them from handling the treatment selection bias. The first problem is the mini-batch sampling effects (MSE). Specifically, current representation-based methods (Shalit et al., 2017; Liuyi et al., 2018) compute distribution discrepancy within mini-batches instead of the entire data space, making it vulnerable to bad sampling cases. For example, given two aligned distributions, if a mini-batch outlier exists in the sampled distribution, the mini-batch discrepancy will be significant, making the training process noise-filled. The second problem is the unobserved confounder effects (UCE). Specifically, current approaches directly assume unconfoundedness Ma et al. (2022) , while the unobserved confounders widely exist in real scenarios and make the resulting estimators biased. Contributions and outline. In this paper, we propose an effective ITE estimator based on optimal transport, Entire Space CounterFactual Regression (ESCFR), which tackles both the MSE and UCE issues with a generalized sinkhorn discrepancy. Specifically, after preliminaries in Section 2, we first reformulate the ITE estimation problem as a stochastic optimal transport problem in Section 3.1. We next showcase the MSE issue faced by existing approaches in Section 3.2 and propose a relaxed mass-preserving regularizer to mitigate this issue. We further investigate the UCE issue in Section 3.3 and propose a proximal factual outcome regularizer to solve it. We finally formulate the architecture and learning objectives of ESCFR in Section 3.4, and report the experimental results in Section 4.

2. PRELIMINARIES 2.1 CAUSAL INFERENCE FROM OBSERVATIONAL DATA

This section formulates basic definitions and models in observational causal inference. We first formalize the fundamental elements in Definition 2.1 following the general notation conventionfoot_0 . Definition 2.1. Let X be the random variable of covariates, with support X and distribution P(x); Let R be the random variable of induced representations, with support R and distribution P(r); Let Y be the random variable of outcomes, with support Y and distribution P(y); Let T be the random variable of treatment indicator, with support T = {0, 1} and distribution P(T ). Following the potential outcome framework (Rubin, 1974) , an individual with covariates x has two potential outcomes, namely Y 1 (x) given it is treated and Y 0 (x) otherwise. The ground-truth individual treatment effect (ITE) is always formulated as the expected difference of potential outcomes: τ (x) ∶= E [Y 1 -Y 0 | x] , where one of these two outcomes is always unobserved. To address such missing counterfactuals, the ITE estimation task is commonly decomposed into potential outcome estimation subproblems that are solvable with any supervised learning method (Künzel et al., 2019) . For example, T-learner models the factual outcomes Y for units in the treated and untreated groups separately; S-learner regards the treatment indicator T as one of the covariates X, and models Y for all units simultaneously. The ITE estimate is then the difference of the estimated outcomes when T is set to treated and untreated. Definition 2.2. Let ψ ∶ X → R be a mapping from support X to R, i.e., ∀x ∈ X , ∃r = ψ(x) ∈ R. Let ϕ ∶ R × T → Y be a mapping from support R × T to Y, i.e., it maps the representations and treatment indicator to the corresponding factual outcome. For example, Y 1 = ϕ 1 (R), Y 0 = ϕ 0 (R), where we abbreviate ϕ(R, T = 1) and ϕ(R, T = 0) to ϕ 1 (R) and ϕ 0 (R), respectively, for brevity. TARNet (Shalit et al., 2017) obtains better performance by absorbing the advantages of both T-learner and S-learner, consisting of a representation mapping ψ and an outcome mapping ϕ as defined in Definition 2.2. For an individual with covariates X, TARNet estimates ITE as: τψ,ϕ (X) = Ŷ1 -Ŷ0 , where Ŷ1 = ϕ 1 (ψ(X)), Ŷ0 = ϕ 0 (ψ(X)), where ψ is trained over all individuals, ϕ 1 and ϕ 0 are trained over the treated and untreated group, respectively, to minimize the factual error. Finally, the performance of ITE estimators is mainly evaluated with the precision in estimation of heterogeneous effect (PEHE): ϵ PEHE (ψ, ϕ) ∶= ∫ X (τ ψ,ϕ (x)τ (x))foot_1 P(x) dx. (3) However, according to Figure 1 (a), the treatment selection bias causes a distribution shift of covariates across groups, which misleads ϕ 1 and ϕ 0 to overfit their respective group's properties and generalize poorly to the entire population. Therefore, the ITE estimate τ by these methods would be biased.

2.2. DISCRETE OPTIMAL TRANSPORT AND SINKHORN DIVERGENCE

Optimal transport (OT) instantiates distribution discrepancy as the minimum transport cost, which provides a grip for quantifying the treatment selection bias in Figure 1 (a). Monge (1781) first formulated OT as finding an optimal mapping between two distributions. However, this formulation cannot guarantee the existence and uniqueness of solutions. Kantorovich (2006) proposed a more applicable formulation in Definition 2.3, which can be seen as a generalization of Monge problem. Definition 2.3. For empirical distributions α and β with n and m units, respectively, the Kantorovich problem aims to find a feasible plan π ∈ R n×m + which transports α to β at minimum cost: W(α, β) ∶= min π π π∈Π(α,β) ⟨D, π π π⟩ , Π(α, β) ∶= {π π π ∈ R n×m + ∶ π π π1 m = a, π π π T 1 n = b} , where W(α, β) ∈ R is the Wasserstein discrepancy between α and β; D ∈ R n×m

+

is the unit-wise distance 2 between α and β; a and b indicate the mass of units in α and β, and Π is the feasible transportation plan set which ensures the mass-preserving constraint holds. However, exact solutions (Bonneel et al., 2011) to (4) always come with high computational costs. As such, researchers would always add an entropic regularization to the Kantorovich problem: W ϵ (α, β) ∶= ⟨D, π π π ϵ ⟩ , π π π ϵ ∶= arg min π π π∈Π(α,β) ⟨D, π π π⟩ -ϵH(π π π), H(π π π) ∶= -∑ i,j π π π i,j (log(π π π i,j ) -1) , (5) making the problem ϵ-convex and solvable with the Sinkhorn algorithm (Cuturi, 2013) . The Sinkhorn algorithm only consists of matrix-vector products, making it suited to be accelerated with GPUs.

3. PROPOSED METHOD

In this section, we present the proposed Entire Space CounterFactual Regression (ESCFR) approach based on optimal transport to tackle the treatment selection bias. We first illustrate the stochastic optimal transport framework for distribution discrepancy minimization across treatment groups. Based on the framework, we then propose a relaxed mass-preserving regularizer to address the sampling effect, and a proximal factual outcome regularizer to handle the unobserved confounders. We finally summarize the model architecture, learning objectives, and optimization algorithm.

3.1. STOCHASTIC OPTIMAL TRANSPORT FOR COUNTERFACTUAL REGRESSION

Representation-based approaches mitigate the treatment selection bias by calculating the distribution discrepancy in the representation space and then minimizing it. We select optimal transport to compute the discrepancy since it has shown compelling advantages over its competitors. Specifically, it accounts for the distribution's geometry and thus works in the cases where ϕ-divergence (e.g., Kullback-Leibler divergence) fails (Seguy et al., 2018) . In addition, the calculated discrepancy can be optimized with the traditional supervised learning framework instead of the adversarial learning framework, and is therefore easier to optimize than adversarial-based methods (Kallus, 2020) . Optimal transport calculates the group discrepancy as W (P T=1 ψ (r), P T=0 ψ (r)), where P T=1 ψ (r) and P T=0 ψ (r) are the distributions of representations in treated and untreated groups, respectively, induced by the mapping r = ψ (x). The discrepancy is differentiable with respect to ψ (Flamary et al., 2021) , thus can be minimized by updating the representation mapping ψ with gradient-based optimizers. Definition 3.1. Let PT =1 (x) and PT =0 (x) be the empirical distributions of covariates at a mini-batch level, which contain n treated units and m untreated units, respectively; PT =1 ψ (r) and PT =0 ψ (r) be that of representations induced by the representation mapping r = ψ(x) defined in Definition 2.2. However, since prevalent neural estimators mainly update parameters with stochastic gradient methods, only a fraction of the units is accessible within each iteration. A shortcut in this context is to calculate the group discrepancy at a stochastic mini-batch level: Ŵψ ∶= W ( PT=1 ψ (r), PT=0 ψ (r)) . ( ) To further investigate the effectiveness of this shortcut, Theorem 3.1 demonstrates that PEHE can be optimized by iteratively minimizing the factual outcome estimation error and the mini-batch group discrepancy (6). The proof of the theorem can be found in Appendix A.3. Theorem 3.1. Let ψ and ϕ be the representation mapping and factual outcome mapping, respectively; Ŵψ be the group discrepancy at a mini-batch level. With the probability of at least 1δ, we have: ϵ PEHE (ψ, ϕ) ≤ 2 [ϵ T =1 F (ψ, ϕ) + ϵ T =0 F (ψ, ϕ) + B ψ Ŵψ -2σ 2 Y + O( 1 δN )] , where ϵ T =1 30) and (32) in Appendix A) which is highly dependent on the uncontrollable sampling quality. Therefore, the discrepancy measure should be robust to bad sampling cases, otherwise the resulting huge variance will impede it from computing and reducing the actual discrepancy. The OT discrepancy in ( 6) can be easily disturbed by many sampling cases, except for the ideal case in Figure 2 (a) where the transport strategy is reasonable and applicable. For example, according to Figure 2 (b), it falsely matches units with unrelated factual outcomes in the mini-batch where the outcomes between groups are imbalanced; according to Figure 2 (c), it falsely matches the mini-batch outliers to normal units, causing a substantial disruption of the transportation strategy. In short, the vanilla OT technique in (6) fails to quantify the group discrepancy for producing erroneous transport strategies in non-ideal mini-batches, and thus misleads the update of the representation mapping ψ. We summarize this phenomenon as the mini-batch sampling effect (MSE) issuefoot_2 . This issue is attributed to the mass-preservation constraint in (5), which requires that all units in both groups match each other, regardless of the actual situation. Mini-batch outliers, for instance, would be compelled to be transported according to Figure 2 , which impedes the transport of normal units and the computation of the actual group discrepancy. A small batch size would exacerbate this defect. Definition 3.2. For empirical distributions α and β with n and m units, respectively, optimal transport with relaxed mass-preserving constraint seeks the transport strategy π π π at the minimum cost: W ϵ,κ (α, β) ∶= min π π π ⟨D, π π π⟩ , π π π ∶= arg min π π π ⟨D, π π π⟩ -ϵH(π π π) + κ(D KL (π π π1 m , a) + D KL (π π π T 1 n , b)) (8) where D ∈ R n×m + is the unit-wise distance, and a; b indicate the mass of units in α and β, respectively. An intuitive approach to mitigate MSE is to relax the marginal constraint and allow for the creation and destruction of unit's mass. To this end, a relaxed mass-preserving regularizer (RMPR) is devised in Definition 3.2, which replaces the hard marginal constraint in (4) with a soft penalty in (8) for deriving transport strategy. In this context, the stochastic discrepancy is calculated as Ŵϵ,κ ψ ∶= W ϵ,κ ( PT=1 ψ (r), PT=0 ψ (r)) , where the hard mass-preservation constraint is removed to mitigate the MSE issue. Inspired by Fatras et al. (2021) , the robustness of RMPR to sampling effects can be further theoretically investigated in Theorem 3.2, where the effect of mini-batch outliers is upper bounded by a constant. Theorem 3.2. For empirical distributions α, β with n and m units, respectively, adding an outlier a ′ to α and denoting the disturbed distribution as α ′ , we have W 0,κ (α ′ , β) -W 0,κ (α, β) ≤ 2κ(1 -e -∑ b∈β (a ′ -b) 2 /2κ )/n, which is upper bounded by 2κ/n. W 0,κ is the unbalanced discrepancy as per Definition 3.2. In addition, compared with alternatives (Xu et al., 2020; Chapel et al., 2021) to relax the marginal constraint, the approach in Definition 3.2 has better metric properties (Séjourné et al., 2019) and can be accelerated via the generalized Sinkhorn algorithm (Chizat et al., 2018) . It is differentiable w.r.t. ψ and thus can be minimized via stochastic gradient methods in an end-to-end manner. Existing representation-based methods fail to eliminate the treatment selection bias due to the unobserved confounder effects (UCE). Beginning with CFR (Shalit et al., 2017) , the unconfoundedness assumption A.1 (see Appendix A) is often taken to circumvent the UCE issue (Ma et al., 2022) . Given two units r i ∈ P T =1 ψ (r) and r j ∈ P T =0 ψ (r), for instance, optimal transport in Definition 3.2 calculates the unit-wise distance as D ij ∶= ∥r ir j ∥ 2 . If Assumption A.1 holds, this approach mitigates the treatment selection bias since it blocks the backdoor path X → T in Figure 3 (a) by balancing the confounders across groups in a latent space. However, Assumption A.1 is usually violated in practice as per Figure 3 (b), which hinder existing methods including OT from handling treatment selection bias since the backdoor path X ′ → T is not blocked.

3.3. PROXIMAL FACTUAL OUTCOME REGULARIZER FOR UNOBSERVED CONFOUNDERS

X Y T (a) X Y T X' (b) According to Figure 3 (b), given balanced X and identical T , the only variable reflecting the variation of X ′ is the outcome Y . As such, inspired by the joint distribution transport technique (Courty et al., 2017a) , we propose to calibrate the unit-wise distance D with the potential outcomes as follow: D γ ij = ∥r i -r j ∥ 2 + γ ⋅ [∥y T =0 i -y T =0 j ∥ 2 + ∥y T =1 j -y T =1 i ∥ 2 ] , where γ controls the strength of regularization. The underlying regularization is: units with similar (both observed and unobserved) confounders should have similar potential outcomes. As such, for a pair of units with similar observed covariates, i.e., ∥r ir j ∥ 2 ≈ 0, if their potential outcomes given the same treatment t = {0, 1} differ greatly, i.e., ∥y t iy t j ∥ >> 0, their unobserved confounders should likewise differ significantly. The vanilla OT technique in (6) with D ij = ∥r ir j ∥ 2 would incorrectly match this pair because ∥r ir j ∥ 2 ≈ 0, generate a false transport strategy, and consequently misguide the update of the representation mapping ψ. In contrast, OT based on D γ ij would not match this pair as the difference of unobserved confounders is compensated with that of potential outcomes. Moreover, since y T =0 i and y T =1 j in (11) are unavailable due to the missing counterfactual outcomes, the proposed proximal factual outcome regularizer (PFOR) uses their estimates instead. Specifically, let ŷi and ŷj be the estimates of y T =0 i and y T =1 j , respectively, PFOR refines (11) as D γ ij = ∥r i -r j ∥ 2 + γ ⋅ [∥ŷ i -y j ∥ 2 + ∥ŷ j -y i ∥ 2 ] , ŷi = ϕ 0 (r i ), ŷj = ϕ 1 (r j ), Additional justifications, assumptions and limitations of PFOR are discussed in Appendix D.3.

3.4. ARCHITECTURE OF ENTIRE SPACE COUNTERFACTUAL REGRESSION

The architecture of ESCFR is presented in Figure 1 (b), where the covariate X is first mapped to the representations R with ψ(⋅), and then to the potential outcomes with ϕ(⋅). The group discrepancy W is calculated with the optimal transport equipping with the RMPR in (8) and PFOR in (12). The learning objective is to minimize the risk of factual outcome estimation and the group discrepancy. Given mini-batch distributions PT =1 (x) and PT =0 (x) in Definition 3.1, the risk of factual outcome estimation following Shi et al. (2019) can be formulated as L F (ψ, ϕ) ∶= E xi∈ PT =1 (x) ∥ϕ 1 (ψ(x i )) -y i ∥ 2 + E xj ∈ PT =0 (x) ∥ϕ 0 (ψ(x j )) -y j ∥ 2 , ( ) where y i and y j are the factual outcomes for the corresponding treatment groups. The discrepancy is: L ϵ,κ,γ D (ψ) ∶= W ϵ,κ ( PT=1 ψ (r), PT=0 ψ (r)) , which is in general the optimal transport with RMPR in Definition 3.2, except for the unit-wise distance D γ calculated with the PFOR in (12). Finally, the overall learning objective of ESCFR is L ϵ,κ,γ,λ ESCFR ∶= L F (ψ, ϕ) + λ ⋅ L ϵ,κ,γ D (ψ), where λ controls the strength of distribution alignment, ϵ controls the entropic regularization in (5), κ controls RMPR in (8), and γ controls PFOR in (12). The learning objective (15) mitigates the selection bias following Theorem 3.1 and handles the MSE and UCE issues. The optimization procedure consists of three steps as summarized in Algorithm 3. First, compute π π π ϵ,κ by solving the linear programming problem in Definition 3.2 with Algorithm 2, where the unit-wise distance is calculated with D γ . Second, compute the discrepancy in (14) as ⟨π π π ϵ,κ , D γ ⟩, where D γ is differentiable to ψ, making it feasible to minimize this discrepancy with its gradient w.r. & Ba (2015) . We fine-tune hyperparameters within the range in Figure 5 , validate performance every two epochs, and save the optimal model for test. Evaluation protocol. Following Liuyi et al. (2018) , PEHE in ( 3) is used as the precision metric for performance evaluation. However, it is unavailable in the model selection phase due to missing counterfactuals. As such, we use the area under the uplift curve (AUUC) (Betlei et al., 2021) to guide model selection, which measures the counterfactual ranking performance of the model and can be computed without counterfactual outcomes. The within-sample and out-of-sample results are reported on the training and test data, respectively, following Shalit et al. (2017) .

4.2. PERFORMANCE COMPARISON

Table 1 reports the performance of ESCFR and its competitors over ten runs. Statistical estimators exhibit competitive performance on the PEHE metric. Due to the superiority to depict non-linearity, neural estimators outperform linear and random forest methods. In particular, TARNet that absorbs the advantage of T-learner and S-learner achieves the best overall performance in statistic estimators. However, the circumvention to treatment selection bias leads to inferior performance. Matching methods, e.g., PSM exhibit compelling ranking performance, which explains why they are favored in counterfactual ranking practice (Betlei et al., 2021) . However, their poor performance on PEHE hinders their application in counterfactual estimation applications such as advertising systems that place more emphasis on the accuracy of treatment effect estimation. Representation-based methods mitigate the treatment selection bias and enhance overall performance. In particular, CFR-WASS reaches an out-of-sample PEHE of 3.207 on ACIC, significantly outperforming most statistical methods. However, the MSE and UCE issues impede these methods from solving the treatment selection bias. The proposed ESCFR achieves significant improvement over most metrics compared with various prevalent baselinesfoot_3 . Combined with the comparisons above, we attribute its superiority to the proposed RMPR and PFOR regularizers, which makes it robust to MSE and UCE. See Appendix D.5 for additional comparison results.

4.3. ABLATION STUDY

To verify the effectiveness of individual components, an ablation study is conducted on the ACIC benchmark in Table 2 . Specifically, ESCFR first augments TARNet with stochastic optimal transport in Section 3.1, which effectively reduces the out-of-sample PEHE from 3.254 to 3.207. Then, it Most prevalent methods fail to cope with the label imbalance and mini-batch outliers in Figure 2(b-c ). Figure 4 shows the transport plan generated with RMPR in the same situations, where RMPR alleviates the MSE issue in both bad cases. RMPR with κ = 10, for instance, avoids the incorrect matching of units with different outcomes; RMPR with κ = 2 gets robust to the outlier's interference and correctly matches the remaining units. We attribute the success to the relaxed mass-preserving constraint in Section 3.2. Notably, RMPR does not transport all mass of a unit. The closer the unit is to the overlapping zone in a batch, the greater the mass is transferred. That is, RMPR adaptively matches and pulls closer units that are close to the overlapping region, ignoring outliers, which mitigates the bias of causal inference methods in cases where the positivity assumption does not strictly hold. Current approaches mainly achieve it by manually cleaning the data or dynamically weighting the units (Johansson et al., 2020) , while RMPR naturally implements it via the soft penalty in (8). We further investigate the performance of RMPR under different batch sizes and κ in Appendix D.2 to verify the effectiveness of RMPR more thoroughly.

4.5. PARAMETER SENSITIVITY STUDY

We discuss four critical hyperparameters in ESCFR, i.e., λ, ϵ, κ, γ, which are the weights in the learning objective and influence the final performance significantly. We firstly vary λ to investigate the influence of stochastic optimal transport. Specifically, increasing λ consistently improves the precision of ITE estimates. For example, the out-of-sample PEHE reduces from 3.22 at λ = 0 to approximately 2.85 at λ = 1.5. However, assigning a higher focus on distribution balancing in a multi-task learning framework leads to difficulties in factual outcome estimation, leading to sub-optimal ITE estimates. Then, we vary ϵ to investigate the entropic regularizer. It does accelerate the computation of optimal transport discrepancy with a large ϵ corresponding to a faster computation speed (Flamary et al., 2021) . However, it comes with a biased transport plan, evidenced by the out-of-sample PEHE increasing to 2.95 with fluctuations. We further vary γ and κ to showcase the influence of PFOR and RMPR, respectively. PFOR benefits ITE estimation, while assigning considerable weight to the proximal outcome distance is detrimental to ITE estimates, because the unit-wise distance calculated with representations (covariates) would be neglected. Relaxing the mass-preserving constraint via RMPR could significantly improve the ITE estimates; however, too-small κ is always detrimental as we cannot guarantee that the representations across treatment groups are drawn closer together in the optimal transport approach. 

5. RELATED WORKS

Current works mitigate the treatment selection bias by balancing the distributions of treated and untreated groups, which can be divided into three groups according to the methods used for balancing. Reweighting-based methods weight individuals with balanced scores to achieve globally balanced distributions, represented by the inverse propensity score (IPS) approach (Rosenbaum & Rubin, 1983a) and its doubly robust variant (Robins et al., 1994) . Imai & Ratkovic (2014) and Fong et al. (2018) propose calculating the balancing score by solving an optimization problem. Kuang et al. (2017b) and Kuang et al. (2017a) consider additional non-confounding factors in covariates. However, these methods are susceptible to non-overlapping units and suffer from a high variance issue. Matching-based methods match similar units from different groups to construct locally balanced distributions. Representative methods (Rosenbaum & Rubin, 1983b; Chang & Dy, 2017; Li et al., 2016) mainly differ in terms of similarity measures. Notably, Tree-based methods (Wager & Athey, 2018) can also be considered as matching methods with adaptive neighborhood metrics. However, the computational cost hinders the application of these methods in large-scale scenarios. Begining with BNN (Johansson et al., 2016) and CFR (Shalit et al., 2017) It is necessary to distinguish ourselves from emerging OT-based causal inference approaches. Dunipace (2021) augments the IPS method with a propensity score estimator based on OT; however, it is limited by the aforementioned high variance issue. Torous et al. (2021) uses the push-forward operator to improve change-in-change models; however, they are designed for multi-phase data which is not available in our case. Li et al. (2022) has a similar setup to to us, while it focuses on the decomposition of latent variables and is identical to Shalit et al. (2017) in terms of alignment technology. Our work is a new take on OT under the CFR framework, alleviating the MSE and UCE issues that have been long neglected by the causal inference community until this year.

6. CONCLUSION

Due to the effectiveness of mitigating treatment selection bias, representation learning has been the primary approach to estimating individual treatment effect. However, existing methods neglect the mini-match sampling effects and unobserved confounders, which hinders them from handling the treatment selection bias. A principled approach named ESCFR is devised based on a generalized Sinkhorn discrepancy. Extensive experiments demonstrate that ESCFR can largely mitigate MSE and UCE issues, and achieve better performance compared with other baseline models. There are two directions of future work that we intend to pursue. The first direction attempts to construct the representation mapping with normalizing flows (Chen et al., 2018), which is invertible and thus satisfies the assumption to the representation mapping by Shalit et al. (2017) . Another direction seeks to extend our methodology to industrial applications, e.g., debias in recommenders (Wang et al., 2022) , which has long been dominated by reweighting methods with the high variance issue.

A CAUSAL INFERENCE WITH OBSERVATIONAL STUDIES

We introduce all the necessary preliminaries about causal inference for readers and reviewers that are unfamiliar with this area.

A.1 PROBLEM FORMULATION

This section formalizes the definitions, assumptions, and useful lemmas in causal inference from observational data. Following the notations in Section 2.1, an individual with covariates x has two potential outcomes, namely Y 1 (x) given it is treated and Y 0 (x) otherwise. The ground-truth individual treatment effect (ITE) is the difference in its potential outcomes. Definition A.1. The individual treatment effect (ITE) for a unit with covariates x is τ (x) ∶= E [Y 1 -Y 0 | x] , where we abbreviate Y 1 (x) to Y 1 for brevity. The expectation is over the potential outcome space Y. Estimating ITE with observational data is a common practice in causal inference, which has long been confronted with two primary challenges:  P(T = 1 | X = x) < 1. That is, all individuals have a chance to be assigned both treatments. The second step is estimation, which aims to estimate the derived statistical estimand with observational data. Lemma A.1 illustrates how this two-step approach can be used for ITE estimation. Lemma A.1. The ITE estimand τ (x) can be identified as: E [Y 1 -Y 0 | X = x] = E [Y 1 | X = x] -E [Y 0 | X = x] (1) = E [Y 1 | X = x, T = 1] -E [Y 0 | X = x, T = 0] (2) = E [Y | X = x, T = 1] -E [Y | X = x, T = 0] , where (1) stems from the unconfoundedness assumption A.1; (2) stems from the consistency assumption A.2. The derived estimand is fully composed of statistical estimands, which can only be estimated under the positivity assumption A.3. Otherwise, if the positivity assumption is violated, we have: E [Y | X = x, T = 1] = ∫ y ⋅ P(Y = y | X = x, T = 1) dy = ∫ y ⋅ P(Y = y, X = x, T = 1) P(T = 1 | X = x)P(X = x) dy, which is not computable as there exists x ∈ X which makes T-learner models the factual outcomes for treated units X 1 and untreated units X 0 separately, which highlights the treatment indicator's effect; however, it reduces the data efficiency and is therefore inapplicable when the dataset is small. Künzel et al. ( 2019) discuss the advantages and limitations of these two approaches in more detail. P(T = 1 | X = x) = 0. ϕ X ψ Y 1 Y 1 Y 0 Y 0 Y Y L L T T T T (a) S-learner X 1 ψ 1 Y 1 Y 1 Y 0 Y 0 Y Y L L T T X0 ψ 0 ϕ 1 ϕ 0 (b) T-learner X ψ Y 1 Y 1 Y 0 Y 0 Y Y L L T T φ1 ϕ 0 (c) TARNet Definition A.2. Let ψ ∶ X → R be a mapping from support X to R. That is, ∀x ∈ X , ∃r = ψ(x) ∈ R. Let ϕ ∶ R × T → Y be a mapping from support R × T to Y. That is, it maps the representations and treatment indicator to the corresponding factual outcome. For example, Y 1 = ϕ 1 (R), Y 0 = ϕ 0 (R), where we will always abbreviate ϕ(R, T = 1) and ϕ(R, T = 0) to ϕ 1 (R) and ϕ 0 (R), respectively, for brevity. Assumption A.4. ϕ ∶ X → R is differentiable and invertible, with its inverse ϕ -1 defined over R. TARNet (Shalit et al., 2017) in Figure 6 (c) obtains better results by absorbing the advantages of both T-learner and S-learner, which consists of a representation mapping ψ and an outcome mapping ϕ as defined in Definition A.2. For a unit with covariates X, TARNet estimates ITE as the difference in predicted outcomes when T is set to treated and untreated: τψ,ϕ (X) ∶= Ŷ1 -Ŷ0 , where Ŷ1 = ϕ 1 (ψ(X)), Ŷ0 = ϕ 0 (ψ(X)), where ψ is trained over all units, ϕ 1 and ϕ 0 are trained over the treated and untreated units, respectively, to minimize the factual error ϵ F (ϕ, ψ) in Definition A.3. Finally, the performance of the ITE estimator is mainly evaluated with PEHE: ϵ PEHE (ψ, ϕ) = ∫ X (τ ψ,ϕ (x) -τ (x)) 2 P(x) dx. ( ) Definition A.3. Let L be the loss function that measures the quality of outcome estimation, e.g., the squared loss. The expected loss for the units with covariates x and treatment indicator t is: l ψ,ϕ (x, t) ∶= ∫ Y L(Y t , ϕ(ψ(x), t)) ⋅ P(Y t | x) dY t . ( ) where L is realized with the squared loss: L(Y t , ψ(ϕ(x), t)) = (Y tψ(ϕ(x), t)) 2 in our scenario. The expected factual outcome estimation error for treated, untreated and all units are: ϵ T=1 F (ψ, ϕ) ∶= ∫ X l ψ,ϕ (x, 1) ⋅ P T=1 (x) dx, ϵ T=0 F (ψ, ϕ) ∶= ∫ X l ψ,ϕ (x, 0) ⋅ P T=0 (x) dx, ϵ F (ψ, ϕ) ∶= ∫ X ×T l ψ,ϕ (x, t) ⋅ P(x, t) dxdt. A.3 REPRESENTATION-BASED METHODS FOR TREATMENT SELECTION BIAS However, the treatment selection bias makes covariate distributions across groups shift. As such, ϕ 1 and ϕ 0 would overfit the respective group's properties and thus cannot generalize well to the entire population. For example, as shown in Figure 1 (a) (a), the potential outcome estimator ϕ 1 trained with treated units cannot generalize to the untreated units. Therefore, the resulting τ would be biased. Definition A.4. Let P T=1 (x) ∶= P(x | T = 1) and P T=0 (x) ∶= P(x | T = 0) be the covariate distribution for treated and untreated groups, respectively. Let P T=1 ψ (r) and P T=0 ψ (x) be that of representations induced by the representation mapping r = ψ(x) defined in Definition 2.2. To mitigate the effect of treatment selection bias, representation-based approaches (Johansson et al., 2016; Shalit et al., 2017) minimize the distribution discrepancy of different groups in the representation space. In particular, the integral probability metric (IPM) in Definition A.4 is a widely used metric that measures the discrepancy of two distributions. Shalit et al. (2017) propose to optimize the PEHE by minimizing the estimation error of factual outcomes ϵ F and the IPM of learned representations between treated and untreated groups. They further provide theoretical results to back up their claim as per Theorem A.1. Definition A.5. Consider two distribution functions P T=1 (x) and P T=0 (x) supported over X , let F be a sufficiently large function family, the integral probability metric induced by F is IPM F (P T=1 , P T=0 ) = sup f ∈F | ∫ X f (x) (P T=1 (x) -P T=0 (x)) dx| , Theorem A.1. Let ψ and ϕ be the mappings in Definition 2.2, F be a predefined sufficiently large function family of ϕ, IPM F be the integral probability metric induced by F. Assume there exists a constant B ψ > 0, such that for t ∈ {0, 1}, 1 B ψ ⋅ l ψ,ϕ (x, t) ∈ F holds. Shalit et al. (2017) demonstrate: ϵ PEHE (ψ, ϕ) ≤ 2 (ϵ T=0 F (ψ, ϕ) + ϵ T=1 F (ψ, ϕ) + B ψ IPM F (P T=1 ψ , P T=0 ψ ) -2σ 2 Y ) , where ϵ T=0 

A.4 THEORETICAL RESULTS AND EXTENSIONS

In this section, we describe the intuition of Theorem 3.1 and Theorem 3.2, and provide rigorous proofs to support our claims. Two weaknesses hinder the existing Theorem A.1 by Shalit et al. (2017) from supporting representation-based causal inference approaches. Firstly, the IPM is a discrepancy metric with profound theoretical properties while being difficult to compute numerically. To counter this, note that the IPM holds for any sufficiently large function families, e.g., the 1-Lipschitz function family where the IPM is equivalent to the Wasserstein divergence as per Kantorovich-Rubinstein duality (Villani, 2009) . As such, a shortcut to IPM is the Wasserstein distance as per Lemma A.2. Lemma A.2. Consider two distribution functions P 1 (x) and P 2 (x) supported over X ; let F be the family of 1-Lipschitz functions, W be the Wasserstein distance, Villani (2009) demonstrates IPM F (P 1 , P 2 ) = W (P 1 , P 2 ) ( ) Another weakness is that it neglects the Mini-batch Sampling Effects (MSE). Specifically, Theorem A.1 holds only if the entire populations of treated and untreated groups are available. However, since the representation-based approaches update parameters with stochastic gradient methods, only a subset of the population is accessible within each iteration. As such, it remains questionable whether Theorem A.1 holds at a mini-batch level in practice. Lemma A.3. Let P(x) be a probability measure supported over X ∈ R d satisfying T 1 (λ) inequality. Let P(x) = 1 N ∑ N i=1 δ xi be the corresponding empirical measure with N units. Bolley et al. ( 2007) and Redko et al. (2017) demonstrate that for any d ′ > d and λ ′ < λ, there exists some constant N 0 , such that for any ε > 0foot_4 and N ≥ N 0 max(ε -(d+2) , 1), we have P (W (P(x), P(x)) > ε) ≤ exp (- λ ′ 2 N ε 2 ) (26) where d ′ , λ ′ can be calculated explicitly. Hoeffding's inequality is a powerful statistical tool to quantify such sampling effects, which is proved to be applicable for W by Bolley et al. (2007) . Therefore, it is natural to expand W according to Lemma A.3 to extend Theorem A.1 to mini-batch situations, in order to quantify the sampling effects. Theorem A.2. Let ψ and ϕ be the representation mapping and factual outcome mapping, respectively; Ŵψ be the discrepancy across groups at a mini-batch level. With the probability of at least 1δ, we have: ϵ PEHE (ψ, ϕ) ≤ 2 [ϵ T=1 F (ψ, ϕ) + ϵ T=0 F (ψ, ϕ) + B ψ Ŵψ -2σ 2 Y + O( 1 δN )] , where ϵ T=1 F and ϵ T=0 F are the expected losses of factual outcome estimation over treated and untreated units, respectively. N is the batch size, σ 2 Y is the variance of outcomes, B ψ is some constant such that 1 B ψ ⋅ l ψ,ϕ (x, t) belongs to the family of 1-Lipschitz functions, O(⋅) is the sampling complexity term. Proof. According to Theorem A.1 we have: ϵ PEHE (ψ, ϕ) ≤ 2 (ϵ T=0 F (ψ, ϕ) + ϵ T=1 F (ψ, ϕ) + B ψ IPM F (P T=1 ψ , P T=0 ψ ) -2σ 2 Y ) . ( ) Assuming that there exists a constant B ψ > 0, such that for t ∈ {0, 1}, 1 B ψ ⋅ l ψ,ϕ (x, t) belongs to the family of 1-Lipschitz functions. According to Lemma A.2, we have ϵ PEHE (ψ, ϕ) ≤ 2 (ϵ T=0 F (ψ, ϕ) + ϵ T=1 F (ψ, ϕ) + B ψ W (P T=1 ψ , P T=0 ψ ) -2σ 2 Y ) . Following Definition 3.1, let PT=1 ψ (r) and PT=0 ψ (r) be the empirical distributions of representations at a mini-batch level, containing N 1 treated units and N 0 untreated units, respectively. Then we have: W (P T=1 ψ , P T=0 ψ ) ≤ W (P T=1 ψ , PT=1 ψ ) + W (P T=0 ψ , PT=1 ψ ) ≤ W (P T=1 ψ , PT=1 ψ ) + W (P T=0 ψ , PT=0 ψ ) + W ( PT=0 ψ , PT=1 ψ ) ∶ = W (P T=1 ψ , PT=1 ψ ) + W (P T=0 ψ , PT=0 ψ ) + Ŵψ , because we have the triangular inequality for W. The Hoeffding inequality in Lemma A.3 further gives the following inequality which holds with the probability at least 1δ: W (P T=1 ψ , PT=1 ψ ) ≤ √ 2 log ( 1 δ ) /λ ′ N 1 W (P T=0 ψ , PT=0 ψ ) ≤ √ 2 log ( 1 δ ) /λ ′ N 0 . Denote N ∶= N 0 + N 1 as the batch size, θ ∶= N 1 /N as the ratio of treated units in the current batch. Combining (30) and 31 we have W (P T=1 ψ , P T=0 ψ ) ≤ Ŵψ + √ 2 log ( 1 δ ) /λ ′ N 1 + √ 2 log ( 1 δ ) /λ ′ N 0 = Ŵψ + √ 2 log ( 1 δ ) /λ ′ N ⎛ ⎝ √ 1 θ + √ 1 1 -θ ⎞ ⎠ ∶ = Ŵψ + O( 1 δN ), where O(⋅) satisfies √ log ( 1 δ ) /λ ′ (1 + √ 1/(N -1)) ≥ O( 1 δN ) ≥ 4 √ log ( 1 δ ) /λ ′ N , where O( 1 δN ) reaches its maximum when θ = 1/N or θ = 1 -1/N , reaches its minimum when θ = 0.5. This corollary can be derived by differentiating the function f (x) = 1/ √ x + 1/ √ 1 -x. Combining (29) and 32 we have ϵ PEHE (ψ, ϕ) ≤ 2 [ϵ T=1 F (ψ, ϕ) + ϵ T=0 F (ψ, ϕ) + B ψ Ŵψ -2σ 2 Y + O( 1 δN )] , where we denote B ψ O( 1 δN ) as O( 1 δN ) and the proof is completed. Theorem A.2 extends Theorem A.1 and derives the upper bound of PEHE in the stochastic batch form, which demonstrates that the PEHE can be optimized by iteratively minimizing the factual outcome estimation error and the optimal transport discrepancy at a mini-batch level. Corollary A.1. The empirical variance of the PEHE estimates in (27) largely depends on the batch size and the proportion of treated and untreated units. Large batch size and balanced proportion produce low empirical variance, and vice versa. Proof. It can be drawn directly from (27) (batch size) and (33) (treatment proportion). Theorem A.3. For discrete measures α = ∑ n i=1 a i δ xi and β = ∑ m j=1 b j δ xj , adding an outlier δ x ′ to α and denote the disturbed distribution as α ′ , we have W 0,κ (α ′ , β) -W 0,κ (α, β) ≤ 2κ(1 -e -∑ m j=1 (x ′ -xj ) 2 /2κ )/n, which is upper bounded by 2κ/(n + 1). W 0,κ is the unbalanced discrepancy as per Definition 3.2. Proof. This is a direct corollary of the Lemma 1 by Fatras et al. (2021) , under the assumption that all the units including the outlier δ x ′ share the same mass.

B DISCRETE OPTIMAL TRANSPORT

This section proposes the definitions and algorithms to calculate optimal transport between discrete measures. We have omitted the case of general measures, as it is beyond the scope of this work. Readers interested in this topic should refer to Peyré & Cuturi (2019) ; Cuturi (2013) for details.

B.1 PROBLEM FORMULATION

Optimal transport derives from the formulation of Monge (1781). We provide an equivalent interpretation under discrete measures. Consider n warehouses and m factories, where the i-th warehouse contains a i units of materials; the j-th factory needs b j units of materials. Now construct a mapping from warehouses to factories, satisfying: (1) all materials of warehouses are transported; (2) all requirements of factories are satisfied; (3) materials from one warehouse are transported to no more than one factory (mapping constraint). Every feasible mapping is associated with a global cost, calculated by aggregating the local cost of moving a unit of material from the i-th warehouse to the j-th factory. Our objective, to find a feasible mapping that minimizes the transport cost, is formulated in Definition B.1. Definition B.1. For discrete measures α = ∑ n i=1 a i δ xi and β = ∑ m j=1 b j δ xj , the Monge problem seeks for a mapping T ∶ {x i } n i=1 → {x j } m j=1 that associates to each point x i a single point x j and pushes the mass of α to β. That is, ∀j ∈ {1, . . . , m} we have b j = ∑ i∶T(xi)=xj a i . This mass-preserving constraint is abbreviated as T ♯ α = β. The mapping should also minimize the transportation cost denoted as c(x, y). To this end, Monge problem for discrete measures is formulated as: min T∶T♯α=β {∑ i c(x i , T(x i ))} . ( ) This problem was further utilized to compare two probability measures where ∑ i a i = ∑ j b j = 1. However, Monge's formulation cannot guarantee the existence and uniqueness of solutions (Peyré & Cuturi, 2019) . Kantorovich (2006)  W(α, β) ∶= min π π π∈Π(α,β) ⟨D, π π π⟩ , Π(α, β) ∶= {π π π ∈ R n×m + ∶ π π π1 m = a, π π π T 1 n = b} , ( ) Algorithm 1 Sinkhorn Algorithm Input: discrete measures α = ∑ n i=1 a i xi and β = ∑ m j=1 b j δ xj , distance matrix D ij = ∥x i -x j ∥ 2 2 . Parameter: ϵ: strength of entropic regularization; ℓ max : maximum iterations. Output: π π π ϵ : the entropic regularized optimal transport matrix. 1: K ← exp(-D/ϵ). 2: u ← 1 n . v ← 1 m , ℓ ← 1. 3: while ℓ < ℓ max do 4: u ← a/(Kv).

5:

v ← b/(K T u). 6: ℓ ← ℓ + 1. 7: π π π ϵ ← diag(u)K diag(v). where W(α, β) ∈ R is the Wasserstein discrepancy between α and β; D ∈ R n×m + is the unit-wise distance 6 between α and β; a and b indicate the mass of units in α and β, and Π is the feasible transportation plan set which ensures the mass-preserving constraint holds.

B.2 SINKHORN DISCREPANCY AND ALGORITHM

Exact solutions to the Kantorovich problem suffer from great computational costs. The interior-point and network-simplex methods, for example, have a complexity of O(n 3 log n) (Pele & Werman, 2009) . A shortcut is to add an entropic regularizer as W ϵ (α, β) ∶= ⟨D, π π π ϵ ⟩ , π π π ϵ ∶= arg min π π π∈Π(α,β) ⟨D, π π π⟩ -ϵH(π π π), H(π π π) ∶= -∑ i,j π π π ij (log(π π π ij ) -1) , (38) which makes the problem ϵ-convex and solvable with the Sinkhorn algorithm (Cuturi, 2013) , with a lower complexity of O(n 2 /ϵ 2 ). Besides, the Sinkhorn algorithm consists of matrix-vector products only, which makes it suited to be accelerated with GPUs. Specifically, let f ∈ R n and g ∈ R m be the lagrangian multipliers, the Lagrangian of ( 38) is: Φ(π π π, f , g) = ⟨D, π π π⟩ -ϵH(π π π) -⟨f , π π π1 n -a⟩ -⟨g, π π π T 1 m -b⟩ (39) According to the first-order condition of constraint optimization problem, we have: ∂Φ(π π π, f , g) ∂π π π ij = D ij + ε log (π π π ij ) -f i -g j = 0, or equivalently, the best transport matrix π π π ϵ should satisfy: π π π ϵ ij = exp ( f i ϵ ) * exp (- D ij ϵ ) * exp ( g j ϵ ). Let u i ∶= exp(f i /ϵ), v j ∶= exp(g j /ϵ), K ij ∶= exp(-D ij /ϵ), then we have π π π ϵ = diag(u)Kdiag(v). The transport matrix should also satisfy the mass-preserving constraint, such that: diag(u)K diag(v)1 m = a, diag(v)K ⊺ diag(u)1 n = b, or equivalently, let ⊙ be the entry-wise multiplication of vectors, we have: u ⊙ (Kv) = a and v ⊙ (K T u) = b. ( ) (43) is known as the matrix scaling problem. An intuitive approach is to solve them iteratively: u (ℓ+1) = a Kv (ℓ) and v (ℓ+1) = b K T u (ℓ+1) (44) which is the critical step of Sinkhorn algorithm in Algorithm 1. The optimal transport matrix π π π ϵ acting as a constant matrix further induces the Sinkhorn discrepancy W ϵ following (38). As D is differentiable to α and β, it is feasible to minimize W ϵ by adjusting the generation process of α and β, i.e., the representation mapping in Definition A.2 with gradient-based optimizers. Algorithm 2 Generalized Sinkhorn Algorithm for Unbalanced Optimal Transport Input: discrete measures α = ∑ i=1 a i δ xi and β = ∑ m j=1 b j δ xj , distance matrix D ij = ∥x i -x j ∥ 2 2 . Parameter: ϵ: strength of entropic regularizer; κ: strength of mass preserving; ℓ max : max iterations. Output: π π π ϵ,κ : the entropic regularized unbalanced optimal transport matrix. 1: K ← exp(-D/ϵ). 2: f ← 0 n , g ← 0 m , ℓ ← 1. 3: while ℓ < ℓ max do 4: u ← exp(f i /ϵ), v ← exp(g j /ϵ) 5: π π π ← diag(u)K diag(v). 6: a ′ ← π π π1 n , b ′ ← π π π T 1 m . 7: if ℓ//2 = 0 then 8: f ← [ f ϵ + log(a) -log (a ′ )] ϵκ ϵ+κ 9: else 10: g ← [ g ϵ + log(b) -log (b ′ )] ϵκ ϵ+κ 11: ℓ ← ℓ + 1. 12: π π π ϵ,κ ← diag(u)K diag(v).

B.3 UNBALANCED OPTIMAL TRANSPORT AND GENERALIZED SINKHORN

We have reported the mini-batch sampling effect (MSE) issue of W ϵ in Section 3.2, and attributed it to the mass-preserving constraint in ( 38). An intuitive approach to mitigate MSE is to relax the marginal constraint and allow for the creation and destruction of mass. To this end, RMPR is proposed in Definition B.3, which replaces the hard marginal constraint with a soft penalty. Definition B.3. For empirical distributions α and β with n and m units, respectively, unbalanced optimal transport seeks a transport plan at minimum cost: W ϵ,κ (α, β) ∶= min π π π ⟨D, π π π⟩ , π π π ∶= arg min π π π ⟨D, π π π⟩ + ϵH(π π π) + κ(KL(π π π1 n , a) + KL(π π π T 1 m , b)), (45) where D ∈ R n×m + is the unit-wise distance, and a and b indicate the mass of units in α and β. The unbalanced optimal transport problem in Definition B.3 has a similar structure with (38) and thus can be solved with a generalized Sinkhorn algorithm (Chizat et al., 2018) . The derivation starts from the Fenchel-Legendre dual form of (45): max f ∈R n ,g∈R m -F * (-f ) -G * (-g) -ϵ ∑ i,j exp ( f i + g j -D ij ϵ ) , F * (f ) = max z∈R n z ⊺ f -κKL(z∥a) = κ ⟨e f /κ , a⟩ -a ⊺ 1 n , G * (g) = max z∈R m z ⊺ g -κKL(z∥b) = κ ⟨e g/κ , b⟩ -b ⊺ 1 m , where the functions F * (⋅) and G * (⋅) are the Legendre transformation of KL divergence. Ignoring the constant terms, we can obtain the equivalent optimization problem: min f ∈R n ,g∈R m ϵ n ∑ i,j=1 exp ( f i + g j -D ij ϵ ) + κ ⟨e -f /κ , a⟩ + κ ⟨e -g/κ , b⟩ . ( ) According to the first-order condition, the minimizer's gradient of (47) should be zero. As such, fixing g ℓ , the updated f ℓ+1 ought to satisfy: exp ( f ℓ+1 i ϵ ) n ∑ j=1 exp ⎛ ⎝ g ℓ j -D ij ϵ ⎞ ⎠ = exp (- f ℓ+1 i κ ) a i , We further multiply both sides by exp(f ℓ i /ϵ):  exp ( f ℓ+1 i ϵ ) a ′ i = exp ( f ℓ i ϵ ) exp (- f ℓ+1 i κ ) a i {r i } n i=1 ← {ψ(x i )} n i=1 , {r j } m j=1 ← {ψ(x j )} m j=1 . 2: {ŷ i } n i=1 ← {ϕ(r i , 1)} n i=1 , {ŷ j } m j=1 ← {ϕ(r j , 0)} m j=1 . 3: {ỹ i } n i=1 ← {ϕ(r i , 0)} n i=1 , {ỹ j } m j=1 ← {ϕ(r j , 1)} m j=1 . 4: D γ ij ← ∥x i -x j ∥ 2 2 + γ ⋅ ∥y i -ỹj ∥ 2 2 + γ ⋅ ∥y j -ỹi ∥ 2 2 . 5: D γ stop ← stopgradient(D γ ). 6: π π π ϵ,κ,γ ← Algorithm2 (α = {r i } n i=1 , β = {r j } m j=1 , D = D γ stop ). 7: L F (ψ, ϕ) ← 1 n ∑ n i=1 ∥ŷ i -y i ∥ 2 2 + 1 m ∑ m j=1 ∥ŷ j -y j ∥ 2 2 . 8: L ϵ,κ,γ D (ψ) ← ⟨D γ , π π π ϵ,κ,γ ⟩. 9: L ϵ,κ,γ,λ ESCFR ← L F (ψ, ϕ) + λ ⋅ L ϵ,κ,γ D (ψ). Table 3 : Running time (mean+std) in seconds of Algorithm 1-2 with 100 runs. where a ′ ∶= π π π1 n with π π π ij ∶= exp(f ℓ i + g ℓ j -D ij ). Similarly, fixing f we have g ℓ+1 as: exp ⎛ ⎝ g ℓ+1 j ϵ ⎞ ⎠ b ′ j = exp ⎛ ⎝ g ℓ j ϵ ⎞ ⎠ exp ⎛ ⎝ - g ℓ+1 j κ ⎞ ⎠ b j where b ′ ∶= π π π T 1 m . ( 49) and ( 50) construct the critical iteration steps of the generalized Sinkhorn algorithm (Chizat et al., 2018) , which we formulate in Algorithm 2. The transport matrix π π π ϵ,κ further induces the generalized Sinkhorn discrepancy W ϵ,κ in Definition B.3. As D is differentiable with respect to α and β, it is feasible to minimize W ϵ,κ by adjusting the generation process of α and β, i.e., the representation mapping in Definition A.2, with gradient-based optimizers.

B.4 OPTIMIZATION OF ENTIRE SPACE COUNTERFACTUAL REGRESSION

Algorithm 3 shows how to calculate the learning objective at a mini-batch level. Specifically, we first calculate the factual outcome estimates (step 2), counterfactual outcome estimates (step 3), and the unit-wise distance matrix with PFOR (step 4). Afterwards, fix the gradient of the distance matrix (step 5) and calculate the transport matrix with Algorithm 2 (step 6). Finally, calculate the factual outcome estimation error (step 7) and distribution discrepancy (step 8), and aggregate them to acquire the learning objective of ESCFR (step 9). According to Section B.3, the learning objective is differentiable to ψ and ϕ and thus can be optimized end-to-end with stochastic gradient methods.

B.5 COMPLEXITY ANALYSIS

One primary concern would be the overall complexity of solving discrete optimal transport problems. Exact algorithms, e.g., the interior-point method and network-simplex method, suffer from a high computational cost of O(n 3 log n) (Pele & Werman, 2009 ). An entropic regularizer is thus introduced in (5), making the problem solvable by the Sinkhorn algorithm (Cuturi, 2013) in Algorithm 1. The complexity was shown to be O(n 2 /ϵ 3 ) by Altschuler et al. (2017) in terms the absolute error of the mass-preservation constraints. Dvurechensky et al. (2018) improved it to O(n 2 /ϵ 2 ), which can be further accelerated with greedy algorithm by Lin et al. (2019) . Several recent explorations (Blanchet et al., 2018; Jambulapati et al., 2019) have also attempted to further reduce the complexity to O(n 2 /ϵ). Entropic regularization trick is still applicable to speed up the solution of the unbalanced optimal transport problem in RMPR, represented by the Sinkhorn-like algorithm in Algorithm 2. Pham et al. (2020) further proved that the complexity of Algorithm 2 is Õ(n 2 /ϵ). Table 3 reports the practical running time at the commonly-used batch settings. In general, the computational cost of optimal transmission is not a concern at the mini-batch level. Notice that enlarging ϵ speeds up the computation while making the resulting transfer matrix biased, hindering the transportation performance, as per Figure 5 . In addition, a large relaxation parameter κ makes the computed results closer to those by Sinkhorn algorithm yet significantly contributes to more iterations, which is discussed and mitigated by Séjourné et al. (2022) .

C REPRODUCTION DETAILS C.1 DATASETS

We conduct experiments on two semi-synthetic benchmarks to validate our models. For the IHDPfoot_6 benchmark, we report the results over 10 simulation realizations following Liuyi et al. (2018) . However, the limited size (747 observations and 25 covariates) makes the results highly volatile. As such, we mainly validate the models on the ACIC benchmark, which was released by the ACIC-2016 competitionfoot_7 . Since the scale of the ACIC benchmark (4802 observations and 58 covariates) is much larger than the IHDP benchmark, we mainly perform ablation studies on it to get more reliable results. All datasets are randomly shuffled and partitioned in a 0.7:0.15:0.15 ratio for training, validation, and test, where we maintain the same ratio of treated units in all three splits to avoid numerical unreliability in the validation and test phases. We find that these datasets are overly easy to fit by the model because they are semi-synthetic. To increase the distinguishability of the results, we omit preprocessing strategies, such as min-max scaling, to increase the difficulty of the learning task.

C.2 BASELINES

We compare the proposed method with baselines based on statistical estimators (Künzel et al., 2019; Shalit et al., 2017) , matching estimators (Rosenbaum & Rubin, 1983b; Wager & Athey, 2018; Crump et al., 2008) and representation-based estimators (Johansson et al., 2016; Shalit et al., 2017) . We implement all these baselines by hand, largely built upon Pytorch for neural network models, Sklearn for statistical models, and EconML for tree and forest models. These implementations will be open-sourced.

C.3 METRICS

Following existing works (Yao et al., 2019; Liuyi et al., 2018) , the Precision in Estimation of Heterogeneous Effect (PEHE) is primarily used as the precision metric for performance evaluation. ϵ PEHE (ψ, ϕ) = ∫ X (τ ψ,ϕ (x) -τ (x)) 2 P(x) dx. However, PEHE is unavailable during the model selection phase as the counterfactual outcomes are missing. A shortcut adapted by Liuyi et al. (2018) for model selection would be the root mean squared error of factual outcome estimates (RMSE F ), which can be evaluated in the absence of the counterfactual outcomes. However, owing to the treatment selection bias, RMSE F is not reliable as it does not consider the precision of counterfactual estimation, and thus cannot effectively evaluate the quality of treatment estimation. Area Under the Uplift Curve (AUUC) (Betlei et al., 2021) evaluates the model's performance in ranking units based on potential treatment benefit. It is more feasible than PEHE because it can be calculated without counterfactual outcomes; it is more reliable than factual MSE because it partially reflects the models' counterfactual ranking ability. Therefore, it has been the primary selection standard by practitioners from Criteo (Betlei et al., 2021) , Alibaba (Ke et al., 2021) , and Tencent (He et al., 2022) , where engineers always fine-tune their causal inference models concerning AUUC performance and decide whether to deploy the models to online traffic. As such, we use AUUC as the model selection criteria. Moreover, we report the within-sample results on the training dataset and the out-of-sample results on the test dataset, where the factual outcome is available in the within-sample case following Shalit et al. (2017) . We compare the fidelity of RMSE F and AUUC on the evaluation set to the out-of-sample PEHE on the test set in Figure 7 . The first observation is the weak correlation between RMSE F on the validation set and the out-of-sample PEHE on the test set. In contrast, the out-of-sample AUUC on the evaluation set could better reflect the variation of PEHE. Another observation is that the within-sample AUUC is not a good criterion for model selection, as higher within-sample AUUC corresponds to higher out-of-sample PEHE, which contrasts with reality. This is reasonable as better within-sample performance does not necessarily correspond to better out-of-sample performance.

D ADDITIONAL DISCUSSIONS

D.1 ADDITIONAL DISCUSSION FOR STOCHASTIC OPTIMAL TRANSPORT According to Theorem 3.1, one critical hyperparameter for CFR-WASS and ESCFR is the batch size, which directly affects the variance of stochastic optimal transport in Section 3.1 and thus the performance of both methods. As such, it is necessary to verify whether ESCFR outperforms CFR for different batch sizes. We conduct extensive experiments and summarize the results in Table 4 . Interesting observations are noted: • Increasing batch size in a wide range improves the performance of CFR-WASS and ESCFR. For example, The PEHE of CFR-WASS decreases from 3.114 at b = 32 to 2.932 at b = 128, and the PEHE of ESCFR exhibits a similar pattern. The performance gain is attributed to the decreased variance in (6), which backs up Theorem 3.1. • By finetuning batch size, we can easily exceed the performance we report in Table 1 . However, we did not finetune it as the PEHE is invisible during our hyper-parameter tuning processfoot_8 . • The performance drop for the overly large batch sizes comes from the sub-optimal backbone (TARNet) performance. Due to the limited number of training samples, e.g., 4.8k * 70% units for ACIC and 0.7k * 70% units for IHDP, a large batch size might block the models from escaping saddle points (Jin et al., 2017) and sharp minima (Xie et al., 2020) , and thus deteriorating the accuracy of factual outcome estimation. Existing methods (Shalit et al., 2017; Johansson et al., 2016; Liuyi et al., 2018) suffer from the mini-batch sampling effect (MSE) issue, as indicated by the two bad cases in Figure 2 . RMPR mitigates the MSE issue by relaxing the mass-preserving constraint, the performance of which is affected by two critical hyperparameters, i.e., the batch size b and the strength of mass-preserving constraint κ. On top of the ablation studies, it is necessary to explore the performance of ESCFR at different settings of b and κ, to investigate 1) how RMPR works; 2) the limitation and bottleneck of RMPR; 3) the robustness of RMPR to hyperparameter setting. The results are presented in Figure 8 , and the observations are summarized as follows. • The optimal value of κ increases with the increase of batch size. For example, the optimal κ is 1.0 at b = 32, and 5.0 at b = 128. This observation partially verifies how RMPR works as described in Section 3.2. Specifically, at small batch sizes where sampling outliers dominate the sampled batches, a small κ effectively relaxes the mass-preserving constraint and avoids the damage of mini-batch outliers, thus improving the performance effectively and robustly. At large batch sizes, the noise of sampling outliers is reduced, and it is reasonable to increase κ to match more units and obtain more accurate wasserstein distance estimates. • Even with large batch sizes, oversized κ, e.g., κ ≥ 10 does not perform well. Although the effect of sampling outliers is reduced, some patterns such as outcome imbalance are present for all batch sizes, resulting in false matching given large mass-preserving constraint strength κ, which might be a primary bottleneck of RMPR. • Hyper-parameter tuning is not necessarily the reason why ESCFR works well, since all ESCFR implementations outperform the strongest baseline CFR-WASS ( κ = ∞) on all batch sizes, most of which are statistically significant. This can be further supported by our extensive ablation study in Section 4.3 and parameter study in Section 4.5. In summary, it is necessary to relax the mass-preserving constraint under all settings of batch size, which strongly verifies the effectiveness of RMPR in Section 3. approach effectively handles the treatment selection bias. However, Assumption A.1 is usually violated in practice, which invalidates this approach as the backdoor path from the unobserved confounder X ′ to T is not blocked. According to the designed causal graph in Figure 3 (b), all factors associated with outcome Y include the observed confounders X, treatment T , and unobserved confounders X ′ . Therefore, it is reasonable to derive that given balanced X and identical T , the only variable reflecting the variation of X ′ is the outcome Y . As such, inspired by the joint distribution transport technique (see Courty et al., 2017a) , PFOR calibrates the unit-wise distance D with the potential outcomes in (12) . The underlying regularization is: units with similar (observed and unobserved) confounders should have similar potential outcomes. Equivalently, for a pair of units with similar observed covariates, i.e., ∥r ir j ∥ 2 ≈ 0, if their potential outcomes under the same treatment t = {0, 1} differ significantly, i.e., ∥y t iy t j ∥ >> 0, their unobserved confounders should also differ significantly. As such, it is reasonable to utilize the difference of outcomes to calibrate the unobserved confounding effect. Assumption D.1. (Monotonicity). For all observed covariates X = x in the population of interest, let T = t and X ′ = x ′ be the treatment assignment and unobserved confounders, respectively, we have E[Y | X = x, X ′ = x ′ , T = t] is monotonically increasing or decreasing with respect to x ′ . Advantages. The advantages of PFOR can be further interpreted as follows. • From a statistical perspective, PFOR encourages units with similar outcomes to share similar representations. It is a valid prior that inspires many learning algorithms, e.g., Knearest neighbors and gaussian process (see Williams & Rasmussen, 2006) . As an effective statistical regularizer, PFOR also works in the absence of unobserved confounders, especially on small data sets. • From a domain adaptation perspective, vanilla Sinkhorn aligns the distributions P T=1 ψ (r) and P T=0 ψ (r), where r is the learned representations in Definition A.4. PFOR further aligns the transition probabilities P T=1 (Y (T = t) | r) and P T=0 (Y (T = t) | r) for t = 0, 1. The discrepancy between transition probabilities can be attributed to unobserved confounders that can be viewed as parameters of the transition probabilities (Courty et al., 2017a) . As such, it is feasible to align the unobserved confounders by aligning the transition probabilities.

Toy example. Let the ground truth

Y ∶= √ R 2 1 + R 2 2 + X ′2 where T is omitted as we only consider one group, R 1 and R 2 are the representations of observed confounders that have been aligned with Sinkhorn algorithm. Let the unobserved X ′ = 0 for controlled units and X ′ = 1 for treated units, which makes X ′ an unobserved confounder as it is related to Y and different between groups. As shown in Figure 9 (a), given balanced R 1 and R 2 , the variation of Y reveals that of X ′ . As such, it is reasonable to employ Y to calibrate the unit-wise distance D that ignores X ′ . Synthetic labels. PFOR remains effective for semi-synthetic data, where outcomes are synthetic from the covariates and treatment assignments. One source of hidden confounders in such data is the information loss from the raw data space to the representation space, where not all valuable information (e.g., confounders) is extracted and preserved. Besides, this improvement could also come from the statistical regularization, encouraging units with similar outcomes to share similar representations, which is an effective prior according to the K-nearest neighboring methods. Limitations. PFOR fails for confounders adding constant effects to all units. Specifically, for unobserved confounder X ′ and treatment assignment t = 0, 1, if E[Y | X, X ′ = x 1 , T = t] = E[Y | X, X ′ = x 2 , T = t], PFOR fails to eliminate the confounding effect of X ′ . Examples can be found in Figure 9 (c). Nevertheless, in real scenarios, it is rare that different values of X ′ only add a constant effect to the outcome (see Sofer et al., 2016; Zheng et al., 2021; Ogburn & VanderWeele, 2012) , making PFOR still effective in a wide range of application scenarios. This limitation is formalized as Assumption D.1, where the outcome should be monotonically increasing or decreasing with unobserved confounders given observed confounders and treatment assignment, as shown in Figure 9 (b). Notably, it is a commonly used assumption in confounder analysis (Sofer et al., 2016; Zheng et al., 2021) . Besides, this assumption is often plausible, at least approximately, conditional on T = t (Zheng et al., 2021) . For example, it naturally holds for binary confounders; and generally holds in applications such as epidemiology (Ogburn & VanderWeele, 2012) . Finally, this assumption is only imposed on the hidden confounder X ′ following Zheng et al. (2021) , which further weakens Assumption D.1 significantly.

D.4 ADDITIONAL DISCUSSION FOR PERFORMANCE IMPROVEMENT

In practice, ESCFR serves as an efficient regularizer on TARNet. To further investigate its regularization effects, the RMSE for estimating factual and counterfactual outcomes on the training and test sets is reported in Table 5 . We find the effectiveness of ESCFR comes from the three sources: 1 . Baselines. The first additional baseline is DRCFR (Hassanpour & Greiner, 2020) . It first decomposes the latent space into instrument variables, confounding variables, and adjustment variables, and then uses MMD distance to align the distribution of the adjustment variables. The second additional baseline is the MIM-DRCFR (Cheng et al., 2022) . As the SOTA approach as recent as 2022, it augments DRCFR with three orthogonal constraints and uses the Wasserstein distance to align the distribution of adjustment variables. We highlight that neither of the two methods considers how to mitigate the mini-batch sampling effect and unobserved confounder effect, the two key problems in representation-based causal inference. Therefore, their contributions are quite different from our study. Settings. Since the full implementations of both baselines are not released, we reproduce them largely from scratch. Nevertheless, DRCFR released its model structure based on tensorflow 1.x, which gives us an important guideline. Experiments are conducted on the ACIC benchmark. In terms of parameters shared with CFR, we set them to the values of CFR and ESCFR for fairness. These shared parameters mainly include the learning rate, weight decay, batch size, strength of distribution alignment, etc. Notably, we set the batch size to 64 instead of 32 in Table 1 , to highlight the advantages of these representation-based baselines. In terms of model structure, the Sharedbottom and Factor-specific Disentangled representation module consists of a shared fully connected layer (60 units) cascaded by three parallel fully connected layers (30 units per layer). The outcome prediction modules consist of two fully connected layesr (60 units per layer) following TARNet, CFR, and ESCFR. Empirically, it is a good setup with competitive performance and a similar number of learnable parameters as ESCFR, ensuring a fair comparison. Results. As shown in Table 6 , the additional baselines show a more stable performance (less variance) and do outperform CFR-WASS in terms of out-of-sample PEHE. However, ESCFR outperforms the additional baselines significantly in terms of most metrics. The only exception, in-sample AUUC, has been analyzed in Section 4.

D.6 ADDITIONAL RESULTS FOR ABLATION STUDY

In this section, we provide additional results on hyperparameter studyfoot_9 as it reflects how different components of ESCFR affect the performance. Specifically, we report the ranking performance under different settings in Figure 10 and Table 7 . The out-of-sample AUUC exhibits converse variation patterns with PEHE, which is consistent with our expectations. We add the estimation error of average treatment effect estimation for the entire population (ε ATE ) and treated population (ε ATT ) in Figure 11 , Figure 12 and Table 8 , where the PFOR significantly reduces both average estimation errors. Finally, we report the root mean squared error of factual outcomes (RMSE F ) and counterfactual outcomes (RMSE F ). Overall, all components of ESCFR could reduce RMSE CF significantly. One interesting observation is that the dropping of RMSE CF always comes with the dropping of RMSE F . It shows that distribution adjustment does not sacrifice the performance of the factual outcome estimators, but provides prompting information that improves both factual and counterfactual outcome estimators. This section reviews the fundamental assumptions in causality and illustrates their relationship to ESCFR. • The positivity assumption A.3 implies that the treated and untreated groups should contain overlapping units. The stochastic optimal transport in Section 3.1 seeks to achieve it in the latent representation space; however, the MSE issue leads to outcome imbalance and sampled outliers at a mini-batch level. In this case, pulling all samples into a common overlapping region leads to incorrect matches and thus misleads the update of the representation mapping ψ. The RMPR adaptively matches and aligns the units that are close to the overlapping region by ignoring outliers, which reduces incorrect matches and prevents biased update of ψ. • The PFOR is closely related to the unconfoundedness assumption A.1. On the basis of the vanilla Sinkhorn in Section 3.1, it further aligns the transition probabilities P T=1 (Y (T = t) | r) and P T=0 (Y (T = t) | r). That is, for a pair of units from different treatment groups, if they share similar confounders r, their potential outcomes, i.e., Y 0 , Y 1 , should also be similar. In other words, the potential outcomes are independent of the specific treatment assignment given confounders r obtained with the assistance of PFOR, i.e., (Y 0 , Y 1 ) ⊥ ⊥ T | r, which is exactly the unconfoundedness assumption A.1.



We use uppercase letters, e.g., X to denote a random variable, and lowercase letters, e.g., x to denote an associated specific value. Letters in calligraphic font, e.g., X represent the support of the corresponding random variable, and P() represents the probability distribution of the random variable, e.g., P(X). We calculate the unit-wise distance with the squared Euclidean metric followingCourty et al. (2017b). Apart from OT, most prevalent methods(Ma et al., 2022;Liuyi et al., 2018) fail to handle the treatment selection bias for neglecting the MSE issue. The ability to formalize the MSE issue through the mass-preserving constraint is an important advantage of OT over other techniques, as it provides a grip for handling it. An exception would be the within-sample AUUC, which is reported over training data and thus easy to be overfitted. This metric is not critical as the factual outcomes are typically unavailable in the inference phase. We mainly rely on out-of-sample AUUC instead to evaluate the ranking performance and perform model selection. While there is a risk of symbol reuse, we use ε here to denote sampling error, and ϵ to control the strength of entropic regularization in optimal transport. In this work, we calculate the unit-wise distance with the squared Euclidean metric followingCourty et al. (2017b). It can be downloaded from https://www.fredjo.com/ It can be downloaded from https://jenniferhill7.wixsite.com/acic-2016/competition Most of the experiments in Table1were performed with a fixed batch size b = 32, which is selected by the factual estimation performance of TARNet. We take a 40% confidence interval to plot error bars to highlight trends. Exact values are given in tables.



Figure 1: Overview of handling treatment selection bias with ESCFR. Red (blue) indicates treated (untreated) group. (a) Treatment selection bias causes the shift between X 1 and X 0 , impeding ϕ 1 and ϕ 0 to generalize beyond the respective group's properties. Scatters and curves indicate the units and fitted outcome mappings, respectively. (b) ESCFR handles this issue by mapping covariates to an overlapped representation space with R = ψ(X) where ϕ 1 and ϕ 0 are mutually generalizable.

Figure 2: Optimal transport plan (upper) and its geometric interpretation (down) in three cases, where the connection strength depicts the transported mass. Different colors (vertical positions) indicate different treatments (outcomes).

Figure 3: Causal graphs with (a) and w/o (b) the unconfoundedness assumption. The shaded node indicates the hidden confounder X ′ .

Figure 4: Geometric interpretation of optimal transport plan with RMPR under the outcome imbalance (upper) and outlier (down) settings. The dark area indicates the transported mass of a unit, i.e., marginal of the transport matrix π. The light area indicates the total mass.

Figure 5: Parameter sensitivity study for critical hyper-parameters of ESCFR

, representation-based methods minimize the group discrepancy in the latent space.Liuyi et al. (2018) andHassanpour & Greiner (2020) further augment CFR with local similarity and non-confounding factors, respectively.Kallus (2020) andYoon et al. (2018) propose to balance the distributions of representations with adversarial training. Due to its scalability and avoidance of the high variance issue, representationbased methods have been predominant for handling the treatment selection bias.

Figure6: Architecture of Meta-learner based ITE estimators, consisting of inputs (green), outputs (white), shared mappings (yellow), and mappings for treated and untreated units (red and blue, respectively).

follow Definition A.3, P T=1 ψ (r) and P T=0 ψ (x) follow Definition A.4.

relaxed the mapping constraint by allowing the transport from one warehouse to many factories and reformulated the Monge problem as a linear programming problem in Definition B.2. Definition B.2. For discrete measures α = ∑ n i=1 a i δ xi and β = ∑ m j=1 b j δ xj , the Kantorovich problem aims to find a feasible plan π ∈ R n×m + which transports α to β at minimum cost:

49) Algorithm 3 ESCFR Algorithm Input: covariates of treated units {x i } n i=1 and untreated units {x j } m ; factual outcomes {y i } n i=1 and {y j } m j=1 ; representation mapping ψ; outcome mapping ϕ. Parameter: λ: strength of optimal transport; ϵ: strength of entropic regularizer; κ: strength of RMPR; γ: strength of PFOR; ℓ max : max iterations Output: L ϵ,κ,γ,λ ESCFR : the learning objective of ESCFR. 1:

Algorithm2 0.0050±0.0011 0.0059±0.0008 0.0060±0.0011 0.0112±0.0014 0.0162±0.0016 0.1039±0.0033

Figure 7: Fidelity of model-selection criterion to out-of-sample PEHE

Figure 8: PEHE of ESCFR and CFR-WASS (κ = ∞) under different batch size.

Figure 9: A diagram showing how PFOR works and its limitations. (a) A toy example of PFOR, where R and X ′ indicate the balanced representations and an unobserved confounder, respectively; scatters indicate the empirical distribution of units in the treated and control groups; for solid scatters with balanced R, the colored dashed line indicates the ground truth outcome Y = √ R 2 1 + R 2 2 + X ′2 in each group, the black dashed line measures the difference of unobserved X ′ . (b) Cases that satisfy Assumption D.1, where the the outcome Y is monotone with unobserved X ′ given observed confounders in R. (c) Cases that violate Assumption D.1, where the Y is non-monotone with X ′ .

Figure 10: Parameter sensitivity study for critical hyper-parameters of ESCFR (AUUC).

Figure 12: Parameter sensitivity study for critical hyper-parameters of ESCFR (ε ATT ).

Performance (mean±std)  on the PEHE and AUUC metrics. "*" marks the baseline estimators that ESCFR outperforms significantly at p-value < 0.05 over paired samples t-test.

Ablation study (mean±std) on the ACIC benchmark. "*" marks the variants that ESCFR outperforms significantly at p-value < 0.01 over paired samples t-test.mitigates the MSE issue with RMPR in Section 3.2 and the UCE issue with PFOR in Section 3.3, reducing the out-of-sample PEHE to 2.768 and 2.633, respectively. Finally, ESCFR combines the RMPR and PFOR in a unified framework in Section 3.4, reducing the out-of-sample PEHE to 2.316.4.4 ANALYSIS OF RELAXED MASS-PRESERVING REGULARIZER

Out-of-sample PEHE of ESCFR and important baselines with different batch sizes b.

RMSE of ESCFR and its competitors for factual and counterfactual outcome estimation.

Better fitting the training set, where ESCFR achieves the minimum RMSE for factual outcomes on the training set. This improvement is attributed to RMPR, which mitigates the mini-batch sampling effect, thus reducing mismatching. • Effective statistical regularization, where ESCFR reduces the gap between the RMSE for factual outcomes on the training set and that on the test set by 52.1% compared with TARNet. It is mainly attributed to PFOR, which encourages samples with similar outcomes to share similar representations. • Effective counterfactual regularization, where ESCFR reduces the gap between the RMSE on the test set for factual outcomes and that for counterfactual outcomes by 55.1% based on TARNet; the gap between the RMSE on the training set for factual outcomes and that for counterfactual outcomes by 57.3%. It makes ESCFR robust to the treatment selection bias.

Additional comparison (mean±std) on the ACIC benchmark. CFR) as a component. Therefore, it would be unfair for ESCFR to compare with them, as ESCFR is not equipped with the variable decomposition technology. As a result, these results are not included in Table

Individual treatment effect estimation performance (mean±std) of ESCFR under different settings.

Average treatment estimation error (mean±std) of ESCFR under different settings.

Outcome estimation percision (mean±std) of ESCFR under different settings.

