SINKHORN DISCREPANCY FOR COUNTERFACTUAL GENERALIZATION

Abstract

Estimating individual treatment effects from observational data is highly challenging due to the existence of treatment selection bias. Most prevalent approaches mitigate this issue by aligning distributions of different treatment groups in the representation space. However, there are two critical problems circumvented: (1) mini-batch sampling effects (MSE), where the alignment easily fails due to the outcome imbalance or outliers at a mini-batch level; (2) unobserved confounder effects (UCE), where the unobserved confounders damage the correct alignment. To tackle these problems, we propose a principled approach named Entire Space CounterFactual Regression (ESCFR) based on a generalized sinkhorn discrepancy for distribution alignment within the stochastic optimal transport framework. Based on the framework, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that our proposed ESCFR can successfully tackle the treatment selection bias and achieve significantly better performance than state-of-the-art methods.

1. INTRODUCTION

Estimating individual treatment effect (ITE) with randomized controlled trials is a common practice in causal inference, which has been widely used in e-commerce (Betlei et al., 2021) , education (Cordero et al., 2018) , and health care (Schwab et al., 2020) . For example, drug developers would conduct clinical A/B tests to evaluate the drug effects. Although randomized controlled trials are the gold standard (Pearl & Mackenzie, 2018) for causal inference, it is often prohibitively expensive to conduct such experiments. Hence, observational data that can be acquired without intervention has been a tempting shortcut. For example, drug developers tend to assess drug effects with post-marketing monitoring reports instead of clinical A/B trials. With the growing access to observational data, estimating ITE from observational data has attracted intense research interest. Estimating ITE with observational data has two main challenges: (1) missing counterfactuals, i.e., only one factual outcome out of all potential outcomes can be observed; (2) treatment selection bias, i.e., individuals have their preferences for treatment selection, making units in different treatment groups heterogeneous. To handle missing counterfactuals, meta-learners (Künzel et al., 2019) decompose the ITE estimation task into solvable factual outcome estimation subproblems. However, the treatment selection bias makes it difficult to generalize the factual outcome estimators trained within respective treatment groups to the entire population; consequently, the derived ITE estimator is biased. Beginning with counterfactual regression (Shalit et al., 2017) and its revolutionary performance, most prevalent methods handle the selection bias by minimizing the distribution discrepancy between groups in the representation space (see Liuyi et al., 2018; Hassanpour & Greiner, 2020; Cheng et al., 2022) . However, two critical issues with these methods have long been neglected, which significantly impedes them from handling the treatment selection bias. The first problem is the mini-batch sampling effects (MSE). Specifically, current representation-based methods (Shalit et al., 2017; Liuyi et al., 2018) compute distribution discrepancy within mini-batches instead of the entire data space, making it vulnerable to bad sampling cases. For example, given two aligned distributions, if a mini-batch outlier exists in the sampled distribution, the mini-batch discrepancy will be significant, making the training process noise-filled. The second problem is the unobserved confounder effects (UCE). Specifically, current approaches directly assume unconfoundedness Ma et al. ( 2022), while the unobserved confounders widely exist in real scenarios and make the resulting estimators biased. Y0 X0 ψ R1 W W Y Y L L T T R0 ϕ1 Y1 ϕ0 X1 ϕ0(R0) R0=ψ(X0) ϕ1(R1) Contributions and outline. In this paper, we propose an effective ITE estimator based on optimal transport, Entire Space CounterFactual Regression (ESCFR), which tackles both the MSE and UCE issues with a generalized sinkhorn discrepancy. Specifically, after preliminaries in Section 2, we first reformulate the ITE estimation problem as a stochastic optimal transport problem in Section 3.1. We next showcase the MSE issue faced by existing approaches in Section 3.2 and propose a relaxed mass-preserving regularizer to mitigate this issue. We further investigate the UCE issue in Section 3.3 and propose a proximal factual outcome regularizer to solve it. We finally formulate the architecture and learning objectives of ESCFR in Section 3.4, and report the experimental results in Section 4.

2.1. CAUSAL INFERENCE FROM OBSERVATIONAL DATA

This section formulates basic definitions and models in observational causal inference. We first formalize the fundamental elements in Definition 2.1 following the general notation conventionfoot_0 . Definition 2.1. Let X be the random variable of covariates, with support X and distribution P(x); Let R be the random variable of induced representations, with support R and distribution P(r); Let Y be the random variable of outcomes, with support Y and distribution P(y); Let T be the random variable of treatment indicator, with support T = {0, 1} and distribution P(T ). Following the potential outcome framework (Rubin, 1974) , an individual with covariates x has two potential outcomes, namely Y 1 (x) given it is treated and Y 0 (x) otherwise. The ground-truth individual treatment effect (ITE) is always formulated as the expected difference of potential outcomes: τ (x) ∶= E [Y 1 -Y 0 | x] , where one of these two outcomes is always unobserved. To address such missing counterfactuals, the ITE estimation task is commonly decomposed into potential outcome estimation subproblems that are solvable with any supervised learning method (Künzel et al., 2019) . For example, T-learner models the factual outcomes Y for units in the treated and untreated groups separately; S-learner regards the treatment indicator T as one of the covariates X, and models Y for all units simultaneously. The ITE estimate is then the difference of the estimated outcomes when T is set to treated and untreated. Definition 2.2. Let ψ ∶ X → R be a mapping from support X to R, i.e., ∀x ∈ X , ∃r = ψ(x) ∈ R. Let ϕ ∶ R × T → Y be a mapping from support R × T to Y, i.e., it maps the representations and treatment indicator to the corresponding factual outcome. For example, Y 1 = ϕ 1 (R), Y 0 = ϕ 0 (R), where we abbreviate ϕ(R, T = 1) and ϕ(R, T = 0) to ϕ 1 (R) and ϕ 0 (R), respectively, for brevity.



We use uppercase letters, e.g., X to denote a random variable, and lowercase letters, e.g., x to denote an associated specific value. Letters in calligraphic font, e.g., X represent the support of the corresponding random variable, and P() represents the probability distribution of the random variable, e.g., P(X).



Figure 1: Overview of handling treatment selection bias with ESCFR. Red (blue) indicates treated (untreated) group. (a) Treatment selection bias causes the shift between X 1 and X 0 , impeding ϕ 1 and ϕ 0 to generalize beyond the respective group's properties. Scatters and curves indicate the units and fitted outcome mappings, respectively. (b) ESCFR handles this issue by mapping covariates to an overlapped representation space with R = ψ(X) where ϕ 1 and ϕ 0 are mutually generalizable.

