REPRESENTATION BALANCING WITH DECOMPOSED PATTERNS FOR TREATMENT EFFECT ESTIMATION Anonymous

Abstract

Estimating treatment effects from observational data is subject to a problem of covariate shift caused by selection bias. Recent studies have attempted to mitigate this problem by group distance minimization, that is, balancing the distribution of representations between the treated and controlled groups. The rationale behind this is that learning balanced representations while preserving the predictive power of factual outcomes is expected to generalize to counterfactual inference. Inspired by this, we propose a new approach to better capture the patterns that contribute to representation balancing and outcome prediction. Specifically, we derive a theoretical bound that naturally ties the notion of propensity confusion to representation balancing, and further transform the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG). Moreover, we propose to decompose proxy features into Patterns of Pre-balancing and Balancing Representations (PPBR), as it is insufficient if only balanced representations are considered in outcome prediction. Extensive experiments on simulation and benchmark data confirm not only PDIG leads to mutual reinforcement between individual propensity confusion and group distance minimization, but also PPBR brings improvement to outcome prediction, especially to counterfactual inference. We believe these findings are heuristics for further investigation of what affects the generalizability of representation balancing models in counterfactual estimation.

1. INTRODUCTION

In the context of the ubiquity of personalized decision-making, causal inference has sparked a surge of research exploring causal machine learning in many disciplines, including economics and statistics (Wager & Athey, 2018; Athey & Wager, 2019; Farrell, 2015; Chernozhukov et al., 2018; Huang et al., 2021) , healthcare (Qian et al., 2021; Bica et al., 2021a; b) , and commercial applications (Guo et al., 2020b; c; Chu et al., 2021) . The main problem of causal inference is the treatment effect estimation, which is tied to a fundamental hypothetical question: What would be the outcome if one received an alternative treatment? Answering this question requires the knowledge of counterfactual outcomes, but they can only be inferred from observational data, not directly obtained. Selection bias presents a major challenge for estimating counterfactual outcomes (Guo et al., 2020a; Zhang et al., 2020; Yao et al., 2021) . This problem is caused by the non-random treatment assignment, that is, treatment (e.g., vaccination) is usually determined by covariates (e.g., age) that also affect the outcome (e.g., infection rate) (Huang et al., 2022b) . The probability of a person receiving treatment is well known as the propensity score, and the difference between each person's propensity score can inherently lead to a covariate shift problem, i.e., the distribution of covariates in the treated units is substantially different from that in the controlled ones. The covariate shift issue makes it more difficult to infer counterfactual outcomes from observational data (Yao et al., 2018; Hassanpour & Greiner, 2019a) . Recently, a line of representation balancing works has sought to alleviate the covariate shift problem by balancing the distribution between the treated group and the controlled group in the representation space (Shalit et al., 2017; Johansson et al., 2022) . The rational insight behind these works is that the counterfactual estimation should rest on the accuracy of factual estimation while enforcing minimization of distributional discrepancy measured by the Integral Probability Metric (IPM) between the treated and controlled units. However, there are two issues that remain to be resolved. First, Wasserstein distance (Cuturi & Doucet, 2014) is the most widely-adopted metric for group distance minimization (Shalit et al., 2017; Huang et al., 2022a; Zhou et al., 2022) , whereas H-divergence has still received little attention in causal representation learning though it is an important distance metric in other fields (Ben-David et al., 2006; 2010) . Second, enforcing models to learn outcome predictors with only balanced representations may inadvertently weaken the predictive power of the outcome function (Zhang et al., 2020; Assaad et al., 2021; Huang et al., 2022a) . We provide intuitive examples to illustrate the second issue in Section A.4. The aforementioned issues motivate us to explore approaches to (i) improving factual outcome prediction without affecting learning balancing patterns or (ii) learning more effective balancing patterns without affecting factual outcome prediction. In this paper, we propose a new method, DIGNet, with learning decomposed patterns to achieve these two goals. The contributions are threefold: (1) We interpret representation balancing as a concept of propensity confusion and derive corresponding theoretical results based on H-divergence to ensure its rationality; (2) DIGNet transforms the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG) to capture patterns beneficial to representation balancing, and we empirically find that the PDIG structure enables individual propensity confusion and group distance minimization to reinforce each other without affecting factual outcome prediction; (3) DIGNet decomposes representative features into Patterns of Pre-balancing and Balancing Representations (PPBR) to preserve patterns contributing to outcome modeling, and we experimentally confirm that the PPBR approach brings improvement to outcome prediction without affecting learning balancing patterns.

1.1. RELATED WORK

The presence of a covariate shift problem stimulates the line of representation balancing works (Johansson et al., 2016; Shalit et al., 2017; Johansson et al., 2022) . These works aim to balance the distributions of representations between treated and controlled groups and simultaneously try to maintain representations predictive of factual outcomes. This idea is closely connected with domain adaptation. In particular, the individual treatment effect (ITE) error bound based on Wasserstein distance is similar to the generalization bound in Ben-David et al. ( 2010 

2. PRELIMINARIES

Notations. Suppose there are the N i.i.d. random variable samples D = {(X i , T i , Y i )} N i=1 with observed realizations {(x i , t i , y i )} N i=1 , where there are N 1 treated units and N 0 controlled units. For each unit i, X i ∈ X ⊂ R d denotes d-dimensional covariates and T i ∈ {0, 1} denotes the binary treatment, with e(x i ) := p(T i = 1 | X i = x i ) defined as the propensity score (Rosenbaum & Rubin, 1983) . Potential outcome framework (Rubin, 2005) defines the potential outcomes Y 1 , Y 0 ∈ Y ⊂ R



);Long et al. (2014);Shen  et al. (2018). In addition to Wasserstein distance-based model, this paper derives a new ITE error bound based on H-divergence(Ben-David et al., 2006; 2010; Ganin et al., 2016). Note that our theoretical results (Section 3.2) and experimental implementations (Section 4.1) differ greatly fromShalit et al. (2017)  due to distinct definitions between Wasserstein distance and H-divergence. Another recent line of work investigates efficient neural network structures for treatment effect estimation. Kuang et al. (2017); Hassanpour & Greiner (2019b) extract the original covariates into treatment-specific factors, outcome-specific factors, and confounding factors; X-learner (Künzel et al., 2019) and R-learner (Nie & Wager, 2021) are developed beyond the classic S-learner and T-learner; Curth & van der Schaar (2021) leverage structures for end-to-end learners to counteract the inductive bias towards treatment effect estimation, as motivated by Makar et al. (2020).The proposed DIGNet model is built on the PDIG structure and the PPBR approach. The PDIG structure is motivated by multi-task learning, where we design a framework incorporating two specific balancing patterns that share the same pre-balancing patterns. The PPBR approach is inspired by Zhang et al. (2020); Assaad et al. (2021); Huang et al. (2022a), where the authors argue that improperly balanced representations can be detrimental predictors for outcome modeling, since such representations can lose the original information that contributes to outcome prediction. Other representation learning methods relevant to treatment effect estimation include Louizos et al. (2017); Yao et al. (2018); Yoon et al. (2018); Shi et al. (2019); Du et al. (2021).

