REPRESENTATION BALANCING WITH DECOMPOSED PATTERNS FOR TREATMENT EFFECT ESTIMATION Anonymous

Abstract

Estimating treatment effects from observational data is subject to a problem of covariate shift caused by selection bias. Recent studies have attempted to mitigate this problem by group distance minimization, that is, balancing the distribution of representations between the treated and controlled groups. The rationale behind this is that learning balanced representations while preserving the predictive power of factual outcomes is expected to generalize to counterfactual inference. Inspired by this, we propose a new approach to better capture the patterns that contribute to representation balancing and outcome prediction. Specifically, we derive a theoretical bound that naturally ties the notion of propensity confusion to representation balancing, and further transform the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG). Moreover, we propose to decompose proxy features into Patterns of Pre-balancing and Balancing Representations (PPBR), as it is insufficient if only balanced representations are considered in outcome prediction. Extensive experiments on simulation and benchmark data confirm not only PDIG leads to mutual reinforcement between individual propensity confusion and group distance minimization, but also PPBR brings improvement to outcome prediction, especially to counterfactual inference. We believe these findings are heuristics for further investigation of what affects the generalizability of representation balancing models in counterfactual estimation.

1. INTRODUCTION

In the context of the ubiquity of personalized decision-making, causal inference has sparked a surge of research exploring causal machine learning in many disciplines, including economics and statistics (Wager & Athey, 2018; Athey & Wager, 2019; Farrell, 2015; Chernozhukov et al., 2018; Huang et al., 2021) , healthcare (Qian et al., 2021; Bica et al., 2021a; b) , and commercial applications (Guo et al., 2020b; c; Chu et al., 2021) . The main problem of causal inference is the treatment effect estimation, which is tied to a fundamental hypothetical question: What would be the outcome if one received an alternative treatment? Answering this question requires the knowledge of counterfactual outcomes, but they can only be inferred from observational data, not directly obtained. Selection bias presents a major challenge for estimating counterfactual outcomes (Guo et al., 2020a; Zhang et al., 2020; Yao et al., 2021) . This problem is caused by the non-random treatment assignment, that is, treatment (e.g., vaccination) is usually determined by covariates (e.g., age) that also affect the outcome (e.g., infection rate) (Huang et al., 2022b) . The probability of a person receiving treatment is well known as the propensity score, and the difference between each person's propensity score can inherently lead to a covariate shift problem, i.e., the distribution of covariates in the treated units is substantially different from that in the controlled ones. The covariate shift issue makes it more difficult to infer counterfactual outcomes from observational data (Yao et al., 2018; Hassanpour & Greiner, 2019a) . Recently, a line of representation balancing works has sought to alleviate the covariate shift problem by balancing the distribution between the treated group and the controlled group in the representation space (Shalit et al., 2017; Johansson et al., 2022) . The rational insight behind these works is that the counterfactual estimation should rest on the accuracy of factual estimation while enforcing minimization of distributional discrepancy measured by the Integral Probability Metric (IPM) between the treated and controlled units. However, there are two issues that remain to be resolved. First, Wasserstein distance (Cuturi & Doucet, 2014) is the most widely-adopted metric for group distance minimization (Shalit et al., 2017; Huang et al., 2022a; Zhou et al., 2022) , whereas H-divergence has still received little attention in causal representation learning though it is an important distance metric in other fields (Ben-David et al., 2006; 2010) . Second, enforcing models to learn outcome predictors with only balanced representations may inadvertently weaken the predictive power of the outcome function (Zhang et al., 2020; Assaad et al., 2021; Huang et al., 2022a) . We provide intuitive examples to illustrate the second issue in Section A.4. The aforementioned issues motivate us to explore approaches to (i) improving factual outcome prediction without affecting learning balancing patterns or (ii) learning more effective balancing patterns without affecting factual outcome prediction. In this paper, we propose a new method, DIGNet, with learning decomposed patterns to achieve these two goals. The contributions are threefold: (1) We interpret representation balancing as a concept of propensity confusion and derive corresponding theoretical results based on H-divergence to ensure its rationality; (2) DIGNet transforms the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG) to capture patterns beneficial to representation balancing, and we empirically find that the PDIG structure enables individual propensity confusion and group distance minimization to reinforce each other without affecting factual outcome prediction; (3) DIGNet decomposes representative features into Patterns of Pre-balancing and Balancing Representations (PPBR) to preserve patterns contributing to outcome modeling, and we experimentally confirm that the PPBR approach brings improvement to outcome prediction without affecting learning balancing patterns.

1.1. RELATED WORK

The presence of a covariate shift problem stimulates the line of representation balancing works (Johansson et al., 2016; Shalit et al., 2017; Johansson et al., 2022) . These works aim to balance the distributions of representations between treated and controlled groups and simultaneously try to maintain representations predictive of factual outcomes. This idea is closely connected with domain adaptation. In particular, the individual treatment effect (ITE) error bound based on Wasserstein distance is similar to the generalization bound in Ben-David et al. (2010) ; Long et al. (2014) ; Shen et al. (2018) . In addition to Wasserstein distance-based model, this paper derives a new ITE error bound based on H-divergence (Ben-David et al., 2006; 2010; Ganin et al., 2016) . Note that our theoretical results (Section 3.2) and experimental implementations (Section 4.1) differ greatly from Shalit et al. (2017) due to distinct definitions between Wasserstein distance and H-divergence. Another recent line of work investigates efficient neural network structures for treatment effect estimation. Kuang et al. (2017) ; Hassanpour & Greiner (2019b) extract the original covariates into treatment-specific factors, outcome-specific factors, and confounding factors; X-learner (Künzel et al., 2019) and R-learner (Nie & Wager, 2021) are developed beyond the classic S-learner and T-learner; Curth & van der Schaar (2021) leverage structures for end-to-end learners to counteract the inductive bias towards treatment effect estimation, as motivated by Makar et al. (2020) . The proposed DIGNet model is built on the PDIG structure and the PPBR approach. The PDIG structure is motivated by multi-task learning, where we design a framework incorporating two specific balancing patterns that share the same pre-balancing patterns. The PPBR approach is inspired by Zhang et al. (2020) ; Assaad et al. (2021) ; Huang et al. (2022a) , where the authors argue that improperly balanced representations can be detrimental predictors for outcome modeling, since such representations can lose the original information that contributes to outcome prediction. Other representation learning methods relevant to treatment effect estimation include Louizos et al. (2017) ; Yao et al. (2018) ; Yoon et al. (2018) ; Shi et al. (2019) ; Du et al. (2021) .

2. PRELIMINARIES

Notations. Suppose there are the N i.i.d. random variable samples D = {(X i , T i , Y i )} N i=1 with observed realizations {(x i , t i , y i )} N i=1 , where there are N 1 treated units and N 0 controlled units. For each unit i, X i ∈ X ⊂ R d denotes d-dimensional covariates and T i ∈ {0, 1} denotes the binary treatment, with e(x i ) := p(T i = 1 | X i = x i ) defined as the propensity score (Rosenbaum & Rubin, 1983) . Potential outcome framework (Rubin, 2005) defines the potential outcomes Y 1 , Y 0 ∈ Y ⊂ R for treatment T = 1 and T = 0, respectively. We let the observed outcome (factual outcome) be Y = T • Y 1 + (1 -T ) • Y 0 . For t ∈ {0, 1}, let τ t (x) := E [Y t | X = x] be a function of Y t w.r.t. X, then our goal is to estimate the individual treatment effect (ITE) τ (x) := E Y 1 -Y 0 | X = x = τ 1 (x) -τ 0 (x), and the average treatment effect (ATE) τ AT E := E Y 1 -Y 0 = X τ (x)p(x)dx.

2.1. PROBLEM SETUP

In causal representation balancing works, we denote representation space by R ⊂ R d , and Φ : X → R is assumed to be a twice-differentiable, one-to-one and invertible function with its inverse Ψ : R → X such that Ψ(Φ(x)) = x. The densities of the treated and controlled covariates are denoted by p T =1 x = p T =1 (x) := p(x | T = 1) and p T =0 x = p T =0 (x) := p(x | T = 0), respectively. Correspondingly, the densities of the treated and controlled covariates in the representation space are denoted by p T =1 Φ = p T =1 Φ (r) := p Φ (r | T = 1) and p T =0 Φ = p T =0 Φ (r) := p Φ (r | T = 0), respectively. Our study is based on the potential outcome framework (Rubin, 2005) . Assumption 1 states standard and necessary assumptions to ensure treatment effects are identifiable. Before proceeding with theoretical analysis, we also present the necessary terms and definitions in Definition 1. Assumption 1 (Consistency, Overlap, and Unconfoundedness) Consistency: If the treatment is t, then the observed outcome equals Y t . Overlap: The propensity score is bounded away from 0 to 1: 0 < e(x) < 1. Unconfoundedness: Y t ⊥ ⊥ T | X, ∀t ∈ {0, 1}. Definition 1 Let h : R×{0, 1} → Y be an hypothesis defined over the representation space R such that h(Φ(x), t) estimates y t , and L : Y × Y → R + be a loss function (e.g., L(y, y ′ ) = (y -y ′ ) 2 ). If we define the expected loss for (x, t) as ℓ h,Φ (x, t) = Y L(y t , h(Φ(x), t))p(y t |x)dy t , we then have factual and counterfactual losses, as well as them on the treated and controlled: ϵ F (h, Φ) = X ×{0,1} ℓ h,Φ (x, t)p(x, t)dxdt, ϵ CF (h, Φ) = X ×{0,1} ℓ h,Φ (x, t)p(x, 1 -t)dxdt, ϵ T =1 F (h, Φ) = X ℓ h,Φ (x, 1)p T =1 (x)dx, ϵ T =0 F (h, Φ) = X ℓ h,Φ (x, 0)p T =0 (x)dx, ϵ T =1 CF (h, Φ) = X ℓ h,Φ (x, 1)p T =0 (x)dx, ϵ T =0 CF (h, Φ) = X ℓ h,Φ (x, 0)p T =1 (x)dx. If we let f (x, t) be h(Φ(x), t), where f : X × {0, 1} → Y is a prediction function for outcome, then the estimated ITE over f is defined as τf (x) := f (x, 1) -f (x, 0). Finally, a better treatment effect estimation can be reformulated as a smaller error in Precision in the expected Estimation of Heterogeneous Effect (PEHE): ϵ P EHE (f ) = X L(τ f (x), τ (x))p(x)dx. Here, ϵ P EHE (f ) can also be denoted by ϵ P EHE (h, Φ) if we let f (x, t) be h(Φ(x), t).

3. THEORETICAL RESULTS

In this section, we first prove ϵ P EHE is bounded by ϵ F and ϵ CF in Lemma 1. Next, we revisit the upper bound (Theorem 1) concerning the group distance minimization guided method in Section 3.1. Section 3.2 further discusses the theoretical results of the proposed individual propensity confusion guided method in Theorem 2. Proofs and additional theoretical results are deferred to Appendix. Lemma 1 Let functions h and Φ be as defined in Definition 1, and L be the squared loss function. Recall that τ t (x) = E [Y t | X = x]. Defining σ 2 y = min{σ 2 y t (p(x, t)), σ 2 y t (p(x, 1 -t))} ∀t ∈ {0, 1}, where σ 2 y t (p(x, t)) = X ×{0,1}×Y (y t -τ t (x)) 2 p(y t |x)p(x, t)dy t dxdt, we have ϵ P EHE (h, Φ) ≤ 2(ϵ CF (h, Φ) + ϵ F (h, Φ) -2σ 2 y ). Note that similar results will hold as long as L takes forms that satisfy the triangle inequality, but is not limited to the squared loss. For instance, we give the result for absolute loss in Lemma 6 in Appendix. This extends the result shown in Shalit et al. (2017) that L only takes the squared loss. , and p T =0 Φ be as defined before. Let L be the squared loss, u := P r(T = 1), and G be the family of 1-Lipschitz functions. Assuming that there exists a constant B Φ ≥ 0 such that g Φ,h (r, t) := 1 BΦ • ℓ h,Φ (Ψ(r), t) ∈ G for fixed t ∈ {0, 1}, we then have ϵ CF (h, Φ) ≤ (1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ) + B Φ • W ass(p T =1 Φ , p T =0 Φ ), (2) ϵ P EHE (h, Φ) ≤ 2(ϵ T =1 F (h, Φ) + ϵ T =0 F (h, Φ) + B Φ • W ass(p T =1 Φ , p T =0 Φ ) -2σ 2 y ). Theorem 1 will be identical to Shalit et al. (2017) if L is the squared loss. Note, however, that similar results for Theorem 1 still hold as long as L takes forms that satisfy the triangle inequality. For instance, we give the result for absolute loss in Theorem 1 in Appendix. We refer to the model as GNet (aka CFR-Wass in Shalit et al. (2017) ) since it is based on group distance minimization.

3.2. INET: INDIVIDUAL PROPENSITY CONFUSION GUIDED REPRESENTATION BALANCING

The propensity score is recognized central to treatment effect estimation because it characterizes the probability that one receives treatment (Rosenbaum & Rubin, 1983) . Therefore, in addition to group distance minimization, the propensity score can be naturally used to identify if representations are adequately balanced, since representation balancing can be intuitively interpreted as propensity confusion, that is, when it is hard to distinguish whether each unit in the representation space is treated or controlled, the representations are thought adequately balanced. In section 4.1, we will demonstrate how minimizing H-divergence (Definition 3) is empirically related to propensity confusion. Below, we first derive an ITE bound in Theorem 2 based on H-divergence. Definition 3 Given a pair of distributions p 1 , p 2 over S, and a hypothesis binary function class H, the H-divergence between p 1 and p 2 is defined as d H (p 1 , p 2 ) := 2sup η∈H |P r p1 [η(s) = 1] -P r p2 [η(s) = 1]| . (4) Theorem 2 Let h, Φ, Ψ, p T =1 Φ , and p T =0 Φ be as defined before. Let L be the squared loss, u := P r(T = 1), and H be the family of binary functions. Assuming that there exists a constant K ≥ 0 such that Y L(y, y ′ )dy ≤ K ∀y ′ ∈ Y, we then have ϵ CF (h, Φ) ≤ (1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ) + K 2 d H (p T =1 Φ , p T =0 Φ ), ϵ P EHE (h, Φ) ≤ ϵ T =1 F (h, Φ) + ϵ T =0 F (h, Φ) + K 2 d H (p T =1 Φ , p T =0 Φ ) -2σ 2 y . We can apply similar arguments in proving Theorem 2 to obtain the corresponding upper bounds as long as L takes forms that satisfy the triangle inequality. For instance, the result for absolute loss and the proof of Theorem 2 are given in Theorem 2 in Appendix. We refer to the model as INet since it is based on individual propensity confusion.

4. METHOD

In this section, we will demonstrate how Theorem 2 is associated with propensity confusion, and further suggest a decomposition network DIGNet based on GNet and INet. Section 4.1 presents the objectives of standard representation balancing models, GNet and INet. Section 4.2 introduces PDIG and PPBR components for the representation balancing with decomposed patterns scheme, and gives the final objective of the proposed model DIGNet.

4.1. REPRESENTATION BALANCING WITHOUT DECOMPOSED PATTERNS

In representation balancing models, given the input data tuples (x, t, y) = {(x i , t i , y i )} N i=1 , the original covariates x are extracted by some representation function Φ(•), and representations Φ(x) are then fed into the outcome functions h 1 (•) := h(•, 1) and h 0 (•) := h(•, 0) that estimate the potential outcome y 1 and y 0 , respectively. Finally, the factual outcome can be predicted by h t (•) = th 1 (•) + (1 -t)h 0 (•), and the corresponding outcome loss is L y (x, t, y; Φ, h t ) = 1 N N i=1 L(h t (Φ(x i )), y i ). If a model does not have decomposition modes like GNet and INet, both outcome prediction and representation balancing will rely on the extracted features Φ(x). Below we will introduce the objectives of GNet and INet. Objective of GNet. GNet learns the balancing patterns over Φ by minimizing the group distance loss L G (x, t; Φ) = W ass ({Φ(x i )} i:ti=0 , {Φ(x i )} i:ti=1 ). If the original covariates x are extracted by the feature extractor Φ E (•), then the final objective of GNet is min Φ E ,h t L y (x, t, y; Φ E , h t ) + α 1 L G (x, t; Φ E ). For the convenience of the reader, we illustrate the structure of GNet in Figure 1(a) . Objective of INet. Next, we detail how Theorem 2 is related to propensity confusion and give the objective of INet. Let I(a) be an indicator function that gives 1 if a is true, and H be the family of binary functions as defined in Theorem 2. The representation balancing seeks to ahieve a smaller empirical H-divergence dH (p T =1 Φ , p T =0 Φ ) such that dH (p T =1 Φ , p T =0 Φ ) = 2 1 -min η∈H 1 N 0 N0 i:ti=0 I[η(Φ(x i )) = 0] + 1 N 1 N1 i:ti=1 I[η(Φ(x i )) = 1] . (9) The "min" part in equation 9 indicates that the optimal classifier η * ∈ H minimizes the classification error between the estimated treatment η * (Φ(x i )) and the observed treatment t i , i.e., discriminating whether Φ(x i ) is controlled (T = 0) or treated (T = 1). As a result, dH (p T =1 Φ , p T =0 Φ ) will be large if η * can easily distinguish whether Φ(x i ) is treated or controlled, i.e., the optimal classification error is small. In contrast, dH (p T =1 Φ , p T =0 Φ ) will be small if it is hard for η * to determine whether Φ(x i ) is treated or controlled, i.e., the optimal classification error is large. Therefore, the prerequisite of a small H-divergence is to find a map Φ such that any classifier η ∈ H will get confused about the probability of Φ(x i ) being treated or controlled. To achieve this goal, we first define a discriminator π(r) : R → [0, 1] that estimates the propensity score of r. The classification error for the i th individual can be empirically approximated by the cross-entropy loss between π(Φ(x i )) and t i : L t (t i , π(Φ(x i ))) = -[t i log π(Φ(x i )) + (1 -t i ) log(1 -π(Φ(x i )))] . To minimize the classification error in equation 9, we aim to find an optimal discriminator π * such that π * maximizes the probability that treatment is correctly classified of the total population: max π∈H L I (x, t; Φ, π) = max π∈H - 1 N 0 N0 i:ti=0 L t (t i , π(Φ(x i ))) - 1 N 1 N1 i:ti=1 L t (t i , π(Φ(x i ))) . ( ) Given the feature extractor Φ E (•), the objective of INet can be formulated as a min-max game: min Φ E ,h t max π L y (x, t, y; Φ E , h t ) + α 2 L I (x, t; Φ E , π). ( ) As stated in equation 12, INet achieves the representation balancing through a min-max formulation. In the maximization, the discriminator π is trained to maximize the probability that treatment is correctly classified. This forces π(Φ E (x i )) closer to the true propensity score e(x i ). In the minimization, the feature extractor Φ E is trained to fool the discriminator π. This confuses π such 2016), the theoretical derivations are completely different due to the significant differences (e.g., definitions, problem settings, and theoretical bounds.) between causal inference and domain adaptation.

4.2. REPRESENTATION BALANCING WITH DECOMPOSED PATTERNS

PDIG. Previous demonstrations have shown that GNet is thriving and widely adopted, while INet is meaningful and interpretable. Nevertheless, there is no consensus on the absolute best approach, as each method has its strengths and weaknesses. To this end, we expect to capture more effective balancing patterns by turning the balancing Patterns into Decompositions of Individual propensity confusion and Group distance minimization (PDIG). More specifically, the covariates x are extracted by the feature extractor Φ E (•), and then Φ E (x) are fed into the balancing networks Φ G (•) and Φ I (•) for group distance minimization and individual propensity confusion, respectively. Finally, the losses for the two separate balancing patterns are min Φ G L G (x, t; Φ G • Φ E ), min Φ I max π L I (x, t; Φ I • Φ E , π). Here, • denotes the composition of two functions, indicating that Φ(•) in L G (x, t; Φ) and L I (x, t; Φ, π) are replaced by Φ G (Φ E (•)) and Φ I (Φ E (•)), respectively. PPBR. Motivated by the discussion in Section 1, we aim to design a framework that is capable of capturing Patterns of Pre-balancing and Balancing Representations (PPBR). First, the representation balancing patterns Φ G (Φ E (x)) and Φ I (Φ E (x)) are learned over Φ G and Φ I , while Φ E is remained fixed as pre-balancing patterns. Furthermore, we concatenate the balancing representations Φ G (Φ E (x)) and Φ I (Φ E (x)) with the pre-balancing representations Φ E (x) as attributes for outcome prediction. As a result, the proxy features used for outcome predictions are Φ E (x) ⊕ Φ G (Φ E (x)) ⊕ Φ I (Φ E (x)) , where ⊕ indicates the concatenation by column. For example, if a = [1, 2] and b = [3, 4], then a ⊕ b = [1, 2, 3, 4]. Although we mainly investigate whether PDIG and PPBR are beneficial to treatment effect estimation in this paper, it would also be an interesting direction for future research to find out whether there exists any interaction or mutual reinforcement between them. Objective of DIGNet. Combining with PDIG and PPBR, we develop a new model architecture, DIGNet, as illustrated in Figure 1 (e). The objective of DIGNet is separated into four stages: min Φ G α 1 L G (x, t; Φ G • Φ E ), ( ) max π α 2 L I (x, t; Φ I • Φ E , π), ( ) min Φ I α 2 L I (x, t; Φ I • Φ E , π), min Φ E ,Φ I ,Φ G ,h t L y (x, t, y; Φ E ⊕ (Φ I • Φ E ) ⊕ (Φ G • Φ E ), h t ). ( ) Within each iteration, DIGNet manages to minimize the group distance via equation 14, and plays an adversarial game to achieve propensity confusion through equation 15 and equation 16. In equation 17, DIGNet updates both the pre-balancing and balancing patterns Φ E , Φ I , Φ G along with the outcome function h t to minimize the outcome prediction loss. 

5. EXPERIMENTS

In non-randomized observational data, the ground truth of treatment effects is inaccessible due to the lack of counterfactuals. Therefore, we use simulated data and semi-synthetic benchmark data to test the performance of our methods and other baseline models. In this section, we mainly investigate two questions: Q1. Compared to DGNet and DINet without the PDIG structure, can DIGNet with the PDIG structure achieve a better representation balancing task? Q2. Are DGNet and DINet that involve PPBR capable of improving the performance on outcome prediction compared with standard representation balancing models such as GNet and INet? 5.1 EXPERIMENTAL SETTINGS Simulation data. Previous causal inference works assess the model effectiveness by varying the distribution imbalance of covariates in treated and controlled groups at different levels (Yao et al., 2018; Yoon et al., 2018; Du et al., 2021) . As suggested in Assaad et al. (2021) , we draw 1000 observational data points from the following data generating strategy: X i ∼ N (0, σ 2 • [ρ1 p 1 ′ p (1 -ρ)I p ]), T i | X i ∼ Bernoulli(1/(1 + exp(-γX i ))), Y 0 i = β ′ 0 X i + ξ i , Y 1 i = β ′ 1 X i + ξ i , ξ i ∼ N (0, 1). Here, 1 p denotes the p-dimensional all-ones vector and I p denotes the identity matrix of size p. We fix p = 10, ρ = 0.3, σ 2 = 2, β ′ 0 = [0.3, ..., 0.3], β ′ 1 = [1.3, ..., 1. 3] and vary γ ∈ {0.25, 0.5, 0.75, 1, 1.5, 2, 3} to yield different level of selection bias. As seen in Figure 6 in Appendix, selection bias becomes more severe with γ increasing. For each γ, we repeat the above data generating process to generate 30 different datasets, with each dataset split by the ratio of 56%/24%/20% as training/validation/test sets.

Semi-synthetic data

The IHDP dataset is introduced by Hill (2011). This dataset consists of 747 samples with 25-dimensional covariates collected from real-world randomized experiments. Selection bias is created by removing some of treated samples. The goal is to estimate the effect of special visits (treatment) on cognitive scores (outcome). The potential outcomes are generated using the NPCI package Dorie (2021). We use the same 1000 datasets as used in Shalit et al. (2017) , with each dataset split by the ratio of 63%/27%/10% as training/validation/test sets.

Models and metrics.

In simulation experiments, we perform comprehensive comparisons between INet, GNet, DINet, DGNet, and DIGNet in terms of the mean and standard error for the following metrics: To analyze the source of gain, we fairly compare models by ensuring that each model shares the same hyperparameters, e.g., learning rate, the number of layers and units for (Φ E , Φ G , Φ I , f t ), and (α 1 , α 2 ). Note that we apply an early stopping rule to all models as Shalit et al. (2017) do. In IHDP experiment, we use √ ϵ P EHE , as well as an additional metric ϵ AT E = |τ AT E -τ AT E | to evaluate performances of various causal models (see them in Table 3 ). More descriptions of the implementation are detailed in Section A.5 of Appendix. √ ϵ P EHE , √ ϵ CF ,

5.2. RESULTS AND ANALYSIS

Varying selection bias. We first make a general comparison between the models on datasets when the degree of covariate imbalance increases, and the relevant results are shown in Figure 2 . There are three main observations: (1) DIGNet attains the lowest √ ϵ P EHE across all datasets, while GNet In conclusion, the finding (2) reveals that (i) the PPBR approach improves the predictive power of outcomes, especially for counterfactual outcomes; and the findings (3) and ( 4) reveal that (ii) the PDIG structure conduces to making group distance minimization and individual propensity confusion complementary and mutually reinforcing. Source of gain. To further investigate the above findings, we choose the case with high selection bias (γ = 3) to explore the source of gain for PDIG and PPBR. We report model performances (mean ± std) averaged over 30 training and test sets in Table 1 and plot specific metrics of 30 runs on test set in Figure 3 and Figure 4 . Below we discuss the source of gain in detail. and ϵ AT E ) compared to DGNet and DINet, as seen in Table 1 . (2) Ablation study for PPBR: The PPBR approach plays an essential role in outcome prediction, especially counterfactual inference. From Figure 4 (Shalit et al., 2017) .88 ± .0 .26 ± .01 .95 ± .0 .28 ± .01 SITE (Yao et al., 2018) .69 ± .0 .22 ± .01 .75 ± .0 .24 ± .01 GANITE (Yoon et al., 2018) 1.9 ± .4 .43 ± .05 2.4 ± .4 .49 ± .05 Dragonnet (Shi et al., 2019) 1.3 ± .4 .14 ± .01 1.3 ± .5 .20 ± .05 BNN (Johansson et al., 2016) 2.2 ± .1 .37 ± .03 2.1 ± .1 .42 ± .03 CFR-Wass (GNet) (Shalit et al., 2017) .73 ± .0 .12 ± .01 .81 ± .0 .15 ± .01 DKLITE (Zhang et al., 2020) . Moreover, it is noticeable that DIGNet achieves the lowest errors overwhelmingly across datasets and metrics, indicating that the proposed method has the most robust performance.

6. CONCLUSION

In this paper, we derive a theoretical ITE bound based on H-divergence and connect representation balancing with the concept of propensity confusion. Furthermore, we propose the components of PDIG and PPBR, on which we construct a decomposition network structure DIGNet for treatment effect estimation. Comprehensive experiments verify that PDIG and PPBR follow different pathways to achieve the same goal of improving treatment effect estimation. In particular, PDIG helps the model better capture representation balancing patterns without affecting outcome prediction, while PPBR preserves patterns predictive of outcomes to enhance the outcome prediction without affecting balancing patterns. We believe that our findings constitute an important step towards the generalizability of representation balancing models in counterfactual estimation. Promising directions for future work include discouraging redundancy of shared information of balancing patterns in the PDIG structure, improving the efficacy of optimizing DIGNet's objective, and exploring whether there exists an interaction between PDIG and PPBR. Let τ t (x) := E [Y t | X = x], we have τ (x) = τ 1 (x) -τ 0 (x). Let f : X × {0, 1} → Y be a prediction function. Definition 2 The individual treatment effect estimate can be defined as: τf (x) := f (x, 1) -f (x, 0). Definition 3 (Hill, 2011) Let L : Y × Y → R + be a loss function. The expected Precision in Estimation of Heterogeneous Effect (PEHE) loss of f is: ϵ P EHE (f ) = X L(τ f (x) -τ (x))p(x)dx. Definition 4 The covariates' distributions in the treated and controlled groups can be denoted by p T =1 (x) := p(x | T = 1) and p T =0 (x) := p(x | T = 0), respectively. In our causal representation balancing approach, we assume that the representation function Φ : Let h : R × {0, 1} → Y be an hypothesis defined over the representation space R, such that f (x, t) = h(Φ(x), t). Definition 6 The expected loss for the unit and treatment pair (x, t) is : X → R is a twice-differentiable, ℓ h,Φ (x, t) = Y L(y t , h(Φ(x), t))p(y t |x)dy t . Definition 7 The expected factual loss and counterfactual losses of h and Φ are, respectively: ϵ F (h, Φ) = X ×{0,1} ℓ h,Φ (x, t)p(x, t)dxdt, ϵ CF (h, Φ) = X ×{0,1} ℓ h,Φ (x, t)p(x, 1 -t)dxdt. Definition 8 The expected treated and control losses are: ϵ T =1 F (h, Φ) = X ℓ h,Φ (x, 1)p T =1 (x)dx, ϵ T =0 F (h, Φ) = X ℓ h,Φ (x, 0)p T =0 (x)dx, ϵ T =1 CF (h, Φ) = X ℓ h,Φ (x, 1)p T =0 (x)dx, ϵ T =0 CF (h, Φ) = X ℓ h,Φ (x, 0)p T =1 (x)dx. Let u := P r(T = 1) be the proportion of treated in the population. We then have the result: Lemma 1 ϵ F (h, Φ) = u • ϵ T =1 F (h, Φ) + (1 -u) • ϵ T =0 F (h, Φ), ϵ CF (h, Φ) = (1 -u) • ϵ T =1 CF (h, Φ) + u • ϵ T =0 CF (h, Φ). Noting that p(x, t) = u • p T =1 (x) + (1 -u) • p T =0 (x) , the results can be easily obtained from the Definitions 7 and 8. Definition 9 Let G be a function family consisting of functions g : S → R. For a pair of distributions p 1 , p 2 over S, define the Integral Probability Metric: IP M G (p 1 , p 2 ) = sup g∈G | S g(s)(p 1 (s) -p 2 (s))ds|. Let G be the family of 1-Lipschitz functions, we obtain the so-called 1-Wasserstein distance between distributions, which we denote W ass(p 1 , p 2 ) (Sriperumbudur et al., 2012) . Definition 10 Given a pair of distributions p 1 , p 2 over S, and a hypothesis binary function class H, the H-divergence between p 1 and p 2 is d H (p 1 , p 2 ) = 2sup η∈H |P r p1 [η(s) = 1] -P r p2 [η(s) = 1]| . Lemma 2 Let G in Definition 9 be the family of binary functions, we obtain half of H-divergence. Proof Let I(•) denotes an indicator function. d H (p 1 , p 2 ) =2 sup η∈H η(s)=1 (p 1 (s) -p 2 (s))ds =2 sup η∈H S I(η(s) = 1)(p 1 (s) -p 2 (s))ds =2 sup η∈H S η(s)(p 1 (s) -p 2 (s))ds The last equation is because an indicator function is also a binary function. □ A.2 BOUNDS FOR CONTERFACTUAL ERROR ϵ CF We first derive the counterfactual error bounds when using Wasserstein distance. The following Lemma 3 and corresponding proof is identical to the Lemma 1 in (Shalit et al., 2017) . Lemma 3 Let Φ : X → R be an invertible representation with Ψ being its inverse. Let p T =1 Φ (r), p T =0 Φ (r) be as defined before. Let h : R × {0, 1} → Y, u := P r(T = 1) and G be the family of 1-Lipschitz functions. Assume there exists a constant B Φ ≥ 0, such that for t = 0, 1, the function g Φ,h (r, t) := 1 BΦ • ℓ h,Φ (Ψ(r), t) ∈ G. Then we have: ϵ CF (h, Φ) ≤ (1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ) + B Φ • W ass(p T =1 Φ , p T =0 Φ ). Proof ϵ CF (h, Φ) -[(1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ)] =[(1 -u) • ϵ T =1 CF (h, Φ) + u • ϵ T =0 CF (h, Φ)] -[(1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ)] =(1 -u) • [ϵ T =1 CF (h, Φ) -ϵ T =1 F (h, Φ)] + u • [ϵ T =0 CF (h, Φ) -ϵ T =0 F (h, Φ)] =(1 -u) X ℓ h,Φ (x, 1)(p T =0 (x) -p T =1 (x))dx + u X ℓ h,Φ (x, 0)(p T =1 (x) -p T =0 (x))dx (19) =(1 -u) R ℓ h,Φ (Ψ(r), 1)(p T =0 Φ (r) -p T =1 Φ (r))dr + u R ℓ h,Φ (Ψ(r), 0)(p T =1 Φ (r) -p T =0 Φ (r))dr ≤B Φ • (1 -u) R 1 B Φ ℓ h,Φ (Ψ(r), 1)(p T =0 Φ (r) -p T =1 Φ (r))dr + B Φ • u R 1 B Φ ℓ h,Φ (Ψ(r), 0)(p T =1 Φ (r) -p T =0 Φ (r))dr ≤B Φ • (1 -u) sup g∈G | R g(r)(p T =0 Φ (r) -p T =1 Φ (r))dr| + B Φ • u • sup g∈G | R g(r)(p T =1 Φ (r) -p T =0 Φ (r))dr| (21) =B Φ • W ass(p T =1 Φ , p T =0 Φ ) Equation ( 19) is by Definition 8; equation ( 20) is by the change of formula, p T =0 Φ (r) = p T =0 (Ψ(r))J Ψ (r), p T =1 Φ (r) = p T =1 (Ψ(r))J Ψ (r) , where J Ψ (r) is the absolute of the determinant of the Jacobian of Ψ(r); inequality (21) is by the premise that 1 BΦ • ℓ h,Φ (Ψ(r), t) ∈ G for t = 0, 1, and ( 22) is by Definition 9 of an IPM. □ The crucial condition in Lemma 3 is that g Φ,h (r, t) := 1 BΦ • ℓ h,Φ (Ψ(r), t) ∈ G. Bounds for B Φ can be given to evaluate this constant when under more assumptions about the loss function L, the Lipschitz constants of p(y t |x), h, and the condition number of the Jacobian of Φ. These assumptions and the specific bounds for B Φ can be seen in supplement Section A.3 of (Shalit et al., 2017) . Now we turn to derive the counterfactual error bounds for the H-divergence case. Assumption 1 There exists a constant K such that for all y 2 ∈ Y, Y L(y 1 , y 2 )dy 1 ≤ K. Lemma 4 Let Φ : X → R be an invertible representation with Ψ being its inverse. Let p T =1 Φ (r), p T =0 Φ (r) be as defined before. Let h : R × {0, 1} → Y, u := P r(T = 1) and H be the family of binary functions. Assume loss function L obeys the Assumption 1. Then we have: ϵ CF (h, Φ) ≤ (1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ) + K 2 d H (p T =1 Φ , p T =0 Φ ). Proof ϵ CF (h, Φ) -[(1 -u) • ϵ T =1 F (h, Φ) + u • ϵ T =0 F (h, Φ)] =(1 -u) R ℓ h,Φ (Ψ(r), 1)(p T =0 Φ (r) -p T =1 Φ (r))dr + u R ℓ h,Φ (Ψ(r), 0)(p T =1 Φ (r) -p T =0 Φ (r))dr ≤(1 -u) p T =0 Φ >p T =1 Φ ℓ h,Φ (Ψ(r), 1)(p T =0 Φ (r) -p T =1 Φ (r))dr + u p T =1 Φ >p T =0 Φ ℓ h,Φ (Ψ(r), 0)(p T =1 Φ (r) -p T =0 Φ (r))dr (24) ≤(1 -u)K p T =0 Φ >p T =1 Φ (p T =0 Φ (r) -p T =1 Φ (r))dr + u • K p T =1 Φ >p T =0 Φ (p T =1 Φ (r) -p T =0 Φ (r))dr (25) =(1 -u)K R I(p t=0 Φ > p T =1 Φ )(p T =0 Φ (r) -p T =1 Φ (r))dr + u • K R I(p T =1 Φ > p T =0 Φ )(p T =1 Φ (r) -p T =0 Φ (r))dr ≤(1 -u)K sup η∈H | R η(r)(p T =1 Φ (r) -p T =0 Φ (r))dr| + u • K • sup η∈H | R η(r)(p T =1 Φ (r) -p T =0 Φ (r))dr| (26) ≤K • sup η∈H | R η(r)((p T =1 Φ (r) -p T =0 Φ (r)))dr| = K 2 d H (p T =1 Φ , p T =0 Φ ) Equation ( 23) is same to equation (20); equation ( 24) is by ℓ h,Φ ≥ 0 for all r and t; inequality ( 25 We first state two lemmas for ϵ P EHE with respect to two different loss functions: the squared loss and the absolute loss. In fact, similar lemmas hold for loss functions that satisfy the (relaxed) triangle inequalities. For the absolute loss L(y 1 , y 2 ) = |y 1 -y 2 | that satisfies triangle inequality, the upper bound in Lemma 5 will replace the standard deviation σ 2 y by mean absolute deviation A y . Definition 12 The mean absolute deviation of y t with regard to p(x, t) is: A y t (p(x, t)) = X ×{0,1}×Y |y t -τ t (x)|p(y t |x)p(x, t)dy t dxdt, and define: A y = min{A y t (p(x, t)), A y t (p(x, 1 -t))}. Lemma 6 Let loss function L be the absolute loss, L(y 1 , y 2 ) = |y 1 -y 2 |. For any function f : X × {0, 1} → Y, and distribution p(x, t) over X × {0, 1}:  ϵ P EHE (h, Φ) ≤ ϵ CF (h, Φ) + ϵ F (h, Φ) -2A y . Proof Recall that ϵ P EHE (f ) = ϵ P EHE (h, Φ), ϵ F (f ) = ϵ F (h, Φ), ϵ CF (f ) = ϵ CF (h, Φ) for f (x, t) = h(Φ(x), t). ϵ P EHE (f ) = X |(f (x, 1) -f (x, 0)) -(τ 1 (x) -τ 0 (x))|p(x)dx ≤ X (|f (x, 1) -τ 1 (x)| + |f (x, 0) -τ 0 (x)|)p(x)dx (32) = X |f (x, 1) -τ 1 (x)|p(x, T = 1)dx + X |f (x, 0) -τ 0 (x)|p(x, T = 0)dx + X |f (x, 1) -τ 1 (x)|p(x, T = 0)dx + X |f (x, 0) -τ 0 (x)|p(x, T = 1)dx (33) = X ×{0,1} ϵ CF (f ) = X ×{0,1} |f (x, t) -τ t (x)|p(x, 1 -t)dxdt + A y t (p(x, 1 -t)). Combining these results and Definition 12, we have ϵ P EHE (h, Φ) ≤ ϵ F (f ) -A y t (p(x, t)) + ϵ CF (f ) -A y t (p(x, 1 -t)) ≤ ϵ CF (h, Φ) + ϵ F (h, Φ) -2A y .

□

We summarize the upper bounds of ϵ CF and ϵ P EHE above, and give the final bounds for these two distance using the squared and absolute loss, respectively. Example 1. Suppose there is a vaccine to prevent some kind of disease. Let X denote the covariate (age), T = 1 denote the treatment (getting vaccinated), T = 0 denote the control (not getting vaccinated), and Y denote the outcome (probability of getting the disease). Suppose that the vaccine is assigned according to age, and we have found that the older, the higher the probability of getting the disease. The left graph in Figure 5 shows the distribution of pre-balancing covariate X for treated and controlled groups, which indicates that vaccines are more likely to distribute to older people. Technically, the pre-balancing data preserve the outcome-predictive information: if we want to estimate Y using the covariate X, we are confident that people in the treatment/control group (orange/blue) are susceptible/unsusceptible to the disease since they are older/younger. The right panel of Figure 5 shows the distribution of the adjusted covariate X, over which the distributions of treated and controlled groups are highly balanced. In this case, however, the distribution of X is too balanced, making it hard to distinguish the treatment samples from the control samples. Consequently, if we want to estimate Y using X, we may get confused about which group is susceptible to the disease because the distributions of X are almost identical between the treated and controlled groups. Therefore, only considering balancing patterns can result in a loss of outcome-predictive information.

Adjust covariates

Example 2. Following the example above, allow us to give a special but more intuitive instance. Imagine two men are entirely identical other than age, of whom one is older (treatment, T) and the other is younger (control, C). So we can use the covariate age to distinguish between T and C. We also found that the older one is susceptible to the disease. However, once their ages are mapped to some representations such that their representations are over-balanced, even identical, then such representations can be useless to distinguish who is T and who is C. As a result, it will be difficult for such representations to be used for estimating who is susceptible to the disease. Therefore, over-balanced representations may lose outcome-predictive information. In summary, on the one hand, involving representation balancing can benefit treatment effect estimation. On the other hand, if p T =1 Φ and p T =0 Φ are too balanced, a model may fail to preserve pre-balancing information that is useful to outcome predictions. Such a dilemma motivates us to incorporate PPBR and PDIG such that PPBR improves outcome prediction without harming representation balancing, and PDIG helps a model to achieve more balanced representations without harming outcome prediction.

A.5 ADDITIONAL EXPERIMENTAL DETAILS

Hyperparameters. In simulation studies, we ensure a fair comparison by fixing all the hyperparameters in all datasets across different models. The relevant details are stated in Table 4 . In IHDP studies, to compare with the baseline model CFR-Wass (GNet), we remain the hyperparameters of INet, DGNet, DINet and the early stopping rule the same as those used in CFR-Wass Shalit et al. (2017) . Since DIGNet is more complex than other four models, we adjust the hyperparameters of Φ E , Φ G , Φ I , α 1 , and α 2 for DIGNet as Shalit et al. (2017) do. The relevant details are stated in Table 5 . (Yao et al., 2018) .69 ± .0 .22 ± .01 .75 ± .0 .24 ± .01 GANITE (Yoon et al., 2018) 1.9 ± .4 .43 ± .05 2.4 ± .4 .49 ± .05 BLR (Johansson et al., 2016) 5.8 ± .3 .72 ± .04 5.8 ± .3 .93 ± .05 BNN (Johansson et al., 2016) 2.2 ± .1 .37 ± .03 2.1 ± .1 .42 ± .03 TARNet (Shalit et al., 2017) .88 ± .0 .26 ± .01 .95 ± .0 .28 ± .01 CFR-Wass (GNet) (Shalit et al., 2017) .73 ± .0 .12 ± .01 .81 ± .0 .15 ± .01 Dragonnet (Shi et al., 2019) 1.3 ± .4 .14 ± .01 1.3 ± .5 .20 ± .05 DKLITE (Zhang et al., 2020) . (Huang et al., 2022a) .52 ± .0 .12 ± .01 .57 ± .0 .13 ± .01 DIGNet (Ours) .42 ± .0 .11 ± .01 .45 ± .0 .12 ± .01 Additional IHDP results. The selection bias of different simulated datasets is illustrated in Figure 6 . We compare DIGNet with more baselines in Table 6 . Note thatindicates either the result is not reproducible or the original paper does not report relevant values. We collect baseline methods that focus on treatment effect estimation, especially methods using deep representation learning techniques, from recent machine learning conferences (e.g., ICML, NeurIPS, ICLR, AISTATS, and PRICAI) Analysis for training time and training stability. We record the time it took for different models to run through 100 IHDP datasets, and each model is trained within 600 epochs. Following Shalit et al. (2017) , all models adopt the early stopping rule. We also record the average early stopping of DIGNet is stable, even steadier than GNet and INet. From this perspective, we haven't seen a difficulty of optimizing DIGNet. Objective of DGNet. Note that similar to DIGNet, the pre-balancing patterns are preserved by only updating Φ G but fixing Φ E in the first step. min Φ G α 1 L G (x, t; Φ G • Φ E ), min Φ E ,Φ G ,h t L y (x, t, y; Φ E ⊕ (Φ G • Φ E ), h t ). Objective of DIGNet. min Φ G α 1 L G (x, t; Φ G • Φ E ), max π α 2 L I (x, t; Φ I • Φ E , π), min Φ I α 2 L I (x, t; Φ I • Φ E , π), min Φ E ,Φ I ,Φ G ,h t L y (x, t, y; Φ E ⊕ (Φ I • Φ E ) ⊕ (Φ G • Φ E ), h t ).



Figure 1: Illustrations of the network architecture of the five models studied in Section 5. that π(Φ E (x i )) cannot correctly specify the true propensity score e(x i ). Eventually, the representations are balanced as it is difficult for π to determine the propensity of Φ(x i ) being treated or controlled. For the convenience of the reader, we illustrate the structure of INet in Figure 1(b). Note that though INet follows the strategy of approximating H-divergence in Ganin et al. (2016), the theoretical derivations are completely different due to the significant differences (e.g., definitions, problem settings, and theoretical bounds.) between causal inference and domain adaptation.

DGNet and DINet. For further ablation studies, we also propose two models, DGNet and DINet. The two models can be considered as either DIGNet without PDIG, or GNet and INet with PPBR. The structures of DGNet and DINet are shown in Figure 1(c) and Figure 1(d), and the objectives of DGNet and DINet are deferred to Section A.6 in Appendix.

and √ ϵ F with L defined in Definition 1 being the squared loss, as well as the empirical approximations of W ass(p by W ass and dH , respectively). Note that as shown in Figure1, W ass is over Φ E for GNet while over Φ G for DGNet and DIGNet; dH is over Φ E for INet while over Φ I for DINet and DIGNet.

Figure 2: Plots of model performances on test set for different metrics as γ varies in {0.25, 0.5, 0.75, 1, 1.5, 2, 3}. Each graph shows the average of 30 runs with standard errors shaded. performs worse than other models; (2) DINet and DGNet outperform INet and GNet regarding √ ϵ CF and √ ϵ P EHE ; (3) INet, DINet, and DGNet perform similarly to DIGNet on factual outcome estimations ( √ ϵ F ), but cannot compete with DIGNet in terms of counterfactual estimations ( √ ϵ CF ); (4) DIGNet achieves smaller dH (or W ass) than DINet and INet (or DGNet and GNet), especially when the covariate shift problem is severe (e.g., when γ > 1).

Figure 3: Plots of model performances on test set for √ ϵ F , √ ϵ CF , dH , and W ass when γ = 3. Each graph plots the metric for 30 runs. Mean ± std of each metric averaged across 30 runs are reported on the top.

Figure 4: Plots of model performances on test set for √ ϵ F , √ ϵ CF , dH , and W ass when γ = 3. Left graphs compare DGNet with GNet, and right graphs compare DINet with INet. Each graph plots the metric for 30 runs. Mean ± std of each metric averaged across 30 runs are reported on the top.

one-to-one function, where R ⊂ R d is the representation space. Then, we can denote Ψ : R → X by the inverse of Φ and the induced distribution of r by p Φ . Definition 5 The covariates' distributions in the treated and controlled groups over R can be denoted by p T =1 Φ (r) := p Φ (r | T = 1) and p T =0 Φ (r) := p Φ (r | T = 0), respectively.

Figure 5: Example for illustrating the importance of decomposed patterns.

epoch on 100 runs and the actual time on 100 runs, where (actual time) = (total time) × (average early stopping epoch)/600. Not surprisingly, GNet took the least amount of time with 3096 seconds since the objective of GNet is the simplest. However, it is very interesting that the proposed methods, DGNet and DINet, are the first two to early stop. As a result, though DGNet and DINet have multiobjectives, they spent less actual training time but achieved better ITE estimation compared to GNet and INet. Since GNet and INet are actually DGNet and DINet with PPBR ablated, we find that PPBR component can help a model achieve better ITE estimates with less time. In addition, we find that DIGNet spent the longest time to optimize since it has the most complex objective. To further study the stability of the model training, we also plot the metrics √ ϵ F , Wass, dH , and √ ϵ P EHEfor the first 100 epochs of each model on the first IHDP dataset. We find that the training process

Figure 6: T-SNE visualizations of the covariates as γ varies. Red represents the treatment group and blue represents the control group. A larger γ indicates a greater imbalance between the two groups.

, we gain an important insight that the difference in learned representation balancing patterns, measured by W ass (or dH ), between DGNet and GNet (or DINet and INet), is negligible. This means that PPBR has no impact on the representation balancing task. However, PPBR can improve the predictive power of factual outcomes, reducing √ ϵ

Training-& test-set √ ϵ P EHE & ϵ AT E when γ = 3. Mean ± standard error of 30 runs.

Training-& test-set √ ϵ P EHE & ϵ AT E on IHDP. Mean ± standard error of 100 runs.

Training-& test-set √ ϵ P EHE & ϵ AT E on IHDP. Mean ± standard error of 1000 runs.

Inequality (34) is also because |x + y| ≤ |x| + |y|, equation (35) is by Definition 12. A similar result can be obtained for ϵ CF :

Hyperparameters of different models in simulation studies.

Hyperparameters of different models in IHDP experiments.

Additional comparisons on IHDP dataset.

Training time records on 100 IHDP datasets. Model Time for 600 epochs Avg early stopping Actual time √ ϵ P EHE on test set Training loss plots for the first 100 epochs on the first IHDP dataset.Objective of DINet. Note that similar to DIGNet, the pre-balancing patterns are preserved by only updating Φ I but fixing Φ E in the second step.max π α 2 L I (x, t; Φ I • Φ E , π), min Φ I α 2 L I (x, t; Φ I • Φ E , π), min Φ E ,Φ I ,h t L y (x, t, y; Φ E ⊕ (Φ I • Φ E ), h t ).

annex

Definition 11 The expected variance of y t with regard to p(x, t) is:(y t -τ t (x)) 2 p(y t |x)p(x, t)dy t dxdt, and define:σ 2 y = min{σ 2 y t (p(x, t)), σ 2 y t (p(x, 1 -t))}.Lemma 5 Let loss function L be the squared loss, L(y 1 , y 2 ) = (y 1 -y 2 ) 2 . For any function f : X × {0, 1} → Y, and distribution p(x, t) over X × {0, 1}:Inequality (28) is because the relaxed triangle inequality, (x + y) 2 ≤ 2(x 2 + y 2 ); equation ( 29) is because p(x) = p(x, T = 0) + p(x, T = 1).Equation ( 31) is by Definition 11 and last term in equation ( 30) equals to zero, since τ t (x) = Y y t p(y t |x)dy t . A similar result can be obtained for ϵ CF :Combining these results and Definition 11, we haveTheorem 1 Let Φ : X → R be an invertible representation with Ψ being its inverse. Let p T =1 Φ (r), p T =0 Φ (r) be as defined before. Let h : R × {0, 1} → Y, u := P r(T = 1) and G be the family of 1-Lipschitz functions. Assume there exists a constant B Φ ≥ 0, such that for t = 0, 1, the function g Φ,h (r, t) := 1 BΦ • ℓ h,Φ (Ψ(r), t) ∈ G. Let loss function L be the squared loss, L(y 1 , y 2 ) = (y 1 -y 2 ) 2 .Then we have:Let loss function L be the absolute loss, L(y 1 , y 2 ) = |y 1 -y 2 |. Then we have:Proof Inequality ( 36) is by Lemma 5, inequality (37) is by Lemma 1 and Lemma 3; Inequality ( 38) is by Lemma 6, inequality ( 39) is by Lemma 1 and Lemma 3; □ Theorem 2 Let Φ : X → R be an invertible representation with Ψ being its inverse. Let p T =1 Φ (r), p T =0 Φ (r) be as defined before. Let h : R × {0, 1} → Y, u := P r(T = 1) and H be the family of binary functions.Let loss function L be the squared loss such that L(y 1 , y 2 ) = (y 1 -y 2 ) 2 . Then we have:Let loss function L be the absolute loss such that L(y 1 , y 2 ) = |y 1 -y 2 |. Then we have:Proof Inequality (40) is by Lemma 5, inequality (41) is by Lemma 1 and Lemma 4; Inequality ( 42) is by Lemma 6, inequality (43) is by Lemma 1 and Lemma 4; □ Obviously, when using Wasserstein distance, there are various versions of bounds for different loss functions as long as they satisfy the (relaxed) triangle inequality and assumptions about ℓ h,Φ in Theorem 1. Similarly, when using H-divergence, there are also various versions of bounds for loss functions that satisfy Assumption 1 and the (relaxed) triangle inequality.For an empirical sample and a family of representations and hypotheses, we can further upper bound ϵ T =0 F and ϵ T =1 F by their respective empirical losses and a model complexity term using standard arguments (Shalev-Shwartz & Ben-David, 2014) . Both the Wasserstein distance and H-divergence can be consistently estimated from finite samples (Sriperumbudur et al., 2012; Ben-David et al., 2006; 2010) .

A.4 ILLUSTRATIVE EXAMPLES

Examples for the motivation for decomposed patterns. To explain the dilemma between representation balancing and outcome prediction, we give two intuitive examples below to help readers better understand the motivation and importance of involving decomposed patterns in representation balancing models.

