VARIATIONAL AUTO-ENCODER ARCHITECTURES THAT EXCEL AT CAUSAL INFERENCE

Abstract

This paper provides a generative approach for causal inference using data from observational studies. Inspired by the work of Kingma et al. (2014), we propose a sequence of three architectures (namely Series, Parallel, and Hybrid) that each incorporate their M1 and M2 models as building blocks. Each architecture is an improvement over the previous one in terms of estimating causal effect, culminating in the Hybrid model. The Hybrid model is designed to encourage decomposing the underlying factors of any observational dataset; this in turn, helps to accurately estimate all treatment outcomes. Our empirical results demonstrate the superiority of all three proposed architectures compared to both state-of-the-art discriminative as well as other generative approaches in the literature.

1. INTRODUCTION

As one of the main tasks in studying causality (Peters et al., 2017; Guo et al., 2018) , the goal of Causal Inference is to figure out how much the value of a certain variable would change (i.e., the effect) had another certain variable (i.e., the cause) changed its value. A prominent example is the counterfactual question (Rubin, 1974; Pearl, 2009) "Would this patient have lived longer [and by how much], had she received an alternative treatment?". Such question is often asked in the context of precision medicine, which attempts to identify which medical procedure t ∈ T will benefit a certain patient x the most, in terms of the treatment outcome y ∈ R (e.g., survival time). A fundamental problem in causal inference is the unobservablity of the counterfactual outcomes (Holland, 1986) . That is, for each subject i, any real-world dataset can only contain the outcome of the administered treatment (aka the observed outcome: y i ), but not the outcome(s) of the alternative treatment(s) (aka the counterfactual outcome(s) ) -i.e., y t i for t ∈ T \ {t i }. In other words, the causal effect is never observed (i.e., missing in any training data) and cannot be used to train predictive models, nor can it be used to evaluated a proposed model. This makes estimating causal effects a more difficult problem than that of generalization in the supervised learning paradigm. In general, we can categorize most machine learning algorithms into two general approaches, which differ in how the input features x and their target values y are modeled (Ng & Jordan, 2002) : Discriminative methods focus solely on modeling the conditional distribution p(y|x) with the goal of direct prediction of y for each instance x. For prediction tasks, discriminative approaches are often more accurate since they use the model parameters more efficiently than generative approaches. Most of the current causal inference methods are discriminative, including the Balancing Neural Network (BNN) (Johansson et al., 2016) , CounterFactual Regression Network (CFR-Net) (Shalit et al., 2017) , and CFR-Net's extensions -cf., (Yao et al., 2018; Hassanpour & Greiner, 2019; 2020) -as well as Dragon-Net (Shi et al., 2019) . Generative methods, on the other hand, describe the relationship between x and y by their joint probability distribution p(x, y). This, in turn, would allow the generative model to answer arbitrary queries, including coping with missing features x using the marginal distribution p(x) or [similar to discriminative models] predicting the unknown target values y via p(y|x). A promising direction forward for causal inference is developing generative models, using either Generative Adverserial Network (GAN) (Goodfellow et al., 2014) or Variational Auto-Encoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) . This has led to two generative approaches for causal inference: GANs for inference of Individualised Treatment Effects (GANITE) (Yoon et al., 2018) and Causal Effect VAE (CEVAE) Louizos et al. (2017) . However, neither of the two achieve competitive performance in terms of treatment effect estimation compared to the discriminative approaches. Although discriminative models have excellent predictive performance, they suffer from two drawbacks: (i) overfitting, and (ii) making highly-confident predictions, even for instances that are "far" from the observed training data. Generative models based on Bayesian inference, on the other hand, can handle both of these drawbacks: issue (i) can be minimized by taking an average over the posterior distribution of model parameters; and issue (ii) can be addressed by explicitly providing model uncertainty via the posterior (Gordon & Hernández-Lobato, 2020) . Although the exact inference is often intractable, efficient approximations to the parameter posterior distribution is possible through variational methods. Here, we use the Variational Auto-Encoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) for the Bayesian inference component of our causal inference method. Contribution: In this paper, we propose three interrelated Bayesian model architectures (namely Series, Parallel, and Hybrid) that employ the VAE framework to address the task of causal inference for binary treatments. We find that the best performing architecture is the Hybrid model, that is [partially] successful in decomposing the underlying factors of any observational dataset. This is a valuable property, as that means it can accurately estimate all all treatment outcomes. We demonstrate that these models significantly outperform the state-of-the-art in terms of treatment effect estimation performance on two publicly available benchmarks, as well as a fully synthetic dataset that allows for detailed performance analyses.

CFR-Net

Shalit et al. ( 2017) considered the binary treatment task and attempted to learn a representation space Φ that reduces selection bias by making Pr( Φ(x) | t = 0 ) and Pr( Φ(x) | t = 1 ) as close to each other as possible, provided that Φ( x ) retains enough information that the learned regressors {h t Φ(•) : t ∈ {0, 1}} can generalize well on the observed outcomes. Their objective function includes L y i , h ti Φ(x i ) , which is the loss of predicting the observed outcome for sample i (described as x i ), weighted by ω i = ti 2u + 1-ti 2(1-u) , where u = Pr( t = 1 ). This is effectively setting ω i = 1 2 Pr( ti ) where Pr( t i ) is the probability of selecting treatment t i over the entire population.

DR-CFR

Hassanpour & Greiner (2020) argued against the standard implicit assumption that all of the covariates X are confounders (i.e., contribute to both treatment assignment and outcome determination). Instead, they proposed a graphical model similar to that in Figure 1 and designed a discriminative causal inference approach accordingly -built on top of the CFR-Net. Specifically, their model, named Disentangled Representations for CFR (DR-CFR), includes three representation networks, each trained with constraints to insure that each component corresponds to its respective underlying factor. While the idea behind DR-CFR provides an interesting intuition, it is known that only generative models (and not discriminative ones) can truly identify the underlying data generating mechanism. This paper is a step in this direction.

Dragon-Net

Shi et al. ( 2019)'s main objective was to estimate the Average Treatment Effect (ATE), which they explain requires a two stage procedure: (i) fit models that predict the outcomes for both treatments; and (ii) find a downstream estimator of the effect. Their method is based on a classic result from strong ignorability -i.e., Theorem 3 in (Rosenbaum & Rubin, 1983 ) -that states: (y 1 , y 0 ) ⊥ ⊥ t | x & Pr( t = 1 | x ) ∈ (0, 1) =⇒ (y 1 , y 0 ) ⊥ ⊥ t | b(x) & Pr( t = 1 | b(x) ) ∈ (0, 1) where b(x) is a balancing scorefoot_0 . They consider propensity score as a balancing score and argue that only the parts of X relevant for predicting T are required for the estimation of the causal effectfoot_1 . This theorem only provides a way to match treated and control instances though -i.e., it helps finding potential counterfactuals from the alternative group to calculate ATE. Shi et al. (2019) , however, used this theorem to derive minimal representations on which to regress to estimate the outcomes. Yoon et al. (2018) proposed the counterfactual GAN, whose generator G, given {x, t, y t }, estimates the counterfactual outcomes (ŷ ¬t ); and whose discriminator D tries to identify which of {[x, 0, y 0 ], [x, 1, y 1 ]} is the factual outcome. It is, however, unclear why this requires that G must produce samples that are indistinguishable from the factual outcomes, especially as D can just learn the treatment selection mechanism instead of distinguishing the factual outcomes from counterfactuals. Although this work is among the few generative approaches for causal inference, our empirical results (in Section 4) show that it does not effectively estimate counterfactual outcomes. CEVAE Louizos et al. (2017) used VAE to extract latent confounders from their observed proxies in X. While this is an interesting step in the right direction, empirical results show that it does not always accurately estimate treatment effect (see Section 4). The authors note that this may be because CEVAE is not able to address the problem of selection bias. Another reason that we think contributes to CEVAE's sub-optimal performance is its assumed graphical model of the underlying data generating mechanism (depicted in Figure 2 ). This model assumes that there is only one latent variable Z (confounding T and Y ) that generates the entire observational data; however, we know from (Kuang et al., 2017) and (Hassanpour & Greiner, 2020 ) that there must be more (see Figure 1 ).

R2: M1 and M2 VAEs

In an attempt to enhance the conventional representation learning with VAEs -referred to as the M1 model (Kingma & Welling, 2014; Rezende et al., 2014) -in a semisupervised manner, Kingma et al. (2014) proposed the M2 VAE. While the M1 model helps learning latent representations from the covariate matrix X alone, the M2 model allows the target information also to guide the representation learning process. In our work, the target information includes the treatment bit T as well as the observed outcome Y . This additional information helps learning more expressive representations, that was not possible with the unsupervised M1 model. Appendix A.1 presents a more detailed overview of the M1 and M2 VAEs.

3. METHOD

Following (Hassanpour & Greiner, 2020) and without loss of generality, we assume that the random variable X follows an unknown joint probability distribution Pr( X | Γ, ∆, Υ, Ξ ), where Γ, ∆, Υ, and Ξ are non-overlapping independent factors. Moreover, we assume that treatment T follows Pr( T | Γ, ∆ ) (i.e., Γ and ∆ are the responsible factors for selection bias) and outcome Y T follows Pr T ( Y T | ∆, Υ ); see Figure 1. Observe that the factor Γ (resp., Υ) partially determines only T (resp., Y ), but not Y (resp., T ); and ∆ includes the confounding factors between T and Y . Our goal is to design generative model architectures that encourage learning disentangled representations of these four underlying latent factors (see Figure 1 ). In other words, it is an attempt to decompose and separately learn the underlying factors that are responsible for determining T and Y . To achieve this, we propose three architectures (as illustrated in Figures 3(a) , 3(b), and 3(c)), each employing a VAE (Kingma & Welling, 2014; Rezende et al., 2014) that each include a decoder (generative model) and an encoder (variational posterior). Specifically, we use the M1 and M2 models from (Kingma et al., 2014) as our building blocks, leading to a Series architecture, a Parallel architecture, and a Hybrid one. Each component is parametrized as a deep neural network.

3.1.1. THE SERIES ARCHITECTURE

The architecture of the Series model is illustrated in Figure 3 (a). Louizos et al. (2015) proposed a similar architecture to address fairness in machine learning, but using a binary sensitive variable S  Priors Likelihood Posteriors p θs (z 2 ) p θs (x|z 1 , t) q φs (z 1 |x, t) p θs (z 1 |y, z 2 ) q φs (y|z 1 ) q φs (z 2 |y, z 1 ) The goal is to maximize the conditional log-likelihood of the observed data (left-hand-side of the following inequality) by maximizing the Evidence Lower BOund (ELBO; right-hand-side) -i.e., N i=1 log p(x i |t i , y i ) ≥ N i=1 E q φs (z1|x,t) log p θs (x i |z 1i , t i ) (1) -KL q φs (z 1 |x, t) || p θs (z 1 |y, z 2 ) -KL q φs (z 2 |y, z 1 ) || p θs (z 2 ) (2) where KL denotes the Kullback-Leibler divergence, p θs (z 2 ) is the unit multivariate Gaussian (i.e., N (0, I)), and the other distributions are parameterized as deep neural networks.

3.1.2. THE PARALLEL ARCHITECTURE

R4: The Series model is composed of two M2 stacked models. However, Kingma et al. (2014) showed that an M1+M2 stacked architecture learns better representations than an M2 model alone for a downstream prediction task. This motivated us to design a double M1+M2 Parallel model; where one arm is for the outcome to guide the representation learning via Z 1 and another for the treatment to guide that via Z 3 . This architecture is illustrated in Figure 3 (b). We hypothesize that Z 1 would learn ∆ and Υ, and Z 3 would learn Γ (and perhaps partially ∆). Decoder and encoder components of the Parallel model (parametrized by θ p and φ p respectively) involve the following distributions: Priors Likelihood Posteriors p θp (z 2 ) p θp (z 4 ) p θp (x|z 1 , z 3 ) q φp (z 1 |x, t) q φp (z 3 |x, y) p θp (z 1 |y, z 2 ) p θp (z 3 |t, z 4 ) q φp (y|z 1 ) q φp (t|z 3 ) q φp (z 2 |y, z 1 ) q φp (z 4 |t, z 3 ) Here, the conditional log-likelihood can be upper bounded by: N i=1 log p(x i |t i , y i ) ≥ N i=1 E q φp (z1,z3|x,t,y) log p θp (x i |z 1i , z 3i ) (3) -KL q φp (z 1 |x, t) || p θp (z 1 |y, z 2 ) -KL q φp (z 2 |y, z 1 ) || p θp (z 2 ) (4) -KL q φp (z 3 |x, y) || p θp (z 3 |t, z 4 ) -KL q φp (z 4 |t, z 3 ) || p θp (z 4 )

3.1.3. THE HYBRID ARCHITECTURE

R4: The final architecture, Hybrid, attempts to get the best capabilities of the previous two architectures. The backbone of the Hybrid model has a Series architecture, that separates Γ (factors related to the treatment T ; captured by the right module with Z 3 as its head), from ∆ and Υ (factors related to the outcome Y ; captured by the left module with Z 7 as its head). The left module, itself, consists of a Parallel model that attempts to proceed one step further and decompose ∆ from Υ. This is done with the help of a discrepancy penalty (see Section 3.3). Figure 3 (c) illustrates our designed architecture for the Hybrid model. Decoder and encoder components of the Hybrid model (parametrized by θ h and φ h respectively) involve the following distributions: Priors Likelihood Posteriors p θ h (z 2 ) p θ h (z 4 ) p θ h (z 6 ) p θ h (x|z 3 , z 7 ) q φ h (z 7 |x, t) q φ h (z 1 |z 7 ) q φ h (z 5 |z 7 ) p θ h (z 1 |y, z 2 ) p θ h (z 3 |t, z 4 ) p θ h (z 5 |y, z 6 ) q φ h (z 3 |x, y) q φ h (y|z 1 , z 5 ) q φ h (t|z 3 ) p θ h (z 7 |z 1 , z 5 ) q φ h (z 2 |y, z 1 ) q φ h (z 6 |y, z 5 ) q φ h (z 4 |t, z 3 ) Here, the conditional log-likelihood can be upper bounded by: N i=1 log p(x i |t i , y i ) ≥ N i=1 E q φ h (z3,z7|x,t,y) log p θ h (x i |z 3i , z 7i ) (6) -KL q φ h (z 1 |z 7 ) || p θ h (z 1 |y, z 2 ) -KL q φ h (z 2 |y, z 1 ) || p θ h (z 2 ) (7) -KL q φ h (z 3 |x, y) || p θ h (z 3 |t, z 4 ) -KL q φ h (z 4 |t, z 3 ) || p θ h (z 4 ) (8) -KL q φ h (z 5 |z 7 ) || p θ h (z 5 |y, z 6 ) -KL q φ h (z 6 |y, z 5 ) || p θ h (z 6 ) -KL q φ h (z 7 |x, t) || p θ h (z 7 |z 1 , z 5 ) The first term in the ELBO (i.e., right-hand-side of Equations ( 1), (3), or ( 6)) is called the Reconstruction Loss (RecL) and the next term(s) (i.e., Equation (2), summation of Equations ( 4) and ( 5), or summation of Equations ( 7), ( 8), (9), and ( 10)) is referred to as the KL R1: Divergence (KLD). Concisely, the ELBO can be written as: RecL -KLD, which is to be maximized.

3.2. DISENTANGLEMENT WITH β-VAE

As mentioned earlier, we want the learned latent variables to be disentangled, to match our assumption of non-overlapping factors Γ, ∆, and Υ. To ensure this, we employ the β-VAE (Higgins et al., 2017) which adds a hyperparameter β as a multiplier of the KLD part of the ELBO. This adjustable hyperparameter facilitates a trade-off that helps balance the latent channel capacity and independence constraints (handled by the KL terms) with the reconstruction accuracy -i.e., including the β hyperarameter grants a better control over the level of disentanglement in the learned representations (Burgess et al., 2018) . Therefore, the generative objective to be minimized becomes: L VAE = -RecL + β • KLD (11) Although Higgins et al. (2017) suggest the β to be set greater than 1 in most applications, Hoffman et al. (2017) show that having a β < 1 weight on the KL term can be interpreted as optimizing the ELBO under an alternative prior, which functions as a regularization term to prevent degeneracy.

3.3. DISCREPANCY

Although all the three proposed graphical models suggest that T and Z 1 are statistically independent (see, for example, the collider structure (at X): T → X ← Z 1 in Figure 3 (a)), an information leak is quite possible due to the correlation between the outcome y and treatment t in the data. We therefore require an extra regularization term on q φ (z 1 |t) in order to penalize the discrepancy (denoted by disc) between the conditional distributions of z 1 given t = 0 versus given t = 1. To achieve this regularization, we calculate the disc using an Integral Probability Metric (IPM) (Mansour et al., 2009) c1 that measures the distance between the two above-mentioned distributions: L disc = IPM {z 1 } i:ti=0 , {z 1 } i:ti=1 3.4 PREDICTIVE LOSS Note, however, that neither the VAE nor the disc losses contribute to training a predictive model for outcomes. To remedy this, we extend the objective function to include a discriminative term for the regression loss of predicting y: c1 L pred = 1 N N i=1 ω i • L y i , ŷi where the predicted outcome ŷi is derived as the mean of the q ti φ (y i |z 1i ) posterior trained for the respective treatment t i ; L y i , ŷi is the factual loss (i.e., L2 loss for real-valued outcomes and log loss for binary-valued outcomes); and ω i represent the weights R1: that attempt to account for selection bias. We consider two approaches in the literature to derive the weights: (i) the Population-Based (PB) weights as proposed in CFR-Net (Shalit et al., 2017) ; and (ii) the Context-Aware (CA) weights as proposed by Hassanpour & Greiner (2019) . Note that disentangling ∆ from Υ is only beneficial when using the CA weights, since we need just the ∆ factors to derive them (Hassanpour & Greiner, 2020) .

3.5. FINAL MODEL(S)

Putting everything together, the overall objective function to be minimized is then: J = L pred + α • L disc + γ • L VAE + λ • Reg (14) where Reg penalizes the model complexity. This objective function is motivated by the work of McCallum et al. (2006) , which suggested optimizing a convex combination of discriminative and generative losses would indeed improve predictive performance. As an empirical verification, note that for γ = 0, the Series and Parallel models effectively reduce to CFR-Net. However, our empirical results (cf., Section 4) suggest that the generative term in the objective function helps learning representations that embed more relevant information for estimating outcomes than that of Φ in CFR-Net. We refer to the family of our proposed methods as VAE-CI (Variational Auto-Encoder for Causal Inference); specifically: {S, P, H}-VAE-CI, for Series, Parallel, and Hybrid respectively.

4. EXPERIMENTS, RESULTS, AND DISCUSSION

4.1 BENCHMARKS

Infant Health and Development Program (IHDP)

The original IHDP randomized controlled trial was designed to evaluate the effect of specialist home visits on future cognitive test scores of premature infants. Hill (2011) induced selection bias by removing a non-random subset of the treated population. The dataset contains 747 instances (608 control and 139 treated) with 25 covariates. We use the same benchmark (with 100 realizations of outcomes) provided by and used in (Johansson et al., 2016) and (Shalit et al., 2017) .

Atlantic Causal Inference Conference 2018 (ACIC'18)

ACIC'18 is a collection of binarytreatment datasets released for a data challenge. Following (Shi et al., 2019) , we use a subset of the datasets with instances N ∈ {1, 5, 10}×10 3 (four datasets in each category). The covariates matrix for each dataset involves 177 features and is sub-sampled from a table of medical measurements taken from the Linked Birth and Infant Death Data (LBIDD) (MacDorman & Atkinson, 1998) , that contains information corresponding to 100,000 subjects.

Fully Synthetic Datasets

We generated a set of synthetic datasets according to the procedure described in (Hassanpour & Greiner, 2020) ; see Section A.2 for an overview. We considered all the viable datasets in a mesh generated by various sets of variables, of sizes m Γ , m ∆ , m Υ ∈ {0, 4, 8} and m Ξ = 1. This creates 24 scenarios c1 that consider all possible relative sizes of the factors Γ, ∆, PEHE and ATE measures (lower is better) represented in the form of "mean (standard deviation)". and Υ. For each scenario, we synthesized multiple datasets with various initial random seeds in order to allow for statistical significance testing of the performance comparisons between various methods.

4.2. EVALUATING IDENTIFICATION OF THE UNDERLYING FACTORS

To evaluate the identification performance of the underlying factors, we use a fully synthetic dataset with m Γ = m ∆ = m Υ = 8 and m Ξ = 1. We set x to be one of the four dummy vectors V 1..4 and input it to each trained representation network Z j . Three of these vectors had "1" in the 8 positions associated with Γ, ∆, and Υ respectively, and the remaining 17 positions of each vector was filled with "0". The fourth vector was all "1" except for the last position (the noise) which was "0". This helps measure the maximum amount of information that is passed to the final layer of each representation network. We let O i,j be the elu output (here, ∈ R 200 ) of the encoder network Z j when x = V i . The average of the 200 values of O i,j (Avg(O i,j )) represents the power of signal that was produced by the Z j channel on the input V i . The values shown in Figure 4 's tables are the ratios of Avg(O 1,j ), Avg(O 2,j ), and Avg(O 3,j ) divided by Avg(O 4,j ) for each of the learned representation networks. Note that, a larger ratio indicates that the respective representation network Z j has allowed more of the input signal V i to pass through. Section A.3 includes more details on this procedure. c0 As expected, Z 3 and Z 4 capture Γ (e.g., the Z 3 ratios for Γ in the {P, H}-VAE-CI tables are largest), and Z 1 , Z 2 , Z 5 , Z 6 , and Z 7 capture ∆ and Υ. Note that decomposition of ∆ from Υ has not been achieved by any of the methods except for H-VAE-CI, which captures Υ by Z 1 and ∆ by Z 5 R4: (note the ratios are largest for Z 1 and Z 5 ). This decomposition is vital for deriving context-aware importance sampling weights because they must be calculated from ∆ only (Hassanpour & Greiner, 2020) . Also observe that both {P, H}-VAE-CI are able to separate Γ from ∆. However, DR-CFR, which tried to disentangle all factors, failed not only to disentangle ∆ from Υ, but also Γ from ∆.

4.3. EVALUATING TREATMENT EFFECT ESTIMATION

Evaluation of treatment effect estimation is often done with semi-or fully-synthetic datasets that include both factual and counterfactual outcomes. There are two categories of performance measures: c0 Unlike the evaluation strategy presented in (Hassanpour & Greiner, 2020) that only looked at the first layer's weights of each representation network, we propagate the values through the entire network and check how much of each factor is exhibited in the final layer of every representation network. R1 & R4: Yet, the proposed procedure still crudely evaluates the quality of disentanglement of the underlying factors. We did explore using the Mutual Information (Belghazi et al., 2018) for this task (not shown here); however, it appears that it does not work for high-dimensional data such as ours. All in all, more research is needed to address this task.

Individual-based: "Precision in Estimation of Heterogeneous

Effect" PEHE = 1 N N i=1 (ê i -e i ) 2 uses êi = ŷ1 i -ŷ0 i as the estimated effect and e i = y 1 i -y 0 i as the true effect (Hill, 2011); and Population-based: "Bias of the Average Treatment Effect" ATE = ATE -ATE , where ATE = 1 N N i=1 y 1 i -1 N N j=1 y 0 j and ATE is calculated based on the estimated outcomes. In this paper, we compare performances of our proposed methods {S, P, H}-VAE-CI versus the following treatment effect estimation methods: CFR-Net (Shalit et al., 2017) , DR-CFR (Hassanpour & Greiner, 2020) , Dragon-Net (Shi et al., 2019) , GANITE (Yoon et al., 2018) , and CEVAE (Louizos et al., 2017) . The basic search grid for hyperparameters of the CFR-Net based algorithms (including our proposed methods) is available in Section A.4. For the other algorithms, we searched around their default hyperparameters' setting. We ran the experiments for the contender methods using their publicly available code-bases; note the following points: • Since Dragon-Net is designed to estimate ATE only, we did not report its performance results for the PEHE measure (which, as expected, were significantly inaccurate). • The original GANITE code-base was implemented for binary outcomes only. We modified the code (losses, etc.) such that it could process real-valued outcomes also. • We were surprised that CEVAE diverged when running on the ACIC'18 datasets. To avoid this, we had to run the ACIC'18 experiments on the binary covariates only. Tables 1, 2 , and 3 summarize the mean and standard deviation of the PEHE and ATE measures (lower is better) on the IHDP, ACIC'18, and Synthetic benchmarks respectively. VAE-CI achieves the best performance among the contending methods. These results are statistically significant (in bold; based on Welch's unpaired t-test with α = 0.05) for the IHDP and Synthetic benchmarks. Although VAE-CI also achieves the best performance on the ACIC'18 benchmark, the results are not statistically significant due to the high standard deviation of the contending methods' performances. Figure 5 visualizes the PEHE measures on the entire synthetic datasets with sample size of N = 10,000. We observe that both plots corresponding to H-VAE-CI method (PB as well as CA) are inscribed by plots of all other methods, showcasing H-VAE-CI's superior performance under every possible selection bias scenario. R1: Note that for scenarios where m ∆ = 0 (i.e., the ones of form m Γ _0_m Υ on perimeter of the radar chart in Figure 5 ), the performances of H-VAE-CI (PB) and H-VAE-CI (CA) are almost identical. This is expected, since for these scenarios, the learned representation for ∆ would be degenerate, and therefore, the context-aware weights would reduce to population-based ones. On the other hand, for scenarios where m ∆ = 0, the H-VAE-CI (CA) often performs better than H-VAE-CI (PB). This can be attributed to the fact that H-VAE-CI has correctly disentangled ∆ from Υ. This facilitates learning good context-aware weights that better account for selection bias, which in turn, results in a better causal effect estimation performance. We also performed hyperparameters' sensitivity analyses in terms of PEHE (see Figure 6 ). We discuss the results in the following: • For the α hyperparameter (i.e., coefficient of the discrepancy penalty), Figure 6 (a) suggests that DR-CFR and H-VAE-CI methods have the most robust performance R1 & R4: throughout various values of α. This is expected, because, unlike CFR and {S, P}-VAE-CI, DR-CFR and H-VAE-CI possess an independent node for representing ∆. This helps them still capture ∆ as α grows; since for them, α only affects learning a representation of Υ. Comparing H-VAE-CI (PB) with (CA), we observe that for all α > 0.01, H-VAE-CI (CA) outperforms H-VAE-CI (PB). This is because the discrepancy penalty would force Z 1 to only capture Υ and Z 5 to only capture ∆. This results in deriving better CA weights (that should be learned from ∆; here, from its learned representation Z 5 ). H-VAE-CI (PB), on the other hand, cannot take advantage of this disentanglement, which explains its sub-optimal performance. • According to Figure 6 do not make much difference for H-VAE-CI (except for β ≥ 1; since this large value means the learned representations will be close to Gaussian noise). We initially thought using β-VAE might help further disentangle the underlying factors. However, Figure 6 (b) suggests that close-to-zero or even zero βs also work effectively. We now think that the H-VAE-CI's architecture itself sufficiently decomposes the Γ, ∆, and Υ factors, without needing the help of a KLD penalty. Appendix A.5 includes more evidence and a detailed discussion on why this interpretation should hold. where the latter performs significantly better than the former. We hypothesize that this is because H-VAE-CI already learns expressive representations Z 3 and Z 7 , meaning the optimization no longer really requires the L VAE term to impose that. This is in contrast to Z 1 in S-VAE-CI and Z 1 and Z 3 in P-VAE-CI.

5. FUTURE WORKS AND CONCLUSION

Despite the success of the proposed methods, especially the Hybrid model, in addressing causal inference, no known algorithms can yet learn to perfectly disentangle factors ∆ and Υ. R1: This goal is important because we know isolating ∆, and learning context-aware weights from it, does enhance the quality of the causal effect estimation performance -note the superior performance of H-VAE-CI (CA). ∆, and Υ respectively, and the remainder of them were filled with zeroes. The fourth vector was all ones, so we can measure the maximum amount of information that is passed to the final layer of each representation network. In the next step, each vector V i is fed to each trained network Z j , and the output O i is recorded (see the right-side of Figure 8 ). The average of O i represents the power of signal that was communicated from V i and passed through the Z j channel. The values reported in the tables illustrated in Figure 4 are the ratios of {average of O 1 , O 2 , O 3 } divided by {average of O 4 } for all the learned representation networks. A.4 HYPERPARAMETERS For all CFR, DR-CFR, and VAE-CI methods, we trained the neural networks with 3 layers (each consisting 200 hidden neurons) c0 , non-linear activation function elu , regularization coefficient of λ=1E-4, Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1E-3, batch size of 300, and maximum number of iterations of 10, 000. See Table 4 for our hyperparameter search space. Table 4 : Hyperparameters and ranges

Hyperparameter Range

Discrepancy coefficient α 0, 1E{-3, -2, -1, 0, 1} KLD coefficient β 0, 1E{-3, -2, -1, 0, 1, 2} Generative coefficient γ 0, 1E{-5, -4, -3, -2, -1, 0} A.5 A DETAILED ANALYSIS OF THE EFFECT OF β R1 & R3: Our initial hypothesis in using β-VAE was that it might help further disentangle the underlying factors, in addition to the other constraint already in place (i.e., the architecture as well as the discrepancy penalty). However, Figure 6 (b) suggests that close-to-zero or even zero βs also work effectively. To further explore this hypothesis, we examined the decomposition tables (similar to Figure 4 ) of H-VAE-CI for extreme configurations with β = 0 and observed that they were all effective at decomposing the underlying factors Γ, ∆, and Υ (similar to the performance reported in the green table in Figure 4 ). Figure 9 shows several of these tables. R1 & R3: Our interpretation of this observation is that the H-VAE-CI's architecture already takes care of decomposing the Γ, ∆, and Υ factors, without needing the help of a KLD penalty. This means either of the following is happening: (i) β-VAE is not the best performing disentangling method and other disentangling constraints should be used instead -e.g., works of Chen et al. ( 2018) and Lopez et al. (2018) ; or (ii) it is theoretically impossible to achieve disentanglement without some supervision (Locatello et al., 2019) , which might not be possible to provide in this task. Exploring these options is out of the scope of this paper and is left to future work.



That is, X ⊥ ⊥ T | b(X) (Rosenbaum& Rubin, 1983). The authors acknowledge that this would hurt the predictive performance for individual outcomes. As a result, this yields inaccurate estimation of Individual Treatment Effects (ITEs). c1 In this work, we use the Maximum Mean Discrepancy (MMD)(Gretton et al., 2012). c1 This is similar to the way(Kingma et al., 2014) included a classification loss in their Equation (9). c1 There are 3 3 = 27 combinations in total; however, we removed three of these combinations that generate pure noise outcomes -i.e., ∆ = Υ = ∅: (0, 0, 0), (4, 0, 0), and (8, 0, 0). c0R1 & R2: In addition to this basic configuration, we also perform our grid search with an updated number of layers and/or number of neurons in each layer. This makes sure that all methods enjoy a similar model complexity.



Figure 1: Underlying factors of any observational dataset(Hassanpour & Greiner, 2020)

Figure 3: Belief nets of the proposed architectures.

Figure 4: Performance analysis for decomposition of the underlying factors on the synthetic dataset with m Γ,∆,Υ = 8, m Ξ = 1.

(b), various β values (i.e., coefficient of KL divergence penalty) R1 & R3 & R4:

Figure 5: Radar graphs of PEHE (on the radii; lower is better) for the entire synthetic benchmark (24 × 3 with N = 10,000; each vertex denotes the respective dataset). Figure is best viewed in color.

Figure 6: Hyperparameters' (x-axis) sensitivity analysis based on PEHE (y-axis) on the synthetic dataset with m Γ,∆,Υ = 8, m Ξ = 1. Legend is the same as Figure 5. Plots are best viewed in color.

Figure 8: The four dummy x-like vectors (left); and the input/output vectors of the representation networks (right).

Figure 9: Decomposition tables for H-VAE-CI with β = 0.

IHDP (100 realizations) benchmark

ACIC'18 (N ≤ 10K) benchmark

PEHE and ATE measures (lower is better) represented in the form of "mean (standard deviation)" on the entire synthetic benchmark (average performance of 24×3 datasets, each with sample size of 10,000). H-VAE-CI (PB) 0.20 (0.03) 0.003 (0.002) H-VAE-CI (CA) 0.18 (0.02) 0.003 (0.002)

R1 & R3: The results of our ablation study (in Figure6(b)), however, revealed that the currently used β-VAE does not help with disentanglement of the underlying factors. This shows that we can attribute all the decomposition we get to the proposed architectures and objective function. A future direction of this research is to explore the use of better disentangling constraints -e.g., works of Chen et al. (2018) andLopez et al. (2018) -to see if they would yield sharper results.The goal of this paper was to estimate treatment effects (either for individuals or the entire population)

A APPENDIX A.1 M1 AND M2 VARIATIONAL AUTO-ENCODERS

As the first proposed model, the M1 VAE is the conventional model that is used to learn representations of data (Kingma & Welling, 2014; Rezende et al., 2014) . These features are learned from the covariate matrix X only. Figure 7 (a) illustrates the encoder and decoder of the M1 VAE. Note the graphical model on the left depicts the encoder; and the one on the right depict the decoder, which has arrows going the other direction.Proposed by Kingma et al. (2014) , the M2 model was an attempt to incorporate the information in target Y into the representation learning procedure. This results in learning representations that separate specifications of individual targets from general properties shared between various targets. In case of digit generation, this translates into separating specifications that distinguish each digit from writing style or lighting condition. We can stack the M1 and M2 models as shown in Figure 7 (c) to get the best results. This way, we can first learn a representation Z 1 from raw covariates, then find a second representation Z 2 , now learning from Z 1 instead of the raw data. • For each latent factor L ∈ { Γ, ∆, Υ }, form L by drawing N instances (each of size m L ) from N (µ L , Σ L ). The covariates matrix X is the result of concatenating Γ, ∆, and Υ. Refer to the concatenation of Γ and ∆ as Ψ and that of ∆ and Υas Φ (for later use). • For treatment T , sample m Γ +m ∆ tuple of coefficients θ from N (0, 1) mΓ+m∆ . Define the logging policy as, where z = Ψ • θ. For each instance x i , sample treatment t i from the Bernoulli distribution with parameter π 0 ( t = 1 | z i ).• For outcomes Y 0 and Y 1 , sample m ∆ +m Υ tuple of coefficients ϑ 0 and ϑ 1 from N (0, 1) m∆+mΥ Define ywhere ε is a white noise sampled from N (0, 0.1) and • is the symbol for element-wise product.

A.3 EVALUATING IDENTIFICATION OF THE UNDERLYING FACTORS

Here, we elaborate on the procedure we followed to evaluate identification performance of the underlying factors. We produced four dummy vectors V i ∈ R mΓ+m∆+mΥ+mΞ as depicted on the left-side of Figure 8 . The first to third vectors had ones (constant) in the positions associated with Γ,

