TARGETED VAE: STRUCTURED INFERENCE AND TAR-GETED LEARNING FOR CAUSAL PARAMETER ESTIMA-TION

Abstract

Undertaking causal inference with observational data is extremely useful across a wide range of domains including the development of medical treatments, advertisements and marketing, and policy making. There are two main challenges associated with undertaking causal inference using observational data: treatment assignment heterogeneity (i.e., differences between the treated and untreated groups), and an absence of counterfactual data (i.e. not knowing what would have happened if an individual who did get treatment, were instead to have not been treated). We address these two challenges by combining structured inference and targeted learning. To our knowledge, Targeted Variational AutoEncoder (TVAE) is the first method to incorporate targeted learning into deep latent variable models. Results demonstrate competitive and state of the art performance.

1. INTRODUCTION

The estimation of the causal effects of interventions or treatments on outcomes is of the upmost importance across a range of decision making processes and scientific endeavours, such as policy making (Kreif & DiazOrdaz, 2019) , advertisement (Bottou et al., 2013) , the development of medical treatments (Petersen et al., 2017) , the evaluation of evidence within legal frameworks (Pearl, 2009; Siegerink et al., 2016) and social science (Vowels, 2020; Hernan, 2018; Grosz et al., 2020) . Despite the common preference for Randomized Controlled Trial (RCT) data over observational data, this preference is not always justified. Besides the lower cost and fewer ethical concerns, observational data may provide a number of statistical advantages including greater statistical power and increased generalizability (Deaton & Cartwright, 2018) . However, there are two main challenges when dealing with observational data. Firstly, the group that receives treatment is usually not equivalent to the group that does not (treatment assignment heterogeneity), resulting in selection bias and confounding due to associated covariates. For example, young people may prefer surgery, older people may prefer medication. Secondly, we are unable to directly estimate the causal effect of treatment, because only the factual outcome for a given treatment assignment is available. In other words, we do not have the counterfactual associated with the outcome for a different treatment assignment to that which was given. Treatment effect inference with observational data is concerned with finding ways to estimate the causal effect by considering the expected differences between factual and counterfactual outcomes. We seek to address the two challenges by proposing a method that incorporates targeted learning techniques into a disentangled variational latent model, trained according to the approximate maximum likelihood paradigm. Doing so enables us to estimate the expected treatment effects, as well as individual-level treatment effects. Estimating the latter is especially important for treatments that interact with patient attributes, whilst also being crucial for enabling individualized treatment assignment. Thus, we propose the Targeted Variational AutoEncoder (TVAE), undertake an ablation study, and compare our method's performance against current alternatives on two benchmark datasets. 2020), where the structure is a priori assumed to factorize into into risk z y , instrumental z t , and confounding factors z c . We extend their model with z o to account for the scenario whereby not all covariates will be related to treatment and/or outcome.

2. BACKGROUND

Problem Formulation: A characterization of the problem of causal inference with no unobserved confounders is depicted in the Directed Acyclic Graphs (DAGs) shown in Figs. 1(a ) and 1(b). Fig. 1(a ) is characteristic of observational data, where the assignment of treatment is related to the covariates. Fig. 1(b ) is characteristic of the ideal RCT, where the treatment is unrelated to the covariates. Here, x i ∼ p(x) ∈ R m represents the m-dimensional, pre-treatment covariates for individual i assigned factual treatment t i ∼ p(t|x) resulting in factual outcome y t i ∼ p(y|x, t). Together, these constitute dataset D = {[y i , t i , x i ]} N i=1 where N is the sample size. The conditional average treatment effect for an individual with covariates x i may be estimated as τi ( x i ) = E[y i |x i , do(t = 1) -y i |x i , do(t = 0)], where the expectation accounts for the nondeterminism of the outcome (Jesson et al., 2020) . Alternatively, by comparing the post-intervention distributions when we intervene on treatment t, the Average Treatment Effect (ATE) is τ (x) = E x [E[y|x, do(t = 1)] -E[y|x, do(t = 0)]]. Here, do(t) indicates the intervention on t, setting all instances to a static value, dynamic value, or distribution and therefore removing any dependencies it originally had (Pearl, 2009; van der Laan & Rose, 2018; 2011) . This scenario corresponds with the DAG in Fig. 1(b) , where treatment t is no longer a function of the covariates x. Using an estimator for the conditional mean Q(t, x) = E(y|t, x), we can calculate the Average Treatment Effect (ATE) and the empirical error for estimation of the ATE (eATE). 1 In order to estimate eATE we assume access to the ground truth treatment effect τ , which is only possible with synthetic or semi-synthetic datasets. The Conditional Average Treatment Effect (CATE) may also be calculated and the Precision in Estimating Heterogeneous Effect (PEHE) is one way to evaluate a model's efficacy in estimating this quantity. See the appendix for the complete definitions of these terms. The Naive Approach: The DAG in Fig. 1 (a) highlights the problem with taking a naive approach to modeling the joint distribution p(y, t, x). The structural relationship t ← x → y indicates both that the assignment of treatment t is dependent on the covariates x, and that a backdoor path exists through x to y. In addition to our previous assumptions, if we also assume linearity, adjusting for this backdoor path is a simple matter of adjusting for x by including it in a logistic regression. The naive method is an example of the uppermost methods depicted in Fig. 2 , and leads to the largest bias. The problem with the approach is (a) that the graph is likely misspecified such that the true relationships between covariates as well as the relationships between covariates and the outcome may be more complex. There is also problem (b), that linearity is not sufficient to 'let the data speak' (van der Laan & Rose, 2011) or to avoid biased parameter estimates. Using powerful nonparametric models (e.g., neural networks) may solve the limitations associated with linearity and interactions to yield a consistent estimator for p(y|x), and such a model is an example of the middlemost methods depicted in Fig. 2 . However, this estimator is not targeted to the estimation of the causal effect parameter τ , only predicting the outcome, and we require a means to reduce residual bias. Targeted Learning: Targeted Maximum Likelihood Estimation (TMLE) (Schuler & Rose, 2016; van der Laan & Rose, 2011; 2018; van der Laan & Starmans, 2014) falls under the lowermost



For a binary outcome variable y ∈ {0, 1}, E(y|t, x) is the same as the conditional probability distribution p(y|t, x).



Figure 1: Directed Acyclic Graphs (DAGs) for (a) the problem of estimating the effect of treatment t on outcome y with confounders x. DAG (b) reflects an RCT. DAG (c) illustrates TVAE and is an extension of the DAG by Zhang et al. (2020), where the structure is a priori assumed to factorize into into risk z y , instrumental z t , and confounding factors z c . We extend their model with z o to account for the scenario whereby not all covariates will be related to treatment and/or outcome.

