SELECTING TREATMENT EFFECTS MODELS FOR DOMAIN ADAPTATION USING CAUSAL KNOWLEDGE Anonymous authors Paper under double-blind review

Abstract

Selecting causal inference models for estimating individualized treatment effects (ITE) from observational data presents a unique challenge since the counterfactual outcomes are never observed. The problem is challenged further in the unsupervised domain adaptation (UDA) setting where we only have access to labeled samples in the source domain, but desire selecting a model that achieves good performance on a target domain for which only unlabeled samples are available. Existing techniques for UDA model selection are designed for the predictive setting. These methods examine discriminative density ratios between the input covariates in the source and target domain and do not factor in the model's predictions in the target domain. Because of this, two models with identical performance on the source domain would receive the same risk score by existing methods, but in reality, have significantly different performance on the test domain. We leverage the invariance of causal structures across domains to introduce a novel model selection metric specifically designed for ITE models under the UDA setting. In particular, we propose selecting models whose predictions of the effects of interventions satisfy known causal structures in the target domain. Experimentally, our method selects ITE models that are more robust to covariate shifts on several synthetic and real healthcare datasets, including on estimating the effect of ventilation in COVID-19 patients from different geographic locations. Under review as a conference paper at ICLR 2021 selecting ITE models, as their validation error in itself is only an approximation of the model's ability to estimate counterfactual outcomes on the source domain. Causal graph describing data generating process.

1. INTRODUCTION

Causal inference models for estimating individualized treatment effects (ITE) are designed to provide actionable intelligence as part of decision support systems and, when deployed on mission-critical domains, such as healthcare, require safety and robustness above all (Shalit et al., 2017; Alaa & van der Schaar, 2017) . In healthcare, it is often the case that the observational data used to train an ITE model may come from a setting where the distribution of patient features is different from the one in the deployment (target) environment, for example, when transferring models across hospitals or countries. Because of this, it is imperative to select ITE models that are robust to these covariate shifts across disparate patient populations. In this paper, we address the problem of ITE model selection in the unsupervised domain adaptation (UDA) setting where we have access to the response to treatments for patients on a source domain, and we desire to select ITE models that can reliably estimate treatment effects on a target domain containing only unlabeled data, i.e., patient features. UDA has been successfully studied in the predictive setting to transfer knowledge from existing labeled data in the source domain to unlabeled target data (Ganin et al., 2016; Tzeng et al., 2017) . In this context, several model selection scores have been proposed to select predictive models that are most robust to the covariate shifts between domains (Sugiyama et al., 2007; You et al., 2019) . These methods approximate the performance of a model on the target domain (target risk) by weighting the performance on the validation set (source risk) with known (or estimated) density ratios. However, ITE model selection for UDA differs significantly in comparison to selecting predictive models for UDA (Stuart et al., 2013) . Notably, we can only approximate the estimated counterfactual error (Alaa & van der Schaar, 2019) , since we only observe the factual outcome for the received treatment and cannot observe the counterfactual outcomes under other treatment options (Spirtes et al., 2000) . Consequently, existing methods for selecting predictive models for UDA that compute a weighted sum of the validation error as a proxy of the target risk (You et al., 2019) is suboptimal for To better approximate target risk, we propose to leverage the invariance of causal graphs across domains and select ITE models whose predictions of the treatment effects also satisfy known or discovered causal relationships. It is well-known that causality is a property of the physical world, and therefore the physical (functional) relationships between variables remain invariant across domains (Schoelkopf et al., 2012; Bareinboim & Pearl, 2016; Rojas-Carulla et al., 2018; Magliacane et al., 2018) . As shown in Figure 1 , we assume the existence of an underlying causal graph that describes the generating process of the observational data. We represent the selection bias present in the source observational datasets by arrows between the features {X 1 , X 2 }, and treatment T . In the target domain, we only have access to the patient features, and we want to estimate the patient outcome (Y ) under different settings of the treatment (intervention). When performing such interventions, the causal structure remains unchanged except for the arrows into the treatment node, which are removed. Contributions. To the best of our knowledge, we present the first UDA selection method specifically tailored for machine learning models that estimate ITE. Our ITE model selection score uniquely leverages the estimated patient outcomes under different treatment settings on the target domain by incorporating a measurement of how well these outcomes satisfy the causal relationships in the interventional causal graph G T . This measure, which we refer to as causal risk, is computed using a log-likelihood function quantifying the model predictions' fitness to the underlying causal graph. We provide a theoretical justification for using the causal risk, and we show that our proposed ITE model selection metric for UDA prefers models whose predictions satisfy the conditional independence relationships in G T and are thus more robust to changes in the distribution of the patient features. We also show experimentally that adding the causal risk to existing state-of-the-art model selection scores for UDA results in selecting ITE models with improved performance on the target domain. We provide an illustrative example of model selection for several real-world datasets for UDA, including ventilator assignment for COVID-19.

2. RELATED WORKS

Our work is related to causal inference and domain adaptation. In this section, we describe existing methods for ITE estimation, UDA model selection in the predictive setting, and domain adaptation from a causal perspective. ITE models. Recently, a large number of machine learning methods for estimating heterogeneous ITE from observational data have been developed, leveraging ideas from representation learning (Johansson et al., 2016; Shalit et al., 2017; Yao et al., 2018) , adversarial training, (Yoon et al., 2018) , causal random forests (Wager & Athey, 2018) and Gaussian processes (Alaa & van der Schaar, 2017; 2018) . Nevertheless, no single model will achieve the best performance on all types of observational data (Dorie et al., 2019) and even for the same model, different hyperparameter settings or training iterations will yield different performance.

ITE model selection.

Evaluating ITE models' performance is challenging since counterfactual data is unavailable, and consequently, the true causal effects cannot be computed. Several heuristics for estimating model performance have been used in practice (Schuler et al., 2018; Van der Laan & Robins, 2003) . Factual model selection only computes the error of the ITE model in estimating the factual patient outcomes. Alternatively, inverse propensity weighted (IPTW) selection uses the estimated propensity score to weigh each sample's factual error and thus obtain an unbiased estimate (Van der Laan & Robins, 2003) . Alaa & van der Schaar (2017) propose using influence functions to approximate ITE models' error in predicting both factual and counterfactual outcomes. Influence function (IF) based validation currently represents the state-of-the-art method in selecting ITE models. However, existing ITE selection methods are not designed to select models robust to distributional changes in the patient populations, i.e., for domain adaptation. UDA model selection. UDA is a special case of domain adaptation, where we have access to unlabeled samples from the test or target domain. Several methods for selecting predictive models for UDA have been proposed (Pan & Yang, 2010 ). Here we focus on the ones that can be adapted for the ITE setting. The first unsupervised model selection method was proposed by Long et al. (2018) , who used Importance-Weighted Cross-Validation (IWCV) (Sugiyama et al., 2007) to select hyperparameters and models for covariate shift. IWCV requires that the importance weights (or density ratio) be provided or known ahead of time, which is not always feasible in practice. Later, Deep Embedded Validation (DEV), proposed by You et al. (2019) , was built on IWCV by using a discriminative neural network to learn the target distribution density ratio to provide an unbiased estimation of the target risk with bounded variance. However, these proposed methods do not consider model predictions on the target domain and are agnostic of causal structure. Causal structure for domain adaptation. Recently, Kyono & van der Schaar (2019) proposed Causal Assurance (CA) as a domain adaptation selection method for predictive models that leverages prior knowledge in the form of a causal graph. Because their work is centered around predictive models, it is suboptimal for ITE models, where the edges into the treatment (or intervention) will capture the selection bias of the observational data. Furthermore, their method does not allow for examining the target domain predictions, which is a key novelty of this work. We leverage do-calculus (Pearl, 2009) to manipulate the underlying directed acyclical graph (DAG) into an interventional DAG that more appropriately fits the ITE regime. More recently, researchers have focused on leveraging the causal structure for predictive models by identifying subsets of variables that serve as invariant conditionals (Rojas-Carulla et al., 2018; Magliacane et al., 2018) .

3.1. INDIVIDUALIZED TREATMENT EFFECTS AND MODEL SELECTION FOR UDA

Consider a training dataset D src = {(x src i , t src i , y src i )} Nsrc i=1 consisting of N src independent realizations, one for each individual i, of the random variables (X, T, Y ) drawn from the source joint distribution p µ (X, T, Y ). Let p µ (X) be the marginal distribution of X. Assume that we also have access to a test dataset D tgt = {x tgt i } Ntgt i=1 from the target domain, consisting of N tgt independent realizations of X drawn from the target distribution p π (X), where p µ (X) = p π (X). Let the random variable X ∈ X represent the context (e.g. patient features) and let T ∈ T describe the intervention (treatment) assigned to the patient. Without loss of generality, consider the case when the treatment is binary, such that T = {0, 1}. However, note that our model selection method is also applicable for any number of treatments. We use the potential outcomes framework (Rubin, 2005) to describe the result of performing an intervention t ∈ T as the potential outcome Y (t) ∈ Y. Let Y (1) represent the potential outcome under treatment and Y (0) the potential outcome under control. Note that for each individual, we can only observe one of potential outcomes Y (0) or Y (1). We assume that the potential outcomes have a stationary distribution p µ (Y (t) | X) = p π (Y (t) | X) given the context X; this represents the covariate shift assumption in domain adaptation (Shimodaira, 2000) . Observational data can be used to estimate E[Y | X = x, T = t] through regression. Assumption 1 describes the causal identification conditions (Rosenbaum & Rubin, 1983) , such that the potential outcomes are the same as the conditional expectation: E[Y (t) | X = x] = E[Y | X = x, T = t]. Assumption 1 (Consistency, Ignorability and Overlap). For any individual (unit) i, receiving treatment t i , we observe Y i = Y (t i ). Moreover, {Y (0), Y (1)} and the data generating process p(X, T, Y ) satisfy strong ignorability Y (0), Y (1) ⊥ ⊥ T | X and overlap ∀x : P (T | X = x) > 0. The ignorability assumption, also known as the no hidden confounders (unconfoundedness) assumptions, means that we observe all variables X that causally affect the assignment of the intervention and the outcome. Under unconfoundedness, X blocks all backdoor paths between Y and A (Pearl, 2009) . Under Assumption 1, the conditional expectation of the potential outcomes can also be written as the interventional distribution obtained by applying the do-operator under the causal framework of Pearl (2009)  : E[Y (t) | X = x] = E[Y | X = x, do(T = t)]. f : X × T → Y such that f (x, t) approximates E[Y | X = x, T = t] = E[Y (t) | X = x] = E[Y | X = x, do(T = t)]. The goal is to estimate the ITE, also known as the conditional average treatment effect (CATE): τ (x) = E[Y (1) | X = x] -E[Y (0) | X = x] (1) = E[Y | X = x, do(T = 1)] -E[Y | X = x, do(T = 0)]. ( ) The CATE is essential for individualized decision making as it guides treatment assignment policies. A trained ITE predictor f (x, t) approximates CATE as: τ (x) = f (x, 1)f (x, 0). Commonly used to assess ITE models is the precision of estimating heterogeneous effects (PEHE) (Hill, 2011) : P EHE = E x∼p(x) [(τ (x) -τ (x)) 2 ], which quantifies a model's estimate of the heterogeneous treatment effects for patients in a population. UDA model selection. Given a set F = {f 1 , . . . f m } of candidate ITE models trained on the source domain D src , our aim is to select the model that achieves the lowest target risk, that is the lowest PEHE on the target domain D tgt . Thus, ITE model selection for UDA involves finding: f = arg min f ∈F E x∼pπ(x) [(τ (x) -τ (x)) 2 ] = arg min f ∈F E x∼pπ(x) [(τ (x) -(f (x, 1) -f (x, 0))) 2 ]. (4) For this purpose, we propose using the invariance of causal graphs across domains to select ITE predictors that are robust to distributional shifts in the marginal distribution of X.

3.2. CAUSAL GRAPHS FRAMEWORK

In this work, we use the semantic framework of causal graphs (Pearl, 2009) to reason about causality in the context of model selection. We assume that the unknown data generating process in the source domain can be described by the causal directed acyclic graph (DAG) G, which contains the relationships between the variables V = (X, T, Y ) consisting of the patient features X, treatment T , and outcome Y . We operate under the Markov and faithfulness conditions (Richardson, 2003; Pearl, 2009) , meaning that any conditional independencies in the joint distribution of p µ (X, T, Y ) are indicated by d-separation in G and vice-versa. In this framework, an intervention on the treatment variable T ∈ V is denoted through the dooperation do(T = t) and induces the interventional DAG G T , where the edges into T are removed. The interventional DAG G T corresponds to the interventional distribution p µ (X, Y | do(T = t)) Pearl (2009) . The only node on which we perform interventions in the target domain is the treatment node. Consequently, this node will have the edges into it removed, while the remainder of the DAG is unchanged. We assume that the causal graph is invariant across domains (Schoelkopf et al., 2012; Ghassami et al., 2017; Magliacane et al., 2018) which we formalize for interventions as follows: Assumption 2 (Causal invariance). Let V = (X, T, Y ) be a set of variables consisting of patient features X, treatment T , and outcome Y . Let ∆ be a set of domains, p δ (X, Y | do(T = t)) be the corresponding interventional distribution on V in domain δ ∈ ∆, and I(p δ (V )) denote the set of all conditional independence relationships embodied in p δ (V ), then ∀δ i , δ j ∈ ∆, I(p δi (X, Y | do(T = t))) = I(p δj (X, Y | do(T = t))). (5)

4. ITE MODEL SELECTION FOR UDA

Let F = {f 1 , f 2 , . . . f m } be a set of candidate ITE models trained on the data from the source domain D src . Our aim is to select the model f ∈ F that achieves the lowest PEHE on the target domain D tgt , as described in Equation 4. Let G be a causal graph, either known or discovered, that describes the causal relationships between the variables in X, the treatment T and the outcome Y . Let G T be the interventional causal graph of G that has edges removed into the treatment variable T . Prior causal knowledge and graph discovery. The invariant graph G can be arrived at in two primary ways. The first would be through experimental means, such as randomized trials, which does not scale to a large number of covariates due to financial or ethical impediments. The second would be through the causal discovery of DAG structure from observational data (for a listing of current algorithms we refer to (Glymour et al., 2019b) ), which is more feasible in practice. Under the assumption of no hidden confounding variables, score-based causal discovery algorithms output a completed partially directed acyclical graph (CPDAG) representing the Markov equivalence class (MEC) of graphs, i.e., those graphs which are statistically indistinguishable given the observational data and therefore share the same conditional independencies. Provided a CPDAG, it is up to an expert (or further experiments) to orient any undirected edges of the CPDAG to convert it into the DAG (Pearl, 2009) . This step is the most error-prone, and we show in our real data experiments how a subgraph (using only the known edges) can still improve model selection performance. Improving target risk estimation. For the trained ITE model f , let ŷ(0) = f (x, 0) and let ŷ(1) = f (x, 1) be the predicted potential outcomes for x ∼ p π (x). We develop a selection method that prefers models whose predictions on the target domain preserve the conditional independence relationships between X, T and Y in the interventional DAG G T with edges removed into the treatment variable T . We first propose a Theorem, which we later exploit for model selection. Theorem 1. Let p µ (X, T, Y ) be a source distribution with corresponding DAG G. If Y = f (X, T ), i.e., f is an optimal ITE model, then I G (G T ) = I(p π (X, f (X, t) | do(T = t))), where p π (X, f (X, t) | do(T = t)) is the interventional distribution for the target domain and I G (G T ) and I(p π (X, f (X, t) | do(T = t)) ) returns all the conditional independence relationships in G T and p π (X, f (X, t) | do(T = t)), respectively. For details and proof of Theorem 1 see Appendix B. Theorem 1 provides an equality relating the predictions of f in the target domain to the interventional DAG G T . Therefore we desire the set of independence relationships in G T to equal I(p π (X, f (X, t) | do(T = t))). In our case, we do not have access to the true interventional distribution p π (X, f (X, t) | do(T = t)), but we can approximate it from the dataset obtained by augmenting the unlabeled target dataset D tgt with the model's predictions of the potential outcomes: Dtgt = {(x tgt i , 0, ŷtgt i (0)), (x tgt i , 1, ŷtgt i (1))} Ntgt i=1 , where ŷtgt i (t) = f (x tgt i , t), for x tgt i ∈ D tgt . We propose to improve the formalization in Eq. 4 by adding a constraint on preserving the conditional independencies of G T as follows: arg min f ∈F R T (f ) s.t. E[N CI(G T , Dtgt )] = 0, where R T (f ) is a function that approximates the target risk for a model f , N CI(G T , Dtgt ) is the number of conditional independence relationships in the graph G T that are not satisfied by the test dataset augmented with the model's predictions of the potential outcomes Dtgt . 

Predictions on Target Data

Interventional DAG X1 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > X2 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > T < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Y < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >

D tr

< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > f < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Dsrc = Dtr [ Dv < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Dtgt < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > Dtgt = {(x tgt i , 0, ŷtgt i (0)), (x tgt i , 1, ŷtgt i (1))} Ntgt i=1 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > vr(f, Dv, Dtgt) < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > cr(f,Dtgt, G T ) = LL(G T | Dtgt) < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > A score that satisfies this is provided by the Lagrangian method: L = R T (f ) + λE[N CI(G T , Dtgt )]. (8) The first term R T (f ) is equivalent to the expected test PEHE which at selection time can be approximated by the validation risk (either source or target risk), which we represent as v r (f, D v , D tgt ). The second term, E[N CI(G T , Dtgt )], which is derived from Theorem 1, evaluates the number of conditional independence relationships resulting from d-separation in the graph G T that are not satisfied by the test dataset augmented with the model's predictions of the potential outcomes Dtgt . However, this term may never equal 0 and directly minimizing N CI(G T , Dtgt ) involves evaluating conditional independence relationships, which is a hard statistical problem, especially for continuous variables (Shah et al., 2020) . Because of this we approximate it by using a causal fitness score that measures the likelihood of a DAG given the augmented dataset D tgt , which we rewrite as c r (f, D tgt , G T ). This represents an alternative and equivalent approach, also used by score-based causal discovery methods (Ramsey et al., 2017b; Glymour et al., 2019c) . Consider partitioning the source dataset D src = {(x src i , t src i , y src i )} Nsrc i=1 into a training dataset D tr and a validation dataset D v such that D src = D tr ∪ D v . From Eq. 8 we define our ICMS score r as follows: Definition 1 (ICMS score). Let f be an ITE predictor trained on D tr . Let D tgt = {(x tgt i )} Ntgt i=1 be test dataset and let G T be the interventional causal graph. We define the following selection score: r(f, D v , D tgt , G T ) = v r (f, D v , D tgt ) + λc r (f, D tgt , G T ) ) where v r measures the validation risk on the validation set D v and c r is a scoring function, which we call causal risk, that measures the fitness of the interventional causal graph G T to the dataset Dtgt = {(x tgt i , 0, ŷtgt i (0)), (x tgt i , 1, ŷtgt i (1))} Ntgt i=1 , where ŷtgt i (t) = f (x tgt i , t), for x tgt i ∈ D tgt . The validation risk v r (f, D v , D tgt ) can either be (1) source risk where we use existing model selection scores for ITE (Alaa & van der Schaar, 2019; Van der Laan & Robins, 2003) , or (2) an approximation of target risk using the preexisting methods of IWCV or DEV (Sugiyama et al., 2007; You et al., 2019) . We describe in the following section how to compute the causal risk c r (f, D tgt , G T ). λ is a tuning factor between our causal risk term and validation risk v r . We currently set λ = 1 for our experiments, but ideally, λ would be proportional to our certainty in our causal graph. We discuss alternative methods for selecting λ, as well as a λ sensitivity analysis in Appendix F. We provide ICMS pseudocode and a graphical illustration for calculating ICMS in Appendix C. We provide additional practical considerations and experiments regarding computational complexity, a subgraph analysis, causal graph misspecifications, ICMS selection on tree-based methods, ICMS selection on causally invariant features, noisiness of fitness score, and additional further discussion in Appendix H. Assessing causal graph fitness. The causal risk term c r (f, D tgt , G T ) as part of our ICMS score requires assessing the fitness of the dataset Dtgt to the invariant causal knowledge in G T . Some options include noteworthy maximum-likelihood algorithms such as the Akaike Information Criterion (AIC) (Akaike, 1998) and Bayesian Information Criterion (BIC) (Schwarz, 1978) . Both the BIC and AIC are penalized versions of the log-likelihood function of a DAG given data, e.g., LL(G T | Dtgt ). In score based causal discovery, the DAG that best fits the data will maximize the LL(G T | Dtgt ) subject to some model complexity penalty constraints. In this work, we are not searching between candidate causal graphs and only care about maximizing our DAG to dataset fitness. Thus, we use the negative log-likelihood of G given Dtgt , i.e., -LL(G T | Dtgt ), for our causal risk term c r . The -LL(G T | Dtgt ) has a smaller value when G is closer to modeling the probability distribution in Dtgt , i.e., the predicted potential outcomes satisfy the conditional independence relationships in G. In score-based causal discovery, the Bayesian Information Criterion (BIC) is a common score that is used to discover the completed partially directed acyclic graph (CPDAG), representing all DAGs in the MEC, from observational data. Under the Markov and faithfullness assumptions, every conditional independence in the MEC of G is also in D. The BIC score is defined as: BIC(G|D) = -LL(G|D) + log 2 N 2 ||G||, ( ) where N is the data set size and ||G|| is the dimensionality of G. For our function f in Eq. 9, we use the BIC score. However, since N and ||G|| are held constant in our proposed method our function f ∝ -LL(G|D). To find the LL(G|D) we use the following decomposition: LL(G|D) = -N XiP Ai H D (X i |P A i ), ( ) where N is the dataset size, P A i are the parent nodes of X i in G, and H is the conditional entropy function which is given by (Darwiche, 2009) for discrete variables and by (Ross, 2014) for continuous or mixed variables. Limitations of UDA selection methods In the ideal scenario, we would be able to leverage labeled samples in the target domain to estimate the target risk of a machine learning model. We can express the target risk R tgt in terms of the testing loss as follows: R tgt = 1 N tgt ((Y tgt (1) -Y tgt (0)) -(f (x tgt , 1) -f (x tgt , 0)) 2 (12) However, in general, we do not have access to the treatment responses for patients in the target set and, even if we did, we can only observe the factual outcome. Moreover, existing model selection methods for UDA only consider predictions on the source domain and do not take into account the predictions of the candidate model in the target domain. Specifically, DEV and IWCV calculate a density ratio or importance weight between the source and target domain as follows: w f (x) = p(d = 1|x) p(d = 0|x) N src N test , where d designates dataset domain (source is 0, target is 1), and p(d=1|x) p(d=0|x) can be estimated by a discriminative model to distinguish source from target samples (You et al., 2019) . Both calculate their score as a function of ∆ as follows: ∆ = 1 N v Nv i=1 w f (x v i )l(y v i , f (x v i , 0), f (x v i , 1)) where l(•, •, •) is a validation loss, such as influence-function based validation (Alaa & van der Schaar, 2019) . Note that the functions l and w are only defined in terms of validation features x v i from the source dataset. Such selection scores can be used to compute the validation score v r (f, D v , D tgt ) part of the ICMS score. However, our ICMS score also computes the likelihood of the interventional causal graph given the predictions of the model in the target domain as a proxy for the risk in the target domain. By adding the causal risk, we the improve the estimation of target risk. Additionally, we specifically make use of the estimated potential outcomes on the test set f (x tgt , 0) and f (x tgt , 1) to calculate our selection score as shown in Eq. 9. Fig. 2 depicts how we use the predictions of the target data to calculate our ICMS score.

5. EXPERIMENTS

We evaluate methods by the test performance in terms of the average PEHE of the top 10% of models in the list returned by the model selection benchmarks. We will refer to this as the PEHE-10 test error. We provide additional metrics for our results in Appendix G.1. Benchmark ITE models. We show how the ICMS score improves model selection for state-ofthe-art ITE methods based on neural networks: GANITE (Yoon et al., 2018) , CFRNet (Johansson et al., 2018) , TARNet (Johansson et al., 2018) , SITE (Yao et al., 2018) and Gaussian processes: CMGP (Alaa & van der Schaar, 2017) and NSGP (Alaa & van der Schaar, 2018) . These ITE methods use different techniques for estimating ITE and currently achieve the best performance on standard benchmark observational datasets (Alaa & van der Schaar, 2019) . We iterate over each model multiple times and compare against various DAGs and held-out test sets. Having various DAG structures results in varying magnitudes of test error. Therefore, without changing the ranking of the models, we min-max normalize our test error between 0 and 1 for each DAG, such that equal weight is given to each experimental run, and a relative comparison across benchmark ITE models can be made. Benchmark methods. We benchmark our proposed ITE model selection score ICMS against each of the following UDA selection methods developed for predictive models: IWCV (Long et al., 2018) and DEV (You et al., 2019) . To approximate the source risk, i.e., the error of ITE methods in predicting potential outcomes on the source domain (validation set D v ), we use the following standard ITE scores: MSE on the factual outcomes, inverse propensity weighted factual error (IPTW) (Van der Laan & Robins, 2003) and influence functions (IF) (Alaa & van der Schaar, 2019) . Note that each score (MSE, IPTW, etc.) can be used to estimate the target risk in the UDA selection methods: IWCV, DEV, or ICMS. Specifically, we benchmark our method in conjunction with each combination of ITE model errors {MSE, IPTW, IF} with validation risk {∅, IWCV, DEV}. We include experiments with ∅, to demonstrate using source risk as an estimation of validation risk.

5.1. SYNTHETIC UDA MODEL SELECTION

Data generation. In this section, we evaluate our method in comparison to related selection methods on synthetic data. For each of the simulations, we generated a random DAG, G, with n Table 1: PEHE-10 performance (with standard error) using ICMS on top of existing UDA methods. ICMS( ) means that the was used as the validation risk v r in the ICMS. For example, ICMS(DEV( )) represents DEV( ) selection used as the validation risk v r in the ICMS. The indicates the method used to approximate the validation error on the source dataset. Our method (in bold) improves over each selection method over all models and source risk scores (Src.). vertices and up to n(n -1)/2 edges (the asymptotic maximum number of edges in a DAG) between them. We construct our datasets with functional relationships between variables with directed edges between them in G and applied Gaussian noise (0 mean and 1 variance) to each. We provide further details and pseudocode in Appendix G.1. Using the structure of G, we synthesized 2000 samples for our observational source dataset D src . We randomly split D src into a training set D tr and validation set D v with 80% and 20% of the samples, respectively. To generate the testing dataset D tgt , we use G to generate 1000 samples where half of the dataset receives treatment, and the other half does not. For D tgt , we randomly shift the mean between 1 and 10 of at least one ancestor of Y in G, whereas in D src a mean of 0 is used. It is important to note that the actual outcome or response is never seen when selecting our models. Furthermore, the training dataset D src is observational and contains selection bias into the treatment node, whereas the synthetic test set D tgt does not, since it was generated by intervention at the treatment node. Our algorithm has only access to the covariates X in D tgt . Improved selection for all ITE models. Table 1 shows results of ICMS on synthetic data over the benchmark ITE models. Here, we evaluate three different types of selection baseline methods: MSE, IPTW, and IF. We then compare each baseline selection method with UDA methods: IWCV, DEV, and ICMS (proposed). We repeated the experiment over 50 different DAGs with 30 candidate models for each model architecture. Each of the candidate algorithms was trained using their published settings and hyperparameters, as detailed in Appendix E. In Table 1 , we see that our proposed method (ICMS) improves on each baseline selection method by having a lower testing error in terms of PEHE-10 (and inversion count in Appendix G.1) over all treatment models.

5.2. APPLICATION TO THE COVID-19 RESPONSE

ICMS facilitates and improves model transfer across domains with disparate distributions, i.e., time, geographical location, etc., which we will demonstrate in this section for COVID-19. The COVID-19 pandemic challenged healthcare systems worldwide. At the peak of the outbreak, many countries experienced a shortage of life-saving equipment, such as ventilators and ICU beds. Considering data from the UK outbreak, the pandemic hit the urban population before spreading to the rural areas (Figure 3 ). This implies that if we reacted in a timely manner, we could transfer models trained on the urban population to the rural population. However, there is a significant domain shift as the rural population is older and has more preexisting conditions (Armstrong et al., 2020) . Furthermore, at the time of model deployment in rural areas, there may be no labeled samples available. The characteristics of the two populations are summarized in Figure 3 . We provide detailed dataset details and patient statistics in Appendix J. 

Chronic

Resp.

Asthma

Gender COVID-19 Ventilation UK (urban) → UK (rural). Using the urban dataset, we performed causal discovery on the relationships between the patient covariates, treatment, and outcome. The discovered graph (Figure 3 ) agree well with the literature (Williamson et al., 2020; Niedzwiedz et al., 2020) . To be able to evaluate the ITE methods on how well they estimate all counterfactual outcomes, we created a semi-synthetic version of the dataset with outcomes simulated according to the causal graph. Refer to Appendix J for details of the semi-synthetic data simulation. Our training observational dataset consists of the patient features, ventilator assignment (treatment) for the COVID-19 patients in the urban area, and the synthetic outcome generated based on the causal graph. For each benchmark ITE model, we used 30 different hyperparameter settings and trained the various models to estimate the effect of ventilator use on the patient risk of mortality. We used the same training regime as in the synthetic experiments and the discovered COVID-19 causal DAG (using FGES Ramsey et al. (2017a) ) shown in Figure 3 . We evaluated the best ITE model selected by each model selection method in a ventilator assignment task. Using each selected ITE model, we assigned 2000 ventilators to the rural area patients that would have the highest estimated benefit (individualized treatment effect) from receiving the ventilator. Using the known synthetic outcomes for each patient, we then computed how many patients would have improved outcomes using each selected ITE model for assigning ventilators. By considering selection based on the factual outcome (MSE) on the source dataset as a baseline, in Figure 4 , we computed the additional number of patients with improved outcomes by using ICMS on top of existing UDA methods when selecting GANITE models with different settings of the hyperparameters. We see that ICMS (in blue) identified the GANITE models that resulted in better patient outcomes in the UK's rural areas without access to labeled data. We include additional experimental results in Appendix J. Additional experiments. On the TWINS dataset (Almond et al., 2005) (in Appendix I), we show how our method improves UDA model selection even with partial knowledge of the causal graph (i.e., using only a known subgraph for computing the ICMS score). Note also that in the Twins dataset, we have access to real patient outcomes. Moreover, we also provide additional UDA model selection results for transferring domains on a prostate cancer dataset and the Infant Health and Development Program (IHDP) dataset (Hill, 2011) in Appendix I.

6. CONCLUSION

We provide a novel ITE model selection method for UDA that uniquely leverages the predictions of candidate models on a target domain by preserving invariant causal relationships. To the best of our knowledge, we have provided the first model selection method for ITE models specifically for UDA. We provide a theoretical justification for using ICMS and have shown on a variety of synthetic, semi-synthetic, and real data that our method can improve on existing state-of-the-art UDA methods.

A WHY USE CAUSAL GRAPHS FOR UDA?

To motivate our method, consider the following hypothetical scenario. Suppose we have X 1 , X 2 , T , and Y representing age, respiratory comorbidities, treatment, and COVID-19 mortality, respectively, and the causal graph has structure X 1 → X 2 → Y ← T . Suppose that each node was a simple linear function of its predecessor with i.i.d. additive Gaussian noise terms. Now consider we have two countries A and B, where A has already been hit by COVID-19 and B is just seeing cases increase (therefore have no observed outcomes yet). B would like to select a machine learning model trained on the patient outcomes from A. However, A and B differ in distributions of age X 1 . Consider the regression of Y on X 1 , X 2 and T , i.e., Y = c 1 X 1 + c 2 X 2 + c 3 T , by two models f 1 and f 2 that are fit on the source domain and evaluated on the target domain. Suppose that f 1 and f 2 have the same value for c 2 and c 3 , but differ in c 1 , where c 1 = 0 for f 1 and c 1 = 0 for f 2 . We know that Y is a function of only X 1 and T . Thus in the shifted test domain, f 1 must have a lower testing error than f 2 , since the predictions of f 2 use X 1 (since c 1 = 0) and f 1 does not. Furthermore the predictions of f 1 have the same causal relationships and conditional independencies as Y , such as f 1 (X 1 , X 2 , T ) ⊥ ⊥ X 2 | X 1 . This is not the case for f 2 , where f 2 (X 1 , X 2 , T ) ⊥ ⊥ X 2 | X 1 . Motivated by this, we can use a metric of graphical fitness of the predictions of f i to the underlying graphical structure to select models in shifted domains when all we have are unlabeled samples. As an added bonus, which we will highlight later, unlike existing UDA selection methods our method can be used without needing to share data between A and B, which can help overcome patient privacy barriers that are ubiquitous in the healthcare setting.

B PROOF OF THEOREM 1

In this section, we present a proof for Theorem 1. Proof. In the source domain, by the Markov and faithfullness assumptions the conditional independencies in G are the same in p µ (X, T, Y ), such that I G (G) = I(p µ (X, T, Y )). ( ) To estimate the potential outcomes Y (t), we apply the do-operator to obtain the interventional DAG G T and interventional distribution p µ (X, Y | do(T = t)), such that: I G (G T ) = I(p µ (X, Y | do(T = t))). ( ) Since we assume Y = f (X, T ) we obtain: I G (G T ) = I(p µ (X, f (X, t) | do(T = t))). ( ) By Assumption 2, we know that the conditional independence relationships in the interventional distribution are the same in any environment, so that I(p µ (X, f (X, t) | do(T = t))) = I(p π (X, f (X, t) | do(T = t))), such that we obtain: I G (G T ) = I(p π (X, f (X, t) | do(T = t))).

C ICMS ADDITIONAL DETAILS

To clarify our methodology further we have provided pseudocode in Algorithms 1 and 2. Algorithm 1 calculates the ICMS score (from Eq. 9) from a given model. The values for c r and v r are min-max normalized between 0 and 1 across all models. Algorithm 2 returns a ranked list of models by ICMS score from a set of ITE models F. It takes optional prior knowledge in the form of a causal graph or known connections. In Figure 5 , we provide a graphical illustration for calculating N CI.  : r(f, D v , D tgt , G T ) Function: ICMS(f, D v , D tgt , G T , λ): ŷtgt i (t) ← f (x tgt i , t), for x tgt i ∈ D tgt Dtgt ← {(x tgt i , 0, ŷtgt i (0)), (x tgt i , 1, ŷtgt i (1))} Ntgt i=1 c r ← Measure of Dtgt to DAG G T fitness. v r ← ICMS_sel(F, D tr , D v , D tgt , λ, G π = ∅): G d ← causal discovery on D tr G ← assumed invariant DAG from G π or G d G T ← interventional DAG of G (remove edges into T ) F ← Sort F by ICMS(f, D v , D tgt , G T , λ) ascending return F .

D CAUSAL DISCOVERY ALGORITHM DETAILS

In this section we discuss our causal discovery algorithms used. For real data, where we did not know all of the connections between variables, we discovered the remaining causal connections from the data using the Fast Greedy Equivalence Search (FGES) algorithm by (Ramsey et al., 2017a) on the entire dataset using the Tetrad software package (Glymour et al., 2019a) . FGES assumes that all variables be observed and there is a linear Gaussian relationship between each node and its parent. Tetrad allows prior knowledge to be specified in terms of required edges that must exist, forbidden edges that will never exist, and temporal restrictions (variables that must precede other variables). Using our prior knowledge, we used the FGES algorithm in Tetrad to discover the causal DAGs for each of the public datasets. Only the directed edges that were output in the CPDAG by FGES were considered as known edges in the causal graphs. The Tetrad software package automatically handles continuous, discrete, and mixed connections, i.e., edges between discrete and continuous variables. If not using Tetrad for mixed variables, the method from (Ross, 2014) 

E.2 CFR AND TAR

For the implementation of CFR and TAR (Johansson et al., 2018) , we used the publicly available codefoot_1 , with hyperameters set as described in Table 3 . Note that for CFR we used Wasserstein regulatization, while for TAR the penalty imbalance parameter is set to 0. Table 3 : Hyperparameters used for CFR and TAR.

E.3 SITE

For the implementation of SITE (Yao et al., 2018) , we used the publicly available codefoot_2 , with hyperameters set as described in Table 4 . 

E.4 CMGP AND NSGP

CMGP (Alaa et al., 2017) and NSGP (Alaa & van der Schaar, 2018) are ITE methods based on Gaussian Process models for which we used the publicly available implementationfoot_3 . Note that for these ITE methods, the hyperparameters associated with the Gaussian Process are internally optimized.

F LAMBDA

We base our choice of λ to be proportional to our belief in our causal DAG that we use for UDA selection. If we are given prior knowledge in the form of a causal graph G π . G π is optional and can be an empty graph as well. In either case we can use causal discovery on our observational dataset to discover a DAG G d . Determining the edges that are truthful (and therefore invariant), in practice comes down to using human/expert knowledge to select the DAG that is most copacetic with existing beliefs of the natural world (Pearl, 2009) . We refer to the selected truthful DAG as G, and we define λ as follows: λ = |E(G)| |E(G π ) ∪ E(G d )| , where E(G) represents the set of edges of G and |E(G)| is the cardinality or number of edges in G. Intuitively, as the number of edges in our truthful dag G decreases relative to our prior knowledge and what is discoverable from data, the less belief we have in our truth causal DAG. In the event that all causal edges are known ahead of time and is discoverable from data appropriately, then λ = 1. Lambda sensitivity. We analyze the sensitivity of our method to the parameter λ in Eq. 9. We used the same experimental set-up used for the synthetic experiments. Figure 6 shows the sensitivity of our method to λ for GANITE using DEV and IF for calculating the validation risk v r .

G SYNTHETIC DATA GENERATION

Here we describe our synthetic data generation process (DGP). Algorithm 3 generates observational data according to a given invariant DAG G. Algorithm 4 generates interventional or treatment data according to a given invariant DAG G, where the treatment node is binarized and forced to have the value of 0 for half of the samples and 1 for the remainder.

Algorithm 3 Generate Observational Data

Input: A Graphical structure G, a mean µ, standard deviation σ, edge weights w and a dataset size n. Output: An observation dataset according to G with n samples.  Function: gen_obs_data(G, µ, σ, w, n): e ← edges of G G sorted ← topological_sort(G) ret ←

G.1 ADDITIONAL METRICS FOR SYNTHETIC EXPERIMENTS

We use an inversion count over the entire list of models, and provides a measure of list "sortedness". If we normalize this between the maximum number of inversions n(n -1)/2, where n is the number of models in the list, then a completely sorted list in ascending order will have a value of 0. Similarly, a monotonically descending ordered list will have a value of 1. We provide additional synthetic results in terms of inversion count in Table 5 . Table 5 : Inversion count using ICMS on top of existing UDA methods. ICMS( ) means that the was used as the validation risk v r in the ICMS. For example, ICMS(DEV( )) represents DEV( ) selection used as the validation risk v r in the ICMS. The indicates the method used to approximate the validation error on the source dataset. Our method (in bold) improves over each selection method over all models and source risk scores (Src.). Utilization of subgraphs. In practice, we will likely not know the true underlying causal graph completely. Due to experimental, economical or ethical limitations, we often can not determine the orientation of all edges completely. Additionally, the process of causal discovery is not perfect and likely will result in unoriented, missing, or spurious edges that result from noisiness and biases in the observational dataset used. In Figure 7 , we plot the performance of our ICMS method when selecting GANITE models as we increase the percentage of known edges into the outcome node in the causal subgraph used. We indeed prefer subgraphs that contain information about the parents of the outcome node. We conclude that it is perfectly admissible to use our methodology with a subgraph as input with the understanding that as edges are missing, performance degrades. However, the performance is still better than without using our ICMS score. Analysis of causal graph correctness. We investigate our method's sensitivity to incorrect causal knowledge. Here, we maliciously reverse or add spurious edges to our causal DAG when calculating ICMS. We used our same synthetic experimental setup, except we mutilate our oracle DAGs to form incorrect DAGs. We set λ to 1 since we assume the graph is truth (even though it is incorrect). We use GANITE with DEV and IF as our validation risk metric and show our results in Fig. 8 , which shows the ∆PEHE-10 error, i.e., the difference in PEHE-10 error of the erroneous DAG G T and the oracle DAG G T , versus the percentage graph difference (between G T and G T ). The graphical difference is calculated in terms of the percentage of edges that are mutated or removed. Fig. 8 shows the correlation between the correctness of the causal graph and the relative model selection improvement. This correlation testifies to the validity of ICMS, where a counterexample of our method would be incorrect DAGs leading to ICMS selecting better models (which is not the case). Noisiness of fitness score or graphs. We would like to point out that there is noisiness in the fitness score that we use. The likelihood requires estimating the conditional entropy between each variable given their parents. This step is not perfect and there are many permutations of graphical structures that could have scores that are very close. We hypothesize that improving our fitness scores will likely improve the efficacy of our approach in general. Application: towards personalized model selection. In some instances, various target domains may be represented by different underlying causal graphs (Shpitser & Sherman, 2018) . Consider the following clinical scenario. Suppose that we have two target genetic populations A and B that each have their own unique causal graph. We have a large observational dataset with no genetic information about each patient. At inference time assuming that we know which genetic group a patient belongs to (and corresponding causal graph), we hypothesize that we can select the models that will administer the more appropriate treatment for each genetic population using our proposed ICMS score. Tree-based methods. Here we provide a brief experiment showing that ICMS improves over non-deep neural network approaches of Bayesian additive regression tree (BART) (Chipman et al., 2010) and Causal Forest (Wager & Athey, 2018) as well. Replicating our synthetic experiments, we evaluated BART and Causal Forest using ICMS with DEV, IWCV, and IF for a validation risk. In Table 6 , we see that even for tree-based methods our ICMS metric is still able to select models that generalize best to the test domain. Model selection on causally invariant features. Here we provide a brief experiment showing that ICMS can be used as a selection method for the causal feature selection algorithms of Rojas-Carulla et al. (2018) ; Magliacane et al. (2018) . It is important to note that model selection is still important for models that are trained on an invariant set of causal features. These models can still converge to different local minima and have disparate performances on the target domain. Replicating our synthetic experiments, we used Rojas-Carulla et al. ( 2018) and Magliacane et al. (2018) to select causally invariant features, which we use for training and testing our model. We then selected models using ICMS and compared against our standard benchmarks using GANITE. In Table 7 , we see that even for these feature selection methods our ICMS metric is still able to select models that generalize best to the test domain (in comparison to DEV, IWCV, and IF).

RESULTS.

In this section, we highlight additional experiments performed on real datasets with semi-synthetic outcomes. Since real-world data rarely contains information about the ground truth causal effects, existing literature uses semi-synthetic datasets, where either the treatment or the outcome are simulated (Shalit et al., 2017) . Thus, we evaluate our model selection method on a prostate cancer dataset and the IHDP dataset where the outcomes are simulated and on the Twins dataset (Almond et al., 2005) where the treatments are simulated. Furthermore, we provide UDA selection results on the prostate cancer dataset for factual outcomes as well. IHDP dataset. The dataset was created by (Hill, 2011) from the Infant Health and Development Program (IHDP)foot_4 and contains information about the effects of specialist home visits on future cognitive scores. The dataset contains 747 samples (139 treated and 608 control) and 25 covariates about the children and their mothers. We use a set-up similar to the one in (Dorie et al., 2019) to simulate the outcome, while at the same time building the causal graph G. Since we do not have access to any real outcomes for this dataset, we build the DAG in Figure 9 , such that a subset of the features affect the simulated outcome. Let x represent the patient covariates and let v be the covariates affecting the outcome in the DAG represented in Figure 9 . We build the outcome for the treated patients f (x, 1) and for the untreated patients f (x, 0) as follows: f (x, 0) = exp(β(v+ 1 2 ))+ and f (x, 1) = βv+η where β consists of random regression coefficients uniformly sampled from [0.1, 0.2, 0.3, 0.4] and ∼ N (0, 1), η ∼ N (0, 1) are noise terms. TWINS dataset. The TWINS dataset contains information about twin births in the US between 1989-1991 (Almond et al., 2005) foot_5 . The treatment t = 1 is defined as being the heavier twin and the outcome corresponds to the 1-year mortality. Since the dataset contains information about both twins we can consider their outcomes as being the potential outcomes for the treatment of being heavier at birth. The dataset consists of 11,400 pairs of twins and for each pair we have information about 30 variables related to their parents, pregnancy and birth. We use the same set-up as in (Yoon et al., 2018) to create an observational study by selectively observing one of the twins based on their features (therefore inducing selection bias) as follows: t | x ∼ Bernoulli(sigmoid(w T x + n)) where w ∼ U((-0.1, 0.1) 30×1 ) and n ∼ N (0, 0.1). Since we have access to the twins outcomes, we perform causal discovery to find causal relationships between the context features and the outcome. However, due to the fact that we do not have prior knowledge of the relationships between all 30 variables, we restrict the causal graph used to compute the causal risk to only contain a subset of variables, as illustrated in Figure 10 . Table 8 illustrates the results for the Twins dataset. Note that in this case, we use real outcomes and we also show the applicability of our method when only a subgraph of the true causal graph is known. Prostate cancer datasets. In this case, we are a interested in deploying a machine learning model for prostate cancer but have access to only labeled data in the UK Biobank dataset, which has approximately 10,000 patients. We would like to deploy our models in the United States, where we have access to many samples of patient features, but no labeled outcome. For this target domain, we use the SEER dataset, which has over 100,000 samples. Our objective is to predict the patient mortality, given the patient features and treatment provided. To be able to evaluate the methods on predicting counterfactual outcomes on the target domain (and thus compute the PEHE), we create a semi-synthetic dataset where the outcomes are simulated according to the discovered causal graph. Thus, we build the semi-synthetic outcomes for the prostate cancer dataset similarly to the IHDP dataset. Let x represent the patient covariates and let v be the covariates affecting the outcome. We build the outcome for the treated patients f (x, 1) and for the untreated patients f (x, 0) as follows: f (x, 0) = exp(β(v + 1 2 )) + and f (x, 1) = βv + η where β consists of random regression coefficients uniformly sampled from [0.1, 0.2, 0.3, 0.4] and ∼ N (0, 0.1), η ∼ N (0, 0.1) are noise terms. For the prostate cancer datasets, we also perform an experiment where we do not use semi-synthetic data (to generate the counterfactual outcomes), but only the factual outcomes of the SEER dataset to evaluate our method. We train 30 models with identical hyperparameters as done in our synthetic and semi-synthetic experiments. We repeat this for all of our benchmark ITE methods. Table 9 shows that ICMS improves in terms of test error over all methods and ITE models. Computational settings. All experiments were performed on an Ubuntu 18.04 system with 12 CPUs and 64 GB of RAM. J COVID-19 EXPERIMENTAL DETAILS J.1 DATASET We obtained de-identified COVID-19 Hospitalization in England Surveillance System (CHESS) data from Public Health England (PHE) for the period from 8 th February (data collection start) to 14 th April 2020, which contains 7,714 hospital admissions, including 3,092 ICU admissions from 94 NHS trusts across England. The data set features comprehensive information on patients' general health condition, COVID-19 specific risk factors (e.g., comorbidities), basic demographic information (age, sex, etc.), and tracks the entire patient treatment journey: hospitalization time, ICU admission, what treatment (e.g., ventilation) they received, and their outcome by April 20th, 2020 (609 deaths and 384 discharges). We split the data set into a source dataset 2,552 patients from urban areas (mostly Greater London area) and a target dataset of the remaining 5,162 rural patients. J.2 ABOUT THE CHESS DATA SET COVID-19 Hospitalizations in England Surveillance System (CHESS) is a surveillance scheme for monitoring hospitalized COVID-19 patients. The scheme has been created in response to the rapidly evolving COVID-19 outbreak and has been developed by Public Health England (PHE). The scheme has been designed to monitor and estimate the impact of COVID-19 on the population in a timely fashion, to identify those who are most at risk and evaluate the effectiveness of countermeasures. The CHESS data therefore captures information to fulfill the following objectives: 1. To monitor and estimate the impact of COVID-19 infection on the population, including estimating the proportion and rates of COVID-19 cases requiring hospitalisation and/or ICU/HDU admission 2. To describe the epidemiology of COVID-19 infection associated with hospital/ICU admission in terms of age, sex and underlying risk factors, and outcomes 3. To monitor pressures on acute health services 4. To inform transmission dynamic models to forecast healthcare burden and severity estimates

J.3 COVID-19 PATIENT STATISTICS ACROSS GEOGRAPHICAL LOCATIONS

Figure 12 shows the histogram of age distribution for urban and rural patients. It is clear from the plot that the rural population is older, and therefore at higher risk of COVID-19. Table 10 presents statistics about the prevalence of preexisting medical conditions, the treatments received, and the final outcomes for patients in urban and rural areas. We can see that the rural patients tend to have more preexisting conditions such as chronic heart disease and hypertension. The higher prevalence's of comorbid conditions complicates the treatment for this population.

J.4 DATA SIMULATION AND ADDITIONAL RESULTS USING ICMS

In the CHESS dataset, we only observe the factual patient outcomes. However, to be able to evaluate the selected ITE models on how well they estimate the treatment effects, we need to have access to both the factual and counterfactual outcomes. Thus, we have built a semi-synthetic version of 



https://bitbucket.org/mvdschaar/mlforhealthlabpub/src/ 70a6f6130f90b7b2693505bb2f9ff78444541983/alg/ganite/ https://github.com/clinicalml/cfrnet https://github.com/Osier-Yi/SITE https://bitbucket.org/mvdschaar/mlforhealthlabpub/src/ 70a6f6130f90b7b2693505bb2f9ff78444541983/alg/causal_multitask_gaussian_ processes_ite/ The dataset can be found as part of the Supplementary Files at https://www.tandfonline.com/ doi/suppl/10.1198/jcgs.2010.08162?scroll=top The data for the TWINS dataset can be found at https://data.nber.org/data/ linked-birth-infant-death-data-vital-statistics-data.html



Figure 1: Method overview. We propose selecting ITE model whose predictions of the treatment effects on the target domain satisfy the causal relationships in the interventional causal graph G T .

Figure 2: ICMS is unique in that it calculates a causal risk (green) using predictions on target data. Purple arrows denote pathways unique to ICMS. Interventional causal model selection. Consider the schematic in Figure 2. We propose an interventional causal model selection (ICMS) score that takes into account the model's risk on the source domain, but also the fitness to the interventional causal graph G T on the target domain according to Eq. 4.A score that satisfies this is provided by the Lagrangian method:

Figure 3: Left: COVID-19 pandemic hit urban areas before spreading to rural areas. Middle: Feature subset showing there exists a significant covariate shift between urban and rural populations with the urban population younger and with fewer preexisting conditions. Right: Discovered COVID-19 DAG.

Figure 4: Performance of model selection methods in terms of the additional number of patients with improved outcomes compared to selecting models based on the factual error on the source domain.

Figure 5: Schematic demonstrating calculation of N CI.

Figure 6: λ sensitivity analysis.

Figure8: Performance of ICMS on incorrect graphs using IWCV(DEV(IF)). ∆PEHE-10 error is the difference of the PEHE-10 error of G T and G T using ICMS versus the percentage of graphical distance (in terms of total edges). G T is the oracle causal graph and is held static across the x-axis.

Figure 11: Interventional DAG for Prostate dataset.

Figure 12: Age distribution for urban and rural patients. The median age of rural patients is five years older than the urban ones.

This equivalence will enable us to reason about causal graphs and interventions on causal graphs in the context of selecting ITE methods for estimating potential outcomes. Evaluating ITE models. Methods for estimating ITE learn predictors

Algorithm 1 Calculate ICMS Input: ITE model f ; source validation dataset D v ; unlabeled target test set D tgt = {x tgt i }

Validation risk of f on D v and D tgt . return c r + λv r (from Eq. 9). Nsrc i=1 split into a training set D tr and validation set D v ; set of ITE models F trained D tr ; unlabeled test set D tgt ; optional prior knowledge in the form of a DAG G π , scale factor λ. Output: A list F of models in F ranked by ICMS score. Function:

can be used. Hyperparameters used for GANITE. s represents the number of input features.

Hyperparameters used for SITE.

empty list for node ∈ G do Append to ret[node] a list of Gaussian (µ and σ) randomly sampled list of size n end for for node ∈ G sorted do for par ∈ {parents(node)} do ret[node] += ret[par] * w(par, node), where w(par, node) is the edge weight from par to node. end for end for Apply sigmoid function to the treatment node and binarize. return ret.

Additional PEHE-10 (with standard error) results for BART and Causal Forest using DEV and IF as validation risk.



Results on IHDP, prostate cancer, and TWINS datasets. IF validation is used to compute the source risk. We report the PEHE-10 test error (with standard error) of various selections methods on ITE models. Our method (in bold) improves in terms of PEHE-10 over all methods and ITE models.

Results on predicting the outcome of prostate cancer given a treatment from models trained on the Prostate Cancer UK (PCUK) dataset and tested on the SEER dataset (United States). IF validation is used to compute the source risk. Here we show the factual error (of the top 10% of selected models) in terms of MSE of various selections methods on ITE models. Our method (in bold) improves in terms of test error over all methods and ITE models. The standard error is shown in parentheses.

Algorithm 4 Generate Treatment Data with perturbation

Input: A Graphical structure G, a mean µ, standard deviation σ, edge weights w, a dataset size n, a list of perturbation nodes p, a perturbation mean µ p and a perturbation standard deviation σ p . Output: An treatment dataset according to G with n samples and perturbation applied at nodes p. Function: gen_treat_data (G, µ, σ, w, n, µ 

H PRACTICAL CONSIDERATIONS

Here we provide a discussion on some practical considerations.Computational complexity. The computational of ICMS as shown in Algorithm 1 and 2 scales linear with the number of models in F. Specifically, the computational complexity is O(N f × Q(G, D)), where N f is the number of candidate models in F and Q(G, D) is the computational complexity of calculating the fitness score of dataset D to G. In our case, we use the log-likelihood score, which requires calculating the conditional entropy between each parent node and child. In the worst case, this has a computational complexity of O(V 2 G ), where V G is the number of vertices (or variables) in G since a DAG with V G vertices will have an asymptotic number of edges the dataset, with potential outcomes simulated according to the causal graph discovered for the COVID-19 patients in Figure 3 .Let x represent the patient covariates and let v be the covariates affecting the outcome in the DAG represented in Figure 3 . Let f (x, 1) be the outcome for the patients that have received the ventilator (treatment) and let f (x, 0) be the outcome for the patients that have not received the ventilator. The outcomes are simulated as follows: f (x, 0) = βv+η and f (x, 1) = exp(βv)-1+ , where β consists of random regression coefficients uniformly sampled from [0.1, 0.2, 0.3, 0.4] and ∼ N (0, 0.1), η ∼ N (0, 0.1) are noise terms. We consider that the patient survives if f (x, t) > 0, where t ∈ {0, 1} indicates the treatment received.Our training observational dataset consists of the patient features x, ventilator assignment (treatment) t for the COVID-19 patients in the urban area and the synthetic outcome generated using f (x, t). For evaluation, we use the set-up described in Section 5.2 for assigning ventilators to patients in the rural area based on their estimated treatment effects. In Figure 13 , we indicate the additional number of patients with improved outcomes by using ICMS on top of existing UDA methods when selecting ITE models with different settings of the hyperparameters.

