ESTIMATING TREATMENT EFFECTS VIA ORTHOGONAL REGULARIZATION

Abstract

Decision-making often requires accurate estimation of causal effects from observational data. This is challenging as outcomes of alternative decisions are not observed and have to be estimated. Previous methods estimate outcomes based on unconfoundedness but neglect any constraints that unconfoundedness imposes on the outcomes. In this paper, we propose a novel regularization framework in which we formalize unconfoundedness as an orthogonality constraint. We provide theoretical guarantees that this yields an asymptotically normal estimator for the average causal effect. Compared to other estimators, its asymptotic variance is strictly smaller. Based on our regularization framework, we develop deep orthogonal networks for unconfounded treatments (DONUT) which learn outcomes that are orthogonal to the treatment assignment. Using a variety of benchmark datasets for causal inference, we demonstrate that DONUT outperforms the state-of-the-art substantially.

1. INTRODUCTION

Estimating the causal effect of an intervention (i. e., treatment effect) is integral for individual decision making in many domains such as marketing (Brodersen et al., 2015; Hatt & Feuerriegel, 2020) , economics (Heckman et al., 1997) , and epidemiology (Robins et al., 2000) . For instance, in order to control an epidemic, it is relevant for public decision-makers to estimate the causal effect of school-closures (intervention) on the infection rate (outcome). The causal effect of an intervention can be estimated in two ways: randomized control trials (RCTs) and observational studies. RCTs are widely recognized as the gold standard for estimating causal effects, yet conducting RCTs is often infeasible (Robins et al., 2000) . For instance, randomly allocating different policy interventions during an epidemic might be unethical and impractical. Unlike RCTs, observational studies adopt observed data to infer causal effects. For this, covariates must be collected that contain all confounders (i. e., variables that affect both treatment and outcome). This is becoming increasingly common due to ease of access to rich data. In this paper, we estimate the average causal effect of a treatment from observational data. In order to estimate the causal effect of a treatment, the outcome of an alternative treatment has to be estimated. However, this is challenging, since we do not know what the outcome would have been if another treatment had been applied. Existing methods for estimating treatment effects use the treatment assignment as a feature and train regression models to estimate the outcomes (Funk et al., 2011; Kallus, 2017b) . Methods based on nearest neighbors and matching are adopted to find similar subjects (Ho et al., 2007; Crump et al., 2008; Kallus, 2017a; 2020) . Tree and forest-based methods (Wager & Athey, 2018) estimate the treatment effect at the leaf node and train many weak learners to build expressive ensemble models. Gaussian process-based methods provide uncertainty quantification (Alaa & van der Schaar, 2017; Ray & Szabo, 2019) . Weighting-based approaches re-weight the outcomes using weights based on covariate and treatment data (Kallus, 2018) . For instance, Fong et al. (2018) ; Yiu & Su (2018) seek weights such that the treatment assignment is unassociated with the covariates. However, they do not require the treatment assignment to be unassociated with the potential outcomes. Doubly robust methods combine a model for the outcomes and a model for the treatment propensity in a manner that is robust to misspecification (Funk et al., 2011; Benkeser et al., 2017; Chernozhukov et al., 2018) . Recently, deep learning has been successful for this task due to its strong predictive performance and ability to learn representations of the data (e. g., Johansson et al., 2016; Louizos et al., 2017; Shalit et al., 2017; Yao et al., 2018; Yoon et al., 2018; Shi et al., 2019) . To ensure identifiability of the causal effect, state-of-the-art methods for estimating treatment effects are based on unconfoundedness (i. e., all confounders are measured, and thus included in the covariates). Hence, unconfoundedness is assumed for identifiability, yet, during estimation of the model parameters, any implications on the unobserved outcomes that arise from unconfoundedness have been neglected. Contribution.foot_0 In this paper, (i) we introduce a regularization framework that exploits unconfoundedness . To this end, we formalize unconfoundedness as an orthogonality constraint. This constraint is used during estimation of the model parameters to ensure that the outcomes are orthogonal to the treatment assignment. We prove sufficient conditions under which this yields an asymptotically normal estimator for the average causal effect. Compared to other estimators, its asymptotic variance is strictly smaller. (ii) Based on our regularization framework, we develop deep orthogonal networks for unconfounded treatments (DONUT) for estimating average causal effects. DONUT leverages the predictive capabilities of neural networks to learn outcomes that are orthogonal to the treatment assignment. Using a variety of benchmark datasets for causal inference, we demonstrate that DONUT outperforms the state-of-the-art substantially.

2. PROBLEM SETUP

Our objective is to estimate the average treatment effect (ATE) of a binary treatment from observational data. For this, we build upon the Neyman-Rubin potential outcomes framework (Rubin, 2005) . Consider a population where every subject i is described by the d-dimensional covariates X i ∈ R d . Each subject is assigned a treatment T i ∈ {0, 1}. The random variable Y i (1) corresponds to the outcome under treatment, i. e., T i = 1, whereas Y i (0) corresponds to the outcome under no treatment, i. e., T i = 0. These two random variables, Y i (1), Y i (0) ∈ R, are known as the potential outcomes. Due to the fundamental problem of causal inference, only one of the potential outcomes is observed, but never both. The observed outcome is denoted by Y i . Our aim is to estimate the average treatment effect ψ = E[Y (1) -Y (0)]. (1) The following standard assumptions are sufficient for identifiability of the causal effect (Imbens & Rubin, 2015) : consistency (i. e., ∀t ∈ {0, 1} : Y = Y (t), if T = t); positivity (i. e., ∀x ∈ R d : 0 < P (T = 1 | X = x) < 1); and unconfoundedness. Unconfoundedness assumes that all confounders are measured and, hence, conditioning on them blocks all backdoor paths. This is equivalent to assuming that the potential outcomes Y (1) and Y (0) are independent of the assigned treatment T given the covariates X, i. e., Y (1), Y (0) ⊥ ⊥ T | X. Based on this, the ATE is equal to ψ = E[E[Y | X, T = 1] -E[Y | X, T = 0]]. ( ) Our task is to estimate the function f (x, t) = E[Y | X = x, T = t] for all x ∈ R d and t ∈ {0, 1} based on observational data D = {(X i , T i , Y i )} n i=1 .

3. ORTHOGONAL REGULARIZATION FOR ESTIMATING TREATMENT EFFECTS

The key idea of our regularization framework is to exploit the implications on the outcomes, that result from unconfoundedness. For this, we formalize unconfoundedness as an orthogonality constraint. This orthogonality constraint ensures that the outcomes are orthogonal to the treatment assignment. We later introduce a specific variant of our regularization framework based on neural networks, which yields DONUT.

3.1. UNCONFOUNDEDNESS AS ORTHOGONALITY CONSTRAINT

Under unconfoundedness, the outcomes are independent of the assigned treatment given the covariates, i. e., Y (1), Y (0) ⊥ ⊥ T | X. Using the inner product V, W = 1 n n i=1 V i W i , we formalize the following orthogonality constraint: Y (t) -f (X, t), T -π(X) = 1 n n i=1 (Y i (t) -f (X i , t))(T i -π(X i )) = 0, for t ∈ {0, 1}, and where f (x, t) = E[Y | X = x, T = t] and π(x) = E[T | X = x]. The function π is the propensity score.foot_1 This is a necessary condition for unconfoundedness, since unconfoundedness requires the covariance between Y (t) and T given X to be zero due to the independence of the outcomes and the treatment assignment. Hence, since the inner product in ( 5) is the empirical covariance between Y (t) and T given X, unconfoundedness requires the (centered) outcomes to be orthogonal to the (centered) treatment assignment with respect to the above inner product. 3An appropriate method for estimating outcomes based on unconfoundedness should ensure that the orthogonality constraint holds. Although existing methods are based on unconfoundedness, the orthogonality of the outcomes to the treatment assignment is ignored during estimation of the model parameters. As a remedy, we propose a regularization framework that accommodates the orthogonality constraint in the estimation procedure.

3.2. PROPOSED REGULARIZATION FRAMEWORK

In our regularization framework, the orthogonality constraint in ( 5) is included in the estimation procedure as follows. Let H f ⊆ [R d × {0, 1} → R] and H π ⊆ [R d → [0, ]] be function classes for f and π, and ∈ R a model parameter. Then, the objective is to find the solution to the optimization problem inf (f,π, )∈H f ×Hπ×R L FL (f, π; D) + λ Ω OR (f, π, ; D), where L FL is the so-called factual loss (see Section 3.2.1) between the estimated models and observed (factual) data. The term Ω OR is our new orthogonal regularization that enforces the orthogonality constraint using the model parameter (see Section 3.2.2). The variable λ ∈ R + is a hyperparameter controlling the strength of regularization. We describe both components of (6), i. e., factual loss and orthogonal regularization, in the following.

3.2.1. FACTUAL LOSS L FL

Similar to previous work (e. g., Shi et al., 2019) , the outcome model f and propensity score model π are estimated using the observed data and the factual loss given by L FL (f, π; D) = 1 n n i=1 (f (X i , T i ) -Y i ) 2 + α CrossEntropy(π(X i ), T i ), where α ∈ R + is a hyperparameter weighting the terms of the factual loss. We note the following about the estimation of the outcome model f in (7). In the first term of the factual loss, f is fitted to the observed outcomes and, thus, no information about the unobserved outcomes is used. As a consequence, the model learns about the observed outcomes, but not about the unobserved outcomes. State-of-the-art methods exclusively rely on the factual loss to estimate the outcomes (e. g. Shalit et al., 2017) . This is based on the assumptions that similar subjects have similar outcomes (Yao et al., 2018) and that for every subject in the data, there is a similar subject in the data with the opposite treatment. However, this is unlikely in presence of selection bias, i. e., when treatment and control groups differ systemically. This pitfall is addressed by our orthogonal regularization in the following.

3.2.2. ORTHOGONAL REGULARIZATION Ω OR

We now specify the orthogonal regularization term Ω OR that ensures that the orthogonality constraint in ( 5) is satisfied. We first state Ω OR and then explain each of its parts. The orthogonal regularization term is given by Ω OR (f, π, ; D) = 1 n n i=1 (Y * i (0) -f (X i , 0)) 2 , with Y * i (0) = Y i -ψ * T i , ψ * = 1 n n i=1 f (X i , 1) -f (X i , 0) (9) and f (X i , t) = f (X i , t) + (T i -π(X i )), for t ∈ {0, 1}, where the pseudo outcome Y * i (0) and the perturbation function f are required to learn outcomes that are orthogonal to the treatment assignment, as described in the following. Pseudo outcome. The untreated outcome of a subject can be expressed as Y i (0) = Y i -ψ(X i ) T i , where ψ(X) = f (X, 1) -f (X, 0) is the true treatment effect at X. Hence, if we had access to the treatment effect ψ(X), we would also have access to the untreated outcome Y i (0) even if we did not observe it. 4 We use the average treatment effect of the current model fit, i. e., ψ * in (9), as proxy for ψ(X i ). This creates a pseudo outcome, Y * i (0) = Y i -ψ * T i . Under sufficient conditions, this converges to the true outcome (see Section 4). Perturbation function. In order to ensure that the orthogonality constraint is satisfied, i. e., Y (t)f (X, t), T -π(X) = 0, we use a perturbation function f . Simply adding the inner product as a regularization term is not sufficient, since this does not guarantee that the orthogonality constraint holds at a solution of (6). A similar approach is used in targeted minimum loss estimation (e. g., van der Laan & Rose, 2011) and, more general, dual problems in optimization (Zalinescu, 2002) . We extend the function f to a perturbation function f using the model parameter as in (10). As a result, solving the optimization problem in (6) forces the outcome estimates f to satisfy the orthogonality constraint. Mathematically, this can be seen by taking the partial derivative of (6) and setting it to zero, i. e., 0 = ∂ ∂ (L FL + λ Ω OR (f, π, ; D)) =ˆ = 2 1 n n i=1 (Y * i (0) -f (X i , 0)) (T i -π(X i )), and, hence, Y * (0) -f (X, 0), T -π(X) = 0. It is analytically sufficient to regularize only the untreated outcome.foot_4 Whenever one of the outcomes is orthogonal to the treatment assignment, the other outcome is also orthogonal. A proof of this is provided in Appendix A. Then, the outcomes are estimated using f (x, t) and, therefore, the estimate for the ATE is ψ = 1 n n i=1 f (X i , 1) -f (X i , 0 ), which coincides with ψ * , since the perturbation terms in (10) cancel.

3.3. DEEP ORTHOGONAL NETWORKS FOR UNCONFOUNDED TREATMENTS

Our regularization framework works with any outcome model f and propensity score model π that is estimated through a loss function. The regularization framework ensures that the outcomes are estimated such that they are orthogonal to the treatment assignment. We introduce a specific variant of our regularization framework based on feedforward neural networks, which yields deep orthogonal networks for unconfounded treatments (DONUT). Neural networks present a suitable model class due to their strong predictive performance. The architecture of DONUT is set as follows. For the outcome model f , we use the basic architecture of TARNet (Shalit et al., 2017) . TARNet uses a deep feedforward neural network to produce a representation layer, followed by two 2-layer neural networks to predict each of the potential outcomes from the shared representation. For the propensity score model π, we use a logistic regression (Rosenbaum & Rubin, 1983) . The choice of feedforward neural networks is particularly strengthened by the theoretical discussion in Section 4.

4. THEORETICAL GUARANTEES

We prove sufficient conditions under which our regularization framework yields an asymptotically normalfoot_5 estimator ψ for the true ATE ψ. Asymptotic normality is particularly favorable for an estimator, as such estimators converge to the true ATE at a rate 1/ √ n. We further show that, compared to other estimators, our estimator has strictly smaller asymptotic variance. Theorem 1. (Asymptotic Normality.) Suppose 1. The estimator η = ( f , π) for the outcome and propensity score model converges to some η = ( f , π) in the sense that η -η = o p (1), where either f = f or π = π (or both) corresponds to the true function. 2. The treatment effect is homogeneous, i. e., ∀x ∈ R d : ψ(x) = ψ. 3. The estimators f and π take values in P-Donsker classes, i. e., H f , H π ∈ CLT(P).foot_6  Then ψ is asymptotically normal, i. e., √ n ( ψ -ψ) d -→ N 0, σ 2 E[Var(T | X)] , ( ) where σ 2 is the variance of the outcome. The proof is provided in Appendix C. We make the following remarks about the result and its conditions. Condition 1 requires that either the model for the outcomes f or the model for the propensity score π (or both) is correctly specified, but not necessarily both. This means that our estimator is doubly robust, since it is consistent under correct specification of either f or π. In particular, neural networks converge at a fast enough rate to invoke the doubly robustness condition (Farrell et al., 2018) . Condition 2 can be easily relaxed to any specification of ψ as long as it has finitely many parameters and given that the appropriate identification criteria hold. This changes the orthogonal regularization in (8) accordingly; see Appendix D for details. However, the advantage of Condition 2 is that it explains why the asymptotic variance of ψ is smaller compared to other estimators. In particular, the difference in asymptotic variance to some estimators (e. g., the inverse probability weighted estimator) can be sizable if the propensity score is close to zero or one. Often, this difference is not offset by weaker restrictions imposed by heterogeneous treatment effects (e. g., Vansteelandt & Joffe, 2014) . Condition 3 captures a large class of functions of estimators for f and π from which we can choose. An overview of P-Donsker classes is given in Mikosch et al. (1997) . In particular, the class of feedforward neural networks is a P-Donsker class, since Lipschitz parametric functions are P-Donsker functions, and any Lipschitz transformation of a P-Donsker function is again a P-Donsker function. For mathematical details, see Appendix E. Together with the convergence rate for doubly robustness, this provides theoretical justification for the use of neural networks in our regularization framework, and therefore in DONUT. Similarity to partially linear regression. We find an interesting similarity between our estimator ψ and the estimator obtained by partially linear regression (PLR) (see (1.5) in Chernozhukov et al. ( ) ). The analytical expression of ψ yield by our regularization framework in (8) is given in (31) in Appendix C. This is similar to the estimator obtained by PLR after partialling the effect of X out from T (see (1.5) in Chernozhukov et al. (2018) ). However, the PLR separately estimates the nuisance functions f and π and then plugs them into the estimator. Our approach does not use the analytical expression as plug-in estimator, but the estimator arises directly from solving the optimization problem in ( 6) via ( 8). The difference becomes apparent in the experiments (Section 5), where we include the PLR as a baseline and find that the performance of DONUT is superior. Comparison to other estimators. We compare the asymptotic behavior of our estimator ψ to other regular estimators (e. g., Funk et al. (2011) ; Nie & Wager (2017) ). We find that, under the conditions of Theorem 1, there does not exist a regular estimator that achieves strictly smaller asymptotic variance than our estimator. The reason for this is as follows. We proof in Appendix C that our regularization framework yields an asymptotically normal estimator for the ATE. In particular, we proof that our estimator is efficient (see Appendix C for the efficient influence function), and therefore it achieves the efficiency bound. As a consequence, among all regular estimators, our estimator achieves the smallest asymptotic variance. In Appendix G, we make an example in which we compare the inverse probability weighted estimator to our estimator. We show that the difference in asymptotic variance becomes particularly pronounced in presence of selection bias (i. e., when treatment and control groups differ systemically). To summarize, we prove sufficient conditions under which our estimator ψ is asymptotically normal. Hence, our proposed estimator converges to the true ATE ψ at a fast rate. Compared to other regular estimators, its asymptotic variance is strictly smaller, since we prove that our estimator is efficient. Since feedforward neural networks converge at a fast enough rate to invoke doubly robustness and belong to a P-Donsker class, these results hold true when using feedforward neural networks in our regularization framework, which yields DONUT.

5. EXPERIMENTS

Our proposed DONUT is evaluated against state-of-the-art baselines, where we find that its ATE estimation is superior (Section 5.2). Its benefits appear especially for problems subject to selection bias, which is confirmed as part of a simulation study (Section 5.3). In the latter, the theoretical guarantees from Section 4 are confirmed.

5.1. SETUP

Evaluating methods for estimating causal effects is challenging as we rarely have access to ground truth causal effects. Established procedures for evaluation of such methods rely on semi-synthetic data, which reflect the real world. Our experimental setup follows established procedure regarding datasets, baselines, and performance metrics (e. g., Johansson et al., 2016; Shalit et al., 2017) . Datasets. We evaluate all methods across four benchmark datasets for causal inference: IHDP (e. g., Johansson et al., 2016) , Twins (e. g., Yoon et al., 2018) , ACIC 2018 (e. g., Shi et al., 2019) , and Jobs (e. g., Shalit et al., 2017) . The first three datasets are semi-synthetic, while the last originated from a RCT. Details on IHDP, Twins, ACIC 2018, and Jobs are provided in Appendix H. Training details. DONUT is trained using the regularization framework in ( 6), where both the outcome model and the propensity score model are trained jointly using stochastic gradient descent with momentum. The hidden layer size is 200 for the representation layers and 100 for the outcome layers similar to Shalit et al. (2017) ; Shi et al. (2019) . The hyperparameter α in the factual loss ( 7) is set to 1 and λ in the orthogonal regularization ( 8) is determined by hyperparameter optimization over {10 k } 2 k=-2 . For IHDP, we follow established practice (e. g., Shalit et al., 2017) and average over 1,000 realizations of the outcomes with 63/27/10 train/validation/test splits. Following Shalit et al. (2017) ; Yoon et al. (2018) ; Shi et al. (2019) , we average over 100 different train/validation/test splits for Twins and Jobs, and over 10 splits for each dataset for ACIC, all with ratios 56/24/20. Baselines. We compare DONUT against 17 state-of-the-art methods for estimating treatment effects, organized in the following groups: (i) Regression methods: Linear regression with treatment as covariate (OLS/LR-1), separate linear regressors for each treatment (OLS/LR-2), and balancing linear regression (BLR) (Johansson et al., 2016) ; (ii) Matching methods: k-nearest neighbor (k-NN) (Crump et al., 2008) ; (iii) Tree methods: Bayesian additive regression trees (BART) (Chipman et al., 2012) , random forest (R-Forest) (Breiman, 2001) , and causal forest (C-Forest) (Wager & Athey, 2018) ; (iv) Gaussian process methods: Causal multi-task Gaussian process (CMGP) (Alaa & van der Schaar, 2017) and debiased Gaussian process (D-GP) (Ray & Szabo, 2019) ; (v) Neural network methods: Balancing neural network (BNN) (Johansson et al., 2016) , treatment-agnostic representation network (TARNet) (Shalit et al., 2017) , counterfactual regression with Wasserstein distance (CFR-WASS) (Shalit et al., 2017) , generative adversarial networks (GANITE) (Yoon et al., 2018) , and Dragonnet (Shi et al., 2019) ; (vi) Plug-in estimators: Augmented inverse probability weighted estimator (AIPWE) (Cao et al., 2009) , targeted maximum likelihood estimator (TMLE) (van der Laan & Rubin, 2006) , and partially linear regression (PLR) (Chernozhukov et al., 2018) all of them using the outcome and propensity score model of Dragonnet as nuisance functions. Performance metrics. Following established procedure, we report the following metrics for each dataset. For IHDP and ACIC 2018, we use the absolute error in average treatment effect (Johansson et al., 2016)  : ATE = | 1 n n i=1 (f (x i , 1)-f (x i , 0))-1 n n i=1 ( f (x i , 1)-f (x i , 0))|. For Twins, we use the absolute error in observed average treatment effect (Yoon et al., 2018)  : ATE = | 1 n n i=1 (y i (1) - y i (0))-1 n n i=1 (ŷ i (1)-ŷi (0))|. For Jobs, all treated subjects T were part of the original randomized sample E, and hence, the true average treatment effect can be computed on the treated by ATT = |T | -1 i∈T y i -|C ∩ E| -1 i∈C∩E y i , where C is the control group. Similar to (Shalit et al., 2017) , we then use the error : ATT = |ATT -|T | -1 i∈T ( f (x i , 1) -f (x i , 0))|.

5.2. RESULTS

IHDP, Twins, and Jobs have been used to evaluate many methods for estimating treatment effects. In Table 1 , the results of the experiments on IHDP, Twins, and Jobs are presented. Overall, DONUT achieves competitive performance across all datasets. On IHDP, the CMGP baseline achieves slightly lower estimation error (mean: .11 (CMGP) vs. .13 (DONUT)), but at the drawback of an inferior standard deviation (std.: .10 (CMGP) vs. .01 (DONUT)). However, as shown by previous work, the small sample size and the limited simulation settings of IHDP make it difficult to draw conclusions about methods (e. g., Yoon et al., 2018; Shi et al., 2019) . In contrast, DONUT achieves superior performance on both Twins and Jobs, where the number of samples is much larger (Twins: n =11,400; Jobs: n =3,212). On these datasets, DONUT is state-of-the-art. In particular, we point out the difference to PLR which uses the analyitcal expression of our estimator similar to Chernozhukov et al. (2018) as discussed in Section 4, but as a plug-in estimator. For a reasonable comparison, we use the outcome models of Dragonnet as nuisance functions. We find that DONUT achieves superior performance. Among neural network-based methods (i. e., BNN, TARNet, CFR-WASS, GANITE, Dragonnet, and DONUT), DONUT performs superior across all datasets. In comparison to TARNet, which shares the basic architecture of DONUT (but without orthogonal regularization), we achieve substantial reduction in ATE estimation error across all datasets (out-of-sample error reduction: 32.1 % on IHDP, 78.1 % on Twins, and 45.5 % on Jobs). This demonstrates the effectiveness of our regularization framework. We further evaluate DONUT using ACIC 2018. This collection of datasets was introduced for evaluating neural networks for estimating average treatment effects (Shi et al., 2019) . We compare DONUT against the most competitive, and current state-of-the-art method, i. e., Dragonnet. In addition, we compare against TARNet (i. e., DONUT, but without orthogonal regularization). Table 2 presents the results of the experiments on ACIC 2018. The main observation is that DONUT improves estimation relative to TARNet (DONUT without orthogonal regularization) and relative to Dragonnet across a large collection of datasets. These results further confirm the benefit of including orthogonal regularization in the estimation procedure.

5.3. SIMULATION STUDY ON SELECTION BIAS

To evaluate the robustness of DONUT with regards to selection bias (i. e., when treatment and control groups differ substantially), we generate synthetic data with varying selection bias according to a similar protocol as in Yao et al. (2018) ; Yoon et al. (2018) . Details on the protocol are provided in Appendix I. We compare DONUT against Dragonnet, which makes use of the AIPWE to estimate the average treatment effect. This is particular interesting for the comparison with DONUT due to the results in Section 4. We showed that, compared to any other estimator, our estimator obtains strictly smaller asymptotic variance, especially in presence of selection bias. Figure 1 presents the mean and standard deviation of ATE for varying selection bias. We report two major insights: (i) Dragonnet is outperformed by DONUT across different levels of selection bias. As selection bias increases, the estimation error of DONUT remains stable. (ii) The standard deviation of both DONUT and Dragonnet increases as selection bias increases. However, in line with our findings that DONUT has the smallest asymptotic variance among all regular estimators, the standard deviation of DONUT remains consistently smaller than the standard deviation of Dragonnet. In addition, this difference becomes more pronounced the larger the selection bias. Hence, we observe empirical properties of our estimator that coincide with the theory derived in Section 4.

6. CONCLUSION

Understanding causal effects is crucial for reliable decision-making. In this paper, we present a regularization framework for estimating average causal effects. We formalize unconfoundedness as an orthogonality constraint that is used to learn outcomes that are orthogonal to the treatment assignment. We prove theoretical guarantees that our regularization framework yields an asymptotically normal estimator. Based on this, we develop DONUT, which leverages the predictive capabilities of neural networks to estimate average causal effects. Experiments on datasets show that, in most cases, DONUT outperforms the state-of-the-art substantially. This work provides an interesting avenue for future research on causal inference. We hypothesize that most existing models can be improved by

A SUFFICIENCY OF REGULARIZING THE UNTREATED OUTCOME

We show that it is sufficient to regularize Y (0), since orthogonality of Y (0) implies orthogonality of Y (1) to T given X. Suppose Y (0) satisfies the orthogonality constraint, i. e., Y (0) -f (X, 0), T -π(X) = 1 n n i=1 (Y i (0) -f (X i , 0))(T i -π(X i )) = 0. Then, 18) and ( 20). Hence, Y (1)f (X, 1), T -π(X) is zero and, therefore, Y (1) is orthogonal to T given X. Nevertheless, it is straightforward to adapt the orthogonal regularizer in (8) to accommodate both outcomes. Consider the following adapted orthogonal regularizer Y (1) -f (X, 1), T -π(X) = 1 n n i=1 (Y i (1) -f (X i , 1))(T i -π(X i )) (15) = 1 n n i=1 (Y i (1) -E[Y i | X i , T = 1])(T i -π(X i )) (16) = 1 n n i=1 (Y i (1) -E[Y i (1) | X i ])(T i -π(X i )) (17) = 1 n n i=1 (Y i + ψ(X i ) (1 -T i ) -E[Y i + ψ(X i ) (1 -T i ) | X i ])(T i -π(X i )) (18) = 1 n n i=1 (Y i -ψ(X i ) T i -E[Y i -ψ(X i ) T i | X i ])(T i -π(X i )) (19) = 1 n n i=1 (Y i (0) -E[Y i (0) | X i ])(T i -π(X i )) (20) = 1 n n i=1 (Y i (0) -E[Y i | X i , T = 0])(T i -π(X i )) (21) = 1 n n i=1 (Y i (0) -f (X i , 0))(T i -π(X i )) (22) = Y (0) -f (X, 0), T -π(X) = 0, using that Y (0) = Y -ψ(X) T and Y (1) = Y + ψ(X) (1 -T ) in ( Ω OR (f, π, ; D) = 1 n n i=1 (Y * i (0) -f (0, X i )) 2 + 1 n n i=1 (Y * i (1) -f (1, X i )) 2 , with Y * i (0) = Y i -ψ * T i , Y * i (1) = T i + ψ * (1 -T i ), ψ * = 1 n n i=1 f (X i , 1) -f (X i , 0) (25) and f (X i , 0) = f (X i , 0) + (T i -π(X i )), f (X i , 1) = f (X i , 1) + (T i -π(x i )), where is a additional model parameters. Similar to the result in Section 3.2.2., this yields that Y * (0) -f (X, 0), T -π(X) and Y * (1) -f (X, 1), T -π(X) are zero, when the partial derivative of (24) w. r. t. is zero.

B ASYMPTOTICALLY NORMAL ESTIMATORS

In this section, we give a brief discussion on asymptotically normal estimators. We refer to van der Vaart (1998) for an in-depth discussion. Suppose ψ is an estimator for the true parameter ψ. Then, ψ is an asymptotically normal estimator if it satisfies ψ -ψ = 1 n n i=1 φ(Z i ) + o p (1/ √ n), where φ(Z) is referred to as the influence function at Z = (X, T, Y ) (Kandasamy et al., 2015) . The influence function has mean zero and finite variance (i. e., E[φ(Z)] = 0 and Var(φ(Z)) < ∞). The existence and uniqueness of the influence function follows by the Riesz representation theorem. Asymptotic normality is a favorable property of an estimator, since, by the central limit theorem, √ n( ψ -ψ) d -→ N (0, Var(φ(Z))). (29) Hence, an asymptotically normal estimator is asymptotically normal distributed and unbiased. Moreover, the variance shrinks in proportion to 1/ √ n as the sample size grows and is given by the variance of the influence function. As a consequence, the distribution of the estimator ψ converges weakly to a dirac delta function centered around the true ψ.

C PROOF OF THEOREM 1

In observational studies, the covariates X are often high-dimensional and limited knowledge about the nuisance functions (i. e., outcome and propensity score functions f and π) is available. In such a case, it is reasonable to use flexible, data-adaptive methods for estimating the nuisance functions, e. g., neural networks. However, the complexity of these methods makes asymptotic analysis difficult, because the estimators used to construct η = ( f , π) are not described by a single finite-dimensional parameter. Nevertheless, under some conditions, we can learn about the asymptotics of ψ using tools from empirical process theory. First, we need to introduce some notation. Throughout the proof, we will use P{f (Z)} = f (z)dP to denote expectations of f (Z) for a random variable Z (treating the function f as fixed). Hence, P{ f (Z)} is random if f is random (e. g., estimated from the sample). In contrast, E[ f (Z)] is a fixed non-random quantity, which averages over randomness in both Z and f and thus will not equal P{ f (Z)} except when f = f is fixed and non-random. Moreover, we let P n = 1 n n i=1 δ Zi denote the empirical measure such that sample averages can be written as 1 n n i=1 f (Z i ) = f (z)dP n = P n (f (Z)). For clarity, we will denote the true nuisance functions and average treatment effect as η 0 = (f 0 , π 0 ) and ψ 0 , respectively. We further denote the triplet (Y, T, X) by Z. For f = f , π, and ψ = 1 n n i=1 f (X i , 1) -f (X i , 0) from (8), the regularization framework yields Y * (0) -f (X, 0), T -π(X) = 1 n n i=1 (Y i -ψ T i -f (X i , 0))(T i -π(X i )) = 0, by construction. Hence, we receive an analytical expression of the estimator of the average treatment effect by solving the above for ψ, i. e., ψ = 1 n n i=1 (Y i -f (X i , 0))(T i -π(X i )) 1 n n i=1 T i (T i -π(X i )) Therefore, the estimator is given by ψ = P n (m(Z; η)), where m(Z; η) = (Y -f (X, 0))(T -π(X)) P{T (T -π(X))} , and η = (π, f ) denotes the nuisance functions. We recall the conditions in Theorem 1. Suppose the estimator η = ( f , π) converges to some η = ( f , π) in the sense that η -η = o p (1) 8 , where either f = f 0 or π = π 0 (or both) corresponds to the true nuisance function. Thus at least one nuisance estimator needs to converge to the correct function, but one can be misspecified. Then, P{m(Z; η)} = P{m(Z; η 0 )} = ψ 0 , from the straightforward to check fact that P{(m(Z; π, f 0 )} = P{m(Z; π 0 , f )} for any π and f . Consider the decomposition = P π0(X) (1 -π(X))(f0(X, 1) -f (X, 0)) P{T (T -π(X))} -(1 -π0(X))(f0(X, 1) -f0(X, 0)) P{T (T -π0(X))} -P (1 -π0(X))π(X)(f0(X, 0) -f (X, 0)) P{T (T -π(X))} = P π0(X) (1 -π(X))(f0(X, 1) -f (X, 0)) P{π0(X)(1 -π(X))} - (1 -π0(X))(f0(X, 1) -f0(X, 0)) P{π0(X)(1 -π0(X))} -P (1 -π0(X))π(X)(f0(X, 0) -f (X, 0)) P{π0(X)(1 -π(X))} = P π0(X)(1 -π(X))(f0(X, 1) -f (X, 0)) P{π0(X)(1 -π(X))} -ψ0 -P (1 -π0(X))π(X)(f0(X, 0) -f (X, 0)) P{π0(X)(1 -π(X))} = P π0(X)(1 -π(X))(f0(X, 1) -f (X, 0)) P{π0(X)(1 -π(X))} -π0(X)(1 -π(X))(f0(X, 1) -f0(X, 0)) P{π0(X)(1 -π(X))} -P (1 -π0(X))π(X)(f0(X, 0) -f (X, 0)) P{π0(X)(1 -π(X))} , where we use P{m(Z; η)} = P{m(Z; η 0 )} in (34), iterated expectation in (35), and P{T (Tπ(X))} = P{π 0 (X)(1 -π(X))} (likewise for π 0 instead of π) in (37). In (39), and similarly in (341), we use that ψ 0 (X) = ψ 0 and, therefore, f 0 (X, 1)-f 0 (X, 0) = ψ 0 . As a result, by simplifying, the above equals π 0 (X) -π(X) P{π 0 (X)(1 -π(X))} (f 0 (X, 0) -f (X, 0)). (43) Therefore, by the fact that π 0 and π are bounded away from zero and one, along with the Cauchy-Schwarz inequality (i. e., P{f g} ≤ f g ), we have that (up to a multiplicative constant) |P{m(Z; η) -m(Z; η)}| is bounded above by π 0 (X) -π(X) f 0 (X, 0) -f (X, 0) . (44) Thus, for example if π is based on a correctly specified parametric model (e. g., logistic regression), so that π -π 0 = o p (1/ √ n), then we only need f to be consistent, i. e., f -f 0 = o p (1), to make the product term P{m(Z; η) -m(Z; η)} = o p (1/ √ n) asymptotically negligible. Then the doubly robust estimator satisfies ψ -ψ 0 = (P n -P){m(Z; η 0 )} + o p (1/ √ n) and it is efficient with



Code available at github.com/anonymous/donut (anonymized for peer-review). Note that the propensity score is defined as π(x) = P (T = 1 | X = x)(Rosenbaum & Rubin, 1983), which is equivalent to E[T | X = x] for binary treatments. Note that we use the notion orthogonality with respect to the inner product in (5). This is different from the notion of Neyman orthogonality in(Nie & Wager, 2017;Chernozhukov et al., 2018), which requires the Gâteau derivative of a debiased score function to vanish at the true parameter. In contrast, our orthogonality constraint exploits unconfoundedness to ensure that the outcomes are orthogonal (w.r.t. the inner product in (5)) to the treatment assignment. This can be seen by distinction of cases: if Ti = 1, then Yi(0) = Yi -ψ(Xi), and if Ti = 0, then Yi(0) = Yi. It is straightforward to extend the orthogonal regularization term to incorporate both outcomes by adding the same term for the treated outcome Y (1), including the pseudo outcome Y * (1) and the perturbation function f (x, 1) to enforce orthogonality. We demonstrate this in Appendix A. We refer to Appendix B for a brief discussion on asymptotically normal estimators. A class F of measurable functions on a probability space (Ω, A, P) is called a P-Donsker class if, forG P n =√ n(Pn -P), the empirical process {G P n f : f ∈ F} n≥1 converges weakly to a P-Brownian bridge G P . Donsker classes include parametric classes, but also many other classes, including infinite-dimensional classes, e. g., smooth functions and bounded monotone functions. See(Mikosch et al., 1997) for more details. op(1/rn) employs the usual stochastic order notation so that Xn = op(1/rn) means that rnXn → 0 in probability. We provide the complete dataset at github.com/anonymous/donut (anonymized for peer-review). We provide the list of unique dataset identification numbers at github.com/anonymous/donut (anonymized for peer-review) for reproducibility.



-ψ 0 = P n (m(Z; η)) -P{m(Z; η)} = (P n -P){m(Z; η)} + P{m(Z; η) -m(Z; η)}.If the estimators for the nuisance functions η take values in P-Donsker classes, then m(Z; η) also belongs to a P-Donsker class, since Lipschitz transformations of Donsker functions are again Donsker functions. The Donsker property together with the continuous mapping theorem yields that (P n -P){m(Z; η)} is asymptotically equivalent to (P n -P){m(Z; η)} up to o p (1/ √ n) error (seeMikosch et al. (1997) for more details on P-Donsker classes). Therefore,ψ -ψ 0 = (P n -P){m(Z; η)} + P{m(Z; η) -m(Z; η)} + o p (1/ √ n).(33)It is left to show that P{m(Z; η) -m(Z; η)} is asymptotically negligible. This term equalsP (T -π(X))(Y -f (X, 0)) P{T (T -π(X))} -(T -π0(X))(Y -f0(X, 0)) P{T (T -π0(X))}(34)

Results for estimating average treatment effects on IHDP, Twins, and Jobs. Lower is better.

Results for estimating average treatment

annex

incorporating orthogonal regularization. We leave the derivation of a unifying theory for future work. A revised version of this work can be found in Hatt & Feuerriegel (2021) . influence function φ(Z; ψ, η) = m(Z; η) -ψ. Thus, this proves the asymptotic normality of ψ. It remains to show that the asymptotic variance of ψ equals σ 2 P{Var(T | X)} .(45)Note that in the paper, E[f (Z)] is used similarly to P{f (Z)}. We introduced the notion P{f (Z)} here to make clear that we only integrate over randomness in Z and not the function estimates f . As seen in Appendix B, the asymptotic variance of an estimator is the variance of its influence functions.From above, we know that the efficient influence function of ψ is φ(Z; ψ, η) = m(Z; η) -ψ. Thus, using that P{φ(Z; ψ, η)} = 0 (by definition),Var(φ(Z; ψ, η)) = P{φ(Z; ψ, η) 2 } (46)This, together withwhere

D ORTHOGONAL REGULARIZATION UNDER LINEAR SPECIFICATION OF THE TREATMENT EFFECT

In Theorem 1, condition 2 ensures the homogeneity of the treatment effect, i. e., ψ(X) = ψ. This can be easily relaxed to any specification of ψ as long as it has finitely many parameters and given that the appropriate identification criteria hold (for linear specification, this is the non-singularity of the design matrix). We explain in this section how the orthogonal regularization in (8) changes when linear specification of the treatment effect is considered, i. e., ψ(x) = θ x, where θ ∈ R d . Recall the orthogonal regularizer from ( 8)where the perturbation function f remains unchanged. The pseudo outcome Y * i (0) is now given aswhere θ is given by the system of linear equations,where v(t, x) is an arbitrary function of the dimension of θ, which ensures that the system of linear equations possesses a unique solution. For instance, in the case of ψ(x) = θ x, the choice v(t i , x i ) = x i t i could be made. The procedure goes as follows. First, the above system of linear equations is solved for θ. Second, the pseudo outcome Y * i (0) = Y i -(θ X i )T i can be computed. Finally, this is plugged into the regularizer. The proof for asymptotic normality goes similar to the one under homogeneous treatment effect.

E FEEDFORWARD NEURAL NETWORKS ARE IN P-DONSKER CLASS

We show that feedforward neural networks are in a P-Donsker class. A feedforward neural network f d is defined by the recursionfor d ∈ N ≥1 , matrices {W i } d i=1 of appropriate dimensions, activation function σ, which applies element-wise, and d is the depth of the neural network.f 1 is clearly in a P-Donsker class as it is Lipschitz parametric. The Donsker property is preserved under Lipschitz transformations. Hence, f i (x), for i = 2, . . . , d, is in a P-Donsker class, if the activation function is a Lipschitz transformation. This is the case for most common activation functions. We give two examples in the following.The ReLU activation function, σ ReLU (x) = max(0, x), preserves the Donsker property, since a constant function is clearly in a P-Donsker class and taking the maximum of two P-Donsker functions is again a P-Donsker function due to the preservation under Lipschitz transformations. The same follows for the ELU activation function,where α ≥ 0, since it is a Lipschitz transformation.

F ASYMPTOTICS OF ψ UNDER HETEROGENEOUS TREATMENT EFFECTS

Our estimator retains a useful interpretation under heterogeneous treatment effects. In this case,where ψ(X) = f (X, 1) -f (X, 0) is the true treatment effect at X. This can be interpreted as a weighted average of treatment effects ψ(X). As a consequence, most weight is given to subpopulations with large Var(T | X), i. e., which are most informative about the treatment effect.We consider the same notation as in the proof in Appendix C and resume it. From the proof in Appendix C, we know thatsince the estimator of the nuisance functions η = ( f , π) belong to a P-Donsker class, and therefore m(Z; η) belongs to a P-Donsker class. Similar to the proof in Appendix C, it is left to investigate the term P{m(Z; η) -m(Z; η)}. We show that this term is asymptotically not negligible when the treatment effect is heterogeneous. Again, by iterated expectations, this term equalsThe first term can be bounded from above (up to a multiplicative constant) by the Cauchy-Schwarz inequality and, therefore, is asymptotically negligible if one of the nuisance functions is correctly specified. Thus, only the second term remains. Using ψ(X) = f 0 (X, 1) -f 0 (X, 0), the second term can be written asSince π 0 (X)(1 -π 0 (X)) = Var(T | X), this concludes the proof.

G COMPARISON TO INVERSE PROBABILITY WEIGHTED ESTIMATOR

We compare our estimator ψ to the inverse probability weighted (IPW) estimator (e. g., Funk et al., 2011) . The IPW estimator is a popular estimator for treatment effects due to its improvement in efficiency and reduction in bias compared to unweighted estimators.Under identical conditions as in Theorem 1, the IPW estimator, denoted as ψIPW , is asymptotically normal with asymptotic variance σ 2 E[1/Var(T | X)] (e. g., Kennedy, 2016) . We can compare the asymptotic behavior of our estimator ψ and the IPW estimator ψIPW . Both estimators are asymptotically unbiased, but with different asymptotic variance as the following result shows.Corollary 1. Under the conditions in Theorem 1, the asymptotic variance of ψ is strictly smaller than the asymptotic variance of ψIPW .Proof. The statement follows by Jensen's inequality.The difference between the asymptotic variances becomes particularly pronounced in presence of selection bias, i. e., when the treatment group is systematically different from the control group. Corollary 2. If the propensity score π(x) is close to 0 or 1 for some x ∈ R d , then the asymptotic variance of ψIPW can take arbitrary large values.Proof. For some small > 0, let π(X) ∈ [0, 1]\( , 1 -). Then, by Bhatia-Davis inequality and since T is binary, Var(T | X) = π(X)(1 -π(X)). Hence, Var(T | X) ≤ , and, therefore, E[1/Var(T | X)] ≥ 1/ , which proves the statement.

H DETAILED DESCRIPTION OF THE DATASETS

H.1 IHDP Hill (Hill, 2011) introduced a semi-synthetic dataset created from the Infant Healthand Development Program (IHDP). This dataset is based on a randomized experiment that examines the effect of home visits by specialists on future cognitive scores. The dataset consists of 747 children (t = 1: 139, t = 0: 608) with 25 covariates. Similar to Shalit et al. (2017) , we use 1,000 realizations from setting A in the NPCI package (Dorie, 2016) .

H.2 TWINS

This dataset is made up of all births in the USA between 1989 and 1991 (Almond et al., 2005) .Only twins are considered among these births. Treatment (i. e., T = 1) is defined as being the heavier twin (and T = 0 as being the lighter twin). The outcome is defined as 1-year mortality. There are 30 covariates available for each pair of twins that relate to the parents, pregnancy and birth: marital status; race; residence; number of previous births; pregnancy risk factors; quality of care during pregnancy; and number of gestation weeks prior to birth. Only twins that weigh less than 2 kg and have no missing covariates (list-wise deletion) are taken into account. This creates a complete dataset (without missing data). 9 The final cohort consists of 11,400 twin pairs, whose mortality rate is 17.7 % for lighter twins and 16.1 % for heavier ones. In this setting, we observed both T = 0 (lighter twin) and T = 1 (heavier twin) for each pair of twins; Therefore, the true treatment effect in this dataset is known. In order to simulate an observational study, one of the twins is selectively observed based on information using the covariates (which leads to selection bias) as follows: T | x ∼ Bern(Sigmoid(w x + n)) where w ∼ U((-0.1, 0.1) 30×1 ) and n ∼ N (0, 0.1).

H.3 JOBS

Jobs studied in LaLonde (1986) consists of randomized data based on the National Supported Work program and non-randomized observational study data. A (random) subset of randomized data is used to evaluate the algorithms. The dataset consists of 722 randomized samples (T = 1: 297, T = 0: 425) and 2,490 non-randomized samples (T = 1: 0, T = 0: 2,490), all with 7 covariates.

H.4 ACIC 2018

ACIC 2018 is a collection of semi-synthetic datasets derived from the linked birth and infant death data (LBIDD) (MacDorman & Atkinso, 1998), and was developed for the 22018 Atlantic Causal Inference Conference competition (ACIC) (Shimoni et al., 2018) . The simulation includes 63 different data generation processes with a sample size from 1,000 to 50,000. Each dataset is a realization from a separate distribution, which itself is randomly drawn in accordance with the settings of the data generation process. Similar to Shi et al. (2019) , we randomly pick 3 datasets 10 of size either 5k or 10k for each of the 63 data generating process settings and exclude all datasets with indication of strong selection bias. This yields a total number of 97 datasets.

I DATA GENERATING PROCESS FOR SIMULATION STUDY

For the simulation study, we follow a similar protocol as in Yao et al. (2018) ; Yoon et al. (2018) . We generate 2,500 untreated samples from N (0 10×1 , 0.5 × ΣΣ ) and 5,000 treated samples from N (µ 1 , 0.5 × ΣΣ ), where Σ ∼ U((0, 1) 10×10 ). Varying µ 1 yields different levels of selection bias, which is measured by the Kullback-Leibler divergence. The larger the Kullback-Leibler divergence, the greater the distributional distance between treatment and control group, and, thus, the larger the selection bias. The outcome is generated as Y | X = x, T = t ∼ (w x + t + n), where w ∼ U((-1, 1) 10×1 ), and n ∼ N (0, 0.1).

