ESTIMATING TREATMENT EFFECTS VIA ORTHOGONAL REGULARIZATION

Abstract

Decision-making often requires accurate estimation of causal effects from observational data. This is challenging as outcomes of alternative decisions are not observed and have to be estimated. Previous methods estimate outcomes based on unconfoundedness but neglect any constraints that unconfoundedness imposes on the outcomes. In this paper, we propose a novel regularization framework in which we formalize unconfoundedness as an orthogonality constraint. We provide theoretical guarantees that this yields an asymptotically normal estimator for the average causal effect. Compared to other estimators, its asymptotic variance is strictly smaller. Based on our regularization framework, we develop deep orthogonal networks for unconfounded treatments (DONUT) which learn outcomes that are orthogonal to the treatment assignment. Using a variety of benchmark datasets for causal inference, we demonstrate that DONUT outperforms the state-of-the-art substantially.

1. INTRODUCTION

Estimating the causal effect of an intervention (i. e., treatment effect) is integral for individual decision making in many domains such as marketing (Brodersen et al., 2015; Hatt & Feuerriegel, 2020) , economics (Heckman et al., 1997) , and epidemiology (Robins et al., 2000) . For instance, in order to control an epidemic, it is relevant for public decision-makers to estimate the causal effect of school-closures (intervention) on the infection rate (outcome). The causal effect of an intervention can be estimated in two ways: randomized control trials (RCTs) and observational studies. RCTs are widely recognized as the gold standard for estimating causal effects, yet conducting RCTs is often infeasible (Robins et al., 2000) . For instance, randomly allocating different policy interventions during an epidemic might be unethical and impractical. Unlike RCTs, observational studies adopt observed data to infer causal effects. For this, covariates must be collected that contain all confounders (i. e., variables that affect both treatment and outcome). This is becoming increasingly common due to ease of access to rich data. In this paper, we estimate the average causal effect of a treatment from observational data. In order to estimate the causal effect of a treatment, the outcome of an alternative treatment has to be estimated. However, this is challenging, since we do not know what the outcome would have been if another treatment had been applied. Existing methods for estimating treatment effects use the treatment assignment as a feature and train regression models to estimate the outcomes (Funk et al., 2011; Kallus, 2017b) . Methods based on nearest neighbors and matching are adopted to find similar subjects (Ho et al., 2007; Crump et al., 2008; Kallus, 2017a; 2020) . Tree and forest-based methods (Wager & Athey, 2018) estimate the treatment effect at the leaf node and train many weak learners to build expressive ensemble models. Gaussian process-based methods provide uncertainty quantification (Alaa & van der Schaar, 2017; Ray & Szabo, 2019) . Weighting-based approaches re-weight the outcomes using weights based on covariate and treatment data (Kallus, 2018). For instance, Fong et al. ( 2018); Yiu & Su (2018) seek weights such that the treatment assignment is unassociated with the covariates. However, they do not require the treatment assignment to be unassociated with the potential outcomes. Doubly robust methods combine a model for the outcomes and a model for the treatment propensity in a manner that is robust to misspecification (Funk et al., 2011; Benkeser et al., 2017; Chernozhukov et al., 2018) . Recently, deep learning has been successful for this task due to its strong predictive performance and ability to learn representations of the data (e. g., Johansson et al., 2016; Louizos et al., 2017; Shalit et al., 2017; Yao et al., 2018; Yoon et al., 2018; Shi et al., 2019) . To ensure identifiability of the causal effect, state-of-the-art methods for estimating treatment effects are based on unconfoundedness (i. e., all confounders are measured, and thus included in the covariates). Hence, unconfoundedness is assumed for identifiability, yet, during estimation of the model parameters, any implications on the unobserved outcomes that arise from unconfoundedness have been neglected. Contribution.foot_0 In this paper, (i) we introduce a regularization framework that exploits unconfoundedness . To this end, we formalize unconfoundedness as an orthogonality constraint. This constraint is used during estimation of the model parameters to ensure that the outcomes are orthogonal to the treatment assignment. We prove sufficient conditions under which this yields an asymptotically normal estimator for the average causal effect. Compared to other estimators, its asymptotic variance is strictly smaller. (ii) Based on our regularization framework, we develop deep orthogonal networks for unconfounded treatments (DONUT) for estimating average causal effects. DONUT leverages the predictive capabilities of neural networks to learn outcomes that are orthogonal to the treatment assignment. Using a variety of benchmark datasets for causal inference, we demonstrate that DONUT outperforms the state-of-the-art substantially.

2. PROBLEM SETUP

Our objective is to estimate the average treatment effect (ATE) of a binary treatment from observational data. For this, we build upon the Neyman-Rubin potential outcomes framework (Rubin, 2005) . Consider a population where every subject i is described by the d-dimensional covariates X i ∈ R d . Each subject is assigned a treatment T i ∈ {0, 1}. The random variable Y i (1) corresponds to the outcome under treatment, i. e., T i = 1, whereas Y i (0) corresponds to the outcome under no treatment, i. e., T i = 0. These two random variables, Y i (1), Y i (0) ∈ R, are known as the potential outcomes. Due to the fundamental problem of causal inference, only one of the potential outcomes is observed, but never both. The observed outcome is denoted by Y i . Our aim is to estimate the average treatment effect ψ = E[Y (1) -Y (0)]. (1) The following standard assumptions are sufficient for identifiability of the causal effect (Imbens & Rubin, 2015) : consistency (i. e., ∀t ∈ {0, 1} : Y = Y (t), if T = t); positivity (i. e., ∀x ∈ R d : 0 < P (T = 1 | X = x) < 1); and unconfoundedness. Unconfoundedness assumes that all confounders are measured and, hence, conditioning on them blocks all backdoor paths. This is equivalent to assuming that the potential outcomes Y (1) and Y (0) are independent of the assigned treatment T given the covariates X, i. e., Y , Y (0) ⊥ ⊥ T | X. Based on this, the ATE is equal to ψ = E[E[Y | X, T = 1] -E[Y | X, T = 0]]. Our task is to estimate the function f (x, t) = E[Y | X = x, T = t] for all x ∈ R d and t ∈ {0, 1} based on observational data D = {(X i , T i , Y i )} n i=1 .

3. ORTHOGONAL REGULARIZATION FOR ESTIMATING TREATMENT EFFECTS

The key idea of our regularization framework is to exploit the implications on the outcomes, that result from unconfoundedness. For this, we formalize unconfoundedness as an orthogonality constraint. This orthogonality constraint ensures that the outcomes are orthogonal to the treatment assignment. We later introduce a specific variant of our regularization framework based on neural networks, which yields DONUT.



Code available at github.com/anonymous/donut (anonymized for peer-review).

