NORMALIZING FLOWS FOR INTERVENTIONAL DENSITY ESTIMATION

Abstract

Existing machine learning methods for causal inference usually estimate quantities expressed via the mean of potential outcomes (e.g., average treatment effect). However, such quantities do not capture the full information about the distribution of potential outcomes. In this work, we estimate the density of potential outcomes after interventions from observational data. For this, we propose a novel, fully-parametric deep learning method called Interventional Normalizing Flows. Specifically, we combine two normalizing flows, namely (i) a teacher flow for estimating nuisance parameters and (ii) a student flow for a parametric estimation of the density of potential outcomes. We further develop a tractable optimization objective based on a one-step bias correction for an efficient and doubly robust estimation of the student flow parameters. As a result our Interventional Normalizing Flows offer a properly normalized density estimator. Across various experiments, we demonstrate that our Interventional Normalizing Flows are expressive and highly effective, and scale well with both sample size and high-dimensional confounding. To the best of our knowledge, our Interventional Normalizing Flows are the first fully-parametric, deep learning method for density estimation of potential outcomes.

1. INTRODUCTION

Causal inference increasingly makes use of machine learning methods to estimate treatment effects from observational data (e.g., van der Laan et al., 2011; Künzel et al., 2019; Curth & van der Schaar, 2021; Kennedy, 2022) . This is relevant for various fields including medicine (e.g., Bica et al., 2021) , marketing (e.g., Yang et al., 2020) , and policy-making (e.g., Hünermund et al., 2021) . Here, causal inference from observational data promises great value, especially when experiments for determining treatment effects are costly or even unethical. The vast majority of the machine learning methods for causal inference estimate averaged quantities expressed by the (conditional) mean of potential outcomes. Examples of such quantities are the average treatment effect (ATE) (e.g., Shi et al., 2019; Hatt & Feuerriegel, 2021) , the individual treatment effect (ITE) (e.g., Shalit et al., 2017; Hassanpour & Greiner, 2019; Zhang et al., 2020) , and treatment-response curves (e.g., Bica et al., 2020; Nie et al., 2021) . Importantly, these estimates only describe averages without distributional properties. However, making decisions based on averaged causal quantities can be misleading and, in some applications, even dangerous (Spiegelhalter, 2017; van der Bles et al., 2019) . On the one hand, if potential outcomes have different variances or number of modes, relying on the average quantities provides incomplete information about potential outcomes, and may inadvertently lead to local -and not global -optima during decision-making. On the other hand, distributional knowledge is needed to account for uncertainty in potential outcomes, and thus informs how likely a certain outcome is. For example, in medicine, knowing the distribution of potential outcomes is highly important (Gische & Voelkle, 2021): it gives the probability that the potential outcome lies in a desired range, and thus defines the probability of treatment success or failure. Motivated by this, we aim to estimate the density of potential outcomes. An example highlighting the need for estimating the density of potential outcomes is shown in Fig. 1 . Here, we simulated outcomes according to a given structural causal model (SCM). The potential outcomes Y [a] can be sampled by setting the treatment to specific value in the equation for A (cf. )). Hence, the ground-truth ATE equals zero. Nevertheless, the distributions of potential outcomes (i. e., P(Y [a])) are clearly different. Hence, in medical practice, acting upon the ATE without knowledge of the distributions of potential outcomes could have severe, negative effects. To show this, let us consider a "do nothing" treatment (a = 0) and some medical treatment (a = 1). Further, let us consider an outcome to be successful if some risk score Y is below the threshold of five. Then, the probability of treatment success (i. e., P{Y [1] < 5.0} ≈ 0.63) is much larger than the probability of success after the "do nothing" treatment (i. e., P{Y [0] < 5.0} ≈ 0.51), highlighting the importance of treatment. P(Y = y | A = 1) P(Y [1] = y) P(Y [1] = y | A = 0) X ∼ Mixture 0.5N (0; 1) + 0.5N (3; 1) A Bern N (X; 0, 1) N (X; 0, 1) + N (X; b, 1) Y ∼ N A (X 2 -1.82X + 2.0)+ (1 -A) (2.18 X + 1.5); 1 In this paper, we aim to estimate the density of potential outcomes after intervention a, i. e., P(Y [a] = y). From this point on, we refer to this task as interventional density estimation (IDE). Estimating the density of interventions has several crucial advantages: it allows to identify multi-modalities in the distribution of potential outcomes; it allows to estimate quantiles of the distribution; and it allows to compute the probability with which a potential outcome lies in a certain range. Importantly, traditional density estimation methods are not applicable for IDE due to the fundamental problem of causal inference: that is, the counterfactual outcomes are typically never observed, and, hence, the sample from ground-truth interventional distribution is also inaccessible. In prior literature, Kennedy et al. ( 2021) introduced a theory for efficient semi-parametric IDE estimation, but without a flexible algorithmic instantiation in form of a method. Existing literature also offers some specific methods for IDE, which are either semi-or non-parametric. 1 Examples are kernel density estimation (Kim et al., 2018) and kernel mean embeddings of distributions (Muandet et al., 2021) . However, both methods neither scale well with the sample size nor with the dimensionality of covariates. Furthermore, both methods have an additional, crucial limitation: estimated densities could be unnormalized or even return negative values (which, by definition, is not possible). Fully-parametric methods, on the other hand, have several practical advantages: they automatically provide properly normalized density estimators, they allow one to sample from the estimated density and typically scale well with large and high-dimensional datasets. However, to the best of our knowledge, there is no fully-parametric, deep learning method for IDE. In this paper, we develop a novel, fully-parametric deep learning method: Interventional Normalizing Flows (INFs). Our INFs build upon normalizing flows (NFs) (Tabak & Vanden-Eijnden, 2010; Rezende & Mohamed, 2015) , but which we carefully adapt for causal inference. This requires several non-trivial adaptations. Specifically, we combine two NFs: a (i) teacher flow for estimating nuisance parameters, and a (ii) student flow for a parametric estimation of the density of potential outcomes. Here, we construct a novel, tractable optimization objective based on a one-step bias correction to allow for an efficient and doubly robust estimation. At the end, we develop a two-step training procedure to train both the teacher and the student flows. Overall, our main contributions are following:foot_1 



We distinguish the interventional distribution (i.e., P(Y [a])) and the counterfactual distribution (i.e., P(Y [a] | A = a ′ )), which are different in general. This can be seen by comparing plots (a) vs. (b) and (c) in Fig. 1. For further information, we refer to Appendix B. Code is available at https://anonymous.4open.science/r/AnonymousInterFlow-E2F3.



= y | A = 0) P(Y [0] = y) P(Y [0] = y | A = 1)

Figure 1: Motivating example showing the densities of observational, interventional, and counterfactual distributions of outcome Y . These are simulated via the structural causal model on the right (here: N (x; µ, σ 2 ) are densities of the normal distribution; and b = 3 is a covariates shift, which regulates the probability of treatment assignment). Potential outcomes have different distributions but the same mean E(Y [0]) = E(Y [1]) ≈ 4.77 and the same variance var(Y [0]) = var(Y [1]) ≈ 4.06. Here, Y [a] is the potential outcome given treatment a. (a) Interventional distributions. (b) and (c) Observational and counterfactual distributions for the same outcomes. As shown here, the observational, interventional, and counterfactual distributions can be vastly different. Appendix B). At the same time, by flipping the treatment assignment in this equation, we obtain counterfactual outcomes Y [a] | A = a ′ . We observe that the potential outcomes have the same mean (i.e., E(Y [0]) = E(Y [1])) and the same variance (i.e., var(Y [0]) = var(Y [1])). Hence, the ground-truth ATE equals zero. Nevertheless, the distributions of potential outcomes (i. e.,P(Y [a])) are clearly different. Hence, in medical practice, acting upon the ATE without knowledge of the distributions of potential outcomes could have severe, negative effects. To show this, let us consider a "do nothing" treatment (a = 0) and some medical treatment (a = 1). Further, let us consider an outcome to be successful if some risk score Y is below the threshold of five. Then, the probability of treatment success (i. e., P{Y [1] < 5.0} ≈ 0.63) is much larger than the probability of success after the "do nothing" treatment (i. e., P{Y [0] < 5.0} ≈ 0.51), highlighting the importance of treatment.

