NORMALIZING FLOWS FOR INTERVENTIONAL DENSITY ESTIMATION

Abstract

Existing machine learning methods for causal inference usually estimate quantities expressed via the mean of potential outcomes (e.g., average treatment effect). However, such quantities do not capture the full information about the distribution of potential outcomes. In this work, we estimate the density of potential outcomes after interventions from observational data. For this, we propose a novel, fully-parametric deep learning method called Interventional Normalizing Flows. Specifically, we combine two normalizing flows, namely (i) a teacher flow for estimating nuisance parameters and (ii) a student flow for a parametric estimation of the density of potential outcomes. We further develop a tractable optimization objective based on a one-step bias correction for an efficient and doubly robust estimation of the student flow parameters. As a result our Interventional Normalizing Flows offer a properly normalized density estimator. Across various experiments, we demonstrate that our Interventional Normalizing Flows are expressive and highly effective, and scale well with both sample size and high-dimensional confounding. To the best of our knowledge, our Interventional Normalizing Flows are the first fully-parametric, deep learning method for density estimation of potential outcomes.

1. INTRODUCTION

Causal inference increasingly makes use of machine learning methods to estimate treatment effects from observational data (e.g., van der Laan et al., 2011; Künzel et al., 2019; Curth & van der Schaar, 2021; Kennedy, 2022) . This is relevant for various fields including medicine (e.g., Bica et al., 2021) , marketing (e.g., Yang et al., 2020) , and policy-making (e.g., Hünermund et al., 2021) . Here, causal inference from observational data promises great value, especially when experiments for determining treatment effects are costly or even unethical. The vast majority of the machine learning methods for causal inference estimate averaged quantities expressed by the (conditional) mean of potential outcomes. Examples of such quantities are the average treatment effect (ATE) (e.g., Shi et al., 2019; Hatt & Feuerriegel, 2021) , the individual treatment effect (ITE) (e.g., Shalit et al., 2017; Hassanpour & Greiner, 2019; Zhang et al., 2020) , and treatment-response curves (e.g., Bica et al., 2020; Nie et al., 2021) . Importantly, these estimates only describe averages without distributional properties. However, making decisions based on averaged causal quantities can be misleading and, in some applications, even dangerous (Spiegelhalter, 2017; van der Bles et al., 2019) . On the one hand, if potential outcomes have different variances or number of modes, relying on the average quantities provides incomplete information about potential outcomes, and may inadvertently lead to local -and not global -optima during decision-making. On the other hand, distributional knowledge is needed to account for uncertainty in potential outcomes, and thus informs how likely a certain outcome is. For example, in medicine, knowing the distribution of potential outcomes is highly important (Gische & Voelkle, 2021) : it gives the probability that the potential outcome lies in a desired range, and thus defines the probability of treatment success or failure. Motivated by this, we aim to estimate the density of potential outcomes. An example highlighting the need for estimating the density of potential outcomes is shown in Fig. 1 . Here, we simulated outcomes according to a given structural causal model (SCM) . The potential outcomes Y [a] can be sampled by setting the treatment to specific value in the equation for A (cf. X ∼ Mixture 0.5N (0; 1) + 0.5N (3; 1) A ∼ Bern N (X; 0, 1) N (X; 0, 1) + N (X; b, 1) Y ∼ N A (Xfoot_1 -1.82X + 2.0)+ (1 -A) (2.18 X + 1.5); 1 )). Hence, the ground-truth ATE equals zero. Nevertheless, the distributions of potential outcomes (i. e., P(Y [a])) are clearly different. Hence, in medical practice, acting upon the ATE without knowledge of the distributions of potential outcomes could have severe, negative effects. To show this, let us consider a "do nothing" treatment (a = 0) and some medical treatment (a = 1). Further, let us consider an outcome to be successful if some risk score Y is below the threshold of five. Then, the probability of treatment success (i. e., P{Y [1] < 5.0} ≈ 0.63) is much larger than the probability of success after the "do nothing" treatment (i. e., P{Y [0] < 5.0} ≈ 0.51), highlighting the importance of treatment. In this paper, we aim to estimate the density of potential outcomes after intervention a, i. e., P(Y [a] = y). From this point on, we refer to this task as interventional density estimation (IDE). Estimating the density of interventions has several crucial advantages: it allows to identify multi-modalities in the distribution of potential outcomes; it allows to estimate quantiles of the distribution; and it allows to compute the probability with which a potential outcome lies in a certain range. Importantly, traditional density estimation methods are not applicable for IDE due to the fundamental problem of causal inference: that is, the counterfactual outcomes are typically never observed, and, hence, the sample from ground-truth interventional distribution is also inaccessible. In prior literature, Kennedy et al. (2021) introduced a theory for efficient semi-parametric IDE estimation, but without a flexible algorithmic instantiation in form of a method. Existing literature also offers some specific methods for IDE, which are either semi-or non-parametric. 1 Examples are kernel density estimation (Kim et al., 2018) and kernel mean embeddings of distributions (Muandet et al., 2021) . However, both methods neither scale well with the sample size nor with the dimensionality of covariates. Furthermore, both methods have an additional, crucial limitation: estimated densities could be unnormalized or even return negative values (which, by definition, is not possible). Fully-parametric methods, on the other hand, have several practical advantages: they automatically provide properly normalized density estimators, they allow one to sample from the estimated density and typically scale well with large and high-dimensional datasets. However, to the best of our knowledge, there is no fully-parametric, deep learning method for IDE. In this paper, we develop a novel, fully-parametric deep learning method: Interventional Normalizing Flows (INFs). Our INFs build upon normalizing flows (NFs) (Tabak & Vanden-Eijnden, 2010; Rezende & Mohamed, 2015) , but which we carefully adapt for causal inference. This requires several non-trivial adaptations. Specifically, we combine two NFs: a (i) teacher flow for estimating nuisance parameters, and a (ii) student flow for a parametric estimation of the density of potential outcomes. Here, we construct a novel, tractable optimization objective based on a one-step bias correction to allow for an efficient and doubly robust estimation. At the end, we develop a two-step training procedure to train both the teacher and the student flows. Overall, our main contributions are following: 2 1. We introduce the first fully-parametric, deep learning method for interventional density estimation, called Interventional Normalizing Flows (INFs). Our INFs provide a properly normalized density estimator. 2. We derive a tractable optimization problem with a one-step bias correction for efficient and doubly robust estimation. To solve, we propose a two-step training procedure with our INFs. 3. We demonstrate in various experiments that our INFs are highly expressive and effective. A major advantage owed to the parametric form of the student flow is that our INFs scale well to both large and high-dimensional datasets in comparison to other non-and semi-parametric methods.

2. RELATED WORK

Recently, there has been a great interest in using machine learning and, specifically, deep learning for estimating causal quantities. Examples are machine learning for estimating ATEs (e.g., Shi et al., 2019; Hatt & Feuerriegel, 2021) , ITEs (e.g., Johansson et al., 2016; Alaa & van der Schaar, 2018; Wager & Athey, 2018; Curth & van der Schaar, 2021) , and treatment-response curves (e.g., Bica et al., 2020; Schwab et al., 2020; Nie et al., 2021) . In this regard, some papers proposed uncertainty-aware methods, e. g., by using the variance of potential outcomes (Alaa & van der Schaar, 2017; Jesson et al., 2020) , or the conditional outcome distribution (Jesson et al., 2021; 2022) . However, the aforementioned works are all concerned with estimating averaged causal quantities expressed via the mean of potential outcomes. In contrast, there are only a few papers that estimate the density of outcomes after intervention. Kennedy et al. (2021) introduced a theory for efficient semi-parametric estimation. The theory also lends to a hypothetical estimator as a solution to an integral equation, namely a bias-corrected moment condition. However, the theory comes without an algorithmic instantiation in form of a method. We later adopt the theoretical framework and convert the bias-corrected moment condition into a tractable optimization objective, which we can then solve very effectively with deep learning.

2.1. INTERVENTIONAL DENSITY ESTIMATION

Table 1 lists existing methods for IDE. Importantly, these are either non-parametric or semi-parametric. Kim et al. (2018) developed a doubly robust kernel density estimation (KDE) via an efficient estimation of density functionals. Muandet et al. (2021) proposed kernel mean embeddings of distributions (DKME), which provides a non-parametric plug-in estimator. However, both methods (Kim et al., 2018; Muandet et al., 2021) have limitations. First, they do not provide a properly normalized density estimator. Hence, the estimated densities can be unnormalized or even negative, yet which, by definition, is not possible. Second, they do not offer direct sampling, which would allow one to sample from the estimated density without an additional algorithm. This may complicate computations of the test log-probability or empirical Wasserstein distance during evaluation. Third, another limitation of both non-parametric and semi-parametric methods is that they typically scale not well. This is unlike fully-parametric methods, which scale well to both large and high-dimensional datasets. However, so far, there is no full-parametric, deep learning method for IDE. The methods for IDE above (Kim et al., 2018; Muandet et al., 2021; Kennedy et al., 2021) build upon general assumptions for causal identifiability. We later adopt the same assumptions for IDE (see Section 3), and we then develop a fully-parametric, deep learning method called INFs. Our method has three favorable properties: it yields a proper density estimator, it allows for direct sampling, and it scales well. 

2.2. EFFICIENT ESTIMATION

In the context of treatment effect estimation, so-called augmented inverse propensity of treatment weighted (A-IPTW) estimators were developed for efficient, semi-parametric estimation of target estimands (parameters) (Robins, 2000) . Formally, A-IPTW estimation performs a first-order bias correction of plug-in models (Bickel et al., 1993; Chernozhukov et al., 2018) . A-IPTW estimation also offers the property of being double robust, i. e., fast convergence rates even if one of the nuisance parameter estimators converges slowly Kennedy (2020) . Although Kennedy et al. (2021) formulated an integral equation for semi-and fully-parametric efficient IDE estimation (see Eq. 19 therein), no flexible algorithmic instantiations in form of a method have been implemented so far. Later, we reformulate this equation as a tractable optimization problem, and, thereby, turn our INFs into an efficient and doubly robust IDE estimator.

2.3. NORMALIZING FLOWS

Normalizing flows were introduced for expressive variational approximations in variational autoencoders (Tabak & Vanden-Eijnden, 2010; Rezende & Mohamed, 2015) . One practical benefit of NFs is that they yield universal density approximators (Dinh et al., 2014; 2017; Huang et al., 2018; Durkan et al., 2019) . Furthermore, NFs can be leveraged for conditional density estimation (e. g., via so-called hypernetworks (Trippe & Turner, 2018) ). Normalizing flows were used for causal inference, but in a different setting from ours (see Appendix A). We provide a background on normalizing flows in Appendix B. Research gap: Existing methods for IDE are either non-or semi-parametric. To the best of our knowledge, our work is the first to propose a fully-parametric, deep learning method for IDE.

3. SETUP: INTERVENTIONAL DENSITY ESTIMATION

Notation. Let P(Z) be a distribution of a random variable Z, and let P(Z = z) be its density or probability mass function. Let π a (x) = P(A = a | X = x) denote the propensity score. Further, 1(•) is the indicator function; P n {f (X)} = 1 n n i=1 f (X i ) is the sample average of a random f (X); and P B b {f (X)} is the average evaluated on a minibatch B of size b. For readability, we sometimes highlight random variables and the corresponding averaging operator in green color. Furthermore, P(Y | X, A) is the conditional distribution of the outcome Y . Problem statement. In this work, we aim at estimating the interventional density from observational data, namely P(Y [a] = y). To compare the goodness-of-fit of different estimators, we evaluate the distributional distance between the ground-truth interventional density and the estimated density. Such distributional distances include, e.g., the average log-probability and the empirical Wasserstein distance. We build upon the standard setting of potential outcomes framework (Rubin, 1974) , where Y [a] stands for the potential outcome after intervening on treatment by setting it to a. That is, we consider an observational sample D with d X -dimensional covariates X ∈ X ⊆ R d X , a treatment A ∈ {0, 1}, and a d Y -dimensional continuous outcome Y ∈ Y ⊆ R d Y , drawn i.i.d. We consider d Y = 1 if not stated explicitly. We assume the treatment to be binary, but note that our INFs also work with categorical treatments. We denote D = {X i , A i , Y i } n i=1 ∼ P(X, A, Y ), where n is the sample size, and i is the index of an observation. For example, in critical care, the patient covariates X are different risk factors (e.g., age, gender, weight, prior diseases), the treatment is whether a ventilator is applied, and the outcome is the probability of patient survival. The covariates X are also called confounders if P(Y [a]) ̸ = P(Y | A = a). Identifiability. To identify the interventional density, we make the following identifiability assumptions with respect to the data-generating mechanism of D: (1) Positivity: For some ϵ > 0, P{1 -ϵ ≥ π a (X) ≥ ϵ} = 1. (2) Consistency: If A = a for some patient, then Y = Y [a]. (3) Ex- changeability: A ⊥ ⊥ Y [a] | X for all a. Note that these assumptions are standard in the literature (Kim et al., 2018; Kennedy et al., 2021; Muandet et al., 2021) . Under assumptions (1)-(3), the density of interventional distribution P(Y [a]) can be expressed in terms of observational distribution with back-door adjustment, i.e., P(Y [a] = y) = x∈X P(Y = y | X = x, A = a) P(X = x) dx = E X∼P(X) P(Y = y | X, A = a) , where P(Y = y | X, A) is the conditional density of the outcome. For more details on the potential outcomes framework and identifiability, we refer to Appendix B. Plug-in estimator. A straightforward approach for IDE (Robins & Rotnitzky, 2001) is the following: first, one estimates the conditional outcome distribution, P(Y | X, A) (here, any method for conditional density estimation could be used). Then, one takes a sample average over covariates X: PPI (Y [a] = y) = Pn{ P(Y = y | X, A = a)}. ( ) This estimator is an unbiased but inefficient estimator of interventional density, which is known as semi-parametric plug-in estimator. Semi-parametric IDE, unlike, e. g., semi-parametric ATE estimation, is highly problematic. For large sample sizes, the semi-parametric estimator requires averaging over the full sample for each evaluation point. Motivated by this, we aim to develop a fully-parametric estimator.

4. THEORETICAL BACKGROUND FOR FULLY-PARAMETRIC IDE

In this section, we introduce a theory for fully-parametric estimation of interventional density. First, we describe a parametric plug-in estimator as a solution to the moment condition (Kennedy et al., 2021) . We call this estimator covariate-adjusted estimator. Second, we develop a one-step bias correction for efficient estimation. We -log g(Y a ; βa) , where βa are called projection parameters as they project the true interventional density onto a class {g(•; β a ); β a ∈ R d }.  m(βa) = E Y a ∼P(Y [a]) T (Y a ; βa) = E X∼P(X) E T (Y ; βa) | X, A = a . Here, the moment condition is the expected score function of the potential outcome. Throughout the paper, we assume that the moment condition has a unique solution, and, therefore, the minimization task in Eq. ( 3) and the root-finding task in Eq. ( 4) are equivalent. In practice, we have neither observations from the interventional distribution nor counterfactual outcomes. Therefore, we cannot use the ground-truth P(Y [a]) but, instead, must use the plug-in estimator distribution from Eq. ( 2). Specifically, we can obtain a plug-in estimator of projection parameters, i. e., βPI a , either by minimizing a cross-entropy loss or by solving the moment condition, both of which are equivalent: βPI a = arg min βa E Y a ∼Pn{ P(Y |X,A=a)} -log g(Y a ; βa) ⇐⇒ mPI (βa) = E Y a ∼Pn{ P(Y |X,A=a)} T (Y a ; βa) ! = 0. (5) Then, we can define a parametric covariate-adjusted (CA) estimator as PCA (Y [a] = y) = g(y; βPI a ). By choosing a sufficiently expressive class of densities for both g and the conditional density estimator P(Y | X, A) (e. g., normalizing flows), CA can be shown to consistently estimate the interventional density (see Appendix B.5 in Kennedy et al. (2021) ).

4.2. EFFICIENT ESTIMATION VIA ONE-STEP BIAS CORRECTION

In the following, we aim to develop an efficient estimator of the projection parameter βa from Eq. ( 3) or, equivalently, the moment condition m(β a ) at fixed β a from Eq. ( 4). For this, we make use of semi-parametric efficiency theory (van der Laan & Robins, 2003; Kennedy et al., 2021) . We provide a background on efficiency theory in Appendix B. Kennedy (2022) showed that the efficient influence function ϕ a (T, P) for the functional E(E(T | X, A = a)) equals to ϕa(T ; P) = 1(A = a) πa(X) T -E(T | X, A = a) + E(T | X, A = a) -E X∼P(X) (E(T | X, A = a)). (6) Here, we use red color to show the nuisance parameters of P that are influencing the functional. We emphasize that the nuisance parameters (i. e., the propensity score and conditional expectations/probabilities) can be either known or estimated. The efficient influence function in Eq. ( 6) allows us to construct an efficient estimator of the moment condition. Following (Kennedy et al., 2021) , we transform the plug-in estimator mPI (β a ) from Eq. ( 5) into an efficient estimator with the help of a one-step bias correction. In our case, the bias-corrected moment condition has the following form: mA-IPTW (βa) = mPI (βa) + Pn ϕa(T (Y ; βa); P) ! = 0, where P = {π a (x), P(Y | X, A)} are the estimated nuisance parameters of P. The estimated nuisance parameters are simultaneously used for plug-in estimation of the moment condition. We call the solution of the bias-corrected moment equation βA-IPTW a an augmented inverse propensity of treatment weighted (A-IPTW) estimator of the projection parameters. Unlike CA estimator, A-IPTW estimator achieves efficiency and possesses a double robustness property. We now transform the bias-corrected moment condition into the following tractable optimization task (see Appendix C for details): βA-IPTW a = arg min βa E Y a ∼Pn{ P(Y |X,A=a)} -log g(Y a ; βa) cross-entropy loss + Pn 1(A = a) πa(X) -log g(Y ; βa) + E Y ∼ P(Y |X,A=a) log g(Y ; βa) one-step bias correction . Previously, Kennedy et al. (2021) proposed to directly solve bias-corrected moment condition, i. e., a system of nonlinear equations, yet which is in general much harder to solve, even computationally. In contrast, we develop an optimization objective that can be directly incorporated into a loss of a deep learning density estimator.

5. INTERVENTIONAL NORMALIZING FLOWS

In the following, we describe our Interventional Normalizing Flows: a fully-parametric method for interventional density estimation via deep learning. First, we describe all the components of our architecture and, then, introduce an efficient estimation using one-step bias correction.

5.1. COMPONENTS

In our INFs, we combine two normalizing flows, which we refer to as (i) teacher flow and (ii) student flow (see Fig. 2 ). The rationale for this is based on our derivations in Section 4, according to which a fully-parametric IDE requires two models: (i) one for the estimation of nuisance parameters, and (ii) one for the subsequent optimization of the learning objective with respect to projection parameters. Accordingly, both NFs in our INFs have thus different objectives: (i) the teacher flow estimates the nuisance parameters (i.e., the propensity score and the conditional outcome distribution); and (ii) the student flow uses the estimated nuisance parameters to estimate the projection parameters. (i) Teacher flow. The teacher flow has three components: two fully-connected (FC) subnetworks and a conditional normalizing flow parameterized by θ. The first FC subnetwork (FC 1 ) takes the covariates X as input and, then, outputs a representation R ∈ R r together with a propensity score πa (X). The second FC subnetwork (FC 2 ) takes the representation R and the observed treatment A i as input and, then, outputs the parameters of flow, conditioned on X and A, i. e., θ(X, A). Together, FC 1 and FC 2 form a so-called hypernetwork (Ha et al., 2017) for the conditional normalizing flow, which allows us to learn the conditional outcome distribution via back-propagation. 3 . Let L t be the loss of the teacher flow. Here, we combine a conditional negative log-likelihood (L NLL ) and binary cross-entropy loss for the propensity score (L π ), i.e., L t ( P, πa ) = P n {L NLL + αL π } with L NLL = -log P(Y = Y | X, A); L π = BCE(π A (X), A) , where α > 0 is a hyperparameter. In general, conditional normalizing flows are prone to overfitting when trained via a conditional negative log-likelihood. To address this, we later employ noise regularization (Rothfuss et al., 2019) in the conditional density estimation. (ii) Student flow. The student flow uses the outputs of the teacher flow and then learns the interventional distribution. We first describe the naïve variant of the student flow without one-step bias correction (we introduce this later in Section 5.2). Different from the conditional normalizing flow in the teacher flow, the student flow is a non-conditional normalizing flow, parameterized by β a . Specifically, we consider two separate normalizing flows, that is, one for each potential outcome (i.e., a = 0 and a = 1, respectively).foot_3  To fit the student flow, we must solve the moment condition from Eq. ( 5) or, equivalently, minimize a cross-entropy loss (L CE ). Here, we use a tractable approximation via numeric integration: LCE(βa) = E Y a ∼Pn{ P(Y |X,A=a)} -log g(Y a ; βa) = - y∈Y log g(y; βa)Pn{ P(Y = y | X, A = a)} dy ≈ -h K j=1 log g(yj; βa) Pn{ P(Y = yj | X, A = a)}, if dY = 1, -PK {log g(Y a ; βa)}, if dY > 1, where y min ≤ y 1 < • • • < y K ≤ y max is an equidistant grid of points on Y with step size h, and {Y a j } K j=1 is an i.i.d. sample drawn from P n { P(Y | X, A = a)}. Training. To train both components in our INFs, we make use of a two-step training procedure. Specifically, we first fit the nuisance parameters with the teacher flow. Then, we freeze the parameters of the teacher flow and fit the student flow. We additionally employ exponential moving average (EMA) of the student parameters with a smoothing hyperparameter γ to stabilize the training for small minibatch sizes (Polyak & Juditsky, 1992) . We show the full algorithm in Appendix D and further implementation details in Appendix E. Inference time. One main advantage of our teacher-student model is that the student flow has constant inference time (e.g., during the evaluation phase). Hence, contrary to state-of-the-art baselines, the inference of our INFs do not depend on the dimensionality of covariates and the size of the training data. This is a major advantage over semi-parametric plug-in estimators. For a detailed runtime comparison, we refer to Appendix K. As such, the student flow allows our method to scale well to large datasets as in medicine (Johnson et al., 2016) .

5.2. ONE-STEP BIAS CORRECTION

To provide an efficient estimation for the parameters of the student flow, we augment the cross-entropy loss (Eq. ( 9)) with a one-step bias correction. To evaluate the bias correction term, we need to compute an approximation of the conditional cross-entropy loss (L CCE (X; β a )) as in Eq. ( 9). We thus compute LCCE(X; βa) = E Y ∼ P(Y |X,A=a) -log g(Y ; βa) ≈ -h K j=1 log g(yj; βa) P(Y = yj | X, A = a), if dY = 1, -PK {log g(Y X,a ; βa)}, if dY > 1, where y min ≤ y 1 < • • • < y K ≤ y max is an equidistant grid of points on Y with step size h, and {Y X,a j } K j=1 is an i.i.d. sample drawn from P(Y | X, A = a). Finally, we obtain the loss of the student flow (L s ), which is now suitable for our A-IPTW estimation from Eq. ( 9). We thus yield Ls(βa) = LCE(βa) + Pn 1(A = a) πa(X) -log g(Y ; βa) -LCCE(X; βa) .

6. EXPERIMENTS

6.1 OVERVIEW To show the effectiveness of our INFs, we use established (semi-)synthetic datasets that have been previously used for treatment effect estimation (Shi et al., 2019; Curth & van der Schaar, 2021) . The benefit of (semi-)synthetic datasets is that both factual and counterfactual outcomes are available (i.e., Y f i and Y cf i ). Therefore, we can obtain a sample from the ground-truth interventional distribution, i. e., Y [a] i = 1(A i = a) Y f i + 1(A i ̸ = a) Y cf i , which we can then use for IDE benchmarking. Evaluation metric. We use the average log-probability as our standard metric for comparing density estimators. It is given by log -prob D = 1 n n i=1 log P(Y [a] = Y [a] i ) , where higher values indicate a better fit. The maximum value of the average log-probability is upper-bounded by the entropy, which, in general, is different for each potential outcome. Therefore, we separately report the results for each potential outcome. Of note, the log-probability is equivalent to the empirical KL-divergence. Baselines. We use state-of-the-art IDE baselines (see Sec. 2.1): (1) an extended TARNet (TARNet * ) (Shalit et al., 2017) estimating the mean of a conditional homoscedastic normal distribution; (2) mixture density networks (MDNs) (Bishop, 1994) 5 ;  (3) conditional normalizing flow (CNF) (Trippe & Turner, 2018) ; (4) kernel density estimation (KDE) (Kim et al., 2018) ; and (5) distributional kernel mean embeddings (DKME) (Muandet et al., 2021) . TARNet * , MDNs, and CNF are semi-parametric plug-in estimators (see Eq. ( 2)). Importantly, KDE and DKME do not guarantee a proper density estimation (unlike our INFs). We thus performed an additional re-normalization and negative values clipping, so that we can use the average log-probability as an evaluation metric. Details on the baselines are in Appendix F, and hyperparameter tuning is reported in Appendix G. Ablation studies. We compare three variants of our INFs: (1) INFs (main): Our INFs as introduced above using A-IPTW estimation. (2) INFs w/o stud flow: A simplified variant which uses only the conditional density estimation from the teacher flow as a semi-parametric plug-in estimator, and thus without student flow. This variant is identical to the CNF baseline. (3) INFs w/o bias corr: We use the covariate-adjusted fully-parametric estimator, where the student flow only uses the cross-entropy loss from Eq. ( 9) but without one-step bias correction. The ablations have the same hyperparameters as our main method for better comparability.

6.2. RESULTS

Synthetic data. We generate synthetic data using the SCM (d X = 1) from Fig. 1 . Here, we vary the covariate shift b, which controls the overlap between the treated and non-treated population. Notably, low values of b correspond to the case, where both populations are similar or the same, while high values of b result in the violation of the positivity assumption. Further details on the synthetic dataset are provided in Appendix H. Fig. 3 shows the results. Our INFs achieve clear performance improvements over the baselines, especially for larger b. Moreover, the ablation studies confirm that our proposed deep learning architecture with one-step bias correction is superior. In Appendix I, we additionally provide a two-dimensional benchmark, where our INFs prove their effectiveness. IHDP dataset. The Infant Health and Development Program (IHDP) (Hill, 2011 ) is a semi-synthetic dataset with two synthetic potential outcomes generated from real-world med-ical covariates (n = 747, d X = 25, see details in Appendix H). Here, we used tenfold train/test splits (90%/10%) and perform hyperparameter tuning based on the first split. Results are in Table 2 . TARNet * is known to entail a ground-truth conditional distribution model and should thus not be interpreted as a baseline but as an upper performance bound. Our INFs reach an equally good performance and, importantly, outperform all the other baselines for both potential outcomes. The ablation study again confirms that our main INFs are superior over the other variants without the student flow and without bias correction. In Appendix J, we repeat the evaluation using the empirical Wasserstein distance with similar findings. HC-MNIST dataset. Hidden confounding MNIST dataset is a semi-synthetic dataset (Jesson et al., 2021) , constructed on top of the canonical image dataset of handwritten digits (MNIST) (LeCun, 1998) . To satisfy the exchangeability assumption, we add a hidden confounder to the set of all covariates, i. e., 28x28 images (d X = 784 + 1). For dataset details, see Appendix H. For our experiments, we use only the train subset of the original MNIST (n = 42, 000). Here we used ten random train/test splits (80%/20%) and tune hyperparameters on the first split. Table 4 shows the results of the experiments. Note that the non-and semi-parametric baselines suffer from scalability issues and were thus excluded. Further, our INFs outperform the variant without a bias correction, i. e., the only other available baseline. Scalability. Experiments with ACIC 2018 and HC-MNIST datasets showed high effectiveness of our INFs for datasets with large sample sizes (n > 25, 000) and with high-dimensional covariates (d X > 100). We provide a runtime comparison in Appendix K. For HC-MNIST, non-and semiparametric methods even become completely impractical due to memory and time constraints. Importantly, this is a major advantage of our fully-parametric IDE estimator, INFs, over semiparamentric plug-in etimators and other baselines. Case study. We performed a case study using data from California's tobacco control program to estimate its effect on tobacco sales. Previously, the evidence was primarily based on point estimates without information on the interventional density (Abadie et al., 2010) . Our INFs suggest that the program led to a large reduction in tobacco sales . Discussion. Interestingly, both components are important for the final performance (see our ablation studies). First, the teacher flow with the help of noise regularization performs consistent estimation of the nuisance parameters. Second, the student flow uses estimated nuisance parameters to solve the optimization objective. The student flow is not redundant but crucial for computational performance. While simple NFs have a similar estimation performance in terms of goodness-of-fit, only our INFs have constant inference time (e.g., during the evaluation phase regardless of the data size). This is a major advantage of parametric treatment effect estimators over semi-paramentric plug-in estimators.

A RELATED WORK: NORMALIZING FLOWS FOR CAUSAL INFERENCE

NFs have been used in the wider area of causal inference, yet in vastly different tasks than ours. Examples include, e. g., robust prediction by employing causal mechanisms (Müller et al., 2021) ; combining interventional and observational datasets (Ilse et al., 2021) ; and causal discovery (Brouillard et al., 2020) . Further, several works aim to model Bayesian networks or structural causal models (SCMs) with known or unknown causal diagrams. For example, NFs were used as a probabilistic model for Bayesian networks aimed at causal discovery, as well as downstream interventional and counterfactual inference (Khemakhem et al., 2021; Wang et al., 2021; Wehenkel & Louppe, 2021) . Balgi et al. (2022) build upon a temporal SCM with exogenous noise, where NFs are used for interventional and counterfactual queries. Importantly, all the aforementioned methods assume continuous variables in SCMs and independence of exogenous noise. 6 Hence, these methods are not applicable in our case, which considers semi-Markovian SCMs and which is thus a different inference task. 7 In sum, NFs have not yet been adapted to IDE, which is our novelty.

B BACKGROUND MATERIALS B.1 NORMALIZING FLOWS

Normalizing flows (NFs) (Tabak & Vanden-Eijnden, 2010; Rezende & Mohamed, 2015) are flexible probabilistic models with a tractable density. A normalizing flow describes the change of the density of a continuous random variable after applying a sequence of invertible transformations. Given a random variable Z with some known density P(Z = •), e. g., normal or uniform, we define a transformed variable X = t(Z) Z ∼ P(Z), (11) where t(•) : Z → X denotes an invertible forward transformation with inverse t -1 (•) : X → Z. Importantly, the transformation is defined between spaces of same dimensionality d Z = d X . To find a distribution of X, we can apply the multivariate change of variables formula P(X = x) = P(Z = z) det dZ dX = P(Z = t -1 (x)) det dt -1 dX (x) , where det dt -1 dX (x) is the Jacobian determinant of the inverse transformation t -1 (•). Then, using the inverse function theorem, we obtain dt -1 dX = dt dZ -1 , so that the Jacobian of the inverse transformation can be substituted with the inverse Jacobian of forward transformation. Using the properties of the determinant, Eq. ( 12) can be simplified to P(X = x) = P(Z = t -1 (x)) det dt dZ t -1 (x) -1 . The name normalizing comes from the fact that any regular continuous distribution X can be transformed to a normal Z with a specific t -1 (•). We can construct arbitrarily complex densities by applying a composition of K transformations t 1 , t 2 , . . . , t K : X = Z K = t K (Z K-1 ) = t K (t K-1 (Z K-2 )) = . . . = t K • . . . • t 1 (Z 0 ), where Z 0 is called a base distribution. One calls this chain of transformations a flow. Finally, the density of X can be recursively found as P(Z K = z K ) = P(Z K-1 = z K-1 ) det dt K dZ K-1 (z K-1 ) -1 = P(Z 0 = z 0 ) K k=1 det dt k dZ k-1 (z k-1 ) -1 , where z 0 , z 1 , . . . , z K are found via Eq. ( 15). Consequently, we now can directly evaluate the log-likehood of an observation X i = Z Ki and, with a proper parametrization of transformations, back-propagate trough it. Examples of simple transformations include affine, planar, and radial (Rezende & Mohamed, 2015) .

B.2 CAUSAL MODEL AND INDENTIFICATION

In this section, we provide a brief background on the underlying causal model in this paper, using both the potential outcomes and the structural causal model framework. These frameworks are equivalent in the sense that they both allow for identification of the interventional density and yield the same statistical estimand. Potential outcomes framework. The observed variables in our model are covariates X ∈ X ⊆ R d X , a treatment A ∈ {0, 1}, and a d Y -dimensional continuous outcome Y ∈ Y ⊆ R d Y . In the main paper, we used the potential outcomes framework (Rubin, 1974) to define the causal estimates. In particular, we defined Y [a] as the potential outcome after intervening on treatment by setting it to a. By imposing Assumptions (1)-(3) in Section 3, this allows us to define the interventional density (our causal estimand) via P(Y [a] = y) = x∈X P(Y = y | X = x, A = a) P(X = x) dx = E X∼P(X) P(Y = y | X, A = a) . SCM framework. Equivalently to the potential outcomes framework, we can also define the interventional density within the structural causal model (SCM) framework (Pearl, 2009; Bareinboim et al., 2022) . More precisely, we can define a (semi-Markovian) SCM by introducing independent exogenous latent variables U A ∼ P(U A ) and U XY ∼ P(U XY ); and the functional assignments X := f X (U XY ), A := f A (X, U A ), and Y := f Y (X, U XY ). Here, X, A and Y are observed endogenous variables, satisfying Assumptions (1)-( 3). We show a corresponding causal graph in Fig. 4 . Figure 4 : Causal graph corresponding to the potential outcome framework assumptions. Interventions vs. counterfactuals. We follow Pearl's hierarchy on causal inference (Bareinboim et al., 2022) and distinguish the interventional and the counterfactual distribution. In SCM language, we can use Pearl's do-notation do(A) = a to denote an intervention on the treatment A. This corresponds to setting A = a in a graph G a where all arrows from parent nodes of A to A are removed. We can then define the potential outcome Y [a] via its interventional density P(Y [a] = y) = P(Y = y | do(A = a)) and obtain the identification result from Eq. ( 17). In contrast, counterfactual queries aim to answer individualized questions "what would have happened if we had used a different treatment given already treated or untreated population". We can then define the counterfactual density as P(Y [a ′ ] = y | A = a) for some different treatment a ′ ̸ = a. This is the distributional equivalent to the average treatment effect of the treated (ATT). However, most of the treatment effect estimation literature focuses on interventional causal estimands (such as the ATE). Our paper is therefore in line with previous work. We acknowledge that other papers oftentimes call the interventional distribution counterfactual distributions for simplicity. Comparison to other identification strategies. For the identification of the interventional density, we mainly rely on the three main assumptions positivity, consistency, and exchangeability (or, equivalently, on the back-door adjustment from Eq. ( 17)). This is a common setup in treatment effect estimation (van der Laan & Rubin, 2006; Shalit et al., 2017; Wager & Athey, 2018) . More complex adjustment rules (e.g., front-door adjustment, adjustment for napkin graph) have the following limitations: (1) they require more unusual, complex assumptions which are often violated in practice; and (2) they require a complex efficient estimation theory Vowels et al. (2022) . Nevertheless, this could be an interesting direction for future research.

B.3 EFFICIENCY THEORY AND INFLUENCE FUNCTIONS

In this section, we give a brief background on semi-parametric efficiency theory and influence functions. Our background builds upon Kennedy et al. (2021) , and we thus refer to it for mathematical details and further explanations. Let us consider a semi-parametric statistical model {P ∈ P}, where P is a family of probability measures. We are interested in estimating a functional ψ : P → R. If ψ is sufficiently smooth, it admits the so-called von Misesor distributional Taylor expansion ψ( P) -ψ(P) = ϕ(t, P) d( P -P)(t) + R 2 ( P, P), ( ) where R 2 ( P, P) is a second-order remainder term and ϕ(t, P) is the so-called efficient influence function of ψ, satisfying ϕ(t, P)dP(t) = 0 and ϕ(t, P) 2 dP(t) < ∞. The efficient influence function ϕ(•, •) plays an important role in the theory of efficient semi-parametric estimation. Under certain assumptions, it can be shown that, for any sequence of estimators ψn , it holds true that inf δ>0 lim inf n→∞ sup TV(P,Q)<δ n E Q ( ψn -ψ(Q)) 2 ) ≥ var (ϕ(T, P)) , where TV denotes total variation. Hence, ϕ characterizes the best possible variance an estimator can achieve (in a local min-max sense). Let now P be an estimator of P and ψ( P) the so-called plug-in estimator of ψ(P ). The von Mises expansion from Eq. ( 20) implies that ψ( P) yields a first-order plug-in bias because ψ( P) -ψ(P) = -ϕ(t, P) dP(t) + R 2 ( P, P) due to that ϕ(t, P) d P(t) = 0. A simple way to correct for the plug-in bias is to estimate the bias term from the right-hand side of Eq. ( 22) and add it to the plug-in estimator via ψA-IPTW = ψ( P) + P n (ϕ(T, P)). ( ) Under certain assumptions it, can be shown that the bias-corrected estimator ψA-IPTW is asymptotically normal with mean zero and variance var (ϕ(T, P)). Hence, by Eq. ( 21), ψA-IPTW is (asymptotically) efficient in the sense that it is consistent with the best possible variance. Application to interventional density estimation: We now return to the specific statistical model in our paper, i.e., we aim at interventional density estimation. In other words, the estimand ψ(P) we are interested in is the function (25) As described above, this estimator suffers from plug-in bias and is not efficient. However, a one-step bias correction for our setting is not as simple due to the fact that the interventional density is a functional target estimand and, hence, infinite dimensional. As a remedy, Kennedy et al. (2021) proposes an elegant solution by introducing the finite dimensional projection parameter βa = arg min P(Y [a] = •) = E X∼P(X) P(Y = • | X, A = a) . βa KL P(Y [a]) g(•; β a ) , which is equivalent to solving the moment condition m(β a ) = E X∼P(X) E T (Y ; β a ) | X, A = a ! = 0, ( ) where T = T (Y ; β a ) = -∇ βa log g(Y ; β a ). The advantage of this approach is that the moment m(β a ) is a finite dimensional quantity, which means efficiency theory can be applied. The plug-in estimator for the moment is mPI (β a ) = E Y a ∼Pn{ P(Y |X,A=a)} T (Y a ; β a ). (28) Kennedy et al. (2021) also derived the efficient influence function for the moment: (30) Estimating the projection parameter via Eq. ( 30) requires solving a (potentially high-dimension) system of non-linear equations, which is often infeasible in practice. Hence, as a remedy, we propose in this paper to reformulate Eq. ( 30) as an optimization problem which can be incorporated directly into loss of a neural network (see Appendix C). ϕ a (T ; P) = 1(A = a) π a (X) T -E(T | X, A = a) + E(T | X, A = a) -E X∼P(X) (E(T | X, A = a)).

C BIAS-CORRECTED MOMENT CONDITION AS AN OPTIMIZATION TASK

We aim to transform the bias-corrected moment condition into an optimization objective: Let us unroll the bias correction term of Eq. ( 7): P n {ϕ a (T ; P)} = P n 1(A = a) πa (X) T -Ê(T | X, A = a) + Ê(T | X, A = a) -P n Ê(T | X, A = a)) (34) = P n 1(A = a) πa (X) T -Ê(T | X, A = a) + Ê(T | X, A = a) -P n Ê(T | X, A = a)) , where nuisance parameters are marked with red color. Here, the last term is in fact the plug-in estimator of the moment condition, i. e., -mPI (β a ). Therefore, we can simplify one-step bias corrected moment condition via mA-IPTW (β a ) = P n 1(A = a) πa (X) T -Ê(T | X, A = a) + Ê(T | X, A = a) (36) = E Y a ∼Pn{ P(Y |X,A=a)} T (Y a ; β a ) + P n 1(A = a) πa (X) T (Y ; β a ) - E Y ∼ P(Y |X,A=a) T (Y ; β a ) , where we use the conditional density estimator but not an estimator for the functional regression. This allows us to transform the A-IPTW moment condition into an optimization objective (Eq. ( 8)) by taking antiderivative with respect to β a .

D TWO-STEP TRAINING PROCEDURE

Our INFs are trained with a two-step procedure. The procedure is shown in Algorithm 1. Recall that we use noise regularization as the main regularization technique for the teacher flow, and exponential moving average (EMA) for the student flow to stabilize training. A-IPTW estimation is also known to become unstable in a finite sample setting (Shi et al., 2019) , so that inverse values of propensity score become too large. Thus, we manually discard observations with too small propensity score (π a (X) < 0.05) from bias correction. Algorithm Init: parameters of the teacher flow: FC (0) 1 , FC ▷ Fitting the teacher flow for i = 0 to n iter, t do B = {X, A, Y } ← minibatch of size b t R, πa (X) ← FC (i) 1 (X) ξ x ∼ N (0, σ 2 x ); ξ y ∼ N (0, σ 2 y ); R ← R + ξ x ; Ỹ ← Y + ξ y ▷ Noise regularization θ(X, A) ← FC (i) 2 (A, R) P(Y | X, A) ← normalizing flow with parameters θ(X, A) L NLL ← -log P(Y = Ỹ | X, A) L π ← BCE(π A (X), A) L t ( P, πa ) ← P B bt {L NLL + αL π } FC (i+1) 1 , FC ← optimization step wrt. L t ( P, πa ) with learning rate η t end for Output: nuisance parameters: P(Y | X, A), πa (X) Init: parameters of the student flows: β (0) a , β (0) a,EMA ← β (0) a ▷ Fitting the student flows for i = 0 to n iter, s do B = {X, A, Y } ← minibatch of size b s for a ∈ {0, 1} do L CE (β (i) a ) ← -h K j=1 log g(y j ; β (i) a )P B bs { P(Y = y j | X, A = a)} L CCE (X; β (i) a ) ← -h K j=1 log g(y j ; β (i) a ) P(Y = y j | X, A = a) bias correction(β (i) a ) ← P B bs 1(A=a&πa(X)≥0.05) πa(X) -log g(Y ; β (i) a ) -L CCE (X; β (i) a ) L s (β (i) a ) ← L CE (β (i) a )+bias correction(β (i) a ) β (i+1) a ← optimization step wrt. L s (β (i) a ) with learning rate η s β (i+1) a,EMA ← γβ (i) a,EMA + (1 -γ)β (i+1) a ▷ EMA update end for end for Output: βa A-IPTW ← β (niter, s ) a,EMA

E INFS IMPLEMENTATION DETAILS

Implementation. We implemented our INFs using PyTorch and Pyro. For both teacher and student flow, we employ neural spline flows (Durkan et al., 2019) with standard normal (N (0; 1)) as a base distribution. Neural spline flows construct an invertible transformation of the base distribution with the help of monotonic rational-quadratic splines. They are characterized by two main hyperparameters: a number of knots n knots and a span of the transformation interval [-B; B]. n knots controls the smoothness of estimated density and B affects the support of the transformation. In our experiments, we heuristically set B = y maxy min + 5. For the teacher flow, we use fully-connected subnetworks each with one hidden layer (with h = 10 hidden units), and the dimensionality of representation is set to r = 10. Training. During training (see full algorithm in Appendix D), we adopt noise regularization (Rothfuss et al., 2019) and add an independent Gaussian noise ξ x ∼ N (0, σ 2 x ), ξ y ∼ N (0, σ 2 y ) to the representation and output of the teacher flow, i. e., R = R + ξ x ; Ỹ = Y + ξ y . For faster learning, we approximate a full-sample average P n {•} with a minibatch average P B b {•} for all the losses, where b is the minibatch size. We use stochastic gradient descent (SGD) for fitting the parameters of the teacher flow, and Adam optimizer (Kingma & Ba, 2015) for the student flow with learning rates η t and η s , respectively. We fix the weighting hyperparameters of the loss to α = 1 and the EMA smoothing hyperparameter to γ = 0.995. The grid size for approximating the cross-entropy loss is set to K = 100. Both y min and y max are set to the empirical minimum and maximum of the train sub-sample. Note that we would need sample splitting for training both flows to guarantee the asymptotic properties, i. e., efficiency and double robustness (see Kennedy et al., 2021, Remark 5) . Nevertheless, we used all data for the both components and trained our INFs with an auxiliary regularization because sample splitting can affect the performance in settings with limited data. This is consistent with previous work on deep learning for efficient treatment effect estimation (Curth & van der Schaar, 2021) . Hyperparameter tuning. We perform extensive hyperparameter tuning only for the teacher flow. Hyperparameters for tuning include, e. g., number of knots of neural spline flows n knots,t , the minibatch size b t , the learning rate η t , and the intensities of the noise regularization σ 2 x , σ 2 y . On the other hand, we discovered, that the student flow works well with the same plain set of hyperparameters in almost all the experiments. Those include the minibatch size b s = 64 and the learning rate η t = 0.005. The number of knots n knots,s is chosen at hand for each dataset. Further details on hyperparameter tuning are provided in Appendix G.

F BASELINES

In the following, we describe the baseline methods in detail. These are two naïve semi-parametric plug-in estimators: mixture density networks (MDNs) (Bishop, 1994) and conditional normalizing flow (CNF) (Trippe & Turner, 2018) . Further, we use two state-of-the-art IDE baselines: kernel density estimation (KDE) (Kim et al., 2018) and distributional kernel mean embeddings (DKME) (Muandet et al., 2021) .

F.1 NAÏVE SEMI-PARAMETRIC PLUG-IN ESTIMATORS

Semi-parametric plug-in estimators estimate the conditional outcome distribution and perform averaging over covariates during evaluation, as introduced in Eq. ( 2). TARNet * , MDNs, and CNF make use of hypernetworks (Ha et al., 2017) , which take covariates X and treatment A as an input and output parameters, i. e., θ(X, A) of the estimated conditional distribution P(Y | X, A). Hypernetwork architectures are considered to be state-of-the-art for neural conditional density estimation and can be found in, e. g., Gaussian mixtures (Bishop, 1994 ), variational autoencoders (Kingma & Welling, 2013) , and normalizing flows (Trippe & Turner, 2018) . For comparability, we use the same network structure of the teacher flow in our INFs as the hypernetwork for the conditional distribution parameters. This gives two fully-connected subnetworks stacked on each other, i. e. FC 1 and FC 2 , as introduced in Section 5.1. To regularize both conditional distribution estimators, we use noise regularization (Rothfuss et al., 2019) . TARNet * . The treatment-agnostic representation network (TARNet) (Shalit et al., 2017) was proposed to estimate nuisance parameters for ITE, i. e., conditional means of outcomes. To obtain density estimates as outputs, we report results from an extended variant which we refer to as to TARNet * . Specifically, we extended the original TARNet by modeling conditional outcome distribution as a homoscedastic normal distribution. For this, we add one unconditional parameter of standard deviation, σ, so that the conditional density equals to P(Y = y | X, A) = N (y; µ(X, A), σ 2 ), where N (y; µ, σ 2 ) is a density of the normal distribution, and µ(X, A) is conditional mean of outcome. Notably, we do not use the two separate outcome heads (as in original TARNet) but only one, i. e., FC 2 . This is crucial to ensure a fair comparison with other plug-in estimators. We estimate the standard deviation σ using maximum-likelihood. Note also that TARNet * is restricted to normal conditional outcome distributions and thus is not a universal density estimator. In contrast to our INFs, TARNet * is unable to capture heavy-tailed, multi-modal, and skewed distributions. MDNs. Mixture density networks (Bishop, 1994) are built on top of mixture of normal distributions, and can approximate any density arbitrarily well (Titterington et al., 1985) , i.e., P(Y = y | X, A) = n C j=1 w j (X, A) N (y; µ j (X, A), σ 2 j (X, A)) where n C is a number of mixture components, w j ≥ 0, n C j=1 w j = 1 are mixture weights, and N (y; µ j , σ 2 j ) is a density of the normal distribution. In the case of MDNs, the hypernetwork outputs logits of mixture weights and parameters of the normal distribution (i.e., mean and logarithm of the standard deviation), i. e., θ = {logits(w j ), µ j , log σ j }. Here, the number of mixture components n C controls the smoothness of the estimator and represents the main hyperparameter for tuning. CNF. We implement conditional normalizing flow (Trippe & Turner, 2018) with the help of neural spline flows (Durkan et al., 2019) . Neural spline flows construct an invertible function parameterized by θ, i. e., f (•; θ) : R → R, which is a monotonic rational-quadratic spline with n knots knots. This spline transforms the density of a base distribution on the interval [-B; B]. Outside of the interval, f (•) equals to the identity function. This allows us to perform flexible parametric density estimation with the help of the change of variables formula, i.e., P(Y = y | X, A) = N f -1 y; θ(X, A) ; 0, 1 df dY f -1 (y; θ(X, A)) -1 where f -1 (•; θ) is the inverse transformation, and the density of standard normal distribution N (y; 0, 1) serves as a base distribution. As already discussed in Appendix E, B affects the support of transformation, and the number of knots n knots controls the smoothness of the estimator and represents the main hyperparameter for tuning.

F.2 KERNEL DENSITY ESTIMATION (KDE)

Kernel density estimation (KDE) is a semi-parametric method for IDE (Kim et al., 2018) . It builds upon the idea of a density functional, namely T y (Y ; h a ), to transform a random variable Y into a proper density via T y (Y ; h a ) = 1 h a K ∥Y -y∥ 2 h a = 1 h a √ 2π exp - ∥Y -y∥ 2 2 2h 2 a , where K(x) = 1 √ 2π exp(-x 2 /2 ) is a radial basis function (RBF) with a treatment-specific smoothing parameter h a called bandwidth, and ∥•∥ 2 is the L 2 -norm. Robins & Rotnitzky (2001) proposed a semi-parametric plug-in estimator of interventional density PPI (Y [a] = y) = P n Ê T y (Y ; h a ) | X, A = a , where μa,y (X) = Ê T y (Y ; h a ) | X, A = a is a functional regression of X and A on T y (Y ; h a ). Kim et al. (2018) further extended this estimator to an efficient, A-IPTW-style semi-parametric estimator PA-IPTW (Y [a] = y) = P n 1(A = a) πa (X) T y (Y ; h a ) -μa,y (X) + μa,y (X) , where πa (X) is an estimator of the propensity score. The main challenge here is building a functional regression μa,y (X). Unfortunately, the work by Kim et al. (2018) does not provide effective, practical solutions. Even more so, Eq. ( 43) does not guarantee that the estimated density is proper, i. e., integrates to 1 and is positive, especially in a small sample regime or when the propensity score has extremely low values. To estimate the nuisance parameters, namely, the propensity score and the functional regression, we use the same network structure as for the teacher flow of our INFs (see Section 5.1). In this way, we estimate the propensity score and perform a functional regression with two joined, fully-connected subnetworks (i.e., FC 1 and FC 2 ). The first subnetwork, FC 1 , outputs a representation R and estimates the propensity score. The second subnetwork, FC 2 , then takes the representation R and the treatment A, and performs an outcome regression: Ŷ = Ê(Y | X, A). The functional expression, i. e., Eq. ( 41), is predicted via μa,y (X) = T y ( Ê(Y | X, A); h a ). Although, this is a biased estimator of µ a,y (X), it ensures a proper normalization, i.e., Y μa,y (X) dy = 1. To fit FC 1 and FC 2 , we use the sum of mean-squared error (L MSE ) and binary cross-entropy (L π ) losses via L KDE ( Ê, πa ) = P n {L MSE + αL π } with L MSE = ( Ŷ -Y ) 2 ; L π = BCE(π A (X), A), ( ) where α is a hyperparameter. In our experiments, we set α = 1 and fit the nuisance parameters (i.e., πa and Ê(Y | X, A)) using the Adam optimizer with n iter = 10000 iterations. Both learning rate η and minibatch size b are subject to hyperparameter tuning. We employ a median heuristic (Garreau et al., 2017) for choosing the bandwidth h a , i.e., h med a = 1 2 Median ∥Y i -Y j ∥ 2 2 | A = a , 1 ≤ i < j ≤ n, where ∥•∥ 2 is the L 2 -norm, and where Y i , Y j are observations from the train subset, conditioned on A = a. To address the numeric instability of the A-IPTW estimator, we discard observations with too small propensity scores (π a (X) < 0.05) from averaging in Eq. ( 43), similarly to our INFs.  D = {X 0 i , Y 0 i } n0 i=1 ∪ {X 1 i , Y 1 i } n1 i=1 . Then, µ Y |X,A=a can be estimated via μY |X,A=a (y) = na i=1 w a i (X) l a (y, Y a i ), w a 1 (X), . . . , w a na (X)) ⊺ = (K a + n a εI) -1 k a (X) ∈ R na , k a (X) = k(X, X a 1 ), . . . , k(X, X a na ) ⊺ ∈ R na , where I ∈ R na×na is an identity matrix, ε > 0 is a regularization hyperparameter, K a ∈ R na×na is a kernel matrix with elements K a ij = k(X a i , X a j ), and k(•, •) is a second kernel representing conditional dependencies between X and Y (Grünewälder et al., 2012) . Muandet et al. (2021) further developed a KME for interventional distribution, i. e., µ Y [a] , and its empirical estimate, μY [a] : µ Y [a] (y) = E X∼P(X) µ Y |X,A=a (y) (50) μY [a] (y) = P n {μ Y |X,A=a (y)} = na i=1 β a i l a (y, Y a i ), (β a 1 , . . . , β a na ) ⊺ = (K a + n a εI) -1 Ka 1 m ∈ R na , where Ka ∈ R na×n is a kernel matrix with elements Ka ij = k(X a i , X j ), and 1 m = (1/n, . . . , 1/n) ⊺ . For our experiments, we choose both kernels, i. e., outcome kernel, l a (•, •), and conditional kernel, k(•, •), to be RBF kernels with bandwidth parameters h a,l and h k , respectively. Therefore, μY [a] (y) represents a valid interventional density estimator. Nevertheless, due to small sample sizes, some β a i could be negative and the estimated density ends up having negative values. We set the bandwidth of the outcome kernel, h a,l according to the median heuristic from Eq. ( 45). The bandwidth of the conditional kernel h k and the regularization hyperparameter ε are subjects to the hyperparameter tuning. Motivated by the interpretation of conditional mean embedding as kernel ridge regression (Grünewälder et al., 2012) , we use out-sample MSE of the ridge regression with parameters h k and ε as a tuning criterion. For the experiments with HC-MNIST, we use a larger network size for our INFs (compared to other benchmarking experiments) to allow for more flexibility. We set the number of hidden units in fully-connected subnetworks to h = 30, and the dimensionality of representation r = 30. We also increase the number of training iterations to n iter,t = 15, 000 and n iter,s = 5000. L CASE STUDY: CALIFORNIA'S TOBACCO CONTROL PROGRAM Overview. To show a real-world application of our INFs, we provide additional results using a case study where we evaluate the effect of California's tobacco control program (Abadie et al., 2010) . This refers to the effect of Proposition 99, a large-scale tobacco control program introduced in California after 1988. Proposition 99 increased California's cigarette tax by 25 cents per pack, and earmarked the tax revenues to health and anti-smoking education. The main conclusion of Abadie et al. (2010) is that the effects of the tobacco control program are much larger than previously reported. The dataset has also found widespread use in causal inference ever since (e.g., Bellot & van der Schaar, 2021 ). P(Y = y | A = 0) PA-IP T W (Y [0] = y) -3 -2 -1 0 1 2 3 4 y P(Y [1] = y) P(Y = y | A = 1) PA-IP T W (Y [1] = y) In the original paper (Abadie et al., 2010) , the results were based on a synthetic control method but without providing density estimates. Dataset. After an initial pre-processing, the dataset consists of the 39 states, including California. For each state, we observe several covariates (e. g., beer consumption per capita, GDP per capita, retail price, and percent of people aged 15-24) and the outcome, i. e., cigarette sales per capita. These are recorded annually for each year from 1970 to 2001. Further details on the datasets are in (Abadie et al., 2010) . To apply our INFs, we make several gross assumptions. First, as there is only one treated state, it is impossible to satisfy the positivity assumption. Therefore, we consider a tuple (state, year) as an independent unit of measurement, thus obtaining n = 1209 observations with 12 treated observations (i. e., those of the state of California after 1989). We also add year as a covariate, which gives d X = 4 + 1. We acknowledge that, even after the previous pre-processing, we still cannot formally guarantee the independence between units of measurement, as the observations of one state over time are not independent. Second, we assume the consistency holds, and there is no spillover effect between neighboring states, so that the potential outcome of one state is independent of the others. Results. We plot the empirical conditional and the estimated interventional distributions in Fig. 8 . The results go in line with the conclusion in (Abadie et al., 2010) . Our main finding is that the introduction of the Proposition 99 (a = 1) to all the states from 1970 would substantially reduce tobacco sales. In particular, the mass of the interventional density is shifted to the left which accounts for the reduction of the consumption. As a robustness check, we analyze the role of the smoothness hyperparameter. Our conclusion remains consistent if one specifies different smoothness hyperparameter for the student flow, i. e. n knots,s = 5 and 10. The specification of this hyperparameter is based on the prior knowledge of a researcher and cannot be chosen via observational data. However, we find consistent evidence of a positive effect. 



We distinguish the interventional distribution (i.e., P(Y [a])) and the counterfactual distribution (i.e., P(Y [a] | A = a ′ )), which are different in general. This can be seen by comparing plots (a) vs. (b) and (c) in Fig. 1. For further information, we refer to Appendix B. Code is available at https://anonymous.4open.science/r/AnonymousInterFlow-E2F3. This is standard approach in neural conditional density estimation(Bishop, 1994;Kingma & Welling, 2013). One can use a single normalizing flow with a hypernetwork for categorical treatments. MDNs were previously used to estimate the conditional distribution of outcome for quantifying the ignorance regions of ITE estimation(Jesson et al., 2021;2022). However, this is different from our IDE task. This is commonly known as a causal Markov condition. This is stated in our identifiability assumptions: there is no limitation on the exogenous noise independence between outcome and covariates. Hence, our setting is more general. https://anonymous.4open.science/r/AnonymousInterFlow-E2F3 After one-hot-encoding of categorical covariates. https://scikit-learn.org/stable/modules/generated/sklearn.datasets. make_moons.html



= y | A = 1) P(Y [1] = y) P(Y [1] = y | A = 0)

Figure 1: Motivating example showing the densities of observational, interventional, and counterfactual distributions of outcome Y . These are simulated via the structural causal model on the right (here: N (x; µ, σ 2 ) are densities of the normal distribution; and b = 3 is a covariates shift, which regulates the probability of treatment assignment). Potential outcomes have different distributions but the same mean E(Y [0]) = E(Y [1]) ≈ 4.77 and the same variance var(Y [0]) = var(Y [1]) ≈ 4.06. Here, Y [a] is the potential outcome given treatment a. (a) Interventional distributions. (b) and (c) Observational and counterfactual distributions for the same outcomes. As shown here, the observational, interventional, and counterfactual distributions can be vastly different. Appendix B). At the same time, by flipping the treatment assignment in this equation, we obtain counterfactual outcomes Y [a] | A = a ′ . We observe that the potential outcomes have the same mean (i.e., E(Y [0]) = E(Y [1])) and the same variance (i.e., var(Y [0]) = var(Y [1])). Hence, the ground-truth ATE equals zero. Nevertheless, the distributions of potential outcomes (i. e.,P(Y [a])) are clearly different. Hence, in medical practice, acting upon the ATE without knowledge of the distributions of potential outcomes could have severe, negative effects. To show this, let us consider a "do nothing" treatment (a = 0) and some medical treatment (a = 1). Further, let us consider an outcome to be successful if some risk score Y is below the threshold of five. Then, the probability of treatment success (i. e., P{Y [1] < 5.0} ≈ 0.63) is much larger than the probability of success after the "do nothing" treatment (i. e., P{Y [0] < 5.0} ≈ 0.51), highlighting the importance of treatment.

Figure 2: Overview of Interventional Normalizing Flows. Our INFs combine two normalizing flows, which we call "teacher flow" and "student flow". The teacher flow estimates the nuisance parameters, i.e., the propensity score πa (X) and the conditional outcome distribution P(Y | X, A). The student flow utilizes them to estimate the projection parameters βA-IPTW a . Both teacher and student flows are fitted via a two-step training procedure.

COVARIATE-ADJUSTED ESTIMATOR Let the d-dimensional random variable T (Y ; β a ) = -∇ βa log g(Y ; β a ) denote the score function. Following Kennedy et al. (2021), the projection parameters can be equivalently expressed as a solution to the moment condition m(β a )

Figure 3: Results for synthetic data based on the SCM from Figure 1. Reported: mean over ten-fold train-test splits. Some runs for MDNs resulted with the log-prob out = -∞ and, thus, are not shown.

Given an initial estimator P(Y = • | X, A = a) and the marginal empirical probability measure P n {•}, the plug-in estimator becomes PPI (Y [a] = •) = P n { P(Y = • | X, A = a)}.

bias-corrected estimator for the projection parameter can be obtained by solving mA-IPTW (β a ) = mPI (β a ) + P n ϕ a (T (Y ; β a ); P) ! = 0.

mA-IPTW (β a ) = mPI (β a ) + P n ϕ a (T (Y ; β a ); P) that the plug-in estimator of moment condition mPI (β a ) can be rewritten asmPI (β a ) = E Y a ∼Pn{ P(Y |X,A=a)} T (Y a ; β a ) = Y T (y; β a ) P n { P(Y = y | X, A = a)} dy (32) = P n Y T (y; β a ) P(Y = y | X, A = a) dy = P n Ê T (Y ; β a ) | X, A = a , (33)where the last equality follows from the definition of the conditional expectation. Notably, we see that the moment condition could be equivalently solved with either the conditional distribution, P(Y | X, A = a), or with the functional regression, E T (Y ; β a ) | X, A = a .

Fig. 5 shows both ground-truth interventional (P(Y [a])) and observational (P(Y | A = a)) distributions together with our INFs A-IPTW estimator ( PA-IPTW (Y [a])). Remarkably, the interventional distributions in HC-MNIST are multi-modal and differ a lot from observational distributions.

Figure 5: Empirical ground-truth interventional and conditional distributions of the HC-MNIST synthetic outcome. We also plot our INFs density estimator, i. e., PA-IPTW (Y [a]).

Figure6: Detailed results for ACIC 2016. For each dataset, we perform five random train-test splits, tune the baselines on the first split, and evaluate the average in-sample / out-sample log-probability for each of the two potential outcomes separately. Shown: median over five runs and improvement of our INFs (main), when they score better than other baselines.

Figure7: Detailed results for ACIC 2018, sorted with respect to sample sizes. For each dataset, we perform five random train-test splits, tune the baselines on the first split, and evaluate the average in-sample / out-sample log-probability for each of the two potential outcomes separately. Shown: median over five runs and improvement of our INFs (main), when they score better than other baselines.

Figure 8: Empirical ground-truth conditional and estimated interventional distributions of cigarette sales per capita from 1970 to 2001. Treatment a = 1 corresponds to the introduction of the Proposition 99, that is, a comprehensive tobacco tax along with educational programs. We plot our INFs density estimator, PA-IPTW (Y [a]) with different smoothness hyperparameter values n knots,s of the student flow.

Overview of methods for interventional density estimation from observational data.

Results for IHDP dataset.Reported: mean ± sd.

Results for ACIC 2016 and ACIC 2018. Reported: % of runs with the best performance.Dorie et al., 2019;Shimoni et al., 2018) (see details in Appendix H). We select 15 random datasets from ACIC 2016 and 24 random datasets (4 of each of 6 sizes) from ACIC 2018. We perform five random train/test splits (80%/20%) for each dataset, tune hyperparameters on the first split and evaluate the average in-and out-sample log-probability on every split. Table3provides the performance comparison. Again, our INFs have a clear performance improvement over both baselines and other model variants. Compared to MDNs as the second-best method, our INFs scale much better in terms of runtime, especially for large sample sizes (see Appendix K).

Results for HC-MNIST. Reported: mean ± sd. over ten random train-test splits.

Training procedure of INFs Input: number of iterations n iter,t , n iter,s ; minibatch sizes b t , b s ; learning rates η t , η s ; intensities of the noise regularization σ 2 x , σ 2 y ; EMA smoothing γ; grid size K.

F.3 DKMEDistributional kernel mean embeddings (DKME)(Muandet et al., 2021) is a non-parametric plug-in estimator of interventional densities. This methods builds a kernel mean embedding (KME), namely, µ Y |X,A=a , for the conditional distribution P(Y | X, A = a) via (•, •) is a measurable positive definite kernel associated with a reproducing kernel Hilbert space H, so that µ Y |X,A=a provides a mapping from the space of conditional distributions to the space of functions H. If l a (•, •) is properly normalized, then µ Y |X,A=a (y) is in fact a conditional density estimator.To estimate the KME of the conditional outcome distribution (conditional mean embedding), we use the i.i.d. sample D = {X i , A i , Y i } n i=1 , and split it into control and treated subsamples:

availability

a = 1 103565436.csv 103742774.csv 114529266.csv 126207042.csv 129048260.csv 134157592.csv 142307479.csv 146384837.csv 166227859.csv 170631108.csv 171703928.csv 171980973.csv 174570990.csv 178541573.csv 186625088.csv 187767828.csv 200561702.csv 201870256.csv 206102672.csv 207685059.csv 211300829.csv 213344131.csv 224612511.csv 235427943.csv 238974649.csv 247513359.csv 250368854.csv 259040773.csv 259543770.csv 262933260.csv 270968850.csv 271787298.csv 272163561.csv 279971358.csv 283645672.csv 297653149.csv 298036285.csv 303833897.csv 310780972.csv 312245601.csv 313668696.csv 323744148.csv 323803854.csv 32472771.csv 325079976.csv 336720379.csv 34127908.csv 34296078.csv 347023904.csv 349098715.csv 349833705.csv 36796048.csv 39000213.csv 39529979.csv 46976133.csv 47360868.csv 48450540.csv 48813636.csv 48977235.csv 53430236.csv 56895733.csv 57390631.csv 58126234.csv 58515705.csv 66096714.csv 69816059.csv 73068755.csv 74339461.csv 76039753.csv 7693621.csv 78747024.csv 80148713.csv 86739305.csv 9256039.csv

G HYPERPARAMETER TUNING

We performed hyperparameters tuning for all the baselines based on five-fold cross-validation using the train subset. For each baseline, we performed a grid search with respect to different tuning criteria, evaluated on the validation subsets. Table 5 shows grids for hyperparameter tuning and other parameters, such as tuning criteria, number of training iterations, and optimizers. We aimed for a fair comparison and thus kept the number of parameters, network structures, and grid size similar across models. For the sake of reproducibility, we make the chosen hyperparameters for all the experiments public (see YAML files in our GitHub 8 ).Importantly, to facilitate the convergence of baseline methods, we additionally perform a standard normalization of both factual and counterfactual outcomes for all the datasets. We sample n = 1000 observations from the SCM (Fig. 1 ) and use a ten-fold split for train/test samples (90%/10%). We separately perform hyperparameter tuning based on the first split for each baseline and each level b. We then report an average out-sample log-likelihood over ten folds.

H.2 IHDP DATASET

The IHDP dataset (Hill, 2011) uses a real-world dataset with 25 covariates (6 continuous, 19 binary) and one binary treatment, capturing aspects related to children and their mothers. Both treated and untreated, synthetic outcomes of IHDP are sampled from different conditional normal distributions. These distributions are homoscedastic (σ 2 = 1) but have substantially different conditional means. We used the setting "B" in (Hill, 2011) with a following SCM:where β, W , ω are constant parameters of the simulation. For the further details, we refer to (Hill, 2011) .

H.3 ACIC 2016 & 2018 DATASETS

Covariates of ACIC 2016 are taken from a large study of developmental disorders (Niswander, 1972) , and covariates of ACIC 2018 are derived from the linked birth and infant death data (MacDorman & Atkinson, 1998) . ACIC 2016 and ACIC 2018 differ in the number of true confounders, the varying level of overlap, and the form of conditional outcome distributions. ACIC 2016 has 77 different datagenerating mechanisms with 100 equal-sized samples for each mechanism (n = 4802, d X = 82). 9 ACIC 2018 provides 63 distinct data-generating mechanisms with around 40 non-equal-sized samples for each mechanism (n ranges from 1, 000 to 50, 000, d X = 177). Notably, ACIC 2018 has a constant ITE for most of the datasets, but heterogeneous propensity scores.

H.4 HC-MNIST

Jesson et al. ( 2021) introduced a complex high-dimensional, semi-synthetic dataset based on the MNIST image dataset LeCun (1998), namely HC-MNIST. This dataset maps high-dimensional images onto a one-dimensional manifold, where potential outcomes depend in a complex way on the average intensity of light and the label of an image. The treatment also uses this one-dimensional summary, ϕ, together with an additional (hidden) synthetic confounder, U . This is described by the following SCM:where c is a label of the digit from the sampled image X; µ Nx is the average intensity of the sampled image; µ c and σ c are the mean and standard deviation of the average intensities of the images with the label c; and Min c = -2 + 4 10 c, Max c = -2 + 4 10 (c + 1). The parameter Γ * defines what factor influences the treatment assignment to a larger extent, i.e., the additional confounder or the one-dimensional summary. We set Γ * = exp(1). For further details, we refer to (Jesson et al., 2021) .

I SYNTHETIC TWO-DIMENSIONAL DATA

In the following, we benchmark our INFs for estimating an interventional density of the multidimensional outcome, d Y = 2.Noisy moons synthetic data. We used a standard two-dimensional toy data generator, namely moons data. 10 It draws samples from two interleaving half-circles with different noise levels ε. The noise level controls the level of the overlap between two half-circles (a higher σ corresponds to a better overlap, and, thus, a satisfaction of the positivity assumption). We drawing n = 1000 observations of two-dimensional covariates, i. e., d X = 2, and use an inclusion to the top or bottom semi-circle as a treatment. Finally, we generate the synthetic outcome by rotating the covariates by a random treatment-specific angle, i. e., α 0 and α 1 :where 1 2 = (1, 1) T , and R(α) = cos αsin α sin α cos α is an α-angle rotation matrix. We set σ = 0.75.For the benchmarking with the noisy moons data, we increased the number of the training iterations for all the plug-in methods (n iter = 10000) and for our INFs, (n iter,t = 10000, n iter,s = 5000). To model two-dimensional (conditional) density, we employed an auto-regressive extension of neural spline flows (Dolatabadi et al., 2020) . We decreased the number of sampled points for approximating the cross-entropy, K = 70, to speed up the training, and set the number of knots for the student flow to n knots,s = 5.Results. Table 6 shows the results. Here, our INFs (main) scores second best in terms of in-sample performance, but, more importantly, best in out-sample performance. MDNs, although scoring the best with in-samnple average log-probability, do not generalize well. Finally, we again confirmed, that our INFs are superior over their ablations and other existing methods, e.g., KDE and DKME.Table 6 : Results for synthetic experiments using the noisy moons synthetic data. The performance is benchmarked using the empirical in-sample / out-sample average log-probability for the two potential outcomes (i.e., a = 0 and a = 1). Reported: mean ± standard deviation over ten-fold train-test splits. J ADDITIONAL RESULTS

J.1 IHDP DATASET

Here, we provide additional results for the IHDP dataset with an alternative evaluation metric, that is, the empirical Wasserstein distance.Evaluation metric. For one-dimensional outcomes, the Wasserstein distance between two distributions can be simply expressed via quantile functionswhere F -1 1 (q) and F -1 2 (q) are quantile functions of P 1 and P 2 , respectively. The Wasserstein distance is not upper-bounded and equals zero if and only if both distributions are the same. Here, we compute the empirical Wasserstein distance, i. e., Ŵ 1 , based on empirical quantile functions. This requires two samples: one from the ground-truth interventional distribution and another from the estimated density. Therefore, methods which do not provide direct sampling (e. g., KDE and DKME) cannot be used for evaluation.Table 7 shows the results. Note that TARNet * , i. e., the plug-in with the ground-truth conditional density estimator for this specific dataset due to the fact how the data was constructed. Hence, we do not interpret TARNet * as a baseline but rather interpret it as a bound for the best performance. We see that all baselines (MDNs, CNF, INFs w/o bias correction) are inferior by a large margin. In contrast, our INFs achieve a performance similar to the bound. In particular, our INFs perform overall best: our INFs are superior over the two other naïve plug-in estimators and the variant of INFs without bias correction. In sum, the results corroborate our findings from the main paper and add to the effectiveness of our INFs.Table 7 : Additional results for semi-synthetic experiments using the IHDP dataset. The performance is benchmarked using the empirical in-sample / out-sample Wasserstein distance (i.e., Ŵ 1 in and Ŵ 1 out ) for the two potential outcomes (i.e., a = 0 and a = 1). Reported: mean ± standard deviation over ten-fold train-test splits. 

