NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION

Abstract

Evaluating hypothetical statements about how the world would be had a different course of action been taken is arguably one key capability expected from modern AI systems. Counterfactual reasoning underpins discussions in fairness, the determination of blame and responsibility, credit assignment, and regret. In this paper, we study the evaluation of counterfactual statements through neural models. Specifically, we tackle two causal problems required to make such evaluations, i.e., counterfactual identification and estimation from an arbitrary combination of observational and experimental data. First, we show that neural causal models (NCMs) are expressive enough and encode the structural constraints necessary for performing counterfactual reasoning. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions. We show that this algorithm is sound and complete for deciding counterfactual identification in general settings. Third, considering the practical implications of these results, we introduce a new strategy for modeling NCMs using generative adversarial networks. Simulations corroborate with the proposed methodology.

1. INTRODUCTION

Counterfactual reasoning is one of human's high-level cognitive capabilities, used across a wide range of affairs, including determining how objects interact, assigning responsibility, credit and blame, and articulating explanations. Counterfactual statements underpin prototypical questions of the form "what if-" and "why-", which inquire about hypothetical worlds that have not necessarily been realized (Pearl & Mackenzie, 2018) . If a patient Alice had taken a drug and died, one may wonder, "why did Alice die?"; "was it the drug that killed her?"; "would she be alive had she not taken the drug?". In the context of fairness, why did an applicant, Joe, not get the job offer? Would the outcome have changed had Joe been a Ph.D.? Or perhaps of a different race? These are examples of fundamental questions about attribution and explanation, which evoke hypothetical scenarios that disagree with the current reality and which not even experimental studies can reconstruct. We build on the semantics of counterfactuals based on a generative process called structural causal model (SCM) (Pearl, 2000) . A fully instantiated SCM M * describes a collection of causal mechanisms and distribution over exogenous conditions. Each M * induces families of qualitatively different distributions related to the activities of seeing (called observational), doing (interventional), and imagining (counterfactual), which together are known as the ladder of causation (Pearl & Mackenzie, 2018; Bareinboim et al., 2022) ; also called the Pearl Causal Hierarchy (PCH). The PCH is a containment hierarchy in which distributions can be put in increasingly refined layers: observational content goes into layer 1 (L 1 ); experimental to layer 2 (L 2 ); counterfactual to layer 3 (L 3 ). It is understood that there are questions about layers 2 and 3 that cannot be answered (i.e. are underdetermined), even given all information in the world about layer 1; further, layer 3 questions are still underdetermined given data from layers 1 and 2 (Bareinboim et al., 2022; Ibeling & Icard, 2020) . Counterfactuals represent the more detailed, finest type of knowledge encoded in the PCH, so naturally, having the ability to evaluate counterfactual distributions is an attractive proposition. In practice, a fully specified model M * is almost never observable, which leads to the question -how can a counterfactual statement, from L * 3 , be evaluated using a combination of observational and experimental data (from L * 1 and L * 2 )? This question embodies the challenge of cross-layer inferences, which entail solving two challenging causal problems in tandem, identification and estimation. In the more traditional literature of causal inference, there are different symbolic methods for solving these problems in various settings and under different assumptions. In the context of identification, there exists an arsenal of results that includes celebrated methods such as Pearl's do-calculus (Pearl, 1995) , and go through different algorithmic methods when considering inferences for L 2 - (Tian & Pearl, 2002; Shpitser & Pearl, 2006; Huang & Valtorta, 2006; Bareinboim & Pearl, 2012; Lee et al., 2019; Lee & Bareinboim, 2020; 2021) and L 3 -distributions (Heckman, 1992; Pearl, 2001; Avin et al., 2005; Shpitser & Pearl, 2009; Shpitser & Sherman, 2018; Zhang & Bareinboim, 2018; Correa et al., 2021) . On the estimation side, there are various methods including the celebrated Propensity Score/IPW for the backdoor case (Rubin, 1978; Horvitz & Thompson, 1952; Kennedy, 2019; Kallus & Uehara, 2020) , and some more relaxed settings (Fulcher et al., 2019; Jung et al., 2020; 2021) , but the literature is somewhat scarcer and less developed. In fact, there is a lack of estimation methods for L 3 quantities in most settings. On another thread in the literature, deep learning methods have achieved outstanding empirical success in solving a wide range of tasks in fields such as computer vision (Krizhevsky et al., 2012) , speech recognition (Graves & Jaitly, 2014) , and game playing (Mnih et al., 2013) . One key feature of deep learning is its ability to allow inferences to scale with the data to high dimensional settings. We study here the suitability of the neural approach to tackle the problems of causal identification and estimation while trying to leverage the benefits of these new advances experienced in non-causal settings. 1 The idea behind the approach pursued here is illustrated in Fig. 1 . Specifically, we will search for a neural model M (r.h.s.) that has the same generative capability of the true, unobserved SCM M * (l.h.s.); in other words, M should be able to generate the same observed/inputted data, i.e., L 1 = L * 1 and L 2 = L * 2 . 2 To tackle this task in practice, we use an inductive bias for the neural model in the form of a causal diagram (Pearl, 2000; Spirtes et al., 2000; Bareinboim & Pearl, 2016) , which is a parsimonious description of the mechanisms (F * ) and exogenous conditions (P (U * )) of the generating SCM. 3 The question then becomes: under what conditions can a model trained using this combination of qualitative inductive bias and the available data be suitable to answer questions about hypothetical counterfactual worlds, as if we had access to the true M * ? There exists a growing literature that leverages modern neural methods to solve causal inference tasks. 1 Our approach based on proxy causal models will answer causal queries by direct evaluation through a parameterized neural model M fitted on the data generated by M * . 4 For instance, some recent work solves the estimation of interventional (L 2 ) or counterfactual (L 3 ) distributions from observational (L 1 ) data in Markovian settings, implemented through architectures such as GANs, flows, GNNs, and VGAEs (Kocaoglu et al., 2018; Pawlowski et al., 2020; Zecevic et al., 2021; Sanchez-Martin et al., 2021) . In some real-world settings, Markovianity is a too stringent condition (see discussion in App. D.4) and may be violated, which leads to the separation between layers 1 and 2, and, in turn, issues of causal identification. 5 The proxy approach discussed above was pursued in Xia et al. (2021) to solve the identification and estimation of interventional distributions (L 2 ) from observational data (L 1 ) in non-Markovian settings. 6 This work introduced an object we leverage throughout this paper called Neural Causal Model (NCM, for short), which is a class of SCMs constrained to neural network functions and fixed distributions over the exogenous variables. While NCMs have been shown to be able to solve the identification and estimation tasks for L 2 queries, their potential for counterfactual inferences is still largely unexplored, and existing implementations have been constrained to low-dimensional settings. Despite all the progress achieved so far, no practical methods exist for estimating counterfactual (L 3 ) distributions in the general setting where an arbitrary combination of observational (L 1 ) and experimental (L 2 ) distributions is available, and unobserved confounders exist (i.e. Markovianity does not hold). Hence, in addition to providing the first neural method of counterfactual identification, this paper establishes the first general counterfactual estimation technique even among non-neural methods, leveraging the neural toolkit for scalable inferences. Specifically, our contributions are: 1. We prove that when fitted with a graphical inductive bias, the NCM encodes the L 3 -constraints necessary for performing counterfactual inference (Thm. 1), and that they are still expressive enough to model the underlying data-generating model, which is not necessarily a neural network (Thm. 2). 2. We show that counterfactual identification within a neural proxy model setting is equivalent to established symbolic approaches (Thm. 3). We leverage this duality to develop an optimization procedure (Alg. 1) for counterfactual identification and estimation that is both sound and complete (Corol. 2). The approach is general in that it accepts any combination of inputs from L 1 and L 2 , it works in any causal diagram setting, and it does not require the Markovianity assumption to hold. 3. We develop a new approach to modeling the NCM using generative adversarial networks (GANs) (Goodfellow et al., 2014) , capable of robustly scaling inferences to high dimensions (Alg. 3). We then show how GAN-NCMs can solve the challenging optimization problems in identifying and estimating counterfactuals in practice. Experiments are provided in Sec. 5 and proofs in Appendix A. All supplementary material can be found in the full technical report (Xia et al., 2022) . Preliminaries. We now introduce the notation and definitions used throughout the paper. We use uppercase letters (X) to denote random variables and lowercase letters (x) to denote corresponding values. Similarly, bold uppercase (X) and lower case (x) letters are used to denote sets of random variables and values respectively. We use D X to denote the domain of X and D X = D X1 ×• • •×D X k for the domain of X = {X 1 , . . . , X k }. We denote P (X = x) (which we will often shorten to P (x)) as the probability of X taking the values x under the probability distribution P (X). We utilize the basic semantic framework of structural causal models (SCMs), as defined in (Pearl, 2000, Ch. 7 ). An SCM M consists of endogenous variables V, exogenous variables U with distribution P (U), and mechanisms F. F contains a function f Vi for each variable V i that maps endogenous parents Pa Vi and exogenous parents U Vi to V i . Each M induces a causal diagram G, where every V i ∈ V is a vertex, there is a directed arrow (V j → V i ) for every V i ∈ V and V j ∈ Pa Vi , and there is a dashed-bidirected arrow (V j V i ) for every pair V i , V j ∈ V such that U Vi and U Vj are not independent. For further details, see (Bareinboim et al., 2022, Def. 13/16, Thm. 4 ). The exogenous U Vi 's are not assumed independent (i.e. Markovianity is not required). Our treatment is constrained to recursive SCMs (implying acyclic causal diagrams) with finite domains over V (see Apps. A/E for details). Each SCM M assigns values to each counterfactual distribution as follows: Definition 1 (Layer 3 Valuation). An SCM M induces layer L 3 (M), a set of distributions over V, each with the form P (Y * ) = P (Y 1[x1] , Y 2[x2],... ) such that P M (y 1[x1] , y 2[x2] , . . . ) = D U 1 Y 1[x1] (u) = y 1 , Y 2[x2] (u) = y 2 , . . . dP (u), where Y i[xi] (u) is evaluated under F xi := {f Vj : V j ∈ V \ X i } ∪ {f X ← x : X ∈ X i }. Each Y i corresponds to a set of variables in a world where the original mechanisms f X are replaced with constants x i for each X ∈ X i ; this is also known as the mutilation procedure. This procedure corresponds to interventions, and we use subscripts to denote the intervening variables (e.g. Y x ) or subscripts with brackets when the variables are indexed (e.g. Y 1[x1] ). For instance, P (y x , y x ) is the probability of the joint counterfactual event Y = y had X been x and Y = y had X been x . SCM M 2 is said to be P (Li) -consistent (for short, L i -consistent) with SCM M 1 if L i (M 1 ) = L i (M 2 ). We will use Z to denote a set of quantities from Layer 2 (i.e. Z = {P (V z k )} k=1 ), and we use Z(M) to denote those same quantities induced by SCM M (i.e. Z(M) = {P M (V z k )} k=1 ). We use neural causal models (NCMs) as a substitute (proxy) model for the true SCM, as follows: Definition 2 (G-Constrained Neural Causal Model (G-NCM) (Xia et al., 2021, Def. 7  )). Given a causal diagram G, a G-constrained Neural Causal Model (for short, G-NCM) M (θ) over variables V with parameters θ = {θ Vi : V i ∈ V} is an SCM U, V, F, P ( U) such that U = { U C : C ∈ C(G)}, where C(G) is the set of all maximal cliques over bidirected edges of G, and D U = [0, 1] for all U ∈ U; F = { fVi : V i ∈ V}, where each fVi is a feedforward neural network parameterized by θ Vi ∈ θ mapping values of U Vi ∪ Pa Vi to values of V i for U Vi = { U C : U C ∈ U s.t. V i ∈ C} and Pa Vi = P a G (V i ); P ( U) is defined s.t. U ∼ Unif(0, 1) for each U ∈ U.

2. NEURAL CAUSAL MODELS FOR COUNTERFACTUAL INFERENCE

We first recall that inferences about higher layers of the PCH generated by the true SCM M * cannot be made in general through an NCM M trained only from lower layer data (Bareinboim et al., 2022; Xia et al., 2021) . This impossibility motivated the use of the inductive bias in the form of a causal diagram G in the construction of the NCM in Def. 2, which ascertains that the G-consistency property holds. (See App. D.1 for further discussion.) We next define consistency w.r.t. to each layer, which will be key for a more fine-grained discussion later on. Definition 3 (G (Li) -Consistency). Let G be the causal diagram induced by the SCM M * . For any SCM M, M is said to be G (Li) -consistent (w.r.t. M * ) if L i (M) satisfies all layer i equality constraints implied by G. This generalization is subtle since regardless of which L i is used with the definition, the causal diagram G generated by M * is the same. The difference lies in the implied constraints. For instance, if an SCM M is G (L1) -consistent, that means that G is a Bayesian network for the observational distribution of M, implying independences readable through d-separation Pearl (1988) (Bareinboim et al., 2022, Def. 16 ) for the interventional distributions of M. While several SCMs could share the same d-separation constraints as M * , there are fewer that share all L 2 constraints encoded by the CBN. Gconsistency at higher layers imposes a stricter set of constraints, narrowing down the set of compatible SCMs. There also exist constraints of layer 3 that are important for counterfactual inferences. . If M is G (L2) -consistent, that means that G is a Causal Bayesian network (CBN) To motivate the use of such constraints, consider an example inspired by the multi-armed bandit problem. A casino has 3 slot machines, labeled "0", "1", and "2". Every day, the casino assigns one machine a good payout, one a bad payout, and one an average payout, with chances of winning represented by exogenous variables U + , U -, and U = , respectively. A customer comes every day and plays a slot machine. X represents their choice of machine, and Y is a binary variable representing whether they win. Suppose a data scientist creates a model of the situation, and she hypothesizes that the casino predicts the customer's choice based on their mood (U M ) and will always assign the predicted machine the average payout to maintain profits. Her model is described by the SCM M : M =                      U = {U M , U + , U = , U -}, U M ∈ {0, 1, 2}, U + , U = , U -∈ {0, 1} V = {X, Y }, X ∈ {0, 1, 2}, Y ∈ {0, 1} F =        f X (u M ) = u M f Y (x, u M , u + , u = , u -) =    u = x = u M u -x = (u M -1)%3 u + x = (u M + 1)%3 P (U) : P (U M = i) = 1 3 , P (U + = 1) = 0.6, P (U = = 1) = 0.4, P (U -= 1) = 0.2 It turns out that in this model P (y x ) = P (y | x). For example, P (Y = 1 | X = 0) = P (U = = 1) = 0.4, and P (Y X=0 = 1) = P (U M = 0)P (U = = 1) + P (U M = 1)P (U -= 1) + P (U M = 2)P (U + = 1) = 1 3 (0.4) + 1 3 (0.2) + 1 3 (0.6) = 0.4. Suppose the true model M * employed by the casino (and unknown by the customers and data scientist) induces graph G = {X → Y }. Interestingly enough, M would be G (L2) -consistent with M * since M is compatible with all L 2 -constraints, including P (y x ) = P (y | x) and P (x y ) = P (x). However, and perhaps surprisingly, it would fail to be G (L3) -consistent. A further constraint implied by G on the third layer is that P (y x | x ) = P (y x ), which is not true of M . To witness, note that P (Y X=0 = 1 | X = 2) = P (U + = 1) = 0.6 in M , which means that if the customer chose machine 2, they would have had higher payout had they chosen machine 0. This does not match P (Y X=0 = 1) = 0.4, computed earlier, so M fails to encode the L 3 -constraints implied by G. In general, the causal diagram encodes a family of L 3 -constraints which we leverage to make cross-layer inferences. A more detailed discussion can be found in Appendix D. We show next that NCMs encodes all of the equality constraints related to L 3 , in addition to the known L 2 -constraints. Theorem 1 (NCM G (L3) -Consistency). Any G-NCM M (θ) is G (L3) - consistent. This will be a key result for performing inferences at the counterfactual level. Similar to how constraints about layer 2 distributions help bridge the gap between layers 1 and 2, layer 3 constraints allow us to extend our inference capabilities into layer 3. (In fact, most of L 3 's distributions are not obtainable through experimentation.) While this graphical inductive bias is powerful, the set of NCMs constrained by G is no less expressive than the set of SCMs constrained by G, as shown next. Theorem 2 (L 3 -G Expressiveness). For any SCM M * that induces causal diagram G, there exists a G-NCM M (θ) = U, V, F, P ( U) s.t. M is L 3 -consistent w.r.t. M * . This result ascertains that the NCM class is as expressive, and therefore, contains the same generative capabilities as the original generating model. More interestingly, even if the original SCM M * does not belong to the NCM class, but from the higher space, there exists a NCM M (θ) that will be capable of expressing the collection of distributions from all layers of the PCH induced by it. A visual representation of these two results is shown in Fig. 2 . The space of all SCMs is called Ω * , and the subspace that contains all SCMs G ((Li) -consistent w.r.t. the true SCM M * (black dot) is called Ω * (G (Li) ). Note that the G (Li) space shrinks with higher layers, indicating a more constrained space with fewer SCMs. Thm. 1 states that all G-NCMs (Ω(G)) are within Ω * (G (L3) ), and Thm. 2 states that all SCMs in Ω * (G (L3) ) can be represented by a corresponding G-NCM on all three layers. It may seem intuitive that the G-NCM has these two properties by construction, but these properties are nontrivial and, in fact, not enjoyed by many model classes. Examples can be found in Appendix D. Together, these two theorems ensure that the NCM has both the constraints and the expressiveness necessary for counterfactual inference, elaborated further in the next section.

3. NEURAL COUNTERFACTUAL IDENTIFICATION

The problem of identification is concerned with determining whether a certain quantity is computable from a combination of assumptions, usually encoded in the form of a causal diagram, and a collection of distributions (Pearl, 2000, p. 77) . This challenge stems from the fact that even though the space of SCMs (or NCMs) is constrained upon assuming a certain causal diagram, the quantity of interest may still be underdetermined. In words, there are many SCMs compatible with the same diagram G but generate different answers for the target distribution. In this section, we investigate the problem of identification and decide whether counterfactual quantities (from L 3 ) can be inferred from a combination of a subset of L 2 and L 1 datasets together with G, as formally defined next. Definition 4 (Neural Counterfactual Identification). Consider an SCM M * and the corresponding causal diagram G. Let Z = {P (V z k )} k=1 be a collection of available interventional (or observational if Z k = ∅) distributions from M * . The counterfactual query P (Y * = y * | X * = x * ) is said to be neural identifiable (identifiable, for short) from the set of G-constrained NCMs Ω(G) and Z if and only if P M1 (y * | x * ) = P M2 (y * | x * ) for every pair of models M 1 , M 2 ∈ Ω(G) s.t. they match M * on all distributions in Z (i.e. Z(M * ) = Z(M 1 ) = Z(M 2 ) > 0). From a symbolic standpoint, a counterfactual quantity P (y * | x * ) is identifiable from G and Z if all SCMs that induce the distributions of Z and abide by the constraints of G also agree on P (y * | x * ). This is illustrated in Fig. 3 . In the definition above, the search is constrained to the NCM subspace (shown in light gray) within the space of SCMs (dark gray). It may be concerning that the true SCM M * might not be an NCM, as we alluded to earlier. The next result ascertains that identification within the constrained space of NCMs is actually equivalent to identification in the original SCM-space. Theorem 3 (Counterfactual Graphical-Neural Equivalence (Dual ID)). Let Ω * , Ω be the spaces including all SCMs and NCMs, respectively. Consider the true SCM M * and the corresponding causal diagram G. Let Q = P (y * | x * ) be the target query and Z the set of observational and interventional distributions available. Then, Q is neural identifiable from Ω(G) and Z if and only if it is identifiable from G and Z.  Ω * Ω M * M 1 M 2 (L1, L2) Data Distributions Z(M * ) = Z( M1) = Z( M2) (L3) P M * (y * | x * ). Let M ∈ Ω(G) be a G-constrained NCM such that Z( M ) = Z(M * ). If Q is identifiable from G and Z, then Q is computable via Eq. 1 from M . Corol. 1 states that once identification is established, the counterfactual query can be inferred through the NCM M , as if it were the true SCM M * , by directly applying layer 3's definition to M (Def. 1). Remarkably, this result holds even if M * does not match M in either the mechanisms F or the exogenous dist. P (U), and it only requires some specific properties: G (L3) -consistency, matching Z, and identifiability. Without these properties, inferences performed on M would bear no meaningful information about the ground truth. To understand this subtlety, refer to examples in App. D. (y * |x * ) if identifiable, FAIL otherwise. Building on these results, we demonstrate through the procedure NeuralID (Alg. 1) how to decide the identifiability of counterfactual quantities. The specific optimization procedure searches explicitly in the space of NCMs for two models that respectively minimize and maximize the target query while maintaining consistency with the provided data distributions in Z. If the two models match in the target query Q, then the effect is identifiable, and the value is returned; otherwise, the effect is non-identifiable. 1 M ← NCM(V, G) // from Def. 2 2 θ * min ← arg min θ P M(θ) (y * |x * ) s.t. Z( M(θ)) = Z(M * ) 3 θ * max ← arg max θ P M(θ) (y * |x * ) s.t. Z( M(θ)) = Z(M * ) 4 if P M(θ * min ) ( The implementation of how to enforce these consistency constraints in practice is somewhat challenging. We note two nontrivial details that are abstracted away in the description of Alg. 1. First, although training to fit a single, observational dataset is straightforward, it is not as clear how to simultaneously maintain consistency with the multiple datasets in Z. Second, unlike with simpler interventional queries, it is not clear how to search the parameter space in a way that maximizes or minimizes a counterfactual query, which may be more involved due to nesting (e.g. P (Y Z X=0 )) or evaluating the same variable in multiple worlds (e.g. P (Y X=0 , Y X=1 )). The details of how to solve these issues are discussed in Sec. 4. Interestingly, this approach is qualitatively different than the case of classical, symbolic methods that avoid operating in the space of SCMs directly. Still, in principle, this alternative approach does not imply any loss in functionality, as evident from the next result. Corollary 2 (Soundness and Completeness). Let Ω * be the set of all SCMs, M * ∈ Ω * be the true SCM inducing causal diagram G, Q = P (y * | x * ) be a query of interest, and Q be the result from running Alg. 1 with inputs Z(M * ) > 0, G, and Q. Then Q is identifiable from G and Z if and only if Q is not FAIL. Moreover, if Q is not FAIL, then Q = P M * (y * | x * ). In words, the procedure NeuralID is both necessary and sufficient in this very general setting, implying that for any instances involving any arbitrary diagram, datasets, or queries, the identification status of the query is always classified correctly by this algorithm.

4. NEURAL COUNTERFACTUAL ESTIMATION

Algorithm 2: NCM Counterfactual Sampling Input :NCM M (θ) = U, V, F , P ( U) , counterfactual Y * , conditional X * = x * , number of samples m Output : m samples from P M (Y * |x * ) 1 Function M.sample(Y * , x * , m): 2 S ← ∅ 3 while |S| < m do 4 u ← P ( U).sample() 5 if X M (θ) * ( u) = x * then 6 S.add(Y M (θ) * ( u)) 7 return S; We developed a procedure for identifying counterfactual quantities that is both sound and complete under ideal conditions -e.g., unlimited data, perfect optimization, which is encouraging. In this section, we build on these results and establish a more practical approach to solving these tasks under imperfect optimization and finite samples. Consider a G-NCM M constructed as specified by Def. 2. Any counterfactual statement Y * = (Y 1[x1] , Y 2[x2] , . . . ) can be evaluated from an NCM M for a specific setting of U = u by computing the corresponding values of Y i in the mutilated submodel M x for each i. That is, Y M * (u) = (Y M 1[x1] (u), Y M 2[x2] (u), . . . ) Then, sampling can be done following the natural approach delineated by Alg. 2. In words, the distribution P (Y * | x * ) can be sampled from NCM M by (1) sampling instances of P ( U), (2) computing their corresponding value for X * (via Eq. 3) while rejecting cases that do not match x * , and (3) returning the corresponding value for Y * (via Eq. 3) for the remaining instances. Following this procedure, a counterfactual P (Y * = y * | X * = x * ) can be estimated from the NCM through a Monte-Carlo approach, instantiating Eq. 4 as follows: P M (y * | x * ) ≈ m j=1 1 Y M * ( u j ) = y * , X M * ( u j ) = x * m j=1 1 X M * ( u j ) = x * , where { u j } m j=1 are a set of m samples from P ( U). Alg. 3 demonstrates how to solve the challenging optimization task in lines 2 and 3 of Alg. 1. The first step is to learn parameters such that the distributions induced by the NCM M match the true distributions in Z. While Alg. 1 describes the inputs in the form of L 2 -distributions, Z(M * ) = {P M * (V z k )} k=1 , in most settings, one has the empirical versions of such distributions in the form of finite datasets, { P M * (V z k ) = {v z k ,i } n k i=1 } k=1 . One way to train M to match M * in the distribution P (V z k ) is to compare the distribution of data points in P M * (V z k ) with the distribution of samples from M , P M (V z k ). The two empirical distributions can be compared using a divergence function D P , which returns a smaller value when the two distributions are "similar". The goal is then to minimize D P ( P M (V z k ), P M * (V z k )) for each k ∈ {1, . . . , }. In this work, a generative adversarial approach (Goodfellow et al., 2014) is taken to train the NCM, and D P is computed using a discriminator network. In addition to fitting the datasets, the second challenge of Alg. 1 is to simultaneously maximize or minimize the query of interest Q = P (y * | x * ). This can be done by first computing samples of P (Y * | x * ) from M via Alg. 2, denoted Q, and then minimizing (or maximizing) the "distance" between Q and Q. Essentially, samples from Y * are penalized based on how similar they are to the correct values y * . For example, if the query to maximize is P (Y = 1), and a value of 0.6 is sampled for Y from M , then the goal could be to minimize squared error, (1 -0.6) 2 . In general, a distance metric D Q is used to compute the distance between Q and Q, and we use log loss for D Q as our experiments involve binary variables. Lmin ← Lmin -λD Q Qmin, Q 13 Lmax ← Lmax + λD Q Qmax, Q 14 θmin ← θmin -η∇Lmin 15 θmax ← θmax -η∇Lmax For this reason, an NCM trained with this approach will be referred as a GAN-NCM.foot_3 More details about architecture and hyperparameters used throughout this work can be found in Appendix B. Putting D P and D Q together, we can write that the objective L M , { P M * (V z k )} k=1 is k=1 D P P M z k , P M * z k ± λD Q Q, Q , where λ is initially set to a high value, and decreases during training. Optimization may be done using gradient descent. After training, the two values of Q induced by M (θ min ) and M (θ max ) are compared with a hypothesis testing procedure to decide identifiability. Eq. 4 is used as Q's estimate, whenever identifiable.

5. EXPERIMENTAL EVALUATION

We first evaluate the NCM's ability to identify counterfactual distributions through Alg. 3.foot_4 Each setting consists of a target query (Q), a causal diagram (G), and a set of input distributions (Z). In total, we test 32 variations. Specifically, we evaluate the identifiability of four queries Q: (1) Average Treatment Effect (ATE), (2) Effect of Treatment on the Treated (ETT) (Pearl, 2000, Eq. 8.18) , (3) Natural Direct Effect (NDE) (Pearl, 2001, Eq. 6) , and (4) Counterfactual Direct Effect (CtfDE) (Zhang & Bareinboim, 2018, Eq. 3) ; each expression is shown on the top of Fig. 4 . The four graphs used are shown on the figure's left side, and represent general structures found throughout the mediation and fairness literature (Pearl, 2001; Zhang & Bareinboim, 2018) . The variable X encodes the treatment/decision, Y the outcome, Z observed features, and W mediating variables. Lastly, we consider a setting in which only the observational data is available (Z = {P (V)}) and another in which additional experimental data on X is available (Z = {P (V), P (V x )}). In the experiments shown, all variables are 1-dimensional binary variables except Z, whose dimensionality d is adjusted in experiments. The background color of each setting indicates that the query Q is identifiable (blue) or is not identifiable (yellow) from the inputted G and Z. Given the sheer volume of variations, we summarize the experiments below and provide further discussion and details in Appendix B. We implement two approaches to ground the discussion around NCMs, one based on GANs (GAN-NCM), and another based on maximum likelihood (MLE-NCM). The former was discussed in the previous section and the latter is quite natural in statistical settings. The experiments (Fig. 4 ) show that GAN-NCM has on average higher accuracy. The MLE-NCM performs slightly better in ID cases (blue), but the performance drops significantly for non-ID cases (yellow), suggesting it may be biased in returning ID for all cases. The GAN-NCM is also shown to achieve decent performance in 16-d, where the MLE-NCM fails to work. We plot the run time of these two approaches in Fig. 5 , which shows that the MLE-NCM scales poorly compared to the GAN-NCM; this pattern is observed in all settings. Intuitively, this is not surprising since the MLE-NCM explicitly computes a likelihood for every value in every variable domain, the size of which grows exponentially w.r.t. the dimensionality (d), while the GAN-NCM avoids this by implicitly fitting distributions through P ( U) and F and directly outputting samples. For the identifiable cases (blue background), the target Q is estimated through Eq. 4 after training. Results are shown in Fig. 6 . The MLE-NCM serves as a benchmark for 1-dimensional cases since, intuitively, the data distributions can be learned more accurately when modeled explicitly. Still, even when d = 1, the GAN-NCM achieves competitive results in most settings and consistently achieves an error under 0.05 with more samples. The GAN-NCM is able to maintain this consistency even at d = 16, demonstrating its robustness scaling to higher dimensions. After all, the GAN-NCM is shown to be effective at identifying and estimating counterfactual distributions even in high dimensions. As expected, the MLE-NCM may achieve lower error in some 1-d settings, but the GAN-NCM may be preferred for scalability. Moreover, an incorrect ID conclusion in a non-ID case may be dangerous for downstream decisionmaking as the resulting estimation will likely be incorrect or misleading. The GAN-NCM is evidently more robust in such non-ID cases while still performing competitively in ID cases. Further experiments and discussions are provided in App. B.

6. CONCLUSIONS

We developed in this work a neural approach to the problems of counterfactual identification and estimation using neural causal models (NCMs). Specifically, we first showed that with the graphical inductive bias, NCMs are capable of encoding counterfactual (L 3 ) constraints and are still expressive so as to represent any generating SCM (Thms. 1, 2). We then showed that NCMs have the ability of solving any counterfactual identification instance (Thm. 3, Corol 1). Given these theoretical properties, we introduced a sound and complete algorithm (Alg. 1, Corol. 2) for identifying and estimating counterfactuals in general non-Markovian settings given arbitrary datasets from L 1 and L 2 . We developed an approach based on GANs to implement this algorithm in practice (Alg. 3) and empirically demonstrated its ability to scale inferences. From a neural perspective, counterfactual reasoning under a causal inductive bias allows for deep models to be trained with an improved understanding of interpretability and generalizability. From a causal perspective, neural nets can now provide tools to solve counterfactual inference problems previously only understood in theory.



One of our motivations is that these methods showed great promise at estimating effects from observational data under backdoor/ignorability conditions(Shalit et al., 2017;Louizos et al., 2017; Li & Fu, 2017;Johansson et al., 2016; Yao et al., 2018;Yoon et al., 2018;Kallus, 2020;Shi et al., 2019;Du et al., 2020;Guo et al., 2020).2 This represents an extreme case where all L1and L2-distributions are provided as data. In practice, this may be unrealistic, and our method takes as input any arbitrary subset of distributions from L1 and L2.3 When imposed on neural models, they enforce equality constraints connecting layer 1 and layer 2 quantities, defined formally through the causal Bayesian network (CBN) data structure(Bareinboim et al., 2022, Def. 16).4 In general, M does not need to, and will not be equal to the true SCM M * . 5 Layer 3 differs from lower layers even in Markovian models; seeBareinboim et al. (2022, Ex. 7).6 Witty et al. (2021) shows a related approach taking the Bayesian route; further details, see Appendix C. We say identification from G and Z instead of Ω * (G) and Z because existing symbolic approaches (e.g. do-calculus) directly solve the identification problem on top of the graph instead of the space of SCMs. Other choices of DP include KL divergence or Maximum Mean Discrepancy (MMD)(Gretton et al., 2012). The code is publicly available at: https://github.com/CausalAILab/NCMCounterfactuals



Figure 1: The l.h.s. contains the true SCM M * that induces PCH's three layers. The r.h.s. contains a neural model M constrained by inductive bias G (entailed by M * ) and matching M * on L 1 and L 2 through training.

Figure 2: Modeltheoretic visualization of Thms. 1 and 2.

NeuralID -Identifying/estimating counterfactual queries with NCMs. Input : query Q = P (y * |x * ), L2 datasets Z(M * ), and causal diagram G Output : P M *

y * |x * ) = P M(θ * max ) (y * |x * ) (θ * min ) (y * |x * ) // choose min or max arbitrarily

Figure 4: Experimental results on deciding identifiability on counterfactual queries with NCMs. GAN-NCM (green) is compared with MLE-NCM (orange) for settings with d = 1. GAN-NCM performance is also shown for d = 16 (dashed green). Blue (resp. yellow) backgrounds on plots correspond to a ground truth of ID (resp. non-ID). ID cases are numbered for reference in later plots.Algorithm 3: Training Model Input : Data { P M * (Vz k ) = {vz k ,i} n k i=1 } k=1 , query Q = P (y * |x * ), causal diagram G, number of Monte Carlo samples m, regularization constant λ, learning rate η, training epochs T 1 M ← NCM(V, G) // from Def. 2 2 Initialize parameters θmin and θmax 3 for t ← 1 to T do 4

Figure 5: Results comparing run times of 100 epochs of training between GAN-NCM (green) and MLE-NCM (orange) in the first graph of Fig. 4 as the dimensionality d of Z scales higher.

Figure 6: Results on estimating identifiable cases from Fig. 4 (corresponding numbers shown on the right). Mean Absolute Error (MAE) is plotted (with 95% confidence) for each setting for varying sample sizes. Results are shown for GAN-NCM (solid green) and MLE-NCM (orange) with d = 1 and also GAN-NCM with d = 16 (dashed green).

Consider the true SCM M * ∈ Ω * , causal diagram G, a set of available distributions Z, and a target query Q equal to

ACKNOWLEDGEMENTS

This research was supported in part by the NSF, ONR, AFOSR, DoE, Amazon, JP Morgan, and The Alfred P. Sloan Foundation.

