NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION

Abstract

Evaluating hypothetical statements about how the world would be had a different course of action been taken is arguably one key capability expected from modern AI systems. Counterfactual reasoning underpins discussions in fairness, the determination of blame and responsibility, credit assignment, and regret. In this paper, we study the evaluation of counterfactual statements through neural models. Specifically, we tackle two causal problems required to make such evaluations, i.e., counterfactual identification and estimation from an arbitrary combination of observational and experimental data. First, we show that neural causal models (NCMs) are expressive enough and encode the structural constraints necessary for performing counterfactual reasoning. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions. We show that this algorithm is sound and complete for deciding counterfactual identification in general settings. Third, considering the practical implications of these results, we introduce a new strategy for modeling NCMs using generative adversarial networks. Simulations corroborate with the proposed methodology.

1. INTRODUCTION

Counterfactual reasoning is one of human's high-level cognitive capabilities, used across a wide range of affairs, including determining how objects interact, assigning responsibility, credit and blame, and articulating explanations. Counterfactual statements underpin prototypical questions of the form "what if-" and "why-", which inquire about hypothetical worlds that have not necessarily been realized (Pearl & Mackenzie, 2018) . If a patient Alice had taken a drug and died, one may wonder, "why did Alice die?"; "was it the drug that killed her?"; "would she be alive had she not taken the drug?". In the context of fairness, why did an applicant, Joe, not get the job offer? Would the outcome have changed had Joe been a Ph.D.? Or perhaps of a different race? These are examples of fundamental questions about attribution and explanation, which evoke hypothetical scenarios that disagree with the current reality and which not even experimental studies can reconstruct. We build on the semantics of counterfactuals based on a generative process called structural causal model (SCM) (Pearl, 2000) . A fully instantiated SCM M * describes a collection of causal mechanisms and distribution over exogenous conditions. Each M * induces families of qualitatively different distributions related to the activities of seeing (called observational), doing (interventional), and imagining (counterfactual), which together are known as the ladder of causation (Pearl & Mackenzie, 2018; Bareinboim et al., 2022) ; also called the Pearl Causal Hierarchy (PCH). The PCH is a containment hierarchy in which distributions can be put in increasingly refined layers: observational content goes into layer 1 (L 1 ); experimental to layer 2 (L 2 ); counterfactual to layer 3 (L 3 ). It is understood that there are questions about layers 2 and 3 that cannot be answered (i.e. are underdetermined), even given all information in the world about layer 1; further, layer 3 questions are still underdetermined given data from layers 1 and 2 (Bareinboim et al., 2022; Ibeling & Icard, 2020) . Counterfactuals represent the more detailed, finest type of knowledge encoded in the PCH, so naturally, having the ability to evaluate counterfactual distributions is an attractive proposition. In practice, a fully specified model M * is almost never observable, which leads to the question -how can a counterfactual statement, from L * 3 , be evaluated using a combination of observational and experimental data (from L * 1 and L * 2 )? This question embodies the challenge of cross-layer inferences, which entail solving two challenging causal problems in tandem, identification and estimation. In the more traditional literature of causal inference, there are different symbolic methods for solving these problems in various settings and under different assumptions. In the context of identification, there exists an arsenal of results that includes celebrated methods such as Pearl's do-calculus (Pearl, 1995) 1978; Horvitz & Thompson, 1952; Kennedy, 2019; Kallus & Uehara, 2020) , and some more relaxed settings (Fulcher et al., 2019; Jung et al., 2020; 2021) , but the literature is somewhat scarcer and less developed. In fact, there is a lack of estimation methods for L 3 quantities in most settings. SCM M * = F * , P (U * ) L * 1 L * 2 L * 3 NCM M = F, P ( U) L1 L2 L3 Training (L1 = L * 1 , L2 = L * 2 ) Causal Diagram G G-Constraint On another thread in the literature, deep learning methods have achieved outstanding empirical success in solving a wide range of tasks in fields such as computer vision (Krizhevsky et al., 2012) , speech recognition (Graves & Jaitly, 2014), and game playing (Mnih et al., 2013) . One key feature of deep learning is its ability to allow inferences to scale with the data to high dimensional settings. We study here the suitability of the neural approach to tackle the problems of causal identification and estimation while trying to leverage the benefits of these new advances experienced in non-causal settings. 1 The idea behind the approach pursued here is illustrated in Fig. 1 . Specifically, we will search for a neural model M (r.h.s.) that has the same generative capability of the true, unobserved SCM M * (l.h.s.); in other words, M should be able to generate the same observed/inputted data, i.e., L 1 = L * 1 and L 2 = L * 2 . 2 To tackle this task in practice, we use an inductive bias for the neural model in the form of a causal diagram (Pearl, 2000; Spirtes et al., 2000; Bareinboim & Pearl, 2016) , which is a parsimonious description of the mechanisms (F * ) and exogenous conditions (P (U * )) of the generating SCM. 3 The question then becomes: under what conditions can a model trained using this combination of qualitative inductive bias and the available data be suitable to answer questions about hypothetical counterfactual worlds, as if we had access to the true M * ? There exists a growing literature that leverages modern neural methods to solve causal inference tasks. 1 Our approach based on proxy causal models will answer causal queries by direct evaluation through a parameterized neural model M fitted on the data generated by M * . 4 For instance, some recent work solves the estimation of interventional (L 2 ) or counterfactual (L 3 ) distributions from observational (L 1 ) data in Markovian settings, implemented through architectures such as GANs, flows, GNNs, and VGAEs (Kocaoglu et al., 2018; Pawlowski et al., 2020; Zecevic et al., 2021; Sanchez-Martin et al., 2021) . In some real-world settings, Markovianity is a too stringent condition (see discussion in App. D.4) and may be violated, which leads to the separation between layers 1 and 2, and, in turn, issues of causal identification. 5 The proxy approach discussed above was pursued in Xia et al. (2021) to solve the identification and estimation of interventional distributions (L 2 ) from observational data (L 1 ) in non-Markovian settings. 6 This work introduced an object we leverage throughout this paper called Neural Causal Model (NCM, for short), which is a class of SCMs constrained to neural network functions and fixed distributions over the exogenous variables. While



One of our motivations is that these methods showed great promise at estimating effects from observational data under backdoor/ignorability conditions(Shalit et al., 2017; Louizos et al., 2017; Li & Fu, 2017; Johansson et al., 2016; Yao et al., 2018; Yoon et al., 2018; Kallus, 2020; Shi et al., 2019; Du et al., 2020; Guo et al., 2020).2 This represents an extreme case where all L1and L2-distributions are provided as data. In practice, this may be unrealistic, and our method takes as input any arbitrary subset of distributions from L1 and L2.3 When imposed on neural models, they enforce equality constraints connecting layer 1 and layer 2 quantities, defined formally through the causal Bayesian network (CBN) data structure(Bareinboim et al., 2022, Def. 16).4 In general, M does not need to, and will not be equal to the true SCM M * . 5 Layer 3 differs from lower layers even in Markovian models; see Bareinboim et al. (2022, Ex. 7). 6 Witty et al. (2021) shows a related approach taking the Bayesian route; further details, see Appendix C.



Figure 1: The l.h.s. contains the true SCM M * that induces PCH's three layers. The r.h.s. contains a neural model M constrained by inductive bias G (entailed by M * ) and matching M * on L 1 and L 2 through training.

