CAUSAL REASONING IN THE PRESENCE OF LATENT CONFOUNDERS VIA NEURAL ADMG LEARNING

Abstract

Latent confounding has been a long-standing obstacle for causal reasoning from observational data. One popular approach is to model the data using acyclic directed mixed graphs (ADMGs), which describe ancestral relations between variables using directed and bidirected edges. However, existing methods using AD-MGs are based on either linear functional assumptions or a discrete search that is complicated to use and lacks computational tractability for large datasets. In this work, we further extend the existing body of work and develop a novel gradientbased approach to learning an ADMG with non-linear functional relations from observational data. We first show that the presence of latent confounding is identifiable under the assumptions of bow-free ADMGs with non-linear additive noise models. With this insight, we propose a novel neural causal model based on autoregressive flows for ADMG learning. This not only enables us to determine complex causal structural relationships behind the data in the presence of latent confounding, but also estimate their functional relationships (hence treatment effects) simultaneously. We further validate our approach via experiments on both synthetic and real-world datasets, and demonstrate the competitive performance against relevant baselines.

1. INTRODUCTION

Learning causal relationships and estimating treatment effects from observational studies is a fundamental problem in causal machine learning, and has important applications in many areas of social and natural sciences (Pearl, 2010; Spirtes, 2010) . They enable us to answer questions in causal nature; for example, what is the effect on the expected lifespan of a patient if I increase the dose of X drug? However, many existing methods of causal discovery and inference overwhelmingly rely on the assumption that all necessary information is available. This assumption is often untenable in practice. Indeed, an important, yet often overlooked, form of causal relationships is that of latent confounding; that is, when two variables have an unobserved common cause (Verma & Pearl, 1990) . If not properly accounted for, the presence of latent confounding can lead to incorrect evaluation of causal quantities of interest (Pearl, 2009) . Traditional causal discovery methods that account for the presence of latent confoundings, such as the fast causal inference algorithm (FCI) (Spirtes et al., 2000) and its extensions (Colombo et al., 2012; Claassen et al., 2013; Chen et al., 2021) , rely on uncovering an equivalence class of acyclic directed mixed graphs (ADMGs) that share the same conditional independencies. Without additional assumptions, however, these methods might return uninformative results as they cannot distinguish between members of the same Markov equivalence class (Bellot & van der Schaar, 2021) . More recently, causal discovery methods based on structural causal models (SCMs) (Pearl, 1998) have been developed for latent confounding (Nowzohour et al., 2017; Wang & Drton, 2020; Maeda & Shimizu, 2020; 2021; Bhattacharya et al., 2021) . By assuming that the causal effects follow specific functional forms, they have the advantage of being able to distinguish between members of the same Markov equivalence class (Glymour et al., 2019 ). Yet, existing approaches either rely on restrictive linear functional assumptions (Bhattacharya et al., 2021; Maeda & Shimizu, 2020; Bellot & van der Schaar, 2021) , and/or discrete search over the discrete space of causal graphs (Maeda & Shimizu, 2021) that are computationally burdensome and unintuitive to use. As a result, modeling non-linear causal relationships between variables in the presence of latent confounders in a scalable way remains an outstanding task. In this work, we seek to utilize recent advances in differentiable causal discovery (Zheng et al., 2018; Bhattacharya et al., 2021) and neural causal models (Lachapelle et al., 2019; Morales-Alvarez et al., 2022; Geffner et al., 2022) to overcome these limitations. Our core contribution is to extend the framework of differentiable ADMG discovery for linear models (Bhattacharya et al., 2021) to non-linear cases using neural causal models. This enables us to build scalable and flexible methods capable of discovering non-linear, potentially confounded relationships between variables and perform subsequent causal inference. Specifically, our contributions include: 1. Sufficient conditions for ADMG identifiability with non-linear SCMs (Section 4). We assume: i) the functional relationship follows non-linear additive noise SCM; ii) the effect of observed and latent variables do not modulate each other, and iii) all latent variables confound a pair of non-adjacent observed nodes. Under these assumptions, the underlying ground truth ADMG causal graph is identifiable. This serves as a foundation for designing ADMG identification algorithms for flexible, non-linear SCMs based on deep generative models. 2. A novel gradient-based framework for learning ADMGs from observational data (Section 5). Based on our theoretical results, we further propose Neural ADMG Learning (N-ADMG), a neural autoregressive-flow-based model capable of learning complex non-linear causal relationships with latent confounding. N-ADMG utilizes variational inference to approximate posteriors over causal graphs and latent variables, whilst simultaneously learning the model parameters via gradient-based optimization. This is more efficient and accurate than discrete search methods, allowing us to replace task-specific search procedures with general purpose optimizers. 3. Empirical evaluation on synthetic and real-world datasets (Section 6). We evaluate N-ADMG on a variety of synthetic and real-world datasets, comparing performance with a number of existing causal discovery and inference algorithms. We find that N-ADMG provides competitive or state-of-the-art results on a range of causal reasoning tasks.

2. RELATED WORK

Causal discovery with latent confounding. Constraint-based causal discovery methods in the presence of latent confounding have been well-studied (Spirtes et al., 2000; Zhang, 2008; Colombo et al., 2012; Claassen et al., 2013; Chen et al., 2021) . Without further assumptions, these approaches can only identify a Markov equivalence class of causal structures (Spirtes et al., 2000) . When certain assumptions are made on the data generating process in the form of SCMs (Pearl, 1998) , additional constraints can help identify the true causal structure. In the most general case, additional nonparametric constraints have been identified (Verma & Pearl, 1990; Shpitser et al., 2014; Evans, 2016) . Further refinement can be made through the assumption of stricter SCMs. For example, in the linear Gaussian additive noise model (ANM) case, Nowzohour et al. (2017) proposes a scorebased approach for finding an equivalent class of bow-free acyclic path diagrams. Both Maeda & Shimizu (2020) and Wang & Drton (2020) develop Independence tests based approach for linear non-Gaussian ANM case, with Maeda & Shimizu (2021) extending this to more general cases. Differentiable characterization of causal discovery. All aforementioned approaches employ a search over a discrete space of causal structures, which often requires task-specific search procedures, and imposes a computational burden for large-scale problems. More recently, (Zheng et al., 2018) proposed a differentiable constraint on directed acyclic graphs (DAG), and frames the graph structure learning problem as a differentiable constrained optimization task in the absence of latent confounders. This is further generalized to the latent confounding case (Bhattacharya et al., 2021) through differentiable algebraic constraints that characterize the space of ADMGs. Nonetheless, this work is limited in that it only considers linear Gaussian ANMs.

3.1. STRUCTURAL CAUSAL MODELS IN THE ABSENCE OF LATENT CONFOUNDINGS

Unlike constraint-based approaches such as the PC algorithm, SCMs capture the asymmetry between causal direction through functional assumptions on the data generating process, and have been central to many recent developments in causal discovery (Glymour et al., 2019) . . Given a directed acyclic graph (DAG) G on nodes {1, . . . , D}, SCMs describe the random variable x = (x 1 , . . . , x D ) by x i = f i (x pa(i;G) , ϵ i ), where ϵ i is an exogenous noise variable that is independent of all other variables in the model, pa(i; G) denotes the set of parents of node i in G, and f i describes how x i depends on its parents and the noise ϵ i . We focus on additive noise SCMs, commonly referred to as additive noise models (ANMs), which take the form x i = f i (x pa(i;G) , ϵ i ) = f i (x pa(i;G) ) + ϵ i , or x = f G (x) + ϵ in vector form. ( ) This induces a joint observation distribution p θ (x n |G), where θ denotes the parameters for functions {f i }. Under the additive noise model in Equation 1, the DAG G is identifiable assuming causal minimality and no latent confounders (Peters et al., 2014) .

3.2. GRAPHICAL REPRESENTATION OF LATENT CONFOUNDERS USING ADMGS

One of the most widely-used graphical representations of causal relationships involving latent confounding is the so-called acyclic directed mixed graph (ADMGs). ADMGs are an extension of DAGs, that contain both directed edges (→) and bidirected edges (↔) between variables. More concretely, the directed edge x i → x j indicates that x i is an ancestor of x j , and the bidirected edge x i ↔ x j indicates that x i and x j share a common, unobserved ancestor (Richardson & Spirtes, 2002; Tian & Pearl, 2002) . An ADMG G over a collection of D variables x = (x 1 , . . . , x D ) can be described using two binary adjacency matrices: G D ∈ R D×D , for which an entry of 1 in position (i, j) indicates the presence of the directed edge x i → x j , and G B ∈ R D×D , for which an entry of 1 in position (i, j) indicates the presence of the bidirected edge x i ↔ x j . Throughout this paper, we will use the graph notation G to indicate the tuple G = {G D , G B }. When using ADMGs to represent latent confounding, causal discovery amounts to learning the matrices G D and G B . Similar to a DAG, an SCM can be specified to describe the causal relationships implied by an ADMG through the so-called magnification process. As formulated in (Peña, 2016) , whenever a bidirected edge x i ↔ x j is present according to G B in an ADMG, we will explicitly add a latent node u m to represent the latent parent (confounder) of x i ↔ x j . Then, the SCM of x i can be written as x by x i = f i (x pa(i;G D ) , u pa(i;G B ) ) + ϵ i , where u pa(i;G B ) denotes the latent parents of x i in the set of all latent nodes u = (u 1 , . . . , u M ) added in the magnification process. In compact form, we can write: [x, u] = f G (x, u) + ϵ. The 'magnified SCM' will serve as a practical device for learning ADMGs in this paper. Similar to DAGs, magnified SCMs induce an observational distribution on x, denoted by p θ (x n ; G). Note that given an ADMG, the magnified SCM is not unique as latent variables may be shared. For example, x 1 ↔ x 2 , x 2 ↔ x 3 , x 3 ↔ x 1 can be magnified as both {x 1 ← u 1 → x 2 , x 2 ← u 2 → x 3 , x 3 ← u 3 → x 1 } and {u 1 → x 1 , u 1 → x 2 , u 1 → x 3 }. Therefore, ADMG identifiability does not imply the structural identifiability of the magnified SCM. In this paper, we focus on ADMG identifiability.

3.3. DEALING WITH GRAPH UNCERTAINTY

Additive noise SCMs only guarantee DAG identifiability in the limit of infinite data. In the finite data regime, there is inherent uncertainty in the causal relationships. The Bayesian approach to causal discovery accounts for this uncertainty in a principled manner using probabilistic modeling (Heckerman et al., 2006) , in which one's belief in the true causal graph is updated using Bayes' rule. This approach is grounded in causal decision theory, which states that rational decision-makers maximize the expected utility over a probability distribution representing their belief in causal graphs (Soto et al., 2019) . We can model the causal graph G jointly with observations x 1 , . . . , x N as p θ (x 1 , . . . , x N , G) = p(G) N n=1 p θ (x n |G) where θ denotes the parameters of the likelihood relating the observations to the underlying causal graph. In general, the posterior distribution p θ (G|xfoot_0 , . . . , x N ) is intractable. One approach to circumvent this is variational inference (Jordan et al., 1999; Zhang et al., 2018) , in which we seek an approximate posterior q ϕ (G) that minimises the KL-divergence KL q ϕ (G)||p θ (G|x 1 , . . . , x N ) . This can be achieved through maximization of the evidence lower bound (ELBO), given by L ELBO = N n=1 E q ϕ (G) [log p θ (x n |G)] -KL [q ϕ (G)||p(G)] . An additional convenience of this approach is that the ELBO serves as a lower bound to the marginal log-likelihood p θ (x 1 , . . . , x N ), so it can also be maximized with respect to the model parameters θ to find the parameters that approximately maximize the likelihood of the data (Geffner et al., 2022) .

4. ESTABLISHING ADMGS IDENTIFIABILITY UNDER NON-LINEAR SCMS

To build flexible methods that are capable of discovering causal relationships under the presence of latent confounding, we first need to establish the identifiability of ADMGs under non-linear SCMs. The concept of structural identifiability of ADMGs is formalized in the following definition: Definition 1 (ADMG structural identifiability). For a distribution p θ (x; G), the ADMG G = {G D , G B } is said to be structurally identifiable from p θ (x; G) if there exists no other distribution p θ ′ (x; G ′ ) such that G ̸ = G ′ and p θ (x; G) = p θ ′ (x; G ′ ). Assuming that our model is correctly specified and p θ 0 (x; G 0 ) denotes the true data generating distribution, then ADMG structural identifiability guarantees that if we find some p θ (x; G) = p θ 0 (x; G 0 ) (by e.g., maximum likelihood learning), we can recover G = G 0 . In this section, we seek to establish sufficient conditions under which the ADMG identifiability is satisfied. Let x = (x 1 , . . . , x D ) be a collection of observed random variables, and u = (u 1 , . . . , u M ) be a collection of unobserved (latent) random variables. Our first assumption is that data generating process can be expressed as a specific non-linear additive noise SCM, in which the effect of the observed and latent variables do not modulate each other, see the functional form below: Assumption 1. We assume that the data generating process takes the form [x, u] ⊤ = f G D ,x (x; θ) + f G B ,u (u; θ) + ϵ (5) where each element of ϵ is independent of all other variables in the model and θ denotes the parameters of the non-linear functions f G D ,x and f G B ,u . This assumption is one of the elements that separate us from previous work of (Bhattacharya et al., 2021; Maeda & Shimizu, 2020) (linearity is assumed) and (Maeda & Shimizu, 2021) (effects are assumed to be fully decoupled, x i = m f im (x m ∈ x pa(i;G D ) ) + k g ik (u k ∈ u pa(i;G B ) ) + ϵ i ). As discussed in Section 3.2, the discovery of an ADMG between observed variables amounts to the discovery of their ancestral relationships. Therefore, the mapping between ADMGs and their magnified SCMs is not one-to-one, which might cause issues when designing causal discovery methods using SCM-based approaches. To further simplify the underlying latent structure (without losing too much generality), our second assumption assumes that every latent variable is a parentless common cause of a pair of non-adjacent observed variables: Assumption 2 (Latent variables confound pairs of non-adjacent observed variables). For each latent variable u k in the data generating process, there exists a non-adjacent pair x i and x j that is unique to u k , such that x i ← u k → x j . Arguments for Assumption 2 are made by (Pearl & Verma, 1992) , who show that this family of causal graphs is very flexible, and can produce the same conditional independencies amongst the observed variables in any given causal graph. Therefore, it has been argued that without loss of generality, we can assume latent variables to be exogenous, and have exactly two non-adjacent children. 1 This assumption also implies that we can specify a magnified SCM to the ADMG as shown in Section 3.2, which allows us to evaluate and optimize the induced likelihood later on. Given Assumptions 1 and 2, we now provide three lemmas which allow us to identify the ADMG matrices, G D and G B . These lemmas extend the previous results in Maeda & Shimizu (2021) under our new assumptions above. Note that all g i and g j functions in the lemmas denote functions satisfying the residual faithfulness condition (Maeda & Shimizu (2021)), as detailed in Appendix A. Lemma 1 (Case 1). Given Assumptions 1 and 2, then [G B ] i,j = 1 and [  G D ] i,j = 0 if and only if ∀g i , g j , [(x i -g i (x -i ) |= (x j -g j (x -j )]. Lemma 2 (Case 2). Given Assumptions 1 and 2, then [G B ] i,j = 0 and [G D ] i,j = 0 if and only if ∃g i , g j , [(x i -g i (x -(i,j) ) |= (x j -g j (x -(i,j) ))]. Lemma 3 (Case 3). Given Assumptions 1 and 2, then [G B ] i,j = 0 and [G D ] i,j = 1 if and only if ∀g i , g j , [(x i -g i (x -(i,j) ) |= (x j -g j (x -j )] (8) ∃g i , g j , [(x i -g i (x -i ) |= (x j -g j (x -(i,j) ))] Since each case leads to mutually exclusive conditional independence/dependence constraints, for any p θ (x; G) specified under the assumptions above, there cannot exist some p θ ′ (x; G ′ ) such that G ̸ = G ′ and p θ (x; G) = p θ ′ (x; G ′ ) , thus structural identifiability is satisfied. This gives our ADMG identifiability result: Proposition 1 (Identifiability of ADMGs under non-linear SCMs). Assume Assumptions 1 to 2 hold. Then, G D and G B are identifiable for the data generating process specified in Equation 5. Whilst theoretically we could test for these conditions stated in Lemmas 1 to 3 directly, a more efficient approach is to use differentiable maximum likelihood learning. Assuming the model is correctly specified, then ADMG identifiability ensures the maximum likelihood estimate recovers the true graph in the limit of infinite data (Section 5.5). Hence, we can design the ADMG identification algorithms via maximum likelihood learning.

5. NEURAL ADMG LEARNING

Whilst we have outlined sufficient conditions under which the underlying ADMG causal structure can be identified, this does not directly provide a framework through which causal discovery and inference can be performed. In this section, we seek to formulate a practical framework for gradientbased ADMG identification. Three challenges that remain are: 1. How can we parameterize the magnified SCM models for ADMG to enable learning flexible causal relationships? 2. How can we optimise our model in the space of ADMG graphs as assumed in Section 4? 3. How do we learn the ADMG causal structure efficiently, whilst accounting for the missing data (u) and graph uncertainties in the finite data regime? In this section, we present Neural ADMG Learning (N-ADMG), a novel framework that addresses all three challenges.

5.1. NEURAL AUTO-REGRESSIVE FLOW PARAMETERIZATION

We assume that our model used to learn ADMGs from data is correctly specified. That is, it can be written in the same magnified SCM form as in Equation 5: [x, u] ⊤ = f G D ,x (x; θ) + f G B ,u (u; θ) + ϵ. ( ) Following Khemakhem et al. (2020) , we factorise the likelihood p θ (x n , u n |G) induced by Equation 10 in an autoregressive manner. We can rearrange Equation 10as ϵ = v -f G D ,x (x; θ) -f G B ,u (u; θ) := g G(v; θ) = v -f G(v; θ) where v = (x, u) ∈ R D+M , and G is the magnified adjacency matrix on v, defined as Gi,j ∈ {0, 1} if and only if v i → v j . This allows us to express the likelihood as p θ (v n |G) = p ϵ (g G(v n ; θ)) = D+M i=1 p ϵi (g G(v n ; θ) i ). Note that we have omitted the Jacobian-determinant term as it is equal to one since G is acyclic (Mooij et al., 2011) . Following Geffner et al. (2022) , we adopt an efficient, flexible parameterization for the functions f i taking the form (consistent with Assumption 1) f i (v) = ξ 1,i   D vj ∈x Gj,i ℓ j (v j )   + ξ 2,i   M vj ∈u Gj,i ℓ j (v j )   where ξ 1,i , ξ 2,i and ℓ i (i = 1, ..., D + M ) are MLPs. A naïve implementation would require training 3(D +M ) neural networks. Instead, we construct these MLPs so that their weights are shared across nodes as •) , with e i ∈ R D+M a trainable embedding that identifies the output and input nodes respectively. ξ 1,i (•) = ξ 1,i (e i , •), ξ 2,i (•) = ξ 2,i (e i , •) and ℓ i (•) = ℓ(e i ,

5.2. ADMG LEARNING VIA MAXIMIZING EVIDENCE LOWER BOUND

Our ADMG identifiability theory in Section 4 suggests that the ground truth ADMG graph can, in principle, be recovered via maximum likelihood learning of p θ (x|G). However, the aforementioned challenges remain: given a finite number of observations x 1 , . . . , x N , how do we deal with the corresponding missing data u 1 , . . . , u N while learning the ADMG? How do we account for graph uncertainties and ambiguities in the finite data regime? To address these issues, N-ADMG takes a Bayesian approach toward ADMG learning. Similar to Section 3.3, we may jointly model the distribution over the ADMG causal graph G, the observations x 1 , . . . , x N , and the corresponding latent variables u 1 , . . . , u N , as p θ (x 1 , u 1 , . . . , x N , u N , G) = p(G) N n=1 p θ (x n , u n |G) where p θ (x n , u n |G) is the neural SCM model specified in Section 5.1, θ denotes the corresponding model parameters, and p(G) is some prior distribution over the graph. Our goal is to learn both the model parameters θ and an approximation to the posterior q ϕ (u 1 , . . . , u N , G) ≈ p θ (u 1 , . . . , u N , G|x 1 , . . . , x N ). This can be achieved jointly using the variational inference framework (Zhang et al., 2018; Kingma & Welling, 2013) , in which we maximize the evidence lower bound (ELBO) L ELBO (θ, ϕ) ≤ n log p θ (x n ) given by L ELBO (θ, ϕ) = E q ϕ (G) N n=1 E q ϕ (u n |G) [log p θ (x n |u n , G)] -KL [q ϕ (G)||p(G)] -E q ϕ (G) N n=1 KL [q ϕ (u n |x n , G))||p θ (u n |G)] . In the following sections, we describe our choice of the graph prior p(G), and approximate posterior q ϕ (u 1 , . . . , u N , G) = n q ϕ (u n |x n , G)q ϕ (G). We will also demonstrate that maximizing L ELBO (θ, ϕ) recovers the true ADMG causal graph in the limit of infinite data. See Section 5.5.

5.3. CHOICE OF PRIOR OVER ADMG GRAPHS

As discussed in Section 3.2, the ADMG G can be parameterized by two binary adjacency matrices, G D whose entries indicate the presence of a directed edge, and G B whose edges indicate the presence of a bidirected edge. As discussed in Section 4, a necessary assumption for structural identifiability is that each latent variable is a parent-less confounder of a pair of non-adjacent observed variables. This further implies that the underlying ADMG must be bow-free (both a directed and a bidirected edge cannot exist between the same pair of observed variables). This constraint can be imposed by leveraging the bow-free constrain penalty introduced by Bhattacharya et al. (2021) , h(G D , G B ) = trace e G D -D + sum (G D • G B ) ) which is non-negative and zero only if (G D , G B ) is a bow-free ADMG. As suggested in Geffner et al. (2022) , we implement the prior as p(G) ∝ exp -λ s1 ∥G D ∥ 2 F -λ s2 ∥G B ∥ 2 F -ρh(G D , G B ) 2 -αh(G D , G B ) where the coefficients α and ρ are increased whilst maximizing L ELBO (θ, ϕ), following an augmented Lagrangian scheme (Nemirovski, 1999) . Prior knowledge about the sparseness of the graph is introduced by penalizing the norms ∥G D ∥ 2 F and ∥G B ∥ 2 F with scaling coefficients λ s1 and λ s2 .

5.4. CHOICE OF VARIATIONAL APPROXIMATION

We seek to approximate the intractable true posterior p θ (u 1 , . . . , u N , G|x 1 , . . . , x N ) using the variational distribution q ϕ (u 1 , . . . , u N , G). We assume the following factorized approximate posterior q ϕ (u 1 , . . . , u N , G) = q ϕ (G) N n=1 q ϕ (u n |x n ). For q ϕ (G), we use a product of Bernoulli distributions for each potential directed edge in G D and bidirected edge in G B . For G D , edge existence and edge orientation are parameterized separately using the ENCO parameterization (Lippe et al., 2021) . For q ϕ (u n |x n ), we apply amortized VI as in VAE literature (Kingma & Welling, 2013) , where q ϕ (u n |x n ) is parameterized as a Gaussian distribution whose mean and variance is determined by passing the x n through an inference MLP . 5.5 MAXIMIZING L ELBO (θ, ϕ) RECOVERS THE GROUND TRUTH ADMG In Section 4, we have proved the structural identifiability of ADMGs. In this section, we further show that under certain assumptions, maximizing L ELBO (θ, ϕ) recovers the true ADMG graph (denoted by G 0 ) in the infinite data limit. This result is stated in the following proposition: Proposition 2 (Maximizing L ELBO (θ, ϕ) recovers the ground truth ADMG). Assume that: • Assumptions 1 and 2 (hence the identifiability of ADMGs) holds for the model p θ (x; G). • The model is correctly specified (∃θ * such that p θ * (x; G 0 ) recovers the data-generating process). • Regularity condition: for all θ and G we have E p(x;G 0 ) [| log p θ (x; G)|] < ∞. • The variational family of q ϕ (u|x, G) is flexible enough, i.e., it contains p θ (u|x, G). Then, the solution (θ ′ , q ′ ϕ (G)) that maximizes L ELBO (θ, ϕ) satisfies q ′ ϕ (G) = δ(G = G 0 ). The proof of Proposition 2 can be found in Appendix B, which justifies performing causal discovery by maximizing ELBO of the N-ADMG model. Once the model has been trained and the ADMG has been recovered, we can use the N-ADMG to perform causal inference as detailed in Appendix C.

6. EXPERIMENTS

We evaluate N-ADMG in performing both causal discovery and causal inference on a number of synthetic and real-world datasets. Note that we run our model both with and without the bow-free constraint, identified as N-BF-ADMG (our full model) and N-ADMG (for ablation purpose), respectively. We compare the performance of our model against five baselines: DECI (Geffner et al., 2022) (which we refer to as N-DAG for consistency), FCI (Spirtes et al., 2000) , RCD (Maeda & Shimizu, 2020), CAM-UV (Maeda & Shimizu, 2021), and DCD (Bhattacharya et al., 2021) . We evaluate the causal discovery performance using F1 scores for directed and and bidirected adjacency. The expected values of these metrics are reported using the learned graph posterior (which is deterministic for RCD, CAM-UV, and DCD). Causal inference is evaluated using the expected ATE as described in Appendix C. We evaluate the causal inference performance of the causal discovery benchmarks by fixing q(G) to either deterministic or uniform categorical distributions on the learned causal graphs, then learning a non-linear flow-based ANM by optimizing Equation 15 in an identical manner to N-ADMG. A full list of results and details of the experimental set-up are included in Appendix F. Our implementation will be available at https://github.com/microsoft/causica.

6.1. SYNTHETIC FORK-COLLIDER DATASET

We construct a synthetic fork-collider dataset consisting of five nodes (Figure 1a ). The datagenerating process is a non-linear ANM with Gaussian noise. Variable pairs (x 2 , x 3 ), and (x 3 , x 4 ) are latent-confounded, whereas (x 4 , x 5 ) share a observed confounder. We evaluate both causal discovery and inference performances. For discovery, we evaluate F-score measure on both G D and G B . For causal inference, we choose x 4 as the treatment variable taking and x 2 , x 3 and x 5 as the x 3 x 2 x 4 x 5 x 1 (a) Ground truth. x 3 x 2 x 4 x 5 x 1 (b) N-BF-ADMG. x 3 x 2 x 4 x 5 x 1 (c) N-ADMG (no constraint). x 3 x 2 x 4 x 5 x 1 (d) RCD. x 3 x 2 x 4 x 5 x 1 (e) DCD. x 3 x 2 x 4 x 5 x 1 (f) CAM-UV. Figure 1 : ADMG identification results on fork-collider dataset. response variables, and evaluate ATE RMSE to benchmark performance. It is therefore crucial that discovery methods do not misidentify latent confounded variables as having a direct cause between them, as this would result in biased ATE estimates. N-BF-ADMG-G and N-ADMG-G on average are able to recover all bidirected edges from the data, while CAM-UV can only recover half of the latent variables. Without the bow-free constraint, N-ADMG-G discovers a directed edge from x 4 to x 3 , which results in poor ATE RMSE performance. On the other hand, DAG-based method (N-DAG-G) is not able to deal with latent confounders, resulting in the poor f-scores in both G D and G B . Linear ANM-based methods (RCD and DCD) perform significantly worse than other methods, resulting in 0 f-scores for directed matrices and largest ATE errors. This demonstrates the necessity for introducing non-linear assumptions.

6.2. RANDOM CONFOUNDED ER SYNTHETIC DATASET

We generate synthetic datasets from ADMG extension of Erdős-Rényi (ER) graph model (Lachapelle et al., 2019) . We first sample random ADMGs from ER model, and simulate each variable using a randomly sampled nonlinear ANM. Latent confounders are then removed from the training set. See Appendix G.2 for details. We consider the number of nodes, directed edges and latent confounders triplets (d, e, m) ∈ {(4, 6, 2), (8, 20, 6), (12, 50, 10)}. The resulting datasets are identified as ER(d, e, m). Figure 2 compares the performance of N-ADMG with the baselines. All variants of N-ADMG outperform the baselines for most datasets, highlighting its effectiveness relative to other methods (even those that employ similar assumptions). Similar to the fork-collider dataset, we see that methods operating under the assumption of linearity perform poorly when the data-generating process is non-linear. It is worth noting that even when the exogenous noise of N-ADMG is misspecified, its still exceeds that of other methods in most cases. This robustness is a desirable property as in many settings the form of exogenous noise is unknown.

6.3. INFANT HEALTH AND DEVELOPMENT PROGRAM (IHDP) DATASET

For the real-world datasets, we evaluate treatment effect estimation performances on infant health and development program data (IHDP). This dataset contains measurements of both infants and their mother during real-life data collected in a randomized experiment. The main task is to estimate the effect of home visits by specialists on future cognitive test scores of infants, where the ground truth outcomes are simulated as in (Hill, 2011) . To make the task more challenging, additional confoundings are introduced by removing a subset (non-white mothers) of the treated population. More details can be found in Appendix G.3. We first perform causal discovery to learn the underlying ADMG of the dataset, and then perform causal inference. Since the true causal graph is unknown, we evaluate the causal inference performance of each method by estimating the ATE RMSE. Apart from the aforementioned baselines, here we introduce four more methods: PC-DWL (PC algorithm for discovery, DoWhy (Sharma et al., 2021) linear adjustment for inference); PC-DwNL (PC for discovery, DoWhy double machine learning for inference); N-DAG-G-DwL (N-DAG Gaussian for discovery, linear adjustment for inference); and N-DAG-S-DwL (N-DAG Spline for discovery, linear adjustment for inference). Results are summarized in Figures 3a to 3c . Generally, models with non-Gaussian exogenous assumptions tend to have lower ATE estimation errors; while models with linear assumptions (RCD and DCD) have the worst ATE RMSE. Interestingly, the DoWhy-based plug-in estimators tend to worsen the performances of SCM models. However, regardless of the assumptions made on exogenous noise, our method (N-BF-ADMG-G and N-BF-ADMG-S) consistently outperforms all other baselines with the same noise. It is evident that for causal inference in real-world datasets, the ability of N-BF-ADMG to handle latent confoundings and nonlinear causal relationships becomes very effective. 

7. CONCLUSION AND FUTURE WORK

In this work, we proposed Neural ADMG Learning (N-ADMG), a novel framework for gradientbased causal reasoning in the presence of latent confounding for nonlinear SCMs. We established identifiability theory for nonlinear ADMGs under latent confounding, and proposed a practical ADMG learning algorithm that is both flexible and efficient. In future work, we will further extend our framework on how to more general settings (e.g., the effect from observed and latent variables can modulate in certain forms; latent variables can confound adjacent variables; etc), and how to improve certain modelling choices such as variational approximation qualities over both causal graph and latent variables

REPRODUCIBILITY STATEMENT

A number of efforts have been made/will be made for the sake of reproducibility. First, we open source our package in our github page github.com/microsoft/causica/tree/v0.0.0. In addition, in this paper we included clear explanations of any assumptions and a complete proof of the claims in the Appendix, as well as model settings and hyperparameters. All datasets used in this paper are either publicly available data, or synthetic data whose generation process is described in detail.

APPENDIX

A PROOF OF LEMMAS 1 TO 3 First we describe the residual faithfulness condition inherited from (Maeda & Shimizu, 2021) in our Lemmas 1 to 3: Definition 2 (Residual faithfulness condition). We say that nonlinear functions g i , g j satisfies the residual faithfulness condition, if: for any two arbitrary subset of x, denote by M and N , when both (x i -g i (M )) and (x i -g j (M )) have terms involving the same exogenous noise ϵ k , then (x i -g i (M )) and (x i -g j (M )) are mutually dependent. Next, we provide the proof for Lemmas 1 to 3 in Section 4. To prove those lemmas, we need the help of the following lemma, which extends Lemma A in (Maeda & Shimizu, 2021), except we don't assume any form for g i other than non-linearity. Lemma 4. Let s(x i ) denote an arbitrary function of x i . The residual of s(x i ) regressed onto x -i cannot be independent of ϵ i : ∀g i , [s(x i ) -g i (x -i ) |= ϵ i ]. ( ) Proof. Assume that [s(x i )-g i (x -i ) |= ϵ i ] holds, then x -i must contain at least one descendent of x i as it must have dependence on the noise ϵ i to cancel effect of ϵ i in s(x i ). We can express g i (x -i ) as u i (ϵ). Since g i operates on variables defined by non-linear transformations of the exogenous noise terms, we cannot express u i as a i (ϵ -i ) + b i (ϵ i ). x -i contains a descendent of x i , so ϵ -i includes at least one noise term ϵ k that satisfies x i |= ϵ k (i.e. is not in x i ). Thus, terms containing ϵ i cannot be fully removed from s(x i ) -g i (x -i ) and so [s (x i ) -g i (x -i ) |= ϵ i ] does not hold. A.1 PROOF OF LEMMA 1 Proof. Define g i and g j as g i (x -i ) = f i,x (pa x (i))+g ′ i (x -i ) and g j (x -j ) = f j,x (pa x (j))+g ′ i (x -j ) respectively. Then, Equation 6 becomes equivalent to ∀g ′ i , g ′ j , [(f i,u (pa u (i)) + ϵ i -g ′ i (x -i ) |= (f j,u (pa u (j)) + ϵ j -g ′ j (x -j )]. Given Lemma 4 and following the same arguments as in Maeda & Shimizu (2021) , this is equivalent to (f i,u (pa u (i)) + ϵ i ) |= (f j,u (pa u (j)) + ϵ j ). (21) Since ϵ i |= ϵ j , we have (f i,u (pa u (i)) |= ϵ j ) ∨ (n i |= f j,u (pa u (j))) ∨ (f i,u (pa u (i)) |= f j,u (pa u (j))). The first implies the existence of an unobserved mediator between x j and x i , the second implies the existence of an unobserved mediator between x i and x j , and the third implies the existence of an unobserved confounder. Given the assumption of latent variables being confounders and minimality, this indicates the presence of a latent confounder and no direct cause between x i and x j . A.2 PROOF OF LEMMA 2 Proof. When Equation 7 holds, Equation 6 does not. Thus, there is no unobserved confounder between x i and x j . Assume that x j is a direct cause of x i , and that Equation 7 is satisfied for g i and g j . x i contains a nonlinear function of ϵ j that cannot be removed by g i (x -(i,j) ), thus (x i - g i (x -(i,j) )) |= ϵ j . Similarly, (x j -g j (x -(i,j) )) |= ϵ i . Thus, we have [(x i -g i (x -(i,j) ) |= (x j - g j (x -(i,j) ))] which contradicts our initial assumption. The same arguments apply when x i is a direct cause of x j , implying that there can be no causal relationship between x i and x j .

A.3 PROOF OF LEMMA 3

Proof. When Equation 9 holds, Equation 6 does not and so there is no latent confounder between x i and x j . When Equation 8 holds, Equation 7does not hold and so there is a direct causal relationship between x i and x j . Assume that x j is a direct cause of x i . Define g 1 (x -i ) = f i,x (pa x (i)) and g 2 (x -(i,j) ) = f j,x (pa x (j)), giving x i -g i (x -i ) = f i,u (pa u (i)) + ϵ i and x j -g j (x -(i,j) ) = f j,u (pa u (j)) + ϵ j . When there is no latent confounder, f i,u (pa u (i)) |= f j,u (pa u (j)). Thus, (f i,u (pa u (i)) + ϵ i ) |= (f j,u (pa u (j)) + ϵ j ) hods and Equation 9 is satisfied. Now, assume that x i is a direct cause of x j . Using Lemma 4, we have ∀g i , [(x i -g i (x -i ) |= ϵ i ]. Similarly, since x j is a function of x i we also have ∀g j , [(x j -g j (x -(i,j) ) |= ϵ i ]. Collectively, this implies ∀g i , g j , [(x i -g i (x -i ) |= (x j -g j (x -i,j )] which contradicts Equation 9. Thus, if Equation 9 is satisfied then x j is a direct cause of x i .

B PROOF OF PROPOSITION 2

To prove Proposition 2, we need the following lemma: Lemma 5. Assume a variational distribution q ϕ (G) over a space of graphs G ϕ , where each graph G ∈ G ϕ has a non-zero associated weight w ϕ (G). With the soft prior p(G) defined as Equation 17and bounded λ 1 , λ 2 , ρ, α, we have lim N →∞ 1 N KL[q ϕ (G)∥p(G)] = 0. Proof. This directly follows from Lemma 1 of (Geffner et al., 2022) .

Now we can proceed to prove Proposition 2:

Proof. For N-ADMG, in the infinite data limit L ELBO becomes lim N →∞ 1 N N n=1 E q ϕ (G)q(un|G) [log p θ (x n , u n |G)] - 1 N N n=1 H[q(u n |x n , G)] - 1 N KL [q(G)||p(G)] →0 = lim N →∞ 1 N N n=1 G∈G ϕ w ϕ (G)E q(u|x,G) [log p θ (x n |u n , G)] - 1 N N n=1 E q(G) [KL [q(u n |x n G)||p(u n |x n , G)]] . (27) where the zeroing of the KL divergence follows from Lemma 5. Given fixed θ, the optimal posterior q * (u n |x n , G) satisfies q * (u n |x n , G) = p θ (u n |x n , G) due to the flexibility assumption. Thus, lim N →∞ L ELBO (θ, ϕ, q * (u n |x n , G)) = lim N →∞ 1 N N n=1 G∈G ϕ w ϕ (G)log p θ (x n |G) = p(x; G 0 ) G∈G ϕ w ϕ (G) log p θ (x|G)dx (28) where p(x; G 0 ) denotes the data generation distribution with the ground truth ADMG, G 0 . Let (θ * , G * ) = arg max p(x; G 0 ) log p θ (x|G)dx be the MLE solution. Since G∈G ϕ w ϕ (G) = 1, w ϕ (G) > 0, we have G∈G ϕ w ϕ (G)E p(x;G 0 ) [log p θ (x|G)] ≤ E p(x;G 0 ) [log p θ * (x; G * )] with the optimal value of G∈G ϕ w ϕ (G)E p(x;G 0 ) [log p θ (x|G)] is achieved when every graph G ∈ G ϕ and associated parameter θ G satisfies E p(x;G 0 ) [log p θ G (x|G)] = E p(x;G 0 ) [log p θ * (x|G * )] . Since the model is correctly specified, the MLE solution (θ * , G * ) satisfies E p(x;G 0 ) [log p θ * (x|G * )] = E p(x;G 0 ) log p(x; G 0 ) Therefore, condition Equation 29 implies for every graph G ′ ∈ G ϕ , G ′ = G 0 under the regularity condition; or equivalently, G ϕ = {G ′ = G 0 }. This proves our statement that q ′ ϕ (G) = δ(G = G ′ ), where G ′ = G 0 .

C ESTIMATING TREATMENT EFFECTS

For all experiments we consider, the causal quantity of interest we wish to estimate is the expected average treatment effect (ATE), E q ϕ (G) [ATE(a, b|G) ], where the expectation is taken with respect to our learned posterior over causal graphs q ϕ (G): E q ϕ (G) [ATE(a, b|G)] = E q ϕ (G) E p(x Y |do(x T =b),G) [x Y ] -E p(x Y |do(x T =b),G) [x Y ] . This requires samples from p(x Y |do(x T = b), G) = p(x Y |x T = b, G do(x T ) ) , where G do(x T ) is the 'mutilated' graph obtained by removing incoming edges into x T . We can achieve this by simulating the learnt SCM on G do(x T ) whilst keeping x T = b fixed. Note that q ϕ (u|x) is not used to estimate the ATE; it suffices to sample u from the prior distribution, p(u). In our setting, the inference MLP q ϕ (u|x) is only used as a means through which the likelihood of the data can be evaluated efficiently, and thus model parameters learned. This is in a similar spirit to VAEs (Kingma & Welling, 2013) .

D RELATED WORK ON CAUSAL INFERENCE UNDER LATENT CONFOUNDING

Attempting to perform causal inference in the presence of latent confounding can lead to biased estimates (Pearl, 2012) . Whilst the observed data distribution may still be identifiable, estimating causal effects are not (Spirtes et al., 2000) . A recent string of work has made progress in the case where the effects of multiple interventions are being estimated (Tran & Blei, 2017; Ranganath & Perotte, 2018; Wang & Blei, 2019; D'Amour, 2019) . An alternative approach is to assume identifiability of the joint distribution over both latent and observed variables given just the observations. Louizos et al. (2017) point out that there are many cases in which this is possible (Khemakhem et al., 2020; Kingma & Welling, 2013) . Nevertheless, all these methods assume the underlying causal graph is known. More recently, Mohammad-Taheri et al. (2021) argue that a DAG latent variable model trained on data can be used for down-stream causal inference tasks even if its parameters are nonidentifiable, as long as the query can be identified from the observed variables according to the do-calculus.

E ADDITIONAL DISCUSSIONS ON STRUCTURAL IDENTIFIABILITY OF LATENT VARIABLES

In Section 3.2, we argued that the identifiability of ADMG does not imply the structural identifiability of the magnified SCM. In this section, we will present more discussions on certain identifiability of latent structures (of magnified SCMs). In general, these examples demonstrate that for linear non-Gaussian ANMs the structure of latent variables can be refined beyond the assumption of latent confounders acting between pairs of non-adjacent observed variables, whilst the same techniques cannot achieve the same for non-linear ANMs.

E.1 DETERMINING CAUSAL STRUCTURE AMONGST LATENT VARIABLES

Recently, Cai et al. (2019) demonstrated that it is possible to discover the structure amongst latent variables using their so-called Triad constraints. Their method is limited to the linear non-Gaussian ANM case. In this section, we demonstrate that an analogous constraint isn't available for non-linear ANMs. u 1 u 2 x i x j x k f 12 f 1i f 2j f 2k (a) u 1 u 2 x i x j x k f 21 f 1i f 2j f 2k (b) Figure 4 : The two possible latent variable structures considered by Cai et al. (2019) . Consider the causal graphs shown in Figure 4a and Figure 4b . Lemma 6 (Linear non-Gaussian identifiability). In the linear non-Gaussian ANM case, Equation 31is satisfied only for the causal graph shown in Figure 4b . In the non-linear ANM case, Equation 31is not satisfied for either Figure 4a or Figure 4b . ∃g, x i -g(x j ) |= x k Sketch of Proof. For the causal graph in Figure 4a , Equation 31 is equivalent to ∃g f 1i (u 1 ) + ϵ i -g(f 2j (f 12 (u 1 ), u 2 ), ϵ j ) |= f 2k (f 12 (u 1 ), u 2 ) + ϵ k . On the right, we have some function of the latent variables u 1 and u 2 , f 2k (f 12 (u 1 ), u 2 ). On the left, we have some function of the same noise terms, f 1i (u 1 ) -g(f 2j (f 12 (u 1 ), ϵ j ). To remove u 1 from both sides, we require g to be non-zero and so a term including u 2 is still present. To remove u 2 , we again require g to be non-zero and so a term including u 1 is still present. Thus, Equation 31 is not satisfied in either the linear non-Gaussian or non-linear ANM case. For the causal graph in Figure 4b , Equation 31 is equivalent to ∃g f 1i (f 21 (u 2 ), u 1 ) + ϵ i -g(f 2j (u 2 ) + ϵ j ) |= f 2k (u 2 ) + ϵ k . In the linear case, we can construct a linear g to remove u 2 fom the left side so that Equation 31 holds (i.e. g = f1if21 f2j ). In the non-linear case, u 1 and u 2 are coupled in the leftmost term and cannot be removed by a term involving u 2 and ϵ j . Hence, Equation 31 is violated.

E.2 DETERMINING THE NUMBER OF LATENT CONFOUNDERS

Here, we consider whether the number of latent confounders can be determined in the non-linear ANM case. Lemma 7 shows that in the linear non-Gaussian case, the number of latent confounders acting between a triplet of confounded observed variables can be determined (by verifying certain constraints on marginal distributions on observed variables), whilst the same approach cannot be used in the non-linear ANM case. Consider the two causal graphs shown in Figure 5a and Figure 5b . Lemma 7 (Linear non-Gaussian identifiability). In the linear non-Gaussian ANM case, Equation 34 is satisfied only for the causal graph shown in Figure 4b . In the non-linear ANM case, Equation 34is not satisfied for either Figure 4a or Figure 4b . ∃g, x i -g(x j ) |= x k . Sketch of Proof. For Figure 5b , Equation 34 is equivalent to ∃g f i (u) + ϵ i -g(f j (u) + ϵ j ) |= f k (u) + ϵ k . In the linear non-Gaussian case, it is straightforward to set g = fi fj to remove the common noise term u from the left term and make the two sides independent. In the non-linear case, when f i ̸ = f j the common noise term u cannot be removed from the left as g must be non-linear, and thus produces a term that involves both u and ϵ j . For Figure 5a , Equation 34 is equivalent to ∃g f i (u 1 , u 2 ) + ϵ i -g(f j (u 1 , u 3 ) + ϵ j ) |= f k (u 2 , u 3 ) + ϵ k . ( ) u 2 cannot be removed from the left, and so Equation 34 does not hold in either the linear non-Gaussian or non-linear case. u 1 u 2 u 3 x i x j x k f 1i f 1j f 2i f 2k f 3j f 3k (a) x i x j x k u f i f j f k (b) Figure 5 : Two possible latent structures that confound each pair of variables x i , x j and x k . F OPTIMISATION DETAILS F.1 OPTIMISATION DETAILS FOR N-ADMG As discussed in Section 5, we gradually increase the prior hyperparameters ρ and α throughout training. This is done using the augmented Lagrangian procedure for optimisation (). The optimisation process interleaves two steps: 1) optimise the objective for fixed values of ρ and α for a certain number of steps; and ii) update the values of ρ and α. Steps i) and ii) and ran until convergence, or the maximum number of optimisation steps is reached. We describe these two steps in more detail below. Step i). The objective is optimised for some fixed vales of ρ and α using Adam (). We use a learning rate of 10 -3 for the model parameters and 5 × 10 -3 for the variational parameters. We optimise the objective for a maxmimum of 5000 steps or until convergence (we stop early if the loss does not improve for 1500 optimisation steps, moving to step ii)). During training, we reduce the learning rate by a factor of 10 if the training loss does not improve for 1000 steps a maximum of two times. If we reach the condition a third time, we assume optimisation has converged and move to step ii). We apply annealing to the KL-divergence between the approximate posterior and prior over the latent variables. The annealing contant is fixed for each step i), and increased linearly over the first optimisation loops. Step ii). We initialise ρ = 1 and α = 0. At the beginning of step i) we measure the DAG / bow-free penalty P 1 = E q ϕ (G) [h(G)]. At the beginning of step ii), we measure this penalty again, ) ]. If P 2 < P 1 , we leave ρ unchanged and update α ← α + ρP 2 . Otherwise, if P 2 ≥ 0.65P 1 , we leave α unchanged and update ρ ← 10ρ. We repeat the steps i) to ii) a maximum of 30 times or until convergence (measured as α or ρ reaching some max value which we set to 10 3 for both). P 2 = E q ϕ (G) [h(G

F.2 ADDITIONAL HYPERPARAMETERS

Prior Hyperparameters. We use the sparsity inducing prior hyperparameters λ s,1 = λ s,2 = 5. ELBO approximation. We construct an approximation to the ELBO in Equation 15 using a single sample from the approximate posteriors. For evaluating the gradients of the ELBO we use the Gumbel softmax method with a hard forward pass and soft backward pass with temperature of 0.25. Neural network architectures. The functions ξ 1 , ξ 2 and ℓ used in the likelihood and the inference network used to parameterise q ϕ (u|x) are all two hidden layer MLPs with 80 hidden units per hidden layer. Non-Gaussian noise model. For the non-Gaussian noise model ATE estimation. For ATE estimation we compute expectations by drawing 1000 graphs from the learnt posterior, and for each graph we draw two samples of x Y for a total of 2000 samples which we used to form a Monte Carlo estimate.

G DATASET DETAILS

G.1 SYNTHETIC FORK-COLLIDER DATASET We constructed a 2000 sample synthetic dataset with the causal structure shown in Figure 1 by sampling from the following SEM: [u 1 , u 2 , ϵ 1 , ϵ 2 , ϵ 3 , ϵ 4 , ϵ 5 ] T ∼ N (0, I) x 1 = ϵ 1 x 2 = √ 6 exp(-u 2 1 ) + 0.1ϵ 2 x 3 = √ 6 exp(-u 2 1 ) + √ 6 exp(-u 2 2 ) + 0.2ϵ 3 x 4 = √ 6 exp(-u 2 2 ) + √ 6 exp(-x 2 1 ) + 0.1ϵ 4 x 5 = √ 6 exp(-x 2 1 ) + 0.1ϵ 5 . (37) Variables u 1 and u 2 are latent confounders acting on variable pairs x 2 and x 3 , and x 3 and x 4 , respectively.

G.2 LATENT CONFOUNDED ER DATASET

We generate synthetic datasets from ADMG extension of Erdős-Rényi (ER) graph model (Lachapelle et al., 2019; Zheng et al., 2020) . An ER(d, e, m) dataset is generated according the following procedures: 1. Generate a d×d directed adjacency matrix G D of a ADMG from Erdős-Rényi (ER) graph model, whose expected directed edges equal to e; 2. Simulate d × d bidirected adjacency matrix G B via random Bernoulli sampling, whose expected number equals to m; 3. Simulate exogenous noises ϵ i from a zero mean Gaussian distribution with standard deviation of 0.1; 4. Simulate latent variables u from a zero mean Gaussian distribution with standard deviation of 0.1; 5. Simulate each observed variable as x i = f i (x pa(i;G D ) ) + g i (u pa(i;G B ) ) + ϵ i , where f i , g i are randomly sampled nonlinear functions of the form: y = w T e -x 2 . 6. Remove u from the sampled dataset.

G.3 IHDP DATASET DETAILS

This dataset contains measurements of both infants (birth weight, head circumference, etc.) and their mother (smoked cigarettes, drank alcohol, took drugs, etc) during real-life data collected in a randomised experiment. The main task is to estimate the effect of home visits by specialists on future cognitive test scores of infants. The outcomes of treatments are simulated artificially as in Hill (2011); hence the outcomes of both treatments (home visits or not) on each subject are known. Note that for each subject, our models are only exposed to only one of the treatments; the outcomes of the other potential/counterfactual outcomes are hidden from the mode, and are only used for the purpose of ATE evaluation. To make the task more challenging, additional confoundings are manually introduced by removing a subset (non-white mothers) of the treated children population. In this way we can construct the IHDP dataset of 747 individuals with 6 continuous covariates and 19 binary covariates. We use 10 replicates of different simulations based on setting B (log-linear response surfaces) of Hill ( 2011), which can downloaded from https://github.com/AMLab-Amsterdam/CEVAE. We use a 70%/30% train-test split ratio. Before training our models, all continuous covariates are normalised.

H RUN TIME COMPARISON

Here, we compare the run time of N-BF-ADMG and CAM-UV on a synthetically generated ER(12, 50, 10) dataset. The results are shown in Figure 6 . N-BF-ADMG is trained using 30k epochs to ensure convergence, whose run time can be further reduced via early stopping etc. The results in Figure 6 highlight that methods based on continuous optimisation (N-BF-ADMG) offer a significant improvement in run time relative to methods based on conditional independency tests (CAM-UV) for large datasets. In combination with the results in the main paper, this shows the empirical validity of deep learning approach for identifying ADMGs that scales to larger dataset, without sacrificing accuracy.



Meanwhile, Evans (2016) show that this graph family cannot induce all possible observational distributions. In Appendix E we show that in the linear non-Gaussian ANM case there exist certain constraints on marginal distributions that cannot be satisfied under Assumption Nonetheless, we could not derive such constraints for the non-linear case.



Figure 2: Causal discovery results for synthetic ER datasets. For readability, the N-ADMG results are connected with lines. The figure shows mean results across five randomly generated datasets.

Gaussian exogenous noise. (b) Spline exogenous noise.(c) DoWhy baselines.

Figure 3: Causal inference results for the IHDP dataset. The figure shows mean ± standard error results across five random initialisations.

Figure 6: Run time results for N-BF-ADMG and CAM-UV on a 12 variable synthetic ER dataset. The figure shows mean results ± standard deviation across five randomly generated datasets.

Causal discovery and inference results for the fork-collider dataset. The table shows the mean and standard error results across five different random seeds.

