LEARNING LATENT STRUCTURAL CAUSAL MODELS

Abstract

Causal learning has long concerned itself with the accurate recovery of underlying causal mechanisms. Such causal modelling enables better explanations of out-ofdistribution data. Prior works on causal learning assume that the high-level causal variables are given. However, in machine learning tasks, one often operates on low-level data like image pixels or high-dimensional vectors. In such settings, the entire Structural Causal Model (SCM) -structure, parameters, and high-level causal variables -is unobserved and needs to be learnt from low-level data. We treat this problem as Bayesian inference of the latent SCM, given low-level data. For linear Gaussian additive noise SCMs, we present a tractable approximate inference method which performs joint inference over the causal variables, structure and parameters of the latent SCM from random, known interventions. Experiments are performed on synthetic datasets and a causally generated image dataset to demonstrate the efficacy of our approach. We also perform image generation from unseen interventions, thereby verifying out of distribution generalization for the proposed causal model.

1. INTRODUCTION

Learning variables of interest and uncovering causal dependencies is crucial for intelligent systems to reason and predict in scenarios that differ from the training distribution. In the causality literature, causal variables and mechanisms are often assumed to be known. This knowledge enables reasoning and prediction under unseen interventions. In machine learning, however, one does not have direct access to the underlying variables of interest nor the causal structure and mechanisms corresponding to them. Rather, these have to be learned from observed low-level data like pixels of an image which are usually high-dimensional. Having a learned causal model can then be useful for generalizing to out-of-distribution data (Scherrer et al., 2022; Ke et al., 2021) , estimating the effect of interventions (Pearl, 2009; Schölkopf et al., 2021) , disentangling underlying factors of variation (Bengio et al., 2012; Wang and Jordan, 2021) , and transfer learning (Schoelkopf et al., 2012; Bengio et al., 2019) . Structure learning (Spirtes et al., 2000; Zheng et al., 2018) learns the structure and parameters of the Structural Causal Model (SCM) (Pearl, 2009) that best explains some observed high-level causal variables. In causal machine learning and representation learning, however, these causal variables may no longer be observable. This serves as the motivation for our work. We address the problem of learning the entire SCM -consisting its causal variables, structure and parameters -which is latent, by learning to generate observed low-level data. Since one often operates in low-data regimes or non-identifiable settings, we adopt a Bayesian formulation so as to quantify epistemic uncertainty over the learned latent SCM. Given a dataset, we use variational inference to learn a joint posterior over the causal variables, structure and parameters of the latent SCM. To the best of our knowledge, ours is the first work to address the problem of Bayesian causal discovery in linear Gaussian latent SCMs from low-level data, where causal variables are unobserved. Our contributions are as follows: • We propose a general algorithm for Bayesian causal discovery in the latent space of a generative model, learning a distribution over causal variables, structure and parameters in linear Gaussian latent SCMs with random, known interventions. Figure 1 illustrates an overview of the proposed method. • By learning the structure and parameters of a latent SCM, we implicitly induce a joint distribution over the causal variables. Hence, sampling from this distribution is equivalent to ancestral sampling through the latent SCM. As such, we address a challenging, simultane- ous optimization problem that is often encountered during causal discovery in latent space: one cannot find the right graph without the right causal variables, and vice versa. • On a synthetically generated dataset and an image dataset used to benchmark causal model performance (Ke et al., 2021) , we evaluate our method along three axes -uncovering causal variables, structure, and parameters -consistently outperforming baselines. We demonstrate its ability to perform image generation from unseen interventional distributions. and a corresponding exogenous noise variable ϵ i . The direct causes are subsets of other endogenous variables. If the causal parent assignment is assumed to be acyclic, then an SCM is associated with a Directed Acyclic Graph (DAG) G = (V, E), where V corresponds to the endogenous variables and E encodes direct cause-effect relationships. The exact value z i taken on by a causal variable Z i , is given by local causal mechanisms f i conditional on the values of its parents z G pa(i) , the parameters Θ i , and the node's noise variable ϵ i , as given in equation 1.

2. PRELIMINARIES

For linear Gaussian additive noise SCMs with equal noise variance, i.e., the setting that we focus on in this work, all f i 's are linear functions, and Θ denotes the weighted adjacency matrix W , where each W ji is the edge weight from j → i. The linear Gaussian additive noise SCM thus reduces to equation 2, z i = f i (z G pa(i) , Θ, ϵ i ) , (1) z i = j∈pa G (i) W ji • z j + ϵ i . (2)

2.2. CAUSAL DISCOVERY

Structure learning in prior work refers to learning a DAG according to some optimization criterion with or without the notion of causality (e.g., He et al. (2019) ). The task of causal discovery on the other hand, is more specific in that it refers to learning the structure (also parameters, in some cases) of SCMs, and subscribes to causality and interventions like that of Pearl (2009) . That is, the methods aim to estimate (G, Θ). These approaches often resort to modular likelihood scores over causal variables -like the BGe score (Geiger and Heckerman, 1994; Kuipers et al., 2022) and BDe score (Heckerman et al., 1995) -to learn the right structure. However, these methods all assume a dataset of observed causal variables. These approaches either obtain a maximum likelihood estimate, G * = arg max G p(Z | G) or (G * , Θ * ) = arg max G,Θ p(Z | G, Θ) , or in the case of Bayesian causal discovery (Heckerman et al., 1997) , variational inference is typically used to approximate a joint posterior distribution q ϕ (G, Θ) to the true posterior p(G, Θ | Z) by minimizing the KL divergence between the two, D KL (q ϕ (G, Θ) || p(G, Θ | Z)) = -E (G,Θ)∼q ϕ log p(Z | G, Θ) -log q ϕ (G, Θ) p(G, Θ) , where p(G, Θ) is a prior over the structure and parameters of the SCM -possibly encoding DAGness. Figure 2 shows the Bayesian Network (BN) over which inference is performed for causal discovery tasks. In more realistic scenarios, the learner does not directly observe causal variables and they must be learned from lowlevel data. The causal variables, structure, and parameters are part of a latent SCM. The goal of causal representation learning models is to perform inference of, and generation from, the true latent SCM. Yang et al. (2021) proposes a Causal VAE but is in a supervised setup where one has labels on causal variables and the focus is on disentanglement. Kocaoglu et al. (2018) present causal generative models trained in an adversarial manner but assumes observations of causal variables.

2.3. LATENT CAUSAL

Given the right causal structure as a prior, the work focuses on generation from conditional and interventional distributions. In both the causal representation learning and causal generative model scenarios mentioned above, the Ground Truth (GT) causal graph and parameters of the latent SCM are arbitrarily defined on real datasets and the setting is supervised. Contrary to this, our setting is unsupervised and we are interested in recovering the GT underlying SCM and causal variables that generate the low-level observed data -we define this as the problem of latent causal discovery, and the BN over which we want to perform inference on is given in figure 3. In the upcoming sections, we discuss related work, formulate our problem setup and propose an algorithm for Bayesian latent causal discovery, evaluate with experiments on causally created vector data and image data, and perform sampling from unseen interventional image distributions to showcase generalization of learned latent SCMs.

3. RELATED WORK

Prior work can be classified into Bayesian (Koivisto and Sood, 2004; Heckerman et al., 2006; Friedman and Koller, 2013) or maximum likelihood (Brouillard et al., 2020; Wei et al., 2020; Ng et al., 2022) methods, that learn the structure and parameters of SCMs using either score-based (Kass and Raftery, 1995; Barron et al., 1998; Heckerman et al., 1995) or constraint-based (Cheng et al., 2002; Lehmann and Romano, 2005) approaches. Causal discovery: Work in this category assume causal variables are observed and do not operate on low-level data (Spirtes et al., 2000; Viinikka et al., 2020; Yu et al., 2021; Zhang et al., 2022) . Peters and Bühlmann (2014) Structure learning with latent variables: Markham and Grosse-Wentrup (2020) introduces the concept of Measurement Dependence Inducing Latent Causal Models (MCM). The proposed algorithm finds a minimal-MCM that induces the dependencies between observed variables. However, similar to VAEs, the method assumes no causal links between latent variables. Kivva et al. (2021) provides the conditions under which the number of latent variables and structure can be uniquely identified for discrete latent variables, given the adjacency matrix between the hidden and measurement variables has linearly independent columns. Elidan et al. (2000) detects the signature of hidden variables using semi-cliques and then performs structure learning using the structural-EM algorithm (Friedman, 1998) for discrete random variables. Anandkumar et al. (2012) and Silva et al. (2006) considers the identifiability of linear Bayesian Networks when some variables are unobserved. In the former work, the identifiability results hold only for particular classes of DAGs which follow certain structural constraints. Assuming non-Gaussian noise and that certain sets of latents have a lower bound on the number of pure measurement child variables, Xie et al. (2020) proposes the GIN condition to identify the structure between latent confounders. The formulation in the above works involves SCMs where some variables are observed and the others are unobserved. In contrast, the entire SCM is latent in our setup. Finally, GraphVAE (He et al., 2019) learns a structure between latent variables but does not incorporate notions of causality. 2016) establishes observable causal footprints in images. We refer the reader to section A.9 for a discussion on identifiability.

4.1. PROBLEM SCENARIO

We are given a dataset D = {x (1) , ..., x (N ) }, where each x (i) is a high-dimensional observed data -for simplicity, we assume x (i) is a vector in R D but the setup extends to other inputs as well. We assume that there exist latent causal variables Z = {z (i) ∈ R d } N i=1 where d ≤ D, that explain the data D, and these latent variables belong to a GT SCM G GT , Θ GT . We wish to invert the data generation process g : (Z, G GT , Θ GT ) → D. In the setting, we also have access to the intervention targets I = {I (i) } N i=1 where each I (i) ∈ {0, 1} d . The j th dimension of I (i) takes a value of 1 if node j was intervened on in data sample i, and 0 otherwise. We will take X , Z, G, Θ to be random variables over low-level data, latent causal variables, the SCM structure, and SCM parameters respectively.

4.2. GENERAL METHOD

We aim to obtain a posterior estimate over the entire latent SCM, p(Z, G, Θ | D). Computing the true posterior analytically requires calculating the marginal likelihood p(D) which gets quickly intractable due to the number of possible DAGs growing super-exponentially with respect to the number of nodes. Thus, we resort to variational inference (Blei et al., 2017 ) that provides a tractable way to learn an approximate posterior q ϕ (Z, G, Θ) with variational parameters ϕ, close to the true posterior p(Z, G, Θ | D) by maximizing the Evidence Lower Bound (ELBO), L(ψ, ϕ) = E q ϕ (Z,G,Θ) log p ψ (D | Z, G, Θ) -log q ϕ (Z, G, Θ) p(Z, G, Θ) , where p(Z, G, Θ) is the prior, p ψ (D | Z, G, Θ) is the likelihood model with parameters ψ, the likelihood model maps the latent variables to high-dimensional vectors. An approach to learn this posterior could be to factorize it as q ϕ (Z, G, Θ) = q ϕ (Z) • q ϕ (G, Θ | Z) (6) Given a way to obtain q ϕ (Z), the conditional q ϕ (G, Θ | Z) can be obtained using existing Bayesian structure learning methods. Otherwise, one has to perform a hard simultaneous optimization which would require alternating optimizations on Z and on (G, Θ) given an estimate of Z. Difficulty of such an alternate optimization is discussed in Brehmer et al. (2022) . Alternate factorization of the posterior: Rather than factorizing as in equation 6, we propose to only introduce a variational distribution q ϕ (G, Θ) over structures and parameters, so that the approximation is given by q ϕ (Z, G, Θ) = p(Z | G, Θ)•q ϕ (G, Θ). The advantage of this factorization is that the true distribution p(Z | G, Θ) over Z is completely determined from the SCM given (G, Θ) and exogenous noise variables (assumed to be Gaussian). This conveniently avoids the hard simultaneous optimization problem mentioned above since optimizing for q ϕ (Z) is not necessary. Hence, equation 5 simplifies to: L(ψ, ϕ) = E q ϕ (Z,G,Θ) log p ψ (D | Z) -log q ϕ (G, Θ) p(G, Θ) - : 0 log p(Z | G, Θ) p(Z | G, Θ) Such a posterior can be used to obtain an SCM by sampling Ĝ and Θ from the approximated posterior. As long as the samples Ĝ are always acyclic, one can perform ancestral sampling through the SCM to obtain predictions of the causal variables ẑ(i) . For additive noise models like in equation 2, these samples are already reparameterized and differentiable with respect to their parameters. The samples of causal variables are then fed to the likelihood model to predict samples x(i) that reconstruct the observed data x (i) .

4.3. POSTERIOR PARAMETERIZATIONS AND PRIORS

For linear Gaussian latent SCMs, which is the focus of this work, learning a posterior over (G, Θ) is equivalent to learning q ϕ (W, Σ) -a posterior over weighted adjacency matrices W and noise covariances Σ. We follow an approach similar to (Cundy et al., 2021) . We express W via a permutation matrix Pfoot_0 and a lower triangular edge weight matrix L, according to W = P T L T P . Here, L is defined in the space of all weighted adjacency matrices with a fixed node ordering where node j can be a parent of node i only if j > i. Search over permutations corresponds to search over different node orderings and thus, W and Σ parameterize the space of SCMs. Further, we factorize the approximate posterior q ϕ (P, L, Σ) as q ϕ (G, Θ) ≡ q ϕ (W, Σ) ≡ q ϕ (P, L, Σ) = q ϕ (P | L, Σ) • q ϕ (L, Σ) Combining equation 7 and 8 leads to the following ELBO which has to be maximized (derived in A.1), and the overall method is summarized in algorithm 1, L(ψ, ϕ) = E q ϕ (L,Σ) E q ϕ (P |L,Σ) E q ϕ (Z|P,L,Σ) [ log p ψ (D | Z) ]-log q ϕ (P | L, Σ) p(P ) -log q ϕ (L, Σ) p(L)p(Σ) Distribution over (L, Σ): The posterior distribution q ϕ (L, Σ) has ( d(d-1) 2 +1) elements to be learnt in the equal noise variance setting. This is parameterized as a diagonal covariance normal distribution. For the prior p(L) over the edge weights, we promote sparse DAGs by using a horseshoe prior (Carvalho et al., 2009) , similar to Cundy et al. (2021) . A Gaussian prior is defined over log Σ. Distribution over P : Since the values of P are discrete, performing a discrete optimization is combinatorial and becomes quickly intractable with increasing d. This can be handled by relaxing the discrete permutation learning problem to a continuous optimization problem. This is commonly done by introducing a Gumbel-Sinkhorn (Mena et al., 2018) distribution and where one has to calculate S((T + γ)/τ ), where T is the parameter of the Gumbel-Sinkhorn, γ is a matrix of standard Gumbel noise, and τ is a fixed temperature parameter. The logits T are predicted by passing the Algorithm 1 Bayesian latent causal discovery to learn G, Θ, Z from high dimensional data Input: D, I Output: Posterior samples over G, Θ, Z 1: Initialize q ϕ (L, Σ), MLP ϕ(T ) , p ψ (X | Z), τ and set learning rate α 2: for num epochs do 3: ( L, Σ) ∼ q ϕ (L, Σ)

4:

T ← MLP ϕ(T ) ( L, Σ) ▷ Compute logits for sampling from q ϕ (P | L, Σ) 5: γ ∈ R d×d ∼ standard Gumbel 6: P soft ← Sinkhorn((T + γ)/τ ) 7: P hard ← Hungarian( P soft ; τ → 0) 8: W ← P T L T P 9: for i ← 1 to N do 10: C (i) ← argwhere(I (i) = 1) 11: W = copy( W ) 12: W [:, C (i) ] ← 0 ▷ Mutated weighted adjacency matrix according to I (i) 13: W I (i) ← W 14: ẑ(i) ← AncestralSample( W I (i) , Σ) 15: end for 16: Ẑ ← {ẑ (i) } N i=1 17: D ∼ p ψ (X | Z = Ẑ) 18: ψ ← ψ + α • ∇ ψ (L(ψ, ϕ)) ▷ Update network parameters 19: ϕ ← ϕ + α • ∇ ϕ (L(ψ, ϕ)) 20: end for 21: return binary( W ), ( W , Σ), Ẑ predicted (L, Σ) through an MLP. In the limit of infinite iterations and as τ → 0, sampling from the distribution returns a doubly stochastic matrix. During the forward pass, a hard permutation P is obtained by using the Hungarian algorithm (Kuhn, 1955) which allows τ → 0. During the backward pass, a soft permutation is used to calculate gradients similar to (Cundy et al., 2021; Charpentier et al., 2022) . We use a uniform prior p(P ) over permutations.

5. EXPERIMENTS AND EVALUATION

We perform experiments to evaluate the learned posterior over (Z, G, Θ) of the true linear Gaussian latent SCM from high-dimensional data. We aim to highlight the performance of our proposed method on latent causal discovery. As proper evaluation in such a setting would require access to the GT causal graph that generated the high-dimensional observations, we test our method against baselines on synthetically generated vector data and in the realistic case of learning the SCM from pixels in the chemistry environment dataset of (Ke et al., 2021), both of which have a GT causal structure to be compared with. Further, we evaluate the ability of our model to sample images from unseen interventional distributions. Baselines: Since we are, to the best of our knowledge, the first to study this setting of Bayesian learning of latent SCMs from low level observations, we are not aware of baseline methods that solve this task. However, we compare our approach against two baselines: (i) Against VAE that has a marginal independence assumption between latent variables and thus have a predefined structure in the latent space, and (ii) against GraphVAE (He et al., 2019) that learns a structure between latent variables. For all baselines, we treat the learned latent variables as causal variables and compare the recovered structure, parameters, and causal variables recovered. Since GraphVAE does not learn the parameters, we fix the edge weight over all predicted edges to be 1.

Evaluation metrics:

To evaluate the learned structure, we use two metrics commonly used in the literature -the expected Structural Hamming Distance (E-SHD, lower is better) obtains the SHD (number of edge flips, removals, or additions) between the predicted and GT graph and then takes an expectation over SHDs of posterior DAG samples, and the Area Under the Receiver Operating Characteristic curve (AUROC, higher is better) where a score of 0.5 corresponds to a random DAG baseline. To evaluate the learned parameters of the linear Gaussian latent SCM, we use the Mean Squared Error (MSE, lower is better) between the true and predicted edge weights. To evaluate the learned causal variables, we use the Mean Correlation Coefficient (MCC, higher is better) following Hyvarinen and Morioka (2017); Zimmermann et al. ( 2021) and Ahuja et al. (2022b) which calculates a score between the true and predicted causal variables. See appendix A.3 and A.8 for training curves and more extensive evaluations of the experiments along other metrics. All our implementations are in JAX (Bradbury et al., 2018) and results are presented over 20 random DAGs.

5.1. EXPERIMENTS ON SYNTHETIC DATA

We evaluate our proposed method with the baselines on synthetically generated dataset, where we have complete control over the data generation procedure.

5.1.1. SYNTHETIC VECTOR DATA GENERATION

To generate high-dimensional vector data with a known causal structure, we first generate a random DAG and linear SCM parameters, and generate true causal variables by ancestral sampling. This is then used to generate corresponding high-dimensional dataset with a random projection function. Generating the DAG and causal variables: Following many works in the literature, we sample random Erdős-Rényi (ER) DAGs (Erdos et al., 1960) with degrees in {1, 2, 4} to generate the DAG. For every edge in this DAG, we sample the magnitude of edge weights uniformly as |L|∼ U(0.5, 2.0) and randomly sample the permutation matrix. We perform ancestral sampling through this random DAG with intervention targets I, to obtain Z and then project it to D dimensions to obtain {x (i) } N i=1 . Generating high-dimensional vectors from causal variables: We consider two different cases of generating the high-dimensional data from the causal variables obtained in the previous step: (i) x (i) is a random linear projection of causal variables, z (i) , from R d to R D , according to x = z P, where P ∈ R d×D is a random projection matrix. (ii) x (i) is a nonlinear projection of causal variables, z (i) , modeled by a 3-layer MLP.

5.1.2. RESULTS ON SYNTHETIC VECTOR DATA

Results on linear projection of causal variables: We present results on the learned causal variables, structure, and parameters in two scenarios: (i) when the true node ordering or permutation is given (e.g., as in He et al. (2019) ), and (ii) when the node ordering is not given and one has to additionally also infer the permutation P . epochs so as to reach convergence. Figure 4 summarizes the results for d = 5 nodes, for which we use 500 observational data points and 2000 interventional data points. Of the 2000 interventional data, we generate 100 random interventional data points per set over 20 intervention setsfoot_1 . It can be seen that when permutation is given, the proposed method can recover the causal graph correctly in all the cases, achieving E-SHD of 0 and AUROC of 1. When the permutation is learned, the proposed method still recovers the true causal structure very well. However, this is not the case with baseline methods of VAE and GraphVAE, which perform significantly worse on most of the metrics. Results on nonlinear projection of causal variables: For d = 5, 10, 20 nodes projected to D = 100 dimensions, we evaluate our algorithm on synthetic ER-1, ER-2, and ER-4 DAGs, given the permutation. Figure 5 summarizes the results for 5 and 10 nodes. As in the linear case, the proposed method recovers the true causal structure and the true causal variables, and is significantly better than the VAE and GraphVAE baselines on all the metrics considered. For experiments in this setting, we noticed empirically that learning the permutation is hard, and performs not so different from a null graph baseline (Figure 13 ). This observation complements the identifiability result that recovery of latent variables is possible only upto a permutation in latent causal models (Brehmer et al., 2022; Liu et al., 2022) for general nonlinear mappings between causal variables and low-level data. This supports our observation of not being able to learn the permutation in nonlinear projection settingsbut once the permutation is given to the model, it can quickly recover the SCM (figures 5,13). Refer figure 14 (in Appendix) for results on d = 20 nodes.

MSE(L, L)

d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 2 0 4 0 6 0 d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 .4 0 .6 0 .8 1 .0 d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 2 5 5 0 7 5 1 0 0 d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 1 2 3 Ours GraphVAE VAE

5.2. RESULTS ON LEARNING LATENT SCMS FROM PIXEL DATA

Dataset and Setup: A major challenge with evaluating latent causal discovery models on images is that it is hard to obtain images with corresponding GT graph and parameters. Other works (Kocaoglu et al., 2018; Yang et al., 2021; Shen et al., 2021) handle this by assuming the dataset is generated from certain causal variables (assumed to be attributes like gender, baldness, etc.) and a causal structure that is heuristically set by experts, usually in the CelebA dataset (Liu et al., 2015) . This makes evaluation particularly noisy. Given these limitations, we verify if our model can perform latent causal discovery by evaluating on images from the chemistry dataset proposed in Ke et al. (2021) -a scenario where all GT factors are known. We use the environment to generate blocks of different intensities according to a linear Gaussian latent SCM where the parent block colors affect the child block colors then obtain the corresponding images of blocks. The dataset allows generating pixel data from random DAGs and linear SCMs. For this step, we use the same technique to generate causal variables as in the synthetic dataset section. Similar to experiments on nonlinear vector data, we are given the node ordering in this setting. Results: We perform experiments to evaluate latent causal discovery from pixels and known interventions. The results are summarized in figure 6 . It can be seen that the proposed approach can recover the SCM significantly better than the baseline approaches in all the metrics even in the realistic dataset. In figure 7 , we also assess the ability of the model to sample images from unseen interventions in the chemistry dataset by examining the generated images with GT interventional samples. The matching intensity of each block corresponds to matching causal variables, which demonstrates model generalization.  d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 1 0 2 0 3 0 4 0 d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 .6 0 .8 1 .0 d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 2 0 4 0 6 0 8 0 d = 1 0 , E R -1 d = 1 0 , E R -2 d = 1 0 , E R -4 0 .0 0 .5 1 .0 Ours GraphVAE VAE

LIMITATIONS

Our approach makes a number of assumptions. First, we assume the latent SCM is linear Gaussian. However, removing this restriction might be crucial to make the approach more general. Second, we assume access to the number of latent causal variables d. Extensions must consider how to infer the number of latent variables. We also make the assumption of known intervention targets whereas this might be restrictive for real-world applications. Future work could overcome this limitation by inferring interventions as in Hägele et al. (2022) . Finally, we have assumed feasibility of interventions and known causal orderings for some of our experiments. However, in reality, some interventions could be infeasible and the causal ordering of latent variables might not be known.

CONCLUSION

We The details regarding synthetic data generation of the DAGs, causal variables, and the high dimensional data is mentioned in section 5. The code for data generation, the models used, as well as for all the experiments is available at anonymous.4open.science/r/anon-biols-86E7. The average runtime of all experiments is documented in table 3. Appendix A.8 further evaluates experiments along additional metrics. Since the log evidence log p(D) is a constant, minimizing the KL divergence corresponds to maximizing the following ELBO: max ϕ,ψ E (L,Σ)∼q ϕ (L,Σ) E P ∼q ϕ (P |L,Σ) E Z∼p(Z|P,L,Σ) [ log p ψ (D | Z) ] -log q ϕ (P | L, Σ) p(P ) -log q ϕ (L, Σ) p(L)p(Σ) A.2 IMPLEMENTATION DETAILS For all our experiments, we use the AdaBelief (Zhuang et al., 2020) optimizer with ϵ = 10 -8 and a learning rate of 0.0008. τ is set to 0.2. Our experiments are fairly robust with respect to hyperparameters and we did not perform hyperparameter tuning for any of our experiments. Table 1 and 2 summarizes the network details for MLP ϕ(T ) and the decoder p ψ (X | Z). Here, we study how the performance and recovery of the latent SCM is affected with respect to the number of intervention types in the dataset. The number of intervention types refers to different combinations of nodes we perform interventions on. For all experiments in this subsection, we use 100 interventional samples per type of intervention. Figure 15 and 16 show results on vector data where the high-dimensional vector is a linear projection of the causal variables. Figure 17 and 18 summarize results on vector data where the high-dimensional vector is a nonlinear projection of the causal variables. Figure 19 shows the complete evaluation on 18 different metrics for d = 5 nodes. The dataset consists of 500 observational points and 2000 interventional points. To sample the 2000 interventional points, we randomly choose 20 intervention sets, and for each intervention set we sample 100 data points with random intervention values. A.9 DISCUSSION ON IDENTIFIABILITY Identifiability is an important topic of discussion when making conclusions about the recovering the structure of a causal model. However in our work, we approximate a full posterior distribution over the latent SCMs, instead of returning just a single graph. In this setting, questions of identifiability become less critical, as we can assign probabilities for many possible candidate graphs (and parameters) to express our level of confidence that a particular SCM yields the correct causal conclusions. This is a softer guarantee than what identifiability would provide. Nevertheless, we refer the reader to recent works in causal representation learning (Brehmer et al., 2022; Ahuja et al., 2022a; c) , that have given some identifiability guarantees in conditions similar to ours. Particularly, we rely on the identifiability results presented in Brehmer et al. (2022) . But identifiability does not imply learnability. Thus, in this work, we are concerned with the problem of how one can devise a principled practical algorithm to learn a distribution over latent SCMs. 



A permutation matrix P ∈ {0, 1} d×d is a bistochastic matrix with i pij = 1∀j and j pij = 1∀i. An intervention set is defined as a set of nodes on which an intervention is performed



Figure 1: Model architecture of the proposed generative model for the Bayesian latent causal discovery task to learn latent SCM from low-level data.

Figure 3: BN for the latent causal discovery task that generalizes standard causal discovery setups

Supervised causal representation learning:Brehmer et al. (2022) present identifiability theory for learning causal representations and propose an algorithm assuming access to pairs of observational and interventional data.Ahuja et al. (2022a)  studies identifiability in a similar setup with the use of sparse perturbations.Ahuja et al. (2022c)  discusses identifiability for causal representation learning when one has access to interventional data.Kocaoglu et al. (2018); Shen et al. (2021); Moraffah et al. (2020) introduce generative models that use an SCM-based prior in latent space. In Shen et al. (2021), the goal is to learn causally disentangled variables. Yang et al. (2021) learn a DAG but assumes complete access to the causal variables. Lopez-Paz et al. (

Figure 4: Learning the latent SCM (i) given a node ordering (top) and (ii) over node orderings (bottom) for linear projection of causal variables for d = 5 nodes, D = 100 dimensions: E-SHD (↓), AUROC (↑), MCC (↑), MSE (↓).

For d = 5, 10, 20 nodes projected to D = 100 dimensions, we evaluate our algorithm on synthetic ER-1, ER-2, and ER-4 DAGs. The model was trained for 5000 Under review as a conference paper at ICLR 2023 d = 5 , E R -1 d = 5 , E R -2 d = 5

Figure 5: Learning the latent SCM for nonlinear projection of causal variables for d = 5 (top) and d = 10 (bottom) nodes, D = 100 dimensions, given the node ordering. E-SHD (↓), AUROC (↑), MCC (↑), MSE (↓).

Figure 11 and 12 (in Appendix) show the results for d = 10 and d = 20 nodes.

Figure 6: Learning the latent SCM from pixels of the chemistry dataset for d = 5 (top) and d = 10 nodes (bottom), given the node ordering. E-SHD (↓), AUROC (↑), MCC (↑), MSE (↓)

Figure 7: Image sampling from 10 random, unseen interventions: Mean over GT (top row) and predicted (bottom row) image samples in the chemistry dataset for d = 5 nodes.

Figures 8, 9, and 10 summarize the training curves for d = 6, 20, 50 nodes on ER-DAGs of degree 1, 2, and 4. Each plot contains 3 lines that shows training with observational, single-interventional, and multi-interventional data.

Figure 14: Learning the latent SCM given a node ordering for nonlinear projection of causal variables for d = 20 nodes, D = 100 dimensions.

Figure 19: Training curves: Metric versus number of iterations for d = 5 nodes linearly projected to D = 100 dimensions, with 20 intervention sets, 100 interventional samples per intervention set, and 500 observational samples.

presented a tractable approximate inference technique to perform Bayesian latent causal discovery that jointly infers the causal variables, structure and parameters of linear Gaussian latent SCMs under random, known interventions from low-level data. The learned causal model is also shown to generalize to unseen interventions. Our Bayesian formulation allows uncertainty quantification and mutual information estimation which is well-suited for extensions to active causal discovery. Extensions of the proposed method to learn nonlinear, non-Gaussian latent SCMs from unknown interventions would also open doors to general algorithms that can learn causal representations.

Network architecture for MLP ϕ(T )

Network architecture for the decoder p ψ (X | Z) Layer type Layer output Activation

A APPENDIX

A.1 DERIVATION OF THE ELBO We want to minimize the KL divergence between the true and approximate posterior:where the prior and posterior factorize as (according to the explanation in 4.2):Thus, we have that D KL (q ϕ (Z, G, Θ) || p(Z, G, Θ | D)) reduces to:-E (G,Θ)∼q ϕ (G,Θ) E Z∼p(Z|G,Θ) [ log p ψ (D | Z) ] -log q ϕ (G, Θ) p(G, Θ) + log p(D)= -E (P,L,Σ)∼q ϕ (P,L,Σ) E Z∼p(Z|P,L,Σ) [ log p ψ (D | Z) ] -log q ϕ (P, L, Σ) p(P, L, Σ) + log p(D)(from 8)= -E (L,Σ)∼q ϕ (L,Σ) E P ∼q ϕ (P |L,Σ) E Z∼p(Z|P,L,Σ) 2022) is in a setting where the edges exist not just between latent causal variables but with high-dimensional variables in the dataset as well. Other efforts include (Shimizu et al., 2011; Lopez-Paz and Oquab, 2016; Yu et al., 2019; Ghoshal and Honorio, 2018; Ng et al., 2020; Li et al., 2022) .

A.8 COMPLETE EVALUATION

For the experiments already presented in the main text, this section contains additional a more comprehensive evaluation on the following metrics:• SHD C: The expected CPDAG SHD between the GT and predicted DAG's skeletons. 

