DIFFUSION MODELS FOR CAUSAL DISCOVERY VIA TOPOLOGICAL ORDERING

Abstract

Discovering causal relations from observational data becomes possible with additional assumptions such as considering the functional relations to be constrained as nonlinear with additive noise (ANM). Even with strong assumptions, causal discovery involves an expensive search problem over the space of directed acyclic graphs (DAGs). Topological ordering approaches reduce the optimisation space of causal discovery by searching over a permutation rather than graph space. For ANMs, the Hessian of the data log-likelihood can be used for finding leaf nodes in a causal graph, allowing its topological ordering. However, existing computational methods for obtaining the Hessian still do not scale as the number of variables and the number of samples are increased. Therefore, inspired by recent innovations in diffusion probabilistic models (DPMs), we propose DiffAN 1 , a topological ordering algorithm that leverages DPMs for learning a Hessian function. We introduce theory for updating the learned Hessian without re-training the neural network, and we show that computing with a subset of samples gives an accurate approximation of the ordering, which allows scaling to datasets with more samples and variables. We show empirically that our method scales exceptionally well to datasets with up to 500 nodes and up to 10 5 samples while still performing on par over small datasets with state-of-the-art causal discovery methods.

1. INTRODUCTION

Figure 1 : Plot showing run time in seconds for different sample sizes, for discovery of causal graphs with 500 nodes. Most causal discovery methods have prohibitive run time and memory cost for datasets with many samples; the previous state-of-the-art SCORE algorithm (Rolland et al., 2022) which is included in this graph cannot be computed beyond 2000 samples in a machine with 64GB of RAM. By contrast, our method DiffAN has a reasonable run time even for numbers of samples two orders of magnitude larger than capable by most existing methods. Understanding the causal structure of a problem is important for areas such as economics, biology (Sachs et al., 2005) and healthcare (Sanchez et al., 2022) , especially when reasoning about the effect of interventions. When interventional data from randomised trials are not available, causal discovery methods (Glymour et al., 2019) may be employed to discover the causal structure of a problem solely from observational data. Causal structure is typically modelled as a directed acyclic graph (DAG) G in which each node is associated with a random variable and each edge represents a causal mechanism i.e. how one variable influences another. However, learning such a model from data is NP-hard (Chickering, 1996) . Traditional methods search the DAG space by testing for conditional independence between variables (Spirtes et al., 1993) or by optimising some goodness of fit measure (Chickering, 2002) . Unfortunately, solving the search problem with a greedy combinatorial optimisation method can be expensive and does not scale to high-dimensional problems. In line with previous work (Teyssier & Koller, 2005; Park & Klabjan, 2017; Bühlmann et al., 2014; Solus et al., 2021; Wang et al., 2021; Rolland et al., 2022) , we can speed up the combinatorial search problem over the space of DAGs by rephrasing it as a topological ordering task, ordering from leaf nodes to root nodes. The search space over DAGs with d nodes and (d 2 -d)/2 possible edges is much larger than the space of permutations over d variables. Once a topological ordering of the nodes is found, the potential causal relations between later (cause) and earlier (effect) nodes can be pruned with a feature selection algorithm (e.g. Bühlmann et al. (2014) ) to yield a graph which is naturally directed and acyclic without further optimisation. Recently, Rolland et al. (2022) proposed the SCORE algorithm for topological ordering. SCORE uses the Hessian of the data log-likelihood, ∇ 2 x log p(x).By verifying which elements of ∇ 2 x log p(x)'s diagonal are constant across all data points, leaf nodes can be iteratively identified and removed. Rolland et al. (2022) estimate the Hessian point-wise with a second-order Stein gradient estimator (Li & Turner, 2018 ) over a radial basis function (RBF) kernel. However, point-wise estimation with kernels scales poorly to datasets with large number of samples n because it requires inverting a n × n kernel matrix. Here, we enable scalable causal discovery by utilising neural networks (NNs) trained with denoising diffusion instead of Rolland et al.'s kernel-based estimation. We use the ordering procedure, based on Rolland et al. (2022) , which requires re-computing the score's Jacobian at each iteration. Training NNs at each iteration would not be feasible. Therefore, we derive a theoretical analysis that allows updating the learned score without re-training. In addition, the NN is trained over the entire dataset (n samples) but only a subsample is used for finding leaf nodes. Thus, once the score model is learned, we can use it to order the graph with constant complexity on n, enabling causal discovery for large datasets in high-dimensional settings. Interestingly, our algorithm does not require architectural constraints on the neural network, as in previous causal discovery methods based on neural networks (Lachapelle et al., 2020; Zheng et al., 2020; Yu et al., 2019; Ng et al., 2022) . Our training procedure does not learn the causal mechanism directly, but the score of the data distribution.

Contributions.

In summary, we propose DiffAN, an identifiable algorithm leveraging a diffusion probabilistic model for topological ordering that enables causal discovery assuming an additive noise model: (i) To the best of our knowledge, we present the first causal discovery algorithm based on denoising diffusion training which allows scaling to datasets with up to 500 variables and 10 5 samples. The score estimated with the diffusion model is used to find and remove leaf nodes iteratively; (ii) We estimate the second-order derivatives (score's Jacobian or Hessian) of a data distribution using neural networks with diffusion training via backpropagation; (iii) The proposed deciduous score (Section 3) allows efficient causal discovery without re-training the score model at each iteration. When a leaf node is removed, the score of the new distribution can be estimated from the original score (before leaf removal) and its Jacobian.

2.1. PROBLEM DEFINITION

We consider the problem of discovering the causal structure between d variables, given a probability distribution p(x) from which a d-dimensional random vector x = (x 1 , . . . , x d ) can be sampled. We assume that the true causal structure is described by a DAG G containing d nodes. Each node represents a random variable x i and edges represent the presence of causal relations between them. In other words, we can say that G defines a structural causal model (SCM) consisting of a collection of assignments x i := f i (P a(x i ), ϵ i ), where P a(x i ) are the parents of x i in G, and ϵ i is a noise term independent of x i , also called exogenous noise. ϵ i are i.i.d. from a smooth distribution p ϵ . The SCM entails a unique distribution p(x) = d i=1 p(x i | P a(x i )) over the variables x (Peters et al., 2017) . The observational input data are X ∈ R n×d , where n is number of samples. The target output is an adjacency matrix A ∈ R d×d . The topological ordering (also called causal ordering or causal list) of a DAG G is defined as a non-unique permutation π of d nodes such that a given node always appears first in the list than its descendants. Formally, π i < π j ⇐⇒ j ∈ De G (x i ) where De G (x i ) are the descendants of the ith node in G (Appendix B in Peters et al. (2017) ).

2.2. NONLINEAR ADDITIVE NOISE MODELS

Learning a unique A from X with observational data requires additional assumptions. A common class of methods called additive noise models (ANM) (Shimizu et al., 2006; Hoyer et al., 2008; Peters et al., 2014; Bühlmann et al., 2014) explores asymmetries in the data by imposing functional assumptions on the data generation process. In most cases, they assume that assignments take the form x i := f i (P a(x i )) + ϵ i with ϵ i ∼ p ϵ . Here we focus on the case described by Peters et al. (2014) where f i is nonlinear. We use the notation f i for f i (P a(x i )) because the arguments of f i will always be P a(x i ) throughout this paper. We highlight that f i does not depend on i. Identifiability. We assume that the SCM follows an additive noise model (ANM) which is known to be identifiable from observational data (Hoyer et al., 2008; Peters et al., 2014) . We also assume causal sufficiency, i.e. there are no hidden variables that are a common cause of at least two observed variables. In addition, corollary 33 from Peters et al. (2014) states that the true topological ordering of the DAG, as in our setting, is identifiable from a p(x) generated by an ANM without requiring causal minimality assumptions. Finding Leaves with the Score. Rolland et al. (2022) propose that the score of an ANM with distribution p(x) can be used to find leavesfoot_0 . Before presenting how to find the leaves, we derive, following Lemma 2 in Rolland et al. (2022) , an analytical expression for the score which can be written as ∇ xj log p(x) = ∇ xj log d i=1 p(x i | P a(x i )) = ∇ xj d i=1 log p(x i | P a(x i )) = ∇ xj d i=1 log p ϵ (x i -f i ) ▷ Using ϵ i = x i -f i = ∂ log p ϵ (x j -f j ) ∂x j - i∈Ch(xj ) ∂f i ∂x j ∂ log p ϵ (x i -f i ) ∂x . (1) Where Ch(x j ) denotes the children of x j . We now proceed, based on Rolland et al. (2022) , to derive a condition which can be used to find leaf nodes. Lemma 1. Given a nonlinear ANM with a noise distribution p ϵ and a leaf node j; assume that ∂ 2 log p ϵ ∂x 2 = a, where a is a constant, then Var X [H j,j (log p(x))] = 0. (2) See proof in Appendix A.1. Remark. Lemma 1 enables finding leaf nodes based on the diagonal of the log-likelihood's Hessian. Rolland et al. (2022) , using a similar conclusion, propose a topological ordering algorithm that iteratively finds and removes leaf nodes from the dataset. At each iteration Rolland et al. (2022) re-compute the Hessian with a kernel-based estimation method. In this paper, we develop a more efficient algorithm for learning the Hessian at high-dimensions and for a large number of samples. Note that Rolland et al. (2022) prove that Equation 2 can identify leaves in nonlinear ANMs with Gaussian noise. We derive a formulation which, instead, requires the second-order derivative of the noise distribution to be constant. Indeed, the condition ∂ 2 log p ϵ ∂x 2 = a is true for p ϵ following a Gaussian distribution which is consistent with Rolland et al. (2022) , but could potentially be true for other distributions as well.

2.3. DIFFUSION MODELS APPROXIMATE THE SCORE

The process of learning to denoise (Vincent, 2011) can approximate that of matching the score (Hyvärinen, 2005) . A diffusion process gradually adds noise to a data distribution over time. Diffusion probabilistic models (DPMs) Sohl-Dickstein et al. (2015) ; Ho et al. (2020) ; Song et al. (2021) learn to reverse the diffusion process, starting with noise and recovering the data distribution. The diffusion process gradually adds Gaussian noise, with a time-dependent variance α t , to a sample x 0 ∼ p data (x) from the data distribution. Thus, the noisy variable x t , with t ∈ [0, T ], is learned to correspond to versions of x 0 perturbed by Gaussian noise following p (x t | x 0 ) = N x t ; √ α t x 0 , (1 -α t ) I , where α t := t j=0 (1 -β j ), β j is the variance scheduled between [β min , β max ] and I is the identity matrix. DPMs (Ho et al., 2020) are learned with a weighted sum of denoising score matching objectives at different perturbation scales with θ * = arg min θ E x0,t,ϵ λ(t) ∥ϵ θ (x t , t) -ϵ∥ 2 2 , where x t = √ α t x 0 + √ 1 -α t ϵ, with x 0 ∼ p(x) being a sample from the data distribution, t ∼ U (0, T ) and ϵ ∼ N (0, I) is the noise. λ(t) is a loss weighting term following Ho et al. (2020) . Remark. Throughout this paper, we leverage the fact that the trained model ϵ θ approximates the score ∇ xj log p(x) of the data (Song & Ermon, 2019) .

3. THE DECIDUOUS SCORE

Discovering the complete topological ordering with the distribution's Hessian (Rolland et al., 2022) is done by finding the leaf node (Equation 2), appending the leaf node x l to the ordering list π and removing the data column corresponding to x l from X before the next iteration d -1 times. Rolland et al. (2022) estimate the score's Jacobian (Hessian) at each iteration. Instead, we explore an alternative approach that does not require estimation of a new score after each leaf removal. In particular, we describe how to adjust the score of a distribution after each leaf removal, terming this a "deciduous score"foot_1 . We obtain an analytical expression for the deciduous score and derive a way of computing it, based on the original score before leaf removal. In this section, we only consider that p(x) follows a distribution described by an ANM, we pose no additional assumptions over the noise distribution. Definition 1. Considering a DAG G which entails a distribution p(x) = d i=1 p(x i | P a(x i )). Let p(x -l ) = p(x) p(x l |P a(x l )) be p(x) without the random variable corresponding to the leaf node x l . The deciduous score ∇ log p(x -l ) ∈ R d-1 is the score of the distribution p(x -l ). Lemma 2. Given a ANM which entails a distribution p(x), we can use Equation 1 to find an analytical expression for an additive residue ∆ l between the distribution's score ∇ log p(x) and its deciduous score ∇ log p(x -l ) such that ∆ l = ∇ log p(x) -∇ log p(x -l ). (4) In particular, ∆ l is a vector {δ j | ∀j ∈ [1, . . . , d] \l} where the residue w.r.t. a node x j can be denoted as δ j = ∇ xj log p(x) -∇ xj log p(x -l ) = - ∂f i ∂x j ∂ log p ϵ (x i -f i ) ∂x . (5) If x j / ∈ P a(x l ), δ j = 0. Proof. Observing Equation 1, the score ∇ xj log p(x) only depends on the following random variables (i) P a(x j ), (ii) Ch(x j ), and (iii) P a(Ch(x j )). We consider x l to be a leaf node, therefore ∇ xj log p(x) only depends on x l if x j ∈ P a(x l ). If x j ∈ P a(x l ), the only term depending on ∇ xj log p(x) dependent on x l is one of the terms inside the summation. Figure 2 : Topological ordering with diffusion models by iteratively finding leaf nodes. At each iteration, one leaf node is found using Equation 9. In the subsequent iteration, the previous leaves are removed, reducing the search space. After topological ordering (as illustrated on the right side), the presence of edges (causal mechanisms) between variables can be inferred such that parents of each variable are selected from the preceding variables in the ordered list. Spurious edges can be pruned with feature selection as a post-processing step (Bühlmann et al., 2014; Lachapelle et al., 2020; Rolland et al., 2022) . However, we wish to estimate the deciduous score ∇ log p(x -l ) without direct access to the function f l , to its derivative, nor to the distribution p ϵ . Therefore, we now derive an expression for ∆ l using solely the score and the Hessian of log p(x). Theorem 1. Consider an ANM of distribution p(x) with score ∇ log p(x) and the score's Jacobian H(log p(x)). The additive residue ∆ l necessary for computing the deciduous score (as in Proposition 2) can be estimated with ∆ l = H l (log p(x)) • ∇ x l log p(x) H l,l (log p(x)) . ( ) See proof in Appendix A.2.

4. CAUSAL DISCOVERY WITH DIFFUSION MODELS

DPMs approximate the score of the data distribution (Song & Ermon, 2019) . In this section, we explore how to use DPMs to perform leaf discovery and compute the deciduous score, based on Theorem 1, for iteratively finding and removing leaf nodes without re-training the score.

4.1. APPROXIMATING THE SCORE'S JACOBIAN VIA DIFFUSION TRAINING

The score's Jacobian can be approximated by learning the score ϵ θ with denoising diffusion training of neural networks and back-propagating (Rumelhart et al., 1986) foot_2 from the output to the input variables. It can be written, for an input data point x ∈ R d , as H i,j log p(x) ≈ ∇ i,j ϵ θ (x, t), where ∇ i,j ϵ θ (x, t) means the ith output of ϵ θ is backpropagated to the jth input. The diagonal of the Hessian in Equation 7 can, then, be used for finding leaf nodes as in Equation 2. In a two variable setting, it is sufficient for causal discovery to (i) train a diffusion model (Equation 3); (ii) approximate the score's Jacobian via backpropagation (Equation 7); (iii) compute variance of the diagonal across all data points; (iv) identify the variable with lowest variance as effect (Equation 2). We illustrate in Appendix C the Hessian of a two variable SCM computed with a diffusion model.

4.2. TOPOLOGICAL ORDERING

When a DAG contains more than two nodes, the process of finding leaf nodes (i.e. the topological order) needs to be done iteratively as illustrated in Figure 2 . The naive (greedy) approach would be to remove the leaf node from the dataset, recompute the score, and compute the variance of the new distribution's Hessian to identify the next leaf node (Rolland et al., 2022) . Since we employ diffusion models to estimate the score, this equates to re-training the model each time after a leaf is removed. We hence propose a method to compute the deciduous score ∇ log p(x -l ) using Theorem 1 to remove leaves from the initial score without re-training the neural network. In particular, assuming that a leaf x l is found, the residue ∆ l can be approximatedfoot_4 with ∆ l (x, t) ≈ ∇ l ϵ θ (x, t) • ϵ θ (x, t) l ∇ l,l ϵ θ (x, t) where ϵ θ (x, t) l is output corresponding to the leaf node. Note that the term ∇ l ϵ θ (x, t) is a vector of size d and the other term is a scalar. During topological ordering, we compute ∆ π , which is the summation of ∆ l over all leaves already discovered and appended to π. Naturally, we only compute ∆ l w.r.t. nodes x j / ∈ π because x j ∈ π have already been ordered and are not taken into account anymore. In practice, we observe that training ϵ θ on X but using a subsample B ∈ R k×d of size k randomly sampled from X increases speed without compromising performance (see Section 4.3). In addition, analysing Equation 5, the absolute value of the residue δ l decreases if the values of x l are set to zero once the leaf node is discovered. Therefore, we apply a mask M π ∈ {0, 1} k×d over leaves discovered in the previous iterations and compute only the Jacobian of the outputs corresponding to x -l . M π is updated after each iteration based on the ordered nodes π. We then find a leaf node according to leaf = arg min xi∈x Var B [∇ x (score(M π ⊙ B, t))] , where ϵ θ is a DPM trained with Equation 3. See Appendix E.3 for the choice of t. This topological ordering procedure is formally described in Algorithm 1, score(-π) means that we only consider the outputs for nodes x j / ∈ π Algorithm 1: Topological Ordering with DiffAN Complexity on n. Our method separates learning the score ϵ θ from computing the variance of the Hessian's diagonal across data points, in contrast to Rolland et al. (2022) . We use all n samples in X for learning the score function with diffusion training (Equation 3). It does not involve expensive constrained optimisation techniquesfoot_5 and we train the model for a fixed number of epochs (which is linear with n) or until reaching the early stopping criteria. We use a MLP that grows in width with d but it does not significantly affect complexity. Therefore, we consider training to be O(n). Moreover, Algorithm 1 is computed over a batch B with size k < n instead of the entire dataset X, as described in Equation 9. Note that the number of samples k in B can be arbitrarily small and constant for different datasets. In Section 5.2, we verify that the accuracy of causal discovery initially improves as k is increased but eventually tapers off. Input: X ∈ R n×d , trained diffusion model ϵ θ , ordering batch size k π = [], ∆ π = 0 k×d , M π = 1 k×d , score = ϵ θ while ∥π∥ ̸ = d do B k ← X // Randomly sample a batch of k elements B ← B • M π // Mask removed leaves ∆ π = Get∆ π (score, B) // Complexity on d. Once ϵ θ is trained, a topological ordering can be obtained by running ∇ x ϵ θ (x, t) d times. Moreover, computing the Jacobian of the score requires back-propagating the gradients d -i times, where i is the number of nodes already ordered in a given iteration. Finally, computing the deciduous score's residue (Equation 8) means computing gradient of the i nodes. Resulting in a complexity of O(d • (d -i) • i) with i varying from 0 to d which can be described by O(d 3 ). The final topological ordering complexity is therefore O(n + d 3 ). DiffAN Masking. We verify empirically that the masking procedure described in Section 4.2 can significantly reduce the deciduous score's residue absolute value while maintaining causal discovery capabilities. In DiffAN Masking, we do not re-train the ϵ θ nor compute the deciduous score. This ordering algorithm is an approximation but has shown to work well in practice while showing remarkable scalability. DiffAN Masking has O(n + d 2 ) ordering complexity.

5. EXPERIMENTS

In our experiments, we train a NN with a DPM objective to perform topological ordering and follow this with a pruning post-processing step (Bühlmann et al., 2014) . The performance is evaluated on synthetic and real data and compared to state-of-the-art causal discovery methods from observational data which are either ordering-based or gradient-based methods, NN architecture. We use a 4layer multilayer perceptron (MLP) with LeakyReLU and layer normalisation. Metrics. We use the structural Hamming distance (SHD), Structural Intervention Distance (SID) (Peters & Bühlmann, 2015) , Order Divergence (Rolland et al., 2022) and run time in seconds. See Appendix D.3 for details of each metric. Baselines. We use CAM (Bühlmann et al., 2014) , GranDAG (Lachapelle et al., 2020) and SCORE (Rolland et al., 2022) . We apply the pruning procedure of Bühlmann et al. (2014) to all methods. See detailed results in the Appendix D. Experiments with real data from Sachs (Sachs et al., 2005) and SynTReN (Van den Bulcke et al., 2006) datasets are in the Appendix E.1.

5.1. SYNTHETIC DATA

In this experiment, we consider causal relationships with f i being a function sampled from a Gaussian Process (GP) with radial basis function kernel of bandwidth one. We generate data from additive noise models which follow a Gaussian, Exponential or Laplace distributions with noise scales in the intervals {[0.4, 0.8], [0.8, 1.2], [1, 1]}, which are known to be identifiable (Peters et al., 2014) . The causal graph is generated using the Erdös-Rényi (ER) (Erdős et al., 1960) and Scale Free (SF) (Bollobás et al., 2003) models. For a fixed number of nodes d, we vary the sparsity of the sampled graph by setting the average number of edges to be either d or 5d. We use the notation [d][graph type][sparsity] for indicating experiments over different synthetic datasets. We show that DiffAN performs on par with baselines while being extremely fast, see Figure 3 . We also explore the role of overfitting in Appendix E.2, the difference between DiffAN with masking only and the greedy  2 0 E R 1 2 0 S F 1 2 0 E R

5.2. SCALING UP WITH DIFFAN MASKING

We now verify how DiffAN scales to bigger datasets, in terms of the number of samples n. Here, we use DiffAN masking because computing the residue with DiffAN would be too expensive for very big d. We evaluate only the topological ordering, ignoring the final pruning step. Therefore, the performance will be measured solely with the Order Divergence metric. Scaling to large datasets. We evaluate how DiffAN compares to SCORE (Rolland et al., 2022) , the previous state-of-the-art, in terms of run time (in seconds) and and the performance (order divergence) over datasets with d = 500 and different sample sizes n ∈ 10 2 , . . . , 10 5 , the error bars are results across 6 dataset (different samples of ER and SF graphs). As illustrated in Figure 1 , DiffAN is the more tractable option as the size of the dataset increases. SCORE relies on inverting a very large n × n matrix which is expensive in memory and computing for large n. Running SCORE for d = 500 and n > 2000 is intractable in a machine with 64Gb of RAM. Figure 4 (left) shows that, since DiffAN can learn from bigger datasets and therefore achieve better results as sample size increases. Ordering batch size. An important aspect of our method, discussed in Section 4.3, that allows scalability in terms of n is the separation between learning the score function ϵ θ and computing the Hessian variance across a batch of size k, with k < n. Therefore, we show empirically, as illustrated in Figure 4 (right), that decreasing k does not strongly impact performance for datasets with d ∈ 10, 20, 50.

6. RELATED WORKS

Ordering-based Causal Discovery. The observation that a causal DAG can be partially represented with a topological ordering goes back to Verma & Pearl (1990) . Searching the topological ordering space instead of searching over the space of DAGs has been done with greedy Markov Chain Monte Carlo (MCMC) (Friedman & Koller, 2003) , greedy hill-climbing search (Teyssier & Koller, 2005) , arc search (Park & Klabjan, 2017), restricted maximum likelihood estimators (Bühlmann et al., 2014) , sparsest permutation (Raskutti & Uhler, 2018; Lam et al., 2022; Solus et al., 2021) , and reinforcement learning (Wang et al., 2021) . In linear additive models, Ghoshal & Honorio (2018) ; Chen et al. (2019) propose an approach, under some assumptions on the noise variances, to discover the causal graph by sequentially identifying leaves based on an estimation of the precision matrix. Hessian of the Log-likelihood. Estimating H(log p(x)) is the most expensive task of the ordering algorithm. Our baseline (Rolland et al., 2022) propose an extension of Li & Turner (2018) which utilises the Stein's identity over a RBF kernel (Schölkopf & Smola, 2002) . Rolland et al.'s method cannot obtain gradient estimates at positions out of the training samples. Therefore, evaluating the Hessian over a subsample of the training dataset is not possible. Other promising kernel-based approaches rely on spectral decomposition (Shi et al., 2018) solve this issue and can be promising future directions. Most importantly, computing the kernel matrix is expensive for memory and computation on n. There are, however, methods (Achlioptas et al., 2001; Halko et al., 2011; Si et al., 2017) that help scaling kernel techniques, which were not considered in the present work. Other approaches are also possible with deep likelihood methods such as normalizing flows (Durkan et al., 2019; Dinh et al., 2017) and further compute the Hessian via backpropagation. This would require two backpropagation passes giving O(d 2 ) complexity and be less scalable than denoising diffusion. Indeed, preliminary experiments proved impractical in our high-dimensional settings. We use DPMs because they can efficiently approximate the Hessian with a single backpropagation pass and while allowing Hessian evaluation on a subsample of the training dataset. It has been shown (Song & Ermon, 2019 ) that denoising diffusion can better capture the score than simple denoising (Vincent, 2011) because noise at multiple scales explore regions of low data density.

7. CONCLUSION

We have presented a scalable method using DPMs for causal discovery. Since DPMs approximate the score of the data distribution, they can be used to efficiently compute the log-likelihood's Hessian by backpropagating each element of the output with respect to each element of the input. The deciduous score allows adjusting the score to remove the contribution of the leaf most recently removed, avoiding re-training the NN. Our empirical results show that neural networks can be efficiently used for topological ordering in high-dimensional graphs (up to 500 nodes) and with large datasets (up to 10 5 samples). Our deciduous score can be used with other Hessian estimation techniques as long as obtaining the score and its full Jacobian is possible from a trained model, e.g. sliced score matching (Song et al., 2020) and approximate backpropagation (Kingma & Cun, 2010) . Updating the score is more practical than re-training in most settings with neural networks. Therefore, our theoretical result enables the community to efficiently apply new score estimation methods to topological ordering. Moreover, DPMs have been previously used generative diffusion models in the context of causal estimation (Sanchez & Tsaftaris, 2022) . In this work, we have not explored the generative aspect such as Geffner et al. (2022) does with normalising flows. Finally, another promising direction involves constraining the NN architecture as in Lachapelle et al. (2020) with constrained optimisation losses (Zheng et al., 2018) . 

A PROOFS

We re-write Equation 1 here for improved readability: ∇ xj log p(x) = ∂ log p ϵ (x j -f j ) ∂x j - i∈Ch(xj ) ∂f i ∂x j ∂ log p ϵ (x i -f i ) ∂x . A.1 PROOF LEMMA 1 Proof. We start by showing the "⇐" direction by deriving Equation 1 w.r.t. x j . If x j is a leaf, only the first term of the equation is present, then taking its derivative results in H l,l (log p(x)) = ∂ 2 log p ϵ (x l -f l ) ∂x 2 • df l dx l = ∂ 2 log p ϵ (x l -f l ) ∂x 2 . ( ) Therefore, only if j is a leaf and d log p ϵ dx 2 = a, Var X [H l,l (log p(x))] = 0. The remaining of the proof follows Rolland et al. (2022) (which was done for a Gaussian noise only), we prove by contradiction that ⇒ is also true. In particular, if we consider that x j is not a leaf and H j,j log p(x) = c, with c being a constant, we can write ∇ xj log p(x) = cx j + g(x -j ). Replacing Equation 12in to Equation 1, we have cx j + g(x -j ) = ∂ log p ϵ (x j -f j ) ∂x - xi∈Ch(xj ) ∂f i ∂x j ∂ log p ϵ (x i -f i ) ∂x . Let x c ∈ Ch(x j ) such that x c ̸ ∈ P a(Ch(x j )). x c always exist since x j is not a leaf, and it suffices to pick a child of x c appearing at last position in some topological order. If we isolate the terms depending on x c on the RHS of Equation 13, we have cx j + ∂ log p ϵ (x j -f j ) ∂x - xi∈Ch(xj ),xi̸ =xc ∂f i ∂x j ∂ log p ϵ (x i -f i ) ∂x = ∂f c ∂x j ∂ log p ϵ (x c -f c ) ∂x -g(x -j ). Deriving both sides w.r.t. x c , since the LHS of Equation 14does not depend on x c , we can write ∂ ∂x c ∂f c ∂x j ∂ log p ϵ (x c -f c ) ∂x -g(x -j ) = 0 ⇒ ∂f c ∂x j : a ∂ log p ϵ (x c -f c ) ∂x 2 = ∂g(x -j ) ∂x c Since g does not depend on x j , ∂fc ∂xj does not depend on x j neither, implying that f c is linear in x j , contradicting the non-linearity assumption.

A.2 PROOF THEOREM 1

Proof. Using Equation 1, we will derive expressions for each of the elements in Equation 6and show that it is equivalent to δ l in Equation 5. First, note that the score of a leaf node x l can be denoted as: ∇ x l log p(x) = ∂ log p ϵ (x j -f j ) ∂x j = ∂ log p ϵ (x l -f l ) ∂x • d (x l -f l ) dx l = ∂ log p ϵ (x l -f l ) ∂x Second, replacing Equation 16 into each element of H l (log p(x)) ∈ R d , we can write H l,j (log p(x)) = ∂ ∂x j [∇ x l log p(x)] = ∂ 2 log p ϵ (x l -f l ) ∂x 2 • d (x l -f l ) dx j = ∂ 2 log p ϵ (x l -f l ) ∂x 2 • df l dx j . ( ) If j = l in Equation 17, we have H l,l (log p(x)) = ∂ 2 log p ϵ (x l -f l ) ∂x 2 • df l dx l = ∂ 2 log p ϵ (x l -f l ) ∂x 2 . ( ) Finally, replacing Equations 16, 17 and 11 into the Equation 5 for a single node x j , if j ̸ = l, we have δ j = H l,j (log p(x)) • ∇ x l log p(x) H l,l (log p(x)) = ∂ log p ϵ (x l -f l ) ∂x 2 • df l dxj • ∂ log p ϵ (x l -f l ) ∂x ∂ log p ϵ (x l -f l ) ∂x 2 = df l dx j • ∂ log p ϵ (x l -f l ) ∂x . The last line in Equation 19is the same as in Equation 5from Lemma 2, proving that δ j can be written using the first and second order derivative of the log-likelihood.

B SCORE OF NONLINEAR ANM WITH GAUSSIAN NOISE

A SCM entails a distribution p(x) = d i=1 p(x i | P a(x i )), over the variables x (Peters et al., 2017) By assuming that the noise variables ϵ i ∼ N (0, σ 2 i ) and inserting the ANM function, Equation 20 can be written as log p(x) = - 1 2 d i=1 x i -f i (P a(x i )) σ i 2 - 1 2 d i=1 log(2πσ 2 i ). The score of p(x) can hence be written as , while not 0 as predicted by Equation 2, is smaller than ∇ xj log p(x) = - x j -f j (P a(x j )) σ 2 j + i∈children(j) ∂f i ∂x j (P a(x i )) x i -f i (P a(x i )) σ 2 i . ( ∂ 2 log p(A,B) ∂A 2 allowing discovery of the true causal direction. 

D.3 METRICS

For each method, we compute the SHD. Structural Hamming distance between the output and the true causal graph, which counts the number of missing, falsely detected, or reversed edges. SID. Structural Intervention Distance is based on a graphical criterion only and quantifies the closeness between two DAGs in terms of their corresponding causal inference statements (Peters & Bühlmann, 2015) . Order Divergence. Rolland et al. (2022) propose this quantity for measuring how well the topological order is estimated. For an ordering π, and a target adjacency matrix A, we define the topological order divergence D top (π, A) as D top (π, A) = d i=1 j:πi>πj A ij . ( ) If π is a correct topological order for A, then D top (π, A) = 0. Otherwise, D top (π, A) counts the number of edges that cannot be recovered due to the choice of topological order. Therefore, it provides a lower bound on the SHD of the final algorithm (irrespective of the pruning method). E OTHER RESULTS

E.1 REAL DATA

We consider two real datasets: (i) Sachs: A protein signaling network based on expression levels of proteins and phospholipids (Sachs et al., 2005) . We consider only the observational data (n = 853 samples) since our method targets discovery of causal mechanisms when only observational data is available. The ground truth causal graph given by Sachs et al. (2005) has 11 nodes and 17 edges. (ii) SynTReN: We also evaluate the models on a pseudo-real dataset sampled from SynTReN generator (Van den Bulcke et al., 2006) . Results, in Table 2 , show that our method is competitive against other state-of-the-art causal discovery baselines on real datasets. The data used for topological ordering (inference) is a subset of the training data. Therefore, it is not obvious if overfitting would be an issue with our algorithm. Therefore, we run an experiment where we fix the number of epochs to 2000 considered high for a set of runs and use early stopping for another set in order to verify if overfitting is an issue. On average across all 20 nodes datasets, the early stopping strategy output an ordering diverge of 9.5 whilst overfitting is at 11.1 showing that the method does not benefit from overfitting.

E.3 OPTIMAL t FOR SCORE ESTIMATION

As noted by Vincent (2011) , the best approximation of the score by a learned denoising function is when the training signal-to-noise (SNR) ratio is low. In diffusion model training, t = 0 corresponds to the coefficient with lowest SNR. However, we found empirically that the best score estimate varies somehow randomly across different values of t. Therefore, we run the the leaf finding function (Equation 9) N times for values of t evenly spaced in the [0, T ] interval and choose the best leaf based on majority vote. We show in Figure 6 that majority voting is a better approach than choosing a constant value for t.

E.4 ABLATIONS OF DIFFAN MASKING AND GREEDY

We now verify how DiffAN masking and DiffAN greedy compare against the original version detailed in the main text which computes the deciduous score. Here, we use the same datasets decribed in Section 5.1 which comprises 4 (20ER1, 20ER5, 20SF1, 20SF5) synthetic dataset types with 27 variations over seeds, noise type and noise scale. DiffAN Greedy. A greedy version of the algorithm re-trains the ϵ θ after each leaf removal iteration. In this case, the deciduous score is not computed, decreasing the complexity w.r.t. d but increasing w.r.t. n. DiffAN greedy has O(nd 2 ) ordering complexity. Figure 6 : The distribution of order divergence measured for different values of t is highly variable. Therefore, we show that we obtain a better approximation with majority voting. We observe, in Figure 7 , that the greedy version performs the best but it is the slowest, as seen in 

E.5 DETAILED RESULTS

We present the numerical results for the violinplots in Section 5.1 in Tables 3 and 4 . The results are presented in mean std with statistics acquired over experiments with 3 seeds. 



We refer to nodes without children in a DAG G as leaves. An analogy to deciduous trees which seasonally shed leaves during autumn. The Jacobian of a neural network can be efficiently computed with auto-differentiation libraries such as functorch (Horace He, 2021). The diffusion model itself is an approximation of the score, therefore its gradients are approximations of the score derivatives. Such as the Augmented Lagrangian method(Zheng et al., 2018;Lachapelle et al., 2020).



Sum of Equation 8 over π score = score(-π) + ∆ π // Update score with residue leaf = GetLeaf(score, B) // Equation 9 π = [leaf, π] // Append leaf to ordered list M :,leaf = 0 // Set discovered leaf to zero end Output: Topological order π 4.3 COMPUTATIONAL COMPLEXITY AND PRACTICAL CONSIDERATIONS We now study the complexity of topological ordering with DiffAN w.r.t. the number of samples n and number of variables d in a dataset. In addition, we discuss what are the complexities of a greedy version as well as approximation which only utilises masking.

Figure 3: SHD (left) and run time in seconds (right) for experiments of synthetic data graphs for graphs with 20 nodes. The variation in the violinplots come from 3 different seeds over dataset generated from 3 different noise type and 3 different noise scales. Therefore, we run a total of 27 experiments for each method and synthetic datasets type

Figure4: Accuracy of DiffAN as the dataset size is scaled up for datasets with 500 variables and increasing numbers of data samples n (left) and as the batch size for computing the Hessian variance is changed (right). We show 95% confidence intervals over 6 datasets which have different graph structures sampled different graph types (ER/SF).

) C VISUALISATION OF THE SCORE'S JACOBIAN FOR TWO VARIABLES Considering a two variables problem where the causal mechanisms are B = f ω (A)+ϵ B and A = ϵ A with ϵ A , ϵ B ∼ N (0, 1) and f ω being a two-layer MLP with randomly initialised weights. Note that, in Figure 5, the variance of ∂ 2 log p(A,B) ∂B 2

Figure 5: Visualisation of the diagonal of the score's Jacobian estimated with a diffusion model as in Equation 7 for a two-variable SCM where A → B.



Figure 7: SID metric for different versions of DiffAN.

Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. Learning Sparse Nonparametric DAGs. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 3414-3425. PMLR, 9 2020.

MLP architecture. The hyperparameters of each Linear layer depend on d such that big = max(1024, 5 * d) and small = max(128, 3 * d).We now describe the hyperparameters for the diffusion training. We use number of time steps T = 100, β t is a linearly scheduled between β min = 0.0001 and β max = 0.02. The model is trained according to Equation 3 which followsHo et al. (2020). During sampling, t is sampled from a Linear layers, LeakyReLU activation function, Layer Normalization and Dropout in the first layer. The full architecture is detailed in Table1.

SHD and SID results over real datasets.

Scale Free (SF) graphs. DiffAN 46.83 11.48 243.50 34.68 30.03 2.17 DiffAN Greedy 43.50 6.89 236.83 23.17 173.13 3.52 GranDAG 60.67 8.80 275.00 22.56 284.34 5.33 SCORE 38.50 9.14 180.33 57.44 19.79 2.92 gauss CAM 46.83 8.11 199.17 53.13 130.41 20.55 DiffAN 50.83 4.62 259.50 47.54 28.58 1.16 DiffAN Greedy 45.17 6.15 224.17 41.80 174.90 3.69 GranDAG 61.67 4.03 241.67 43.11 292.77 7.29 SCORE 44.50 5.24 217.83 50.05 19.77 2.51 laplace CAM 49.50 9.01 191.33 27.43 160.53 23.16 DiffAN 52.00 5.66 230.67 50.66 29.05 1.60 DiffAN Greedy 47.00 9.94 191.33 26.16 173.42 3.52 GranDAG 65.00 7.13 262.50 44.23 295.40 7.72 SCORE 46.67 10.03 193.83 34.14 43.84 27.14

8. ACKNOWLEDGEMENT

This work was supported by the University of Edinburgh, the Royal Academy of Engineering and Canon Medical Research Europe via P. Sanchez's PhD studentship. S.A. Tsaftaris acknowledges the support of Canon Medical and the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme (grant RCSRF1819\825).

