DIFFUSION MODELS FOR CAUSAL DISCOVERY VIA TOPOLOGICAL ORDERING

Abstract

Discovering causal relations from observational data becomes possible with additional assumptions such as considering the functional relations to be constrained as nonlinear with additive noise (ANM). Even with strong assumptions, causal discovery involves an expensive search problem over the space of directed acyclic graphs (DAGs). Topological ordering approaches reduce the optimisation space of causal discovery by searching over a permutation rather than graph space. For ANMs, the Hessian of the data log-likelihood can be used for finding leaf nodes in a causal graph, allowing its topological ordering. However, existing computational methods for obtaining the Hessian still do not scale as the number of variables and the number of samples are increased. Therefore, inspired by recent innovations in diffusion probabilistic models (DPMs), we propose DiffAN 1 , a topological ordering algorithm that leverages DPMs for learning a Hessian function. We introduce theory for updating the learned Hessian without re-training the neural network, and we show that computing with a subset of samples gives an accurate approximation of the ordering, which allows scaling to datasets with more samples and variables. We show empirically that our method scales exceptionally well to datasets with up to 500 nodes and up to 10 5 samples while still performing on par over small datasets with state-of-the-art causal discovery methods.

1. INTRODUCTION

Figure 1 : Plot showing run time in seconds for different sample sizes, for discovery of causal graphs with 500 nodes. Most causal discovery methods have prohibitive run time and memory cost for datasets with many samples; the previous state-of-the-art SCORE algorithm (Rolland et al., 2022) which is included in this graph cannot be computed beyond 2000 samples in a machine with 64GB of RAM. By contrast, our method DiffAN has a reasonable run time even for numbers of samples two orders of magnitude larger than capable by most existing methods. Understanding the causal structure of a problem is important for areas such as economics, biology (Sachs et al., 2005) and healthcare (Sanchez et al., 2022) , especially when reasoning about the effect of interventions. When interventional data from randomised trials are not available, causal discovery methods (Glymour et al., 2019) may be employed to discover the causal structure of a problem solely from observational data. Causal structure is typically modelled as a directed acyclic graph (DAG) G in which each node is associated with a random variable and each edge represents a causal mechanism i.e. how one variable influences another. However, learning such a model from data is NP-hard (Chickering, 1996) . Traditional methods search the DAG space by testing for conditional independence between variables (Spirtes et al., 1993) or by optimising some goodness of fit measure (Chickering, 2002) . Unfortunately, solving the search problem with a greedy combinatorial optimisation method can be expensive and does not scale to high-dimensional problems. In line with previous work (Teyssier & Koller, 2005; Park & Klabjan, 2017; Bühlmann et al., 2014; Solus et al., 2021; Wang et al., 2021; Rolland et al., 2022) , we can speed up the combinatorial search problem over the space of DAGs by rephrasing it as a topological ordering task, ordering from leaf nodes to root nodes. The search space over DAGs with d nodes and (d 2 -d)/2 possible edges is much larger than the space of permutations over d variables. Once a topological ordering of the nodes is found, the potential causal relations between later (cause) and earlier (effect) nodes can be pruned with a feature selection algorithm (e.g. Bühlmann et al. ( 2014)) to yield a graph which is naturally directed and acyclic without further optimisation. Recently, Rolland et al. ( 2022) proposed the SCORE algorithm for topological ordering. SCORE uses the Hessian of the data log-likelihood, ∇ 2 x log p(x).By verifying which elements of ∇ 2 x log p(x)'s diagonal are constant across all data points, leaf nodes can be iteratively identified and removed. Rolland et al. ( 2022) estimate the Hessian point-wise with a second-order Stein gradient estimator (Li & Turner, 2018) over a radial basis function (RBF) kernel. However, point-wise estimation with kernels scales poorly to datasets with large number of samples n because it requires inverting a n × n kernel matrix. Here, we enable scalable causal discovery by utilising neural networks (NNs) trained with denoising diffusion instead of Rolland et al.'s kernel-based estimation. We use the ordering procedure, based on Rolland et al. ( 2022), which requires re-computing the score's Jacobian at each iteration. Training NNs at each iteration would not be feasible. Therefore, we derive a theoretical analysis that allows updating the learned score without re-training. In addition, the NN is trained over the entire dataset (n samples) but only a subsample is used for finding leaf nodes. Thus, once the score model is learned, we can use it to order the graph with constant complexity on n, enabling causal discovery for large datasets in high-dimensional settings. Interestingly, our algorithm does not require architectural constraints on the neural network, as in previous causal discovery methods based on neural networks (Lachapelle et al., 2020; Zheng et al., 2020; Yu et al., 2019; Ng et al., 2022) . Our training procedure does not learn the causal mechanism directly, but the score of the data distribution. Contributions. In summary, we propose DiffAN, an identifiable algorithm leveraging a diffusion probabilistic model for topological ordering that enables causal discovery assuming an additive noise model: (i) To the best of our knowledge, we present the first causal discovery algorithm based on denoising diffusion training which allows scaling to datasets with up to 500 variables and 10 5 samples. The score estimated with the diffusion model is used to find and remove leaf nodes iteratively; (ii) We estimate the second-order derivatives (score's Jacobian or Hessian) of a data distribution using neural networks with diffusion training via backpropagation; (iii) The proposed deciduous score (Section 3) allows efficient causal discovery without re-training the score model at each iteration. When a leaf node is removed, the score of the new distribution can be estimated from the original score (before leaf removal) and its Jacobian.

2. PRELIMINARIES 2.1 PROBLEM DEFINITION

We consider the problem of discovering the causal structure between d variables, given a probability distribution p(x) from which a d-dimensional random vector x = (x 1 , . . . , x d ) can be sampled. We assume that the true causal structure is described by a DAG G containing d nodes. Each node represents a random variable x i and edges represent the presence of causal relations between them. In other words, we can say that G defines a structural causal model (SCM) consisting of a collection of assignments x i := f i (P a(x i ), ϵ i ), where P a(x i ) are the parents of x i in G, and ϵ i is a noise term independent of x i , also called exogenous noise. ϵ i are i.i.d. from a smooth distribution p ϵ . The SCM entails a unique distribution p(x) = d i=1 p(x i | P a(x i )) over the variables x (Peters et al., 2017) . The observational input data are X ∈ R n×d , where n is number of samples. The target output is an adjacency matrix A ∈ R d×d .

