TRUST YOUR ∇: GRADIENT-BASED INTERVENTION TARGETING FOR CAUSAL DISCOVERY

Abstract

Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.

Gradient-based Causal Discovery

Gradient-based Intervention Targeting (GIT) Score using

Intervention Acquisition

Figure 1 : Overview of GIT's usage in a gradient-based causal discovery framework. The framework infers a posterior distribution over graphs from observational and interventional data (denoted as D obs and Dint) through gradient-based optimization. The distribution over graphs and the gradient estimator ∇L(•) are then used by GIT in order to score the intervention targets based on the magnitude of the estimated gradients. The intervention target with the highest score is then selected, upon which the intervention is performed. New interventional data D new int are then collected and the procedure is repeated. Estimating causal structure from data, commonly known as causal discovery or causal structure learning, is central to the progress of science (Pearl, 2009) . Methods for causal discovery have been successfully deployed in various fields, such as biology (Sachs et al., 2005; Triantafillou et al., 2017; Glymour et al., 2019 ), medicine (Shen et al., 2020; Castro et al., 2020; Wu et al., 2022) , earth system science (Ebert-Uphoff & Deng, 2012), or neuroscience (Sanchez-Romero et al., 2019) . In general, realworld systems can often be explained as a modular composition of smaller parts connected by causal relationships. Knowing the underlying structure is crucial for making robust predictions about the system after a perturbation (or treatment) is applied (Peters et al., 2016) . Moreover, such knowledge decompositions are shown to enable sampleefficient learning and fast adaptation to distribution shifts by only updating a subset of parameters (Bengio et al., 2019; Scherrer et al., 2022) . To identify a system's causal structure uniquely, observational data (i.e., obtained directly from the system, without interference) are, in general, insufficient and only allow recovery of the causal structure up to the Markov Equivalence Class (MEC) (Spirtes et al., 2000a; Peters et al., 2017) . Such a class contains multiple graphs that explain the observational data equally well. To overcome the limited identifiability, causal discovery algorithms commonly leverage interventional data (Hauser & Bühlmann, 2012; Brouillard et al., 2020; Ke et al., 2019) , which are acquired by manipulating a part of the system (Spirtes et al., 2000b; Pearl, 2009) . Without an experimental design strategy, intervention targets (i.e. variables on which the manipulation is performed) are usually chosen at random before conducting an experiment. While collecting enough interventional samples under a random strategy enables identification (Eberhardt & Scheines, 2007; Eberhardt, 2012) , such an acquisition technique neglects the current evidence and can be wasteful, as acquiring interventional data might be costly (e.g. additional experiments in the chemistry lab) (Peters et al., 2017) . Consequently, the field of experimental design (Lindley, 1956; Murphy, 2001; Tong & Koller, 2001) is concerned with the acquisition of interventional data in a targeted manner to minimize the number of required experiments. In this work, we introduce a simple yet effective approach to actively choose intervention targets, called Gradient-based Intervention Targeting, or GIT for short, see Figure 1 . GIT is a scoringbased method (i.e., the intervention with the highest score is selected) that relies on "imaginary" interventional data, and is simple to implement on top of existing gradient-based causal discovery methods. GIT requires access to a parametric causal graph model and a loss function, typically based on interventional or fused (observational and interventional) data. The model and loss function can be defined exclusively for GIT purposes or provided by the underlying gradient-based causal discovery framework. With that, the GIT scores reflect the expected magnitudes of gradients of the loss function with respect to the model structural parameters. Intuitively, GIT selects an intervention on which the model is the most mistaken, i.e., the one that can lead to the largest model update.

Our contributions include:

• We introduce GIT, a method for active intervention targeting in gradient-based causal discovery, which can be applied on top of various causal discovery frameworks. • We conduct extensive experiments on synthetic and real-world graphs. We demonstrate that, compared against competitive baselines, our method typically reduces the amount of interventional data needed to discover the causal structure. GIT is particularly efficient in the low-data regime and thus recommended when access to interventional data is limited. • We perform additional analyses which suggest that the good performance of GIT stems from its ability to focus on highly informative nodes.

2. RELATED WORK

Experimental Design / Intervention Design. There are two major classes of methods for selecting optimal interventions for causal discovery. One class of approaches is based on graph-theoretical properties. Typically, a completed partially directed acyclic graph (CPDAG), describing an equivalence class of DAGs, is first specified. Then, either substructures, such as cliques or trees, are investigated and used to inform decisions (He & Geng, 2008; Eberhardt, 2012; Squires et al., 2020; Greenewald et al., 2019) , or edges of a proposed graph are iteratively refined until reaching a prescribed budget (Ghassami et al., 2018; 2019; Kocaoglu et al., 2017; Lindgren et al., 2018) Gradient-based Causal Structure Learning. The appealing properties of neural networks have sparked a flurry of gradient-based causal structure learning methods. The most prevalent approaches are unsupervised formulations that optimize a data-dependent scoring metric (for instance, penalized log-likelihood) to find the best causal graph G. Existing unsupervised methods that are capable (or can be extended) to incorporate interventional data can be categorized based on the underlying optimization formulation into: (i) frameworks with a joint optimization objective (Brouillard et al., 2020; Lorch et al., 2021; Cundy et al., 2021; Annadani et al., 2021; Geffner et al., 2022; Deleu et al., 



. The most severe limitation of graph-theoretical approaches is that misspecification of the CPDAG at the beginning of the process can deteriorate the final solution. Another class of methods is based on Bayesian Optimal Experiment Design(Lindley, 1956), which aims to select interventions with the highest mutual information (MI) between the observations and model parameters. MI is approximated in different ways: AIT

