ORDERING-BASED CAUSAL DISCOVERY WITH REINFORCEMENT LEARNING

Abstract

It is a long-standing question to discover causal relations among a set of variables in many empirical sciences. Recently, Reinforcement Learning (RL) has achieved promising results in causal discovery. However, searching the space of directed graphs directly and enforcing acyclicity by implicit penalties tend to be inefficient and restrict the method to small problems. In this work, we alternatively consider searching an ordering by RL from the variable ordering space that is much smaller than that of directed graphs, which also helps avoid dealing with acyclicity. Specifically, we formulate the ordering search problem as a Markov decision process, and then use different reward designs to optimize the ordering generating model. A generated ordering is then processed using variable selection methods to obtain the final directed acyclic graph. In contrast to other causal discovery methods, our method can also utilize a pretrained model to accelerate training. We conduct experiments on both synthetic and real-world datasets, and show that the proposed method outperforms other baselines on important metrics even on large graph tasks.

1. INTRODUCTION

Identifying causal structure from observational data is an important but also challenging task in many applications. This problem can be formulated as that of finding a Directed Acyclic Graph (DAG) that minimizes some score function defined w.r.t. observed data. Though there exist well-studied score functions like Bayesian Information Criterion (BIC) or Minimum Description Length (MDL) (Schwarz et al., 1978; Chickering, 2002) , searching over the space of DAGs is known to be NP-hard, even if each node has at most two parents (Chickering, 1996) . Consequently, traditional methods mostly rely on local heuristics to perform the search, including greedy hill climbing and Greedy Equivalence Search (GES) that explore the Markov equivalence classes (Chickering, 1996) . Along with various search strategies, existing methods have also considered to reduce the search space while meeting the DAG constraint. A useful practice is to cast the causal structure learning problem as that of learning an optimal ordering of variables (Koller & Friedman, 2009) . Because the ordering space is significantly smaller than the space of directed graphs and searching over ordering space can avoid the problem of dealing with acyclic constraints (Teyssier & Koller, 2005) . Many algorithms are used to search for ordering such as genetic algorithm (Larranaga et al., 1996) , Markov chain Monte Carlo (Friedman & Koller, 2003) and greedy local hill-climbing (Teyssier & Koller, 2005) . However, these algorithms often cannot find the best ordering effectively. Recently, with differentiable score functions, several gradient-based methods have been proposed based on a smooth characterization of acyclicity, including NOTEARS (Zheng et al., 2018) for linear causal models and several subsequent works, e.g., Yu et al. (2019); Lachapelle et al. (2020); Ng et al. (2019b; a) , which use neural networks to model nonlinear causal relationships. As another attempt, Zhu et al. (2020) utilize Reinforcement Learning (RL) as a search strategy to find the best DAG from the graph space and it can be incorporated with a wide range of score functions. Unfortunately, its good performance is achieved only with around 30 variables, for at least two reasons: 1) the action space, consisting of directed graphs, is tremendous for large scale problems and hard to be explored efficiently; and 2) it has to compute scores for many non-DAGs generated during training but computing scores w.r.t. data is generally be time-consuming. It appears that the RL-based approach may not be able to achieve a close performance to other gradient-based methods that directly optimize the same (differentiable) score function for large causal discovery problems, due to its search nature. Under review as a conference paper at 2021 In this work, we propose a RL-based approach for causal discovery, named Causal discovery with Ordering-based Reinforcement Learning (CORL), which combines RL with the ordering based paradigm so that we can exploit the powerful search ability of RL to search the best ordering efficiently. To achieve this, we formulate the ordering search problem as a Markov Decision Process (MDP), and then use different reward designs for RL to optimize the ordering generating model. In addition, we notice that pretrained model can be incorporated into our model to accelerate training. For a generated ordering, we prune it to the final DAG by variable selection. The proposed approach is evaluated on both synthetic and real datasets to validate its effectiveness. In particular, the proposed method can achieve a much improved performance than the previous RL-based method on both linear and non-linear data models, even outperforms NOTEARS, a gradient-based method, on 150-node linear data models, and is competitive with Causal Additive Model (CAM) on non-linear data models.

2. RELATED WORKS

Existing causal discovery methods roughly fall into three classes. The first class, as described in the introduction, are the so-called score-based methods. Besides the mentioned BIC/MDL scores, other score functions include the Bayesian Gaussian equivalent score (Geiger & Heckerman, 1994) , the generalized score based on (conditional) independence relationship (Huang et al., 2018) , and a recently proposed meta-transfer score (Bengio et al., 2020) . Another class of methods such as fast causal inference and PC (Spirtes et al., 2000; Zhang, 2008) which first find causal skeleton and then decide the orientations of the edges up to the Markov equivalence class, are viewed as constraint-based. Such methods usually involve multiple independent testing problems; the testing results may have conflicts and handling them is not easy. The last class of methods relies on properly defined functional causal models, includeing Linear Non-Gaussian Acyclic Model (LiNGAM), nonlinear Additive Noise Model (ANM) (Hoyer et al., 2009; Peters et al., 2014) , and the post-nonlinear causal model (Zhang & Hyvärinen, 2009) . By placing certain assumptions on the class of causal functions and/or noise distributions, these methods can distinguish different DAGs in the same Markov equivalence class. (2015) further proposed an ordering exploration method on the basis of an approximated score function so as to scale to thousands of variables. The Causal Additive Model (CAM) was developed by Bühlmann et al. (2014) for nonlinear data models. Some recent ordering-based methods such as sparsest permutation (Raskutti & Uhler, 2018) and greedy sparest permutation (Solus et al., 2017) can guarantee consistency of Markov equivalence class, based on some conditional independence relations and certain assumptions like faithfulness. A variant of greedy sparest permutation was also proposed in Bernstein et al. (2020) for the setting with latent variables. In the present work, we mainly work on identifiable cases which may have different assumptions from theirs. In addition, several exact algorithms such as dynamic programming (Silander & Myllymäki, 2006; Perrier et al., 2008) and integer or linear programming (Jaakkola et al., 2010; Cussens, 2011; Bartlett & Cussens, 2017) are used to solve causal discovery problem. However, these exact algorithms usually work on small graphs efficiently (De Campos & Ji, 2011) , and in order to handle larger problems with hundreds of variables, they need to incorporate heuristics search (Xiang & Kim, 2013) or limit the maximum number of parents. Recently, RL has been used to tackle several combinatorial problems such as maximum cut and the traveling salesman problem, due to their relatively simple reward mechanisms (Khalil et al., 2017) . In combination with the encoder-decoder based pointer networks (Vinyals et al., 2015 ), Bello et al. (2016) showed that RL can have a better generalization even when the optimal solutions are used as labeled data in a supervised way. Kool et al. (2019) further used an attention based encoder-decoder model for an improved performance. These works aim to learn a policy as a combinatorial solver based on the same structure of a particular type of combinatorial problems. However, various causal discovery tasks generally have different causal relationships, data types, graph structures, etc, and moreover, are typically off-line with focus on a causal graph. As such, we use RL as a search strategy, similar to Zhu et al. (2020); Zoph & Le (2017) ; nevertheless, a pretrained model or a policy can offer a good starting point to speed up training, as shown in our evaluation results (Figure 3 ).



Since we particularly consider ordering-based approaches, here we present a more detailed review of such methods. Most of the ordering-based methods belong to the class of score-based methods. Besides the mentioned heuristic search algorithms, Schmidt et al. (2007) proposed L1OBS to conduct variable selection using 1 -regularization paths based on Teyssier & Koller (2005). Scanagatta et al.

