ORDERING-BASED CAUSAL DISCOVERY WITH REINFORCEMENT LEARNING

Abstract

It is a long-standing question to discover causal relations among a set of variables in many empirical sciences. Recently, Reinforcement Learning (RL) has achieved promising results in causal discovery. However, searching the space of directed graphs directly and enforcing acyclicity by implicit penalties tend to be inefficient and restrict the method to small problems. In this work, we alternatively consider searching an ordering by RL from the variable ordering space that is much smaller than that of directed graphs, which also helps avoid dealing with acyclicity. Specifically, we formulate the ordering search problem as a Markov decision process, and then use different reward designs to optimize the ordering generating model. A generated ordering is then processed using variable selection methods to obtain the final directed acyclic graph. In contrast to other causal discovery methods, our method can also utilize a pretrained model to accelerate training. We conduct experiments on both synthetic and real-world datasets, and show that the proposed method outperforms other baselines on important metrics even on large graph tasks.

1. INTRODUCTION

Identifying causal structure from observational data is an important but also challenging task in many applications. This problem can be formulated as that of finding a Directed Acyclic Graph (DAG) that minimizes some score function defined w.r.t. observed data. Though there exist well-studied score functions like Bayesian Information Criterion (BIC) or Minimum Description Length (MDL) (Schwarz et al., 1978; Chickering, 2002) , searching over the space of DAGs is known to be NP-hard, even if each node has at most two parents (Chickering, 1996) . Consequently, traditional methods mostly rely on local heuristics to perform the search, including greedy hill climbing and Greedy Equivalence Search (GES) that explore the Markov equivalence classes (Chickering, 1996) . Along with various search strategies, existing methods have also considered to reduce the search space while meeting the DAG constraint. A useful practice is to cast the causal structure learning problem as that of learning an optimal ordering of variables (Koller & Friedman, 2009) . Because the ordering space is significantly smaller than the space of directed graphs and searching over ordering space can avoid the problem of dealing with acyclic constraints (Teyssier & Koller, 2005) . Many algorithms are used to search for ordering such as genetic algorithm (Larranaga et al., 1996) , Markov chain Monte Carlo (Friedman & Koller, 2003) and greedy local hill-climbing (Teyssier & Koller, 2005) . However, these algorithms often cannot find the best ordering effectively. Recently, with differentiable score functions, several gradient-based methods have been proposed based on a smooth characterization of acyclicity, including NOTEARS (Zheng et al., 2018) for linear causal models and several subsequent works, e.g., Yu et al. (2019) ; Lachapelle et al. (2020) ; Ng et al. (2019b; a) , which use neural networks to model nonlinear causal relationships. As another attempt, Zhu et al. (2020) utilize Reinforcement Learning (RL) as a search strategy to find the best DAG from the graph space and it can be incorporated with a wide range of score functions. Unfortunately, its good performance is achieved only with around 30 variables, for at least two reasons: 1) the action space, consisting of directed graphs, is tremendous for large scale problems and hard to be explored efficiently; and 2) it has to compute scores for many non-DAGs generated during training but computing scores w.r.t. data is generally be time-consuming. It appears that the RL-based approach may not be able to achieve a close performance to other gradient-based methods that directly optimize the same (differentiable) score function for large causal discovery problems, due to its search nature. In this work, we propose a RL-based approach for causal discovery, named Causal discovery with Ordering-based Reinforcement Learning (CORL), which combines RL with the ordering based paradigm so that we can exploit the powerful search ability of RL to search the best ordering efficiently. To achieve this, we formulate the ordering search problem as a Markov Decision Process (MDP), and then use different reward designs for RL to optimize the ordering generating model. In addition, we notice that pretrained model can be incorporated into our model to accelerate training. For a generated ordering, we prune it to the final DAG by variable selection. The proposed approach is evaluated on both synthetic and real datasets to validate its effectiveness. In particular, the proposed method can achieve a much improved performance than the previous RL-based method on both linear and non-linear data models, even outperforms NOTEARS, a gradient-based method, on 150-node linear data models, and is competitive with Causal Additive Model (CAM) on non-linear data models.

2. RELATED WORKS

Existing causal discovery methods roughly fall into three classes. The first class, as described in the introduction, are the so-called score-based methods. Besides the mentioned BIC/MDL scores, other score functions include the Bayesian Gaussian equivalent score (Geiger & Heckerman, 1994) , the generalized score based on (conditional) independence relationship (Huang et al., 2018) , and a recently proposed meta-transfer score (Bengio et al., 2020) . Another class of methods such as fast causal inference and PC (Spirtes et al., 2000; Zhang, 2008) which first find causal skeleton and then decide the orientations of the edges up to the Markov equivalence class, are viewed as constraint-based. Such methods usually involve multiple independent testing problems; the testing results may have conflicts and handling them is not easy. The last class of methods relies on properly defined functional causal models, includeing Linear Non-Gaussian Acyclic Model (LiNGAM), nonlinear Additive Noise Model (ANM) (Hoyer et al., 2009; Peters et al., 2014) , and the post-nonlinear causal model (Zhang & Hyvärinen, 2009) . By placing certain assumptions on the class of causal functions and/or noise distributions, these methods can distinguish different DAGs in the same Markov equivalence class. Since we particularly consider ordering-based approaches, here we present a more detailed review of such methods. Most of the ordering-based methods belong to the class of score-based methods. Besides the mentioned heuristic search algorithms, Schmidt et al. (2007) proposed L1OBS to conduct variable selection using 1 -regularization paths based on Teyssier & Koller (2005) . Scanagatta et al. (2015) further proposed an ordering exploration method on the basis of an approximated score function so as to scale to thousands of variables. The Causal Additive Model (CAM) was developed by Bühlmann et al. (2014) for nonlinear data models. Some recent ordering-based methods such as sparsest permutation (Raskutti & Uhler, 2018) and greedy sparest permutation (Solus et al., 2017) can guarantee consistency of Markov equivalence class, based on some conditional independence relations and certain assumptions like faithfulness. A variant of greedy sparest permutation was also proposed in Bernstein et al. (2020) for the setting with latent variables. In the present work, we mainly work on identifiable cases which may have different assumptions from theirs. In addition, several exact algorithms such as dynamic programming (Silander & Myllymäki, 2006; Perrier et al., 2008) and integer or linear programming (Jaakkola et al., 2010; Cussens, 2011; Bartlett & Cussens, 2017) are used to solve causal discovery problem. However, these exact algorithms usually work on small graphs efficiently (De Campos & Ji, 2011) , and in order to handle larger problems with hundreds of variables, they need to incorporate heuristics search (Xiang & Kim, 2013) or limit the maximum number of parents. Recently, RL has been used to tackle several combinatorial problems such as maximum cut and the traveling salesman problem, due to their relatively simple reward mechanisms (Khalil et al., 2017) . In combination with the encoder-decoder based pointer networks (Vinyals et al., 2015) , Bello et al. (2016) showed that RL can have a better generalization even when the optimal solutions are used as labeled data in a supervised way. Kool et al. (2019) further used an attention based encoder-decoder model for an improved performance. These works aim to learn a policy as a combinatorial solver based on the same structure of a particular type of combinatorial problems. However, various causal discovery tasks generally have different causal relationships, data types, graph structures, etc, and moreover, are typically off-line with focus on a causal graph. As such, we use RL as a search strategy, similar to Zhu et al. (2020) ; Zoph & Le (2017) ; nevertheless, a pretrained model or a policy can offer a good starting point to speed up training, as shown in our evaluation results (Figure 3 ).

3.1. CAUSAL STRUCTURE LEARNING

Let G = (d, V, E) denotes a DAG, with d the number of nodes, V = {v 1 , . . . , v d } the set of nodes, and E = {(v i , v j )|i, j = 1, . . . , d} the set of directed edges from v i to v j . Each node v j is associated with a random variable X j . The probability model associated with G factorizes as p(X 1 , . . . , X d ) = d j=1 p(X j |Pa(X j )), where p(X j |Pa(X j )) is the conditional probability distribution for X j given its parents Pa(X j ) := {X k |(v k , v j ) ∈ E}. We assume that the observed data x j is obtained by the Structural Equation Model (SEM) with addtive noise: X j := f θj (Pa(X j )) + j , j = 1, . . . , d, where f θj parameterized by θ j is used to represent the functional relationship between X j and its parents, and j 's denote jointly independent additive noise variables. We assume causal minimality, which is equivalent to that each f j is not a constant for any X k ∈ Pa(X j ) in this SEM (Peters et al., 2014) . Given a sample X = [x 1 , . . . , x d ], where X ∈ R m×d and x j is a vector of m observations for random variables X j . Our goal in this paper is to find the DAG G that maximizes the BIC score (or other well studied scores), defined as Score BIC (G) = d j=1 m k=1 log p(X k j |Pa(X k j ), θ j ) - |θ j | 2 log m , where |θ j | is the number of free parameters in p(X k j |Pa(X k j ), θ j ). For linear causal relationships, |θ j | = |Pa(X k j ) | the number of parents, up to some constant factor. 𝑣 2 𝑣 3 𝑣 1 𝑣 4 Π ≔ {𝑣 2 , 𝑣 4 , 𝑣 1 , 𝑣 3 } Figure 1: An example of the correspondence between an ordering (down) and a fully-connected DAG (top). The problem of finding a directed graph that satisfies the ayclicity constraint can be cast as that of finding an variable ordering (Teyssier & Koller, 2005; Schmidt et al., 2007) . The score of an ordering is usually defined as the score of the best DAG that is consistent with the given ordering. Specifically, let Π denote an ordering of the nodes in V , where the length of the ordering |Π| = |V | and Π is indexed from 1. If node v j ∈ V lies in the p-th position, then Π(p) = v j . Notation Π ≺vj denotes the set of nodes in V that precede node v j in Π. One can establish a canonical correspondence between an ordering Π and a fully-connected DAG G Π ; an example with four nodes is presented in Figure 1 . For a given DAG G, it can be consistent with more than one orderings and the set of these orderings is given by Φ(Π) := {Π : the fully-connected DAG G Π is a super-DAG of G}, (2) where a super-DAG of G is a DAG whose edge set is a superset of that of G. The score of an ordering is usually defined as the score of the best DAG that is consistent with the given ordering (Teyssier & Koller, 2005; Peters et al., 2014) . We provide a formal description in Proposition 1 to show that it is possible to find the correct ordering with high probability in the large sample limit. Therefore, the search for the true DAG G * can be decomposed to two phases: finding the correct ordering and performing variable selection (feature selection); the latter is to find the optimal DAG that is consistent with the ordering found in the first step. Proposition 1. Suppose that an identifiable SEM with true causal DAG G * on X = {X j } d j=1 induces distribution P (X). Let G Π be the fully-connected DAG that corresponds to an ordering Π. If there is an SEM with G Π inducing the same distribution P (X), then G Π must be a super-graph of G * , i.e., every edge in G * is covered in G Π . Proof. The SEM with G Π may not be causally minimal but can be reduced to an SEM satisfying the causal minimality condition (Peters et al., 2014) . Let GΠ denotes the causal graph in the reduced SEM with the same distribution P (X). Since we have assumed that original SEM is identifiable, i.e., the distribution P (X) corresponds to a unique true graph, GΠ is then identical to G * . The proof is complete by noticing that G Π is a super-graph of GΠ .

3.2. REINFORCEMENT LEARNING

Standard RL is usually formulated as an MDP over the environment state s ∈ S and agent action a ∈ A, under an (unknown) environmental dynamics defined by a transition probability T (s |s, a). We use π φ (a|s) to denote the policy, parameterized by φ, which outputs a discrete (or continuous) distribution used to select an action from action space A based on state s. For episodic tasks, a trajectory τ = {s t , a t } T t=0 , where T is the finite time horizon, can be collected by executing the policy repeatedly. In many cases, an immediate reward r(s, a) can be received when agent executes an action. The objective of RL is to learn a policy which can maximize the expected cumulative reward along a trajectory, i.e., J(φ) = E π φ [R 0 ] with R 0 = T t=0 γ t r t (s t , a t ) and γ ∈ (0, 1] being a discount factor. For some scenarios, the reward is only earned at the terminal time (also called episodic reward), and J(φ) = E π φ [R(τ )] with R(τ ) = r T (s T , a T ).

4. METHOD

In this section, we first introduce how to model the ordering search problem as an MDP, then we show how to use RL to find the optimal ordering, and we introduce how to process the searched ordering to obtain the final DAG, finally we provide a discussion on computational complexity.

4.1. ORDERING SEARCH AS A MARKOV DECISION PROCESS

We can regard the variable ordering search problem as a muti-stage decision problem with a variable selected at each decision step. We sort the selected variables according to decision steps to obtain a variable ordering, which is defined as the searched ordering. The decision-making process is Markovian, and its elements in the problem can be defined as follows: State One can directly take the sample data x j of each variable X j as a state s j . However, preliminary experiments show that it is difficult for feed-forward neural network models to capture the underlying causal relationships directly using observed data as states, and the data pre-processed by a module conventionally named encoder is helpful to finding the better ordering, see Appendix A.1. The encoder module embeds each x j to s j and all the embedded states constitute the space S := {s 1 , . . . , s d }. Adding initial state s 0 to the space constitutes the complete state space Ŝ := S ∪ s 0 . We use ŝt to denote the state encountered at the t-th decision step. Action We select an action (variable) from the action space constituted by all the variables at each decision step, and the space size is equal to the number of variables, |A| = d. Compared to the previous RL-based method searching from the graph space with size O(2 d×d ) (Zhu et al., 2020) , the search space is smaller. Note that according to the definition of ordering, the first selected variable is the source node and the last selected node is the sink node.

State transition

The specified state transition is related to the action selected at the current decision step. Specifically, if the selected variable is v j at the t-th decision step, then the state is transferred to the state s j ∈ S which is the j-th output from Transformer encoder, i.e., ŝt+1 = s j . Reward As we described in Section 3.1, only the variables that have been selected in previous decision steps can be the potential parents of the currently selected variable. Hence, we can design the rewards in the following cases: dense reward and episodic reward. For dense reward case, we can exploit the decomposability of the score function (BIC score) to calculate an immediate reward for the current decision step based on the potential parent variables (that have been selected), i.e., if v j is selected at time step t, the immediate reward is calculated by r t = m k=1 log p(X k j |U (X j ), θ j ) - |θ j | 2 log m, where U (X j ) denotes the potential parent variable set of X j and the set consists of the variables associated with the nodes in Π ≺vj . For episodic reward case, we directly calculate a score as an episodic reward for a complete variable ordering regardless of whether the scoring function is decomposable or not, i.e., R(τ ) = r T (ŝ, a) = Score BIC (G Π ) where Score BIC has been defined in Equation ( 1), with the set Pa(X j ) of each variable X j replaced by the potential parent variable set U (X j ) here.

4.2. IMPLEMENTATION AND OPTIMIZATION WITH REINFORCEMENT LEARNING

We describe the neural network architectures implemented in our method, which consists of an encoder and a decoder as shown in Figure 2 . Here we briefly describe the model architectures and leave details regarding model parameters to Appendix A. Encoder Decoder 𝑠 1 … 𝑠 𝑗 … 𝑠 𝑑 𝐱 1 … 𝐱 𝑗 … 𝐱 𝑑 𝑠 0 … 𝑠 𝑡 … 𝑠 𝑇 𝑎 0 … 𝑎 𝑡 … 𝑎 𝑇 Figure 2: Illustration of the policy model. The encoder embeds the observed data x j into the state s j . An action a t can be selected by the decoder according to the given state ŝt at each time step t. Encoder f enc φe : X → S is used to map the observed data to the embedding space S = {s 1 , . . . , s d }. For sample efficiency, we follow Zhu et al. (2020) to randomly draw n samples from the dataset X to construct X ∈ R n×d at each episode and use X instead of X. We also set the embedding s j to be in the same dimension, i.e., s j ∈ R n . For encoder choice, we conduct an empirical comparison among several possible structures such as a selfattention based encoder (Vaswani et al., 2017) and an LSTM structure. Empirically, we confirm that the self-attention based encoder in the Transformer structure performs the best, which is also used in Zhu et al. (2020) . Please find more details in Appendix A.1.

Decoder f dec

φ d : Ŝ → A maps the state space Ŝ to the action space A. Among several decoder choices (please see also Appendix A.1 for empirical comparison), we specifically pick an LSTM based structure that proves effective in our experiments. Although the initial state is usually generated randomly, we pick it as s 0 = 1 d d i=1 s i , considering that the source node is fixed in correct ordering. We then restrict each node only be selected once by masking the selected nodes so as to generate a valid variable ordering (Vinyals et al., 2015) . Optimization The optimization objective is to learn a policy which maximizes J(φ) = E π φ [R], where π φ denotes the policy model parameterized by the paprameters {φ e , φ d } of encoder f enc and decoder f dec . Based on the above definition, policy gradient (Sutton & Barto, 2018) is used to optimize the ordering generation model parameters. For the dense reward case, policy gradient can be written as ∇J(φ) = E π φ T t=0 R t ∇ φ log π φ (a t |ŝ t ) , where R t = T -t l=0 γ l r t+l denotes the return at time step t. We denote the algorithm using this reward design as CORL-1. For the episodic reward case, we have the following policy gradient ∇J(φ) = E π φ R(τ ) T t=0 ∇ φ log π φ (a t |ŝ t ) , the algorithm using this reward design is denoted as CORL-2. Using a parametric baseline to estimate the expected score typically improves learning (Konda & Tsitsiklis, 2000) . Therefore, we introduce a critic network V φv (ŝ t ) parameterized by φ v , which learns the expected return given a state ŝt . It is trained with stochastic gradient descent using Adam optimizer on a mean squared error objective between its predicted value and the actual return. The details about the parameters of critic network are described in Appendix A.2. Inspired by the benefits of pretrained models in other tasks (Hinton & Salakhutdinov, 2012) , we also consider to incorporate it to our method to accelerate training. Usually, one can obtain some observed data with known causal structure or correct ordering, e.g., by simulation or real data with labeled graph. Hence, we pretrain a policy model with such data in a supervised way. So far we presented CORL in a general manner without specifying explicitly which distribution family is during the evaluation of rewards. In principle, any distribution family could be employed as long as its log-likelihood can be computed and no differentiability is required. However, it is not always clear whether the maximization of the accumulated reward recovers the correct ordering. It will depend on both the modelling choice of reward and the underlying SEM; in fact, if the causal relationships fall into the chosen model functions and a right distribution family is assumed, then given infinite samples the optimal accumulated reward, corresponding to the negative log-likelihood, must be achieved by a super-DAG of the underlying graph according to Proposition 1. In practice, we Algorithm 1 CORL. Require: observed data X, initial parameters θ e , θ d and θ v , two empty buffers D and D score , initial value (negative infinite) BestScore and a random ordering BestOrdeing. 1: while not terminated do 2: draw a batch of samples from X, encode them to S and calculate the initial state ŝ0 3: end if 13: end while 14: get the final DAG by pruning the BestOrdering can only apply approximate model functions and also need to assume certain distribution family for caculating the reward. for t = 0, 1, . . . , T do 4: collect a batch of data ŝt , a t , r t with π θ : D = D ∪ { ŝt , a t , r t } 5: if v t , Π ≺vt , r t is not in D score then 6: store v t , Π ≺vt , Our method is summarized in Algorithm 1. In addition, we record the decomposed scores for each variable v j with different parental sets Π ≺vj to avoid repeated computations which are generally time-consuming (see Section 4.4). Although we cannot guarantee to find the optimal ordering because policy gradient can at most guarantee local convergence (Sutton et al., 2000) and also we only have access to the empirical log-likelihood, we remark that the ordering obtained from CORL still enjoy a good performance in the experiments, compared with consistent methods like GES and PC.

4.3. VARIABLE SELECTION

If an estimated ordering Π is consistent, then we obtain a fully-connected DAG (super-DAG) G Π of the underlying DAG G. One can then pursue consistent estimation of intervention distributions based on Π without any additional need to find the true underlying DAG G * (Bühlmann et al., 2014) . For other purposes, however, we need to recover the true graph from the fully-connected DAG. There exist several efficient methods such as sparse candidate (Teyssier & Koller, 2005) , significance testing of covariates (Bühlmann et al., 2014) , the group Lasso (Ravikumar et al., 2009) , or its improved version with a sparsity-smoothness penalty proposed in Meier et al. (2009) . For linear data models, we apply linear regression to the obtained fully-connected DAG, followed by thresholding to prune edges with small weights, as similarly used by Zheng et al. (2018) ; Yu et al. (2019) ; Zhu et al. (2020) . For the nonlinear model, we follow the pruning process used by Bühlmann et al. (2014) ; Lachapelle et al. (2020) . Specifically, for each variable X j , one can fit a generalized additive model against the current parents of X j and then apply significance testing of covariates, declaring significance if the reported p-values are lower or equal to 0.001.

4.4. COMPUTATIONAL COMPLEXITY

To learn an ordering, CORL relies on the proper training of RL model. Policy gradient and stochastic gradient are adopted to train the actor and critic respectively, which are the standard choice in RL (Konda & Tsitsiklis, 2000) . Similar to RL-BIC2 (Zhu et al., 2020) , CORL requires the evaluation of the rewards at each episode with O(dm 2 + d 3 ) computational cost if linear functions are adopted to model the causal relations, but does not need to compute the matrix exponential term with O(d 3 ) cost due to the use of ordering search. In addition, CORL formulates causal discovery as a multi-stage decision process and we observe that CORL performs fewer episodes than RL-BIC2 before the episode reward converges (see Appendix C). We suspect that due to a significant reduction in the size of action space, the model complexity of the RL policy is reduced, thus leading to higher sample efficiency. The evaluation of Transformer encoder and LSTM decoder in CORL take O(nd 2 ) and O(dn 2 ), respectively. However, we find that computing rewards is more dominating in the total runing time (e.g., around 95% and 87% for 30-and 100-node linear data models, respectively). Speeding up the calculation of rewards would be helpful in extend our approach to a larger problem, which is left as a future work. In contrast with typical RL applications, we treat RL here as a search strategy for causal discovery, aiming to find an ordering that achieves the best score and then applying a variable selection method to remove redundant edges. Nevertheless, for the pretraining part with the goal of a good initialization, we may want sufficient generalization ability and hence consider diverse datasets with different number of nodes, noise types, causal relationships, etc.

5. EXPERIMENTS

In this section, we evaluate our methods against a number of methods on synthetic datasets with linear and non-linear causal relationships and also a real dataset. Specifically, these baselines include ICA-LiNGAM (Shimizu et al., 2006) , three ordering-based approaches L1OBS (Schmidt et al., 2007) , CAM (Bühlmann et al., 2014) and A* Lasso (Xiang & Kim, 2013) , some recent gradient-based approaches NOTEARS (Zheng et al., 2018) , DAG-GNN (Yu et al., 2019) and GraN-DAG (Lachapelle et al., 2020) , and the RL-based approach RL-BIC2 (Zhu et al., 2020) . For all the compared algorithms, we use their original implementations (see Appendix B.1 for details) and pick the recommended hyper-parameters unless otherwise stated. Different types of data are generated in synthetic datasets which vary along five dimensions: level of edge sparsity, graph type, number of nodes, causal functions and sample size. Two types of graph sampling schemes: Erdös-Rényi (ER) and Scale-free (SF) are considered in our experiments. We denote d-node ER or SF graphs with on average hd edges as ERh or SFh. Two common metrics are considered: True Positive Rate (TPR) and Structural Hamming Distance (SHD). The former indicates the probability of correctly finding the positive edges among the discoveries (Jain et al., 2017) . Hence, it can be used to measure the quality of an ordering, the higher the better. The latter counts the total number of missing, falsely detected or reversed edges, the smaller the better.

5.1. LINEAR DATA MODELS WITH GAUSSIAN AND NON-GAUSSIAN NOISE

We evaluate the proposed methods on Linear Gaussian (LG) with equal variance Gaussian noise and LiNGAM data models, and the true DAGs in both cases are known to be identifiable (Peters & Bühlmann, 2014; Shimizu et al., 2006) . We set h ∈ {2, 5} and d ∈ {30, 50, 100} to generate observed data following the procedure done in Zheng et al. (2018) (see Appendix B.2 for details). For variable selection, we set the thresholding as 0.3 and apply it to the estimated coefficients. Tables 1 &2 present results only for 30and 100-node LG data models since the conclusions do not change with graphs of 50 nodes (see Appendix D for 50-node graphs). We report the performance of the popular ICA-LiNGAM, GraN-DAG and CAM in Appendix D since they are almost never on par with the best methods presented in this section. CORL-1 and CORL-2 achieve consistently good results on the LiNGAM datasets which are reported in Appendix E due to the space limit. We now examine Tables 1 &2 (the values in parentheses represent the standard deviation across datasets per task). We can see that, across all settings, CORL-1 and CORL-2 are the best performing methods, both in terms of TPR and SHD, while NOTEARS and DAG-GNN are not too far behind. Although both CORL-1 and CORL-2 achieve the desired performance, one can notice that CORL-2 achieves slightly better SHD than CORL-1. This is a little different from the usual understanding that RL is usually easier to learn from dense rewards than from episodic reward case. We take the 100-node LG ER2 data models as an example to show the training reward curves of the CORL-1 and CORL-2 in Figure 3 . One can notice that CORL-2 converges faster to a better result than CORL-1, which corresponds to the fact that CORL-2 achieves the slightly better TPR and SHD than CORL-1. We conjecture that this is because it is difficult for the critic to learn to predict the score accurately for each state; in the episodic reward case, however, it is only required to learn to predict the score accurately for the initial state.

Pretaining

We show the training reward curve of CORL-2-pretrain in Figure 3 , which is CORL-2 based on a pretrained model trained in the supervised way. The testing data task is unseen in the datasets for pretraining which contain 30-node LiNGAM ER2, 50-node LiNGAM ER2, 30-node LiNGAM SF2 and 30-node GP ER1. Obviously compared to that of CORL-2 using random initialization (CORL-2), the use of a pretrained model can accelerate the model learning process. Consistent conclusion can be drawn from the experiment of CORL-1, see Appendix G. In addition, we consider using the policy model learned only on 30-node LiNGAM data model as the pretrained model on 100-node LG task. We observe that the performance is similar to that of using the pretrained model obtained in the supervised way, we do not report the results repeatedly here.

5.2. NON-LINEAR MODEL WITH GAUSSIAN PROCESS

In this experiment, we consider to use Gaussian Process (GP) to model causal relationships in which each causal relation f j is a function sampled from a GP with radial basis function kernel of bandwidth one and normal Gaussian noises, which is known to be identifiable according to Peters et al. (2014) . We set h = 1 and 4 to get ER1 and ER4 graphs, respectively, and generate data by exploiting them (see Appendix B.2 for details). Our method is evaluated on different sample numbers, the results when m = 500 are reported here (see Appendix F for additional results). Note that due to the efficiency of the reward calculation, we only experimented on nonlinear data models of up to 30 nodes scale. The pruning method for variable selection used here is the CAM pruning from Bühlmann et al. (2014) . The results on 10-node and 30-node datasets with ER1 and ER4 graphs are shown in Figure 3 . Here we only consider some baselines that are competitive in the nonlinear data models, where CAM is a very strong ordering based baseline. Although GraN-DAG achieves better results than DAG-GNN, they are worse than CAM overall. We believe this is because 500 samples are so small that GraN-DAG and DAG-GNN have not learned the good model. RL-BIC2 performs well on 10-node datasets, but achieves poor results on 30-node datasets, probably due to its lack of scalability. CAM, CORL-1 and CORL-2 have good results, with CORL-2 achieving the best results on 10-node graphs and slightly worse than CAM on 30-node graphs. All of these methods have better results on ER1 than on ER4, especially on 30-node graphs.

5.3. REAL DATA

The Sachs dataset (Sachs et al., 2005) , with 11-node and 17-edge true graph, is widely used for research on graphical models. The expression levels of protein and phospholipid in the dataset can be used to discover the implicit protein signal network. The observational dataset has m = 853 samples and is used to discover the causal structure. GP is used to model the causal relationship in our method. In this experiment, CORL-1, CORL-2 and RL-BIC2 achieve the best SHD 11. CAM, GraN-DAG, and ICA-LiNGAM achieve the SHD 12, 13 and 14, respectively. Particularly, DAG-GNN and NOTEARS result in SHD 16 and 19, respectively, whereas an empty graph has an SHD 17.

APPENDIX A NETWORK ARCHITECTURES AND HYPER-PARAMETERS A.1 MULTIPLE NETWORK ARCHITECTURE DESIGNS FOR ENCODER AND DECODER

There are a variety of neural network modules for encoder and decoder structures. Here we consider some representative modules: including , Multi-layer Perceptrons (MLP) module, an LSTM based recurrent neural network module, and the self-attention based encoder in the Transformer. In addition, we use the original observation data as the state directly, i.e., no encoder module is used, to show the necessity of encoder, which is denoted as Null. MLP consists of 3-layer feed-forward neural networks with 256, 512 and 256 units. The architecture of LSTM and Transformer are that introduced in Appendix A.2. The empirical results of CORL-2 on 30-node LG ER2 datasets are reported in Table 3 . We observe that LSTM decoder achieves a better performance than that of MLP decoder. This shows that LSTM is more effective than MLP in sequential decision tasks. Besides, the overall performance of neural network encoder is better than that of Null, which shows that the data pre-processed by encoder module is necessary. Among all these encoders, Transformer encoder achieves the best results. Similar conclusion was drawned in Zhu et al. (2020) . We hypothesize that the performance of Transformer encoder is benefit from the self-attention scheme. Table 3 : Empirical results of CORL-2 with different choices of encoder and decoder on 30-node LG ER2 datasets. The smaller SHD the better, the higher TPR the better. The encoder embeds the observed data x j of each variable j into the state s j . Notation block@3 denotes three blocks here.

Encoder

Both CORL-1 and CORL-2 use the actor-critic algorithm. We use the Adam optimizer with learning rate 1e-4 and 1e-3 for actor and critic respectively. The discount factor γ is set to 0.98. The actor consists of an encoder and a decoder. We illustrate the neural network structure of the Transformer encoder used in our experiments in Figure 5 . It consists of a feed-forward layer with 256 units and three blocks. Each block is composed of a multi-head attention network with 8 heads and 2-layer feed-forward neural networks with 1024 and 256 units. Each feed-forward layer is followed by a normalization layer. Given a batch of observed samples with shape b × d × n, where b denotes the batch size, d the node number and n the number of observed data for each variable in a batch, the final output of the encoder is a batch of embedded state with shape b × d × 256. We illustrate the neural network structure of the decoder in Figure 6 , which is mainly a LSTM similar to the decoder proposed by Vinyals et al. (2015) . The LSTM takes a state as input and outputs a embedding. The embedding is mapped to the action space A by using some feed-forward neural networks, a soft-max module and the pointer mechanism (Vinyals et al., 2015) . The outputs of the encoder are processed as the initial hidden state h 0 of the decoder. The LSTM with 256 hidden units is used here. All of the feed-forward neural networks used in decoder have 256 units. The critic uses 3-layer feed-forward neural networks with 512, 256 and 1 units. It takes a state ŝ as input and outputs a predicted value for the current policy given state ŝ. For CORL-1, the critic needs to predict the score for each state. For CORL-2, the critic takes the initial state ŝ0 as input and outputs a predicted value directly for a complete ordering.

B BASELINES AND DATE SETS B.1 DETAILS OF BASELINES

The details of all the baselines considered in our experiments are listed as follows: • ICA-LiNGAM assumes linear non-Gaussian additive model for data generating procedure and applies independent component analysis to recover the weighted adjacency matrix. This method can usually achieve good performance on LiNGAM datasets. However, it does not provide guarantee for linear Gaussian datasets.foot_0  • NOTEARS recovers the causal graph by estimating the weighted adjacency matrix with the least squares loss and the smooth characterization for acyclicity constraint.foot_1  • DAG-GNN formulates causal discovery in the framework of variational autoencoder and optimizes a weighted adjacency matrix with the evidence lower bound and a modified smooth characterization on acyclicity as loss function.foot_2  • GraN-DAG models the conditional distribution of each variable given its parents with feedforward neural networks. It uses the smooth directed acyclic constraint from NOTEARS to find a DAG that maximizes the log-likelihood of the observed samples.foot_3 • RL-BIC2 formulates the causal discovery as a one-step decision making process and uses score function and acyclic constraint from NOTEARS to calculate the reward for the recovered directed graph.foot_4 • CAM conducts a greedy estimation procedure that starts with an empty DAG and adds at each iteration the edge (v k , v j ) between nodes v k and v j that corresponds to the largest gain in log-likelihood. For a searched ordering, they prune it to the final DAG by applying significance testing of covariates. They perform the preliminary neighborhood selection to reduce the ordering space size searched.foot_5 • L1OBS performs heuristic search (greedy hill-climbing with tabu lists) through the space of topological orderings to search a ordering with the best score, then it uses L1 variable selection to pruning the searched ordering (fully-connected DAG) to the final DAG.foot_6 • A* Lasso with a limited queue size incorporates a heuristic scheme into a dynamic programming based method. It first prunes the search space by using A* Lasso, then it further pruning the search space by limiting the size of the priority queue in the OPEN list of A* Lasso. The queue size usually needs to be adjusted to balance the time cost and the quality of the solution.foot_7 

B.2 DETAILS ON DATASETS GENERATION

We generate synthetic datasets which vary along five dimensions: level of edge sparsity, graph type, number of nodes, causal functions and sample size. We sampled 5 datasets of the required number examples for each task as follows: a ground truth DAG G is drawn randomly from either the Erdös-Rényi (ER) or Scale-free (SF) graph model; then, the data is generated according to a specific sampling scheme. Specifically, for Linear Gaussian (LG) case, we set h ∈ {2, 5} and d ∈ {30, 50, 100} to obtain the ER graph and SF graph (different types) with different levels of edge sparsity and different number of nodes, respectively. Then 5 datasets of 3000 examples are generated for each task following X = W T X + , where W ∈ R d×d denotes the weight matrix which is obtained by assigning edge weights independently from Unif([-2, -0.5] ∪ [0.5, 2]). Note that 's are standard Gaussian noises with equal variances for each variable so as to LG data model is identifiable (Peters & Bühlmann, 2014) . For LiNGAM data model, the datasets are generated in the similar way with LG but the sampling for are different. The non-Gaussian noises are obtained by following Shimizu et al. (2006) which passes the noise samples from Gaussian distribution through a power nonlinearity to make them non-Gaussian. LiNGAM is identifiable shown in Shimizu et al. (2006) . Another data generating process is GP. We first obtain graph with different density and different number of nodes. Then the datasets with different sample sizes are generated following X j = f j (Pa(X j )) + j with jointly independent Gaussian noises for all j where the function f j is a function sampled from a GP with radial basis function kernel of bandwidth one and normal Gaussian noises. This setting is known to be identifiable according to Peters et al. (2014) . Note that due to the efficiency of the reward calculation, we only experimented on nonlinear data models of up to 30 nodes.

C TOTAL NUMBER OF EPISODES BEFORE CONVERGENCE

Table 4 reports the total number of episodes required for CORL-2 and RL-BIC2 to be converged, averaged over five datasets. Note that the episodic reward is evaluated once per episode. CORL formulates causal discovery as a multi-stage decision process and we observe that CORL performs fewer episodes than RL-BIC2 before the episode reward converges. We suspect that due to a significant reduction in the size of action space, the model complexity of the RL policy is reduced thus leading to higher sample efficiency. Some runtimes about them are also provided here (CORL-2 total runtime ≈ 15 minutes against RL-BIC2 ≈ 3 hours for 30-node ER2 graphs, ≈ 4 hours against ≈ 14 hour for 50-node ER2 graphs, and CORL-2 ≈ 7 hours for 100-node ER2 graphs). We set the maximal runtime up to 24 hours, but RL-BIC2 did not converge within that time on 100-node graphs, hence we did not report it here. Note that these runtime may be significantly reduced by parallelizing the evaluation of reward. CORL-2 1.0 (0.3) 1.1 (0.4) 1.9 (0.3) 2.4 (0.3) 2.3 (0.5) 2.9 (0.4) RL-BIC2 3.9 (0.5) 4.1 (0.6) 3.4 (0.4) 3.5 (0.5) × ×

D ADDITIONAL RESULTS ON LINEAR GAUSSIAN DATASETS

The results for 50-node LG data models are presented in Table 5 . The conclusion is similar to the 30and 100-node experiments. The results of ICA-LiNGAM, GraN-DAG and CAM on LG data models are presented in Table 6 . Their performances do not compare favorably to CORL-1 nor CORL-2 in LG datasets. It is not surprising that ICA-LiNGAM does not perform well because the algorithm is specifically designed for non-Gaussian noise and does not provide guarantee for LG data models. We hypothesize that CAM's poor performance on LG data models is because it uses nonlinear regression instead of linear regression. As for GraN-DAG, it uses 2-layer feed-forward neural networks to model the causal relationships, which may not be able to learn a good linear relationship in this experiment. Here, we report the empirical results on 30-, 50-and 100-node LiNGAM datasets in Table 7 . The observed samples are generated according to the same procedure with linear Gaussian datasets (see Appendix B.2 for details). The non-Gaussian noise is obtained by passing the noise samples from Gaussian distribution through a power nonlinearity to make them non-Gaussian. For L1OBS, we increased the authors' recommended number of evaluations 2500 to 10000. For A* Lasso, we set the queue size to 10, 500 and 1000, and report the best result for all of these different parameter settings. The results of L1OBS and A* Lasso reported here are that of pruning using our pruning method. For other baselines, we pick the recommended hyper-parameters. Among all these algorithms, ICA-LiNGAM can recover the true graph on most of the LiNGAM data models. This is because ICA-LiNGAM is specifically designed for non-Gaussian noise. CORL-1 and CORL-2 achieve consistently good results than other baselines. 

F RESULTS ON 20-NODE GP DATASETS WITH DIFFERENT SAMPLE SIZES

We take the 20-node GP data models as an example to show the performance of our method on different sample numbers. We set h = 4 to get ER4 graphs and generate data by using them. We illustrate the empirical results in Table 8 . Since previous experiments have shown that CORL-2 is slightly better than CORL-1, we only report the results of CORL-2 here. CAM as the most competitive baseline, we also report its results on these datasets. TPR reported here is calculated based on the variable ordering. We can see that, as the sample size decreases, CORL-2 ends up outperforming CAM. We believe this is because CORL-2 benefits from the exploratory ability of RL.

G CORL-1 WITH A PRETRAINED MODEL

We show the training reward curves of CORL-1 and CORL-1-pretrain which is CORL-1 based on a pretrained model in Figure 7 . To obtain a pretraining model with good generalization ability, we combine various types data described in Appendix B.2 with different levels of edge sparsity, graph



https://sites.google.com/site/sshimizu06/lingam https://github.com/xunzheng/notears https://github.com/fishmoon1234/DAG-GNN https://github.com/kurowasan/GraN-DAG https://github.com/huawei-noah/trustworthyAI/tree/master/Causal_ Structure_Learning/Causal_Discovery_RL https://cran.r-project.org/web/packages/CAM. https://www.cs.ubc.ca/~murphyk/Software/DAGlearn/ http://www.cs.cmu.edu/~jingx/software/AstarLasso.zip



Figure 3: Learning curves of CORL-1, CORL-2 and CORL-2-pretrain on 100node linear Gaussian dataset.

Figure 4: The empirical results on GP data models with 10 and 30 nodes.

Figure5: Illustration of the Transformer encoder. The encoder embeds the observed data x j of each variable j into the state s j . Notation block@3 denotes three blocks here.

Figure6: Illustration of the LSTM decoder. At each time step, it maps the state ŝt to a distribution over action space A := {a 1 , . . . , a d }, then an action (variable) can be selected randomly according to the distribution.

r t in D score to avoid repeated computations

Empirical results for ER and SF graphs of 30 nodes with LG data.

Empirical results for ER and SF graphs of 100 nodes with LG data.

Total number of iterations (×10 3 ) before RL converge on LG data.

Empirical results for ER and SF graphs of 50 nodes with LG data. The higher TPR the better, the smaller SHD the better.

Empirical results of ICA-LiNGAM, GraN-DAG and CAM (against CORL-2 for reference) for ER and SF graphs with LG data. The higher TPR the better, the smaller SHD the better.

Empirical results on 30-, 50-and 100-node LiNGAM ER2 datasets. The smaller SHD the better, the higher TPR the better.

6. CONCLUSION

In this work, we have proposed a RL-based approach for causal discovery named CORL. It searches the space of variable orderings, instead of the space of directed graphs. We formulate ordering search as an MDP and have further proposed CORL-1 and CORL-2 for training the ordering generating model. For a generated ordering, it can be pruned by variable selection to the final DAG. The empirical results on the synthetic and the real datasets show that our approach is promising.

annex

type and number of nodes and data generating process to construct observation samples. Next, we train a policy model by supervised learning on the mixed datasets. Finally, we use the pretrained model as a start point on the task that has never been seen before during pretraining.From Figure 7 , we can observe that although the pretrained model is trained on other types of datasets, it can still accelerate the training as the initial model on the LG dataset task. This shows that our policy model may learn some implicit knowledge across different tasks.

