SCORE-BASED CAUSAL DISCOVERY FROM HETEROGENEOUS DATA

Abstract

Causal discovery has witnessed significant progress over the past decades. Most algorithms in causal discovery consider a single domain with a fixed distribution. However, it is commonplace to encounter heterogeneous data (data from different domains with distribution shifts). Applying existing methods on such heterogeneous data may lead to spurious edges or incorrect directions in the learned graph. In this paper, we develop a novel score-based approach for causal discovery from heterogeneous data. Specifically, we propose a Multiple-Domain Score Search (MDSS) algorithm, which is guaranteed to find the correct graph skeleton asymptotically. Furthermore, benefiting from distribution shifts, MDSS enables the detection of more causal directions than previous algorithms designed for single domain data. The proposed MDSS can be readily incorporated into off-the-shelf search strategies, such as the greedy search and the policy-gradient-based search. Theoretical analyses and extensive experiments on both synthetic and real data demonstrate the efficacy of our method.

1. INTRODUCTION

Discovering causal relations among variables is a fundamental problem in various fields such as economics, biology, drug testing, and commercial decision making. Because conducting randomized controlled trials is usually expensive or even infeasible, discovering causal relations from observational data, i.e. causal discovery (Pearl, 2000; Spirtes et al., 2000) , has received much attention over the past few decades. Early causal discovery algorithms can be roughly categorized into two types: constraint-based ones (e.g. PC (Spirtes et al., 2000) ) and score-based ones (e.g. GES (Chickering, 2002) ). In general, these methods cannot uniquely identify the causal graph but are guaranteed to output a Markov equivalence class. Since the seminal work by Shimizu et al. (2006) , several methods have been developed, achieving identifiability of the whole causal structure by making use of constrained Functional Causal Models (FCMs), including the linear non-Gaussian model (Shimizu et al., 2006) , the nonlinear additive noise model (Hoyer et al., 2009) , and the post-nonlinear model (Zhang & Hyvärinen, 2009) . Recently, Zheng et al. (2018) proposed a score-based method that formulates the causal discovery problem as continuous optimization with a structural constraint that ensures acyclicity. Based on the continuous structural constraint, several researchers further proposed to model the causal relations by neural networks (NNs) (Lachapelle et al., 2019; Yu et al., 2019; Zheng et al., 2019) . Another recent work Zhu & Chen (2019) used reinforcement learning (RL) for causal discovery, where the RL agent searches over the graph space and outputs a graph that fits the data best. The above approaches are designed for data from a single domain with a fixed causal model, with the limitation that many of the edge directions cannot be determined without strong functional constraints. In addition, the sample size of data from one domain is usually not large enough to guarantee small statistical estimation errors. One way to improve statistical reliability is to combine datasets from multiple domains, such as P-value meta-analyses (Lee, 2015; Marot et al., 2009) . The idea of combining multiple-domain data is commonly seen in learning with mixture of Bayesion networks (Thiesson et al., 1998) . While mixture of Bayesion networks are usually used for density estimation, the purpose of causal analysis from multiple-domain data is completely different, it aims at discovering the underlying causal graphs for all domains. Regarding causal analysis from multiple-domain data, a challenge is the data heterogeneity problem: the data distribution may vary across domains. For example, in fMRI hippocampus signal analysis, the connection strength among different brain regions may change across different subjects (domains). Due to the distribution shift, directly pooling the data from multiple domains may lead to spurious edges. To tackle the issue, different ways have been investigated, including using sliding windows (Calhoun et al., 2014) , online change point detection (Adams & MacKay, 2007) , online undirected graph learning (Talih & Hengartner, 2005) , locally stationary structure tracker (Kummerfeld & Danks, 2013) , and regime aware learning (Bendtsen, 2016) . However, these methods may suffer from high estimation variance due to sample scarcity, large type II errors, and a large number of statistical tests. Huang et al. (2015) recovers causal relations with changing modules by making use of certain types of smoothness of the change, while it does not explicitly locate the changing causal modules. Other similar methods, including Xing et al. (2010) ; Song et al. (2009) , can be reduced to online parameter learning because the causal directions are given. By utilizing the invariance property (Hoover, 1990; Tian & Pearl, 2001; Peters et al., 2016) and the more general independent change mechanism (Pearl, 2000) , recently, Ghassami et al. (2018) developed two methods: identical boundaries (IB) and minimal changes (MC), for causal discovery from multi-domain data. However, the proposed methods 1) assume causal sufficiency (i.e., all common causes of variables are measured), which is usually not held in real circumstances, 2) are designed for linear systems only, 3) and are not capable of identifying causal directions from more than ten domains. Huang et al. (2019) proposed a more general approach called CD-NOD for both linear and nonlinear heterogeneous data, by extending the PC algorithm to tackle the heterogeneity issue. However, inheriting the drawbacks of constraint-based methods, CD-NOD involves a multiple testing problem and is time-consuming due to large number of independence tests. To overcome the limitations of existing works, we propose a Multiple-Domain Score Search (MDSS) method for causal discovery from heterogeneous data, which enjoys the following properties. (1) To avoid spurious edges when combing multi-domain data, MDSS searches over the space of augmented graphs, which includes an additional domain index as a surrogate variable to characterize the distribution shift. (2) The changing causal modules can be immediately identified from the recovered augmented graph. (3) Benefiting from causal invariance and the independent change mechanism, MDSS uses a novel Multiple-Domain Score (MDS) to help identify more causal directions beyond those in the Markov equivalence class from distribution-shifted data. (4) MDSS can be readily incorporated into off-the-shelf search strategies and is time-efficient and applicable to both linear and nonlinear data. ( 5) Theoretically, we show that MDSS is guaranteed to find the correct graph skeleton asymptotically, and further identify more causal directions than other traditional score-based and constraint-based algorithms. Empirical studies on both synthetic and real data prove the efficacy of our method.

2. METHODOLOGY

In this section, we start from a brief introduction to causal discovery and distribution shifts (Section 2.1), and then in Section 2.2 and 2.3, we introduce our proposed Multiple-Domain Score Search (MDSS). In Section 2.2, MDSS starts with a predefined graph search algorithm to learn the skeleton of the causal graph, with the linear Bayesian information criterion (BIC) score or nonlinear generalized score (GS (Huang et al., 2018) ) on the augmented causal system. Then in Section 2.3, MDSS further identifies causal directions with Multiple-Domain Score (MDS) based on the identified skeleton of the graph from Section 2.2. Both theoretically and empirically, we show that MDSS can identify more directions compared to algorithms that are designed for i.i.d. or stationary data.

2.1. BACKGROUND IN CAUSAL DISCOVERY AND DISTRIBUTION SHIFTS

The basic causal discovery problem can be formulated as follows: Suppose there are d observable random variables, i.e. V = (V 1 , ..., V d ). Each random variable satisfies the following generating process: V i = f i P A i , , where f i is a function to model the causal relation between V i and its parents P A i , and i is a noise variable with non-zero variance. All the noise variables are independent of each other. The task of causal discovery is to recover the causal adjacency matrix B given the observed data matrix X ∈ R T ×d , where B ij = 1 indicates that V i is a parent of V j , and T is the sample size. We denote the underlying causal graph over V as G 0 . For each V i , we call P (V i |P A i ) its causal module. For a single domain, the joint probability can be factorized as P (V) = d i=1 P (V i |P A i ). Suppose there are n domains with distribution shifts (i.e. P (V) changes across domains), which implies that some causal modules change across domains. The changes may be caused by the variation of functional models, causal strength, or noise variance. Furthermore, we have the following assumptions. Assumption 1. The changes of causal modules can be represented as functions of domain index C, denoted by g(C), Assumption 2. There is no confounder in each single dataset, but we allow the changes of different causal modules being dependent. Remark: If changes in several causal modules are dependent, it can be regarded as special "confounders" that simultaneously affect these causal modules. As a consequence of such confounders, previous causal discovery algorithms designed for i.i.d. or stationary data may output erroneous edges. See section 3.1 for an illustration. Thus, causal discovery from multiple-domain data with distribution shifts (i.e. , heterogeneous data) can be much more difficult than that from single-domain data.

2.2. SKELETON ESTIMATION ON AUGMENTED GRAPHS

With Assumptions 1 and 2, it is natural to consider g(C) as an extra variable in order to remove any potential influence caused by these special confounders. We assume that there are L such confounders (g 1 (C), ..., g L (C)). The causal relation between each observable variable V i and its parents P A i can be formalized with V i = f i P A i , g i (C), θ i (C), i , where g i (C) ⊆ {g l (C)} L l=1 is the set of confounders that influence V i , θ i (C) are the effective parameters in V i 's causal module that are also assumed to be functions of C and are mutually independent for all variables. Let G 0 be the underlying causal graph over V. We denote the graph resulting from adding arrows g i (C) → V i and θ i (C) → V i on G 0 for each V i in V as G aug over V ∪ {g l (C)} L l=1 ∪ {θ i (C)} d i=1 . We call G aug an augmented graph (see Figure 1 (d) as an example), which satisfies the following assumption. Assumption 3. The joint distribution over V ∪ {g l (C)} L l=1 ∪ {θ i (C)} d i=1 is Markov and faithful to G aug . To remove the potential influence from confounders and recover causal relations from multiple domains, one way is to perform causal discovery algorithms on the augmented graph. While {g l (C)} L l=1 and {θ i (C)} d i=1 are not directly observed, we take C as a surrogate variable (Huang et al., 2019) for them because C is always available as a domain index. Given Assumptions 1, 2 and 3, one can apply any score-based method over V ∪ {C} to recover the causal relations among variables V as if {g l (C)} L l=1 ∪ {θ i (C)} d i=1 were known. For simplicity, we denote the graph over V ∪ {C} as augmented graph as well. Since C is the domain index, P (C) follows a discrete uniform distribution. Correspondingly, the generating process of non-stationary data can be considered as follows: First we generate random values from P (C), and then we generate data points over V according to the SEM in Equation 1. Finally, generated data points are sorted in ascending order according to the values of C (i.e., data points having the same value of C are regarded as belonging to the same domain). In other words, we observe the distribution P (V|C), where P (V|C) may change across different values of C, resulting in non-stationary data. Note that if we do not include C into the system explicitly, samples of V are not i.i.d. However, after explicitly including the domain index C into the system, P (V, C) is fixed, and thus the pooled data are i.i.d. samples from distribution P (V, C). Before stating our main result, we first give the definitions of globally consistent scoring criterion and locally consistent scoring criterion, which will be used in the paper. Definition 1 (Globally Consistent Scoring Criterion). Let D be a dataset consisting of T i.i.d. samples from some distribution P (•). Let H and G be any DAGs. A scoring criterion S is globally consistent if the following two properties hold as T → ∞: 1. If H contains P and G does not contain P , then S(H, D) > S(G, D) 1 . 2. If H and G both contain P , and G contains fewer parameters than H, then S(H, D) < S(G, D). Definition 2 (Locally Consistent Scoring Criterion). Let D be a dataset consisting of T i.i.d. samples from some distribution P (•). Let G be any DAG, and let G be the DAG that results from adding the edge V i → V j on G. A scoring criterion S(G, D) is locally consistent if the following two properties hold as T → ∞: 1. If V j ⊥ ⊥ V i |P A G j , then S(G , D) > S(G, D). 2. If V j ⊥ ⊥ V i |P A G j , then S(G , D) < S(G, D). It has been shown that the BIC score and the GS score are both globally and locally consistent (Chickering, 2002; Huang et al., 2018) . The procedure for skeleton estimation on augmented graphs is described in Algorithm 1. The predefined graph search algorithms will be discussed in Section 2.4. Apart from the recovered skeleton over V, the changing modules can be detected as well in Step 4 of Algorithm 1. It is important to note that we allow causal relations to be either linear or nonlinear. If they are nonlinear, we apply GS as a score function. When they are linear, although we can also use GS, we use linear BIC instead because it is less likely to be overfitting for linear data and is computationally more efficient. Algorithm 1 Skeleton Search on Augmented Graph Input: n datasets, each has T observations, d variables and index C. Output: skeleton S of G aug 's subgraph G 1 over V, and variables V C ∈ V that are connected with C. 1: Pool all datasets with an extra surrogate variable C to form a data matrix X ∈ R nT ×(d+1) . 2: Use the predefined graph search algorithm with BIC or GS plus acyclicity constraints to recover the augmented graph. Eliminate any direction V i → C in the graph with the prior that any variable V i does not affect domain index. This step leads to the recovered augmented graph G aug . Proof of the theorem is given in Appendix A.1. Intuitively, this theorem means we will obtain an augmented graph that is in the same Markov equivalence class as the true augmented graph if we maximize the score.

2.3. CAUSAL DIRECTION DETERMINATION BY MULTIPLE-DOMAIN-SCORE

For each variable V k ∈ V C , we prove that it is possible to determine the directions of edges that connect to V k . We denote V l as any variable that connects to V k . There are two possible cases: 1) V l / ∈ V C . In this case, C -V k -V l forms an unshielded triple. It is intuitive to incorporate the prior that C → V k (i.e. change of domain leads to the distribution shift of V k ). There are two possible patterns in this case: C → V k → V l and C → V k ← V l , which we denote as P and P respectively. For P, the causal mechanisms P (effect|cause) is invariant when P (cause) changes. For P , we have the invariance of P (cause) when the causal mechanism P (effect|cause) changes, which is complementary to the invariance of causal mechanisms. The direction between V k and V l can be determined as long as a globally consistent score is used. To be specific, suppose P is the true causal pattern underlying the generative distribution, the score of P will be larger than that of P if the score function used is globally consistent and decomposable, because compared with P, P eliminates a conditional independence (C ⊥ ⊥ V l |V k ) that actually holds in the generative distribution. This causal direction determination is not achievable for algorithms designed for stationary data from a single domain (because domain index cannot be used as an additional variable in this case). To utilize this prior, we simply eliminate any direction V i → C as described in Step 2 of Algorithm 1. Figure 1 (a) is a graphical illustration for this case. 2) V l ∈ V C . In this case, both V k and V l are connected to C, which is much more difficult than case 1). We propose a novel multiple-domain score (MDS) that utilizes the property of independent changes of causal modules to determine causal directions based on the causal skeleton derived from Algorithm 1. To specify the idea, we take the two-variable case as an example. Here we assume the true causal direction is 1 (b) stands for the case where θ 1 and θ 2 are independent. (We drop the notation of domain index C for simplicity). In other words, P (V 1 ; θ 1 ) and P (V 2 |V 1 ; θ 2 ) change independently. If the recovered direction is reverse (see Figure 1(c )), we factorize the joint distribution as V 1 → V 2 . Figure P (V 1 , V 2 ; θ 1 , θ 2 ) = P (V 2 ; θ 2 ) P (V 1 |V 2 ; θ 1 ) , where θ 1 and θ 2 are assumed to be sufficient for P (V 2 ) and P (V 1 |V 2 ) respectively. Since V 1 ← V 2 is not the true direction, θ 1 and θ 2 are not independent, and they are determined jointly by θ 1 and θ 2 . Based on this point, the causal direction can be determined by comparing the dependence between θ 1 , θ 2 and the dependence between θ 1 and θ 2 , and choose the direction with smaller dependence. V i C (a) V 1 V 2 θ1(C) θ2(C) (b) V 1 V 2 θ ′ 1 (C) θ ′ 2 (C) (c) V 1 V 2 θ1(C) θ2(C) g(C) (d) Figure 1 : (a) In skeleton estimation on augmented graph, we force the direction to be C→V i if algorithm finds a link between them. (b)(c) True causal graph for V 1 and V 2 and the graph with reversed causal direction in the two-variable example. (c) A two-variable example where confounder exists. For linear systems, the dependence can be described with covariance. θ 1 and θ 2 can be easily obtained by regressing V 1 and V 2 on their parents respectively. We first perform the regression for each domain then calculate the covariance between θ 1 (C) and θ 2 (C). When there are more than two variables that are connected to C, we denote such set of variables as V C with cardinality m. For each variable V k C ∈ V C and its parents P A k C ⊆ V C , we calculate the sum of the dependence between parameters of V k C 's causal module and the parameters of the causal module of each variable in P A k C . To incorporate the minimization of such dependence into score-based method, we propose MDS for linear systems: M DS linear = 1 n n i=1 (d ln(T i ) -2 ln(L i ))+λ 1 I(G / ∈ DAGs)+λ 2 h(A)+ λ 3 m m k=1 |cov(θ V k C , θ P A k C )|, where n, d, T i and L i represent the number of domains, the number of variables (here we assume this quantity is the same for all domains), sample size, and the maximized log likelihood for domain i, respectively. λ 1 , λ 2 and λ 3 are regularization coefficients. λ 3 is fixed to 0.001 in our experiments, λ 1 and λ 2 are adjusted dynamically while training (see Zhu & Chen (2019) for how λ 1 and λ 2 are adjusted). A is the weighted adjacency matrix recovered by the algorithm. m is the number of nonstationary variables. cov(•) is the covariance operator. The first term in Equation 3 is the average of BIC on n domains, the second and third terms are acyclicity constraints proposed in Zhu & Chen (2019) to narrow down the search space to DAGs. See Appendix A.2 for more details about the acyclicity constraints. For nonlinear systems, θ cannot be calculated explicitly. In the two-variable case, the dependence between θ 1 and θ 2 can be characterized by the dependence between P (V 1 ) and P (V 2 |V 1 ) with the assumption that θ 1 and θ 2 are sufficient for the corresponding distribution module. To calculate the dependence between P (V 1 ) and P (V 2 |V 1 ), Huang et al. (2019) proposes to first use kernel embeddings of distributions P (V 1 ) and P (V 2 |V 1 ), then measure their dependence with extended Hilbert Schmidt Independence Criterion (HSIC (Gretton et al., 2008) ) in Reproducing Kernel Hilbert Space (RKHS). When there are more than two variables that are connected to C, for each variable V k C ∈ V C and its parents P A k C ∈ V C , we calculate the dependence between P (P A k C ) and P (V k C |P A k C ). We propose corresponding MDS for nonlinear systems by integrating such dependence with GS: M DS nonlinear = 1 n n i=1 (GS i ) + λ 1 I(G / ∈ DAGs) + λ 2 h(A) + λ 3 m m k=1 HSIC(µ V k C |P A k C , µ P A k C ), where GS i is the generalized score for domain i, µ V k C |P A k C and µ P A k C are the kernel embeddings of distribu- tions P (V k C |P A k C ) and P (P A k C ) respectively. HSIC(•) is HSIC operator that measures the dependence of two random variables. See Appendix A.3, A.4 and A.5 for brief descriptions of GS, kernel embedding of distributions and HSIC respectively. Degeneration issue. If we apply search strategies over the entire space of graphs over V to optimize MDS, the MDS penalty (corresponds to the fourth term in Equation 3 and 4) tends to eliminate any edges between each V k C and its parents because the dependence between θ V k C and empty set (i.e. no parents for V k C ) is 0. We call this an degeneration issue. To tackle this issue, we optimize the MDS score based on the skeleton of graph G 1 from Algorithm 1. To be specific, we fix the skeleton S of G 1 and apply search strategies over the space defined by S to optimize the MDS score. In other words, MDS is optimized by only altering the direction of each edge in S. With the solution to degeneration issue, we claim that the proposed MDS can recover more correct directions compared with G 1 . This is supported by Theorem 2 and Theorem 3. See Appendix A.6 and A.7 for proofs. Let D be the pooling of n datasets with distribution shifts and G 0 be the DAG underlying the distribution of D. Let S be the skeleton of G 0 and G 1 (recall G 1 is in the same equivalent class as G 0 , which means they have the same skeleton). Let E 1 be the set of edges that exist in both G 0 and G 1 but have different directions in G 0 and G 1 . Let E 2 ∈ E 1 be the set of edges whose left node (or variable) and right node are both nonstationary. Let n 1 and n 2 be the cardinality of E 1 and E 2 . 2 ∈ E 2 be the set of edges whose directions are correctly determined by G * , let n * denote the cardinality of E * 2 . Given E 2 is not empty and G 1 , G * have the same skeleton S, then G * is in the same equivalent class as G 0 and 0 n * n 2 . Theorem 2 and 3 mainly state that under proper assumptions, the directions of some edges whose left node and right are both nonstationary can be correctly determined by the proposed method. Confounding case. When the confounder g(C) exists (e.g. Figure 1 (d)), the above approach still works if the influence from the confounder is not very strong for the following reason: for the correct direction, the dependence of θ 1 and θ 2 would come from the confounder, while for the wrong direction, the dependence would come from the confounder as well as the wrong direction.

2.4. GRAPH SEARCH STRATEGIES

The proposed MDSS can be readily incorporated into off-the-shelf search strategies. In this paper, we adopt the policy-gradient-based search strategy (Zhu & Chen, 2019) to search for the optimal causal structure. Compared with other search strategies such as greedy equivalence search (GES (Chickering, 2002) ), max-min hill climbing (Tsamardinos et al., 2006) , direct search by regarding the weighted graph adjacency matrix as parameters (Zheng et al., 2018; Yu et al., 2019; Lachapelle et al., 2019) , the policy-gradient-based search by a reinforcement learning (RL) agent with stochastic policy can determine automatically where to search given the uncertainty information of the learned policy, which gets updated promptly by the stream of reward signals (Zhu & Chen, 2019) . The graph search strategy using RL is proven to be better than other search strategies mentioned above empirically. The idea of causal discovery with RL can be summarized as follows. The algorithm uses an encoder-decoder neural network model to generate directed graphs from the observed data, which are then used to compute rewards consisting of the predefined score function as well as some regularization terms for acyclicity. The encoder-decoder model can be regarded as an "actor" that learns to generate "actions" (i.e., graph adjacency matrices) in actor-critic algorithm, an algorithm commonly used in RL. The reward function can be regarded as the "environment" that evaluates how good the "action" is (i.e., how good the produced graph adjacency matrix fits the observed data). The weights of the encoder-decoder model is trained by policy gradient and stochastic optimization methods. The output of the algorithm is the graph that achieves the best reward during the training process. To integrate MDS with the policy-gradient-based search, we replace the predefined score function in the original paper (where BIC is used) with MDS. Apart from policy-gradient-based search, we also experiment with greedy equivalence search, details of which can be found in Appendix A.8. The complete search procedure is described in Algorithm 2.

Algorithm 2 Multiple-domain Score Search

Input n datasets each has T observations, d variables and index C. Output causal graph G 2 over V. 1: Execute Algorithm 1, input all the datasets and corresponding domain index, output skeleton S and nonstationary variables V C . 2: Execute the predefined graph search algorithm with MDS in the space defined by S, output G 2 over V. 3: Perform any pruning methods on G 2 if needed.

3. EXPERIMENTS

In this section, we conduct empirical studies to show the effectiveness of our MDSS method combined with the MDS score. We compare MDSS to some well-known causal discovery algorithms that are designed for i.i.d. or stationary data from a single domain (GES (Chickering, 2002) , PC (Spirtes et al., 2000) , LiNGAM (Shimizu et al., 2006) , NO-TEARS (Zheng et al., 2018) and RL (Zhu & Chen, 2019 )) as well as algorithms designed for heterogeneous data from multiple domains (CD-NOD (Huang et al., 2019) , MC and IB (Ghassami et al., 2018) ). The comparison is made on both synthetic and real data. We evaluate the estimated graphs using three metrics: True Negative Rate (TNR), True Positive Rate (TPR), and Structural Hamming Distance (SHD, i.e. , the smallest number of edge additions, deletions, and reversals to convert the estimated graph into the true DAG). A lower SHD indicates a better estimate of the causal graph. For algorithms that output completed partially directed acyclic graph (CPDAG), we randomly choose a direction for those undirected edges.

3.1. A TOY EXAMPLE

We use a synthetic toy example to illustrate the influence of confounders g(C) for algorithms (we use RL) designed for homogeneous data, and demonstrate that MDSS can avoid such influence and further identify more directions. See Appendix A.9 for this example.

3.2. SYNTHETIC DATA

In this section, we conduct extensive experiments with MDSS and other causal discovery algorithms on linear and nonlinear synthetic data. We denote n as the number of datasets, each has d variables and T observations. We set n ∈ {6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, d ∈ {6, 7 , 8}, T = 100 for both linear and nonlinear data. We repeat each setting 20 times with DAGs randomly generated by Erdős-Rényi model (ER) with parameter 0.3. Each variable V i is chosen as nonstationary with probability 0.6. Similar to Section 3.1, linear data are generated using linear SEM V i = w i P A i + b i + i , we fix w i and i across domains and vary b i if V i is chosen as nonstationary. Nonlinear data is generated using nonlinear SEM V i = f i (P A i ) + b i + i , f i (•) is randomly picked from {sin(•), cos(•), sigmoid(•)}, and b i varies if V i is nonstationary, i stays invariant. We first consider the setting when n = 6 and d = 10. MDSS, MC, IB, and CD-NOD are tested on data from all domains. GES, PC, LiNGAM, NO-TEARS, and RL are tested on data from all domains as well as data from a single domain (the domain is randomly chosen). For GES, we use fast GES (FGES (Ramsey et al., 2017) ), which is an improved version of the original GES. The mean and standard deviation are reported in Table 1 and 2. As we can see, MDSS outperforms other algorithms on both linear and nonlinear data. The performances of PC, FGES, LiNGAM, NO-TEARS, and RL on pooling of all domain data are worse than that on single domain data. Specifically, despite the minor increase in TPR, their TNR decrease dramatically when data from all domains are used. This phenomenon further proves our proposition that distribution shifts will introduce spurious edges if not properly dealt with. We further compare the performance of MDSS, IB, MC and CD-NOD by varying d and n. The results are reported in Figure 2 . The black curves in figures of row 2 and row 3 are shorter than others because CD-NOD takes too much time to give any result when d > 6 and n > 15. According to the results, MDSS outperforms the others in most cases. To demonstrate that the proposed MDS contributes to the performance, we conduct some ablation studies. To be specific, we keep the directions in step 3 of Algorithm 1 and output G 1 (Algorithm 2, or MDS search, is not executed). We use the same experimental setting as Table 1 . The results (TPR, TNR and SHD) are (2018) formulates the structure learning problem as a purely continuous optimization problem by a new characterization of acyclicity that is not only smooth but also exact. Zheng et al. (2018) proposes to measure the "DAG-ness" of a graph by h(A) = tr e A•Ad, (5) where A is a weighted adjacency matrix and d is the number of node in the graph. Function h(•) satisfies the following properties: • h(A) = 0 if and only if A is acyclic. • The values of h quantify the "DAG-ness" of the graph. • h is smooth. • h and its derivatives are easy to compute. Further, Zhu & Chen (2019) finds that h(A), which is non-negative, can be small for certain cyclic graphs and its minimum over all non-DAGs is not easy to compute. As a consequence, it would require a very large penalty weight for h(A) to obtain exact DAGs if only h(A) is used. To address the issue, Zhu & Chen (2019) proposes another acyclicity penalty term I(G / ∈ DAGs), which is the indicator function w.r.t. acyclicity to induce exact DAGs. The combination of the above two acyclicity constraints can be written as λ 1 I(G / ∈ DAGs) + λ 2 h(A), which corresponds to the second and third terms in our proposed MDS. See their original papers Zheng et al. (2018) and Zhu & Chen (2019) for more details.

A.3 GENERALIZED SCORE

We use generalized score (GS (Huang et al., 2018 )) as a model selection criteria to measure how well the a graph fits the data. Here we give a brief introduction of the calculation of GS. Assume X is a random variable with domain X , and H X is a reproducing kernel Hilbert space (RKHS) on X with continuous feature mapping φ X : X → H X . Similarly we define variable Y , Z with domain Y, Z, the corresponding RKHS H Y , H Z and feature mapping φ Y , φ Z . Let Z := (Y, Z), consider the following two regression functions in RKHS: φ X (X) = F 1 (Z) + U 1 , φ X (X) = F 2 ( Z) + U 2 , where F 1 : Z → H X and F 2 : Z → H X . If X ⊥ ⊥ Y |Z, the following equation holds: E Z [V ar X|Z [φ X (X)|Z]] = E Z [V ar X| Z [φ X (X)| Z]], which means that it is not useful to incorporate Y as a predictor of X given Z, so the first model (i.e. the model with less complexity) in Equation 6is preferred. Cross-validated likelihood is used to express such preference. To perform cross validation, the whole data set D is split into a training set and a test set and repeat this procedure K times, i.e. K-fold cross validation. Let D 0,i be the data of X i and its parents in training set and validation set respectively. The GS of DAG G h using cross-validated likelihood is calculated with S CV (G h ; D) = m i=1 S CV X i , P A G h i = m i=1 1 K K k=1 F (k) i |D (k) 0,i , where P A G h i are parents of X i in graph G h , F i is the regression function estimated from kth training data D (k) 1,i , F (k) i |D (k) 0,i is the log-likelihood evaluated on the kth validation set with the estimated regression function. Another type of GS based on marginal likelihood are also proposed, see Huang et al. (2018) for more details.

A.4 KERNEL EMBEDDING OF DISTRIBUTIONS

According to Equation 4 in our paper, we need to calculate the dependence between distributions P (V k C |P A k ) and P (P A k ). The dependence between random variables can be measured by Hilbert Schmidt Independence Criterion (HSIC), which will be discussed in next section. To transform the distribution for data from different domains to a random variable in RKHS, we use kernel embeddings of conditional distributions (Song et al., 2013) . In the rest of this section, we denote P A k as X and V k C as Y for simplicity. Let X be the domain of X, and (H, k) be a reproducing kernel Hilbert space (RKHS) with a measurable kernel on X . Let φ(x) ∈ H be a continuous feature mapping φ X : X → H. Similar notations are for variables Y and C. We define the cross-covariance operator C Y X : H → G as C Y X := E Y X [φ(X)⊗ψ(Y )]. The kernel embedding of the conditional distribution P (X|C = c n ) for data from a given domain C = c n can be calculated as µ X|C=cn = C XC C -1 CC φ (c n ) The empirical estimate for µ X|C=cn is μX|C=cn = 1 N Φ x Φ c 1 N Φ c Φ c + λI -1 φ cn = Φ x (K c + λI) -1 k c,cn , ( ) where N is sample size, Φ x := [φ (x 1 ) , .  M H X = exp - diag M l X • 1 N + 1 N • diag M l X -2M l X 2σ 2 x , where diag(•) sets all off-diagonal entries of the matrix as zero, and 1 N is a N × N matrix with all entries being 1. M l X is the Gram matrix with a linear kernel: M l X = K c (K c + λI) -1 K x (K c + λI) -1 K c , whose (c, c ) entry can be calculated by M l X (c, c ) = μ X| C=c μ X| C=c = k c,c (K c + λI) -1 Φ x Φ x (K c + λI) -1 k c,c = k c,c (K c + λI) -1 K x (K c + λI) -1 k c,c Similarly we can calculate the empirical kernel embedding of the conditional distribution P (Y |X, C = c n ) and the corresponding Gram matrix, which we denote as μY |X,C=cn and M G Y |X , respectively. For more details about kernel embeddings of distributions, see Song et al. (2013) and Huang et al. (2019) .

A.5 EXTENDED HILBERT SCHMIDT INDEPENDENCE CRITERION

With the notations and results in the above section, we can calculate the dependence between P (X) and P (Y |X) by extended Hilbert Schmidt Independence Criterion: HSIC P (X),P (Y |X) = 1 (N -1) 2 tr M H X HM G Y |X H , ( ) where H is a matrix used to center the features, whose entries H ij := δ ij -N -1 . Huang et al. (2019) uses a normalized version of the estimated HSIC, which is invariant to the scale in M H X and M G Y |X : HSIC N P (X),P (Y |X) = HSIC P (X),P (Y |X) 15)  1 N -1 tr M H X H • 1 N -1 tr M G Y |X H = tr M H X HM G Y |X H tr M H X H tr M G Y |X H



Here, larger score means the corresponding graph is closer to the equivalent class of the true DAG, while the MDS defined in Section 2.3 should be regarded as a type of "loss function" which needs to be minimized.



3: Discard the index variable in G aug to obtain the induced subgraph G 1 . Discard the directions in G 1 and output the skeleton S of G 1 . 4: Detect changing causal modules by inspecting G aug recovered in Step 2, and output V C . The validity of searching on augmented graph is guaranteed by Theorem 1. Theorem 1. Let D be the pooling of all datasets, D C be the augmented dataset with the domain index as an extra random variable. Let G 0 be the underlying causal graph for the distribution of D over V, G C be the underlying causal graph for the distribution of D C over V ∪ C. If we denote G C as the graph after the following modifications on G C : 1. adding any edges, 2. deleting any edges or 3. reversing any edges that changes the conditional dependence relation of G C , then we have S(G C , D C ) > S(G C , D C ), where S is any globally consistent scoring criterion.

For linear systems, let G * = arg min G M DS linear (G, D), let E * 2 ∈ E 2 be the set of edges whose directions are correctly determined by G * , let n * denote the cardinality of E * 2 . Given E 2 is not empty and G 1 , G * have the same skeleton S, then G * is in the same equivalent class as G 0 and n * = n 2 . Theorem 3. For nonlinear systems, let G * = arg min G M DS nonlinear (G, D), let E *

kth training set and kth validation set respectively. Let D

. . , φ (x N )], Φ c := [φ (c 1 ) , . . . , φ (c N )], K c (c t , c t ) = φ (c t ) , φ (c t ) , k c,cn := [k (c 1 , c n ) , . . . , k (c N , c n )]. The corresponding Gram matrix with Gaussian kernel with σ x is

Empirical results for MDSS, MC, IB and CD-NOD on linear and nonlinear data.

Empirical results for PC, FGES, LiNGAM, NO-TEARS and RL on linear and nonlinear data. .2 ACYCLICITY CONSTRAINTS Causal discovery from samples of a joint distribution is a challenging combinatorial problem because of the intractable search space super-exponential in the number of graph nodes. Recently,Zheng et al.

annex

0.58 ± 0.21, 0.92 ± 0.07 and 7.95 ± 4.51 for linear data, 0.32 ± 0.22, 0.58 ± 0.22 and 6.20 ± 0.26 for nonlinear data. When compared with the results of MDSS in Table 1 , it is obvious that MDS search identifies more directions and boost the performance.

3.3. REAL DATA

We apply MDSS to fMRI hippocampus dataset (Poldrack et al., 2015) . This dataset records signals from 6 separate brain regions: perirhinal cortex (PRC), parahippocampal cortex (PHC), entorhinal cortex (ERC), subiculum (Sub), CA1, and CA3/Dentate Gyrus (DG) of a single person with resting states in 84 successive days. The records for each day can be regarded as a domain. We select 10 of them. The results from MDSS, MC, IB and CD-NOD are given in Figure 3 . 

4. CONCLUSIONS

This paper proposes a Multiple-Domain Score Search (MDSS) algorithm for causal discovery from heterogeneous data. It first performs skeleton learning over the space of augmented graphs. Then a Multiple-Domain Score (MDS) is used to determine causal directions based on the skeleton of the recovered graph. The MDS is proposed based on distribution shifts across domains and the assumption of independent change. Compared with previous methods, MDSS can remove the influence of distribution shifts and further recover more causal directions. In future work, we aim to improve MDSS from the following two aspects: 1) Score calculation takes more time than training NNs for searching, so it is essential to optimize score computing to accelerate the entire framework.2) The current framework of MDSS cannot deal with the more general case where causal directions also change, while this phenomenon does exist in some real-world circumstances. • If G is not in the same equivalent class as G 0 and n < n 2 . then M DS linear (G , D) > M DS linear (G * , D) clearly holds according to the above two cases.All the above three cases are contradict with the condition that these three graphs are DAGs that minimize M DS nonlinear (G, D).

A.7 PROOF OF THEOREM 3

The conclusion that 0 n * n 2 holds clearly. The conclusion that G * is in the same equivalent class as G 0 can be proved similar to the proof of Theorem 2. M DS nonlinear is not able to guarantee n * = n 2 mainly because GS is not score equivalent.

A.8 EXPERIMENTS OF GREEDY EQUIVALENCE SEARCH

The proposed MDSS can be readily incorporated into off-the-shelf search strategies. In the main paper, we adopt policy-gradient-based search strategy (Zhu & Chen, 2019) to search the optimal causal structure. In this section, we demonstrate that greedy equivalence search (Chickering, 2002) can also be utilized as the search strategy.Similar to the two-stage search in our main paper, we first perform greedy equivalence search on the augmented graphs (i.e. graphs with domain index as an additional node) to optimize the score (BIC for linear systems and GS for nonlinear systems). The output of this step is an equivalence class of the augmented graph. Then we utilize distribution shifts of the nonstationary variables to detect more edge directions. Consider the same setting as in Section 3.2 (i.e. , n = 6 and d = 10). In linear case, TPR, TNR, SHD for MDSS with greedy equivalence search is 0.67 ± 0.15, 0.75 ± 0.16 and 4.75 ± 2.05 respectively. In nonlinear case, TPR, TNR, SHD for MDSS with greedy equivalence search is 0.69 ± 0.13, 0.63 ± 0.21 and 5.80 ± 2.67 respectively. Compared with the results in Table 1 , 1) MDSS with greedy equivalence search outperforms MC, IB and CD-NOD in both linear and nonlinear cases, 2) although MDSS with greedy equivalence search is not as good as MDSS with policy-gradient-based search in linear case, it achieves comparable results as policy-gradient-based one in nonlinear case.A.9 A TOY EXAMPLEThe example is consisted of 10 linear datasets with 4 variables, whose underlying causal graph is given in Figure 4 (a). We use linear SEM V i = w i P A i + b i + i to generate the data. For each variable V i , w i is fixed to 1 and i is also fixed as a standard Gaussian noise across all datasets. To introduce distribution shifts, we vary b i across datasets, here b 3 is chosen to be invariant (i.e. V 3 is stationary).We first run RL on single dataset (randomly chosen from 10 datasets) and the pooling of all datasets respectively, with results shown in Figure 4 (b) and 4(c). RL misidentifies the direction between V 1 and V 2 in both cases, which is reasonable because RL uses BIC plus an acyclicity constraint as score function, and BIC is score equivalent. Furthermore, when multiple datasets with distribution shifts are used, RL outputs erroneous edges (i.e. edges between V 1 , V 4 and V 2 , V 4 ) due to confounders. Next we run MDSS on the pooling of all datasets, with result shown in Figure 4 (d). MDSS correctly detects variables with changing causal modules. It recovers all the directions between those 4 variables correctly as well. 

