BOOSTING DIFFERENTIABLE CAUSAL DISCOVERY VIA ADAPTIVE SAMPLE REWEIGHTING

Abstract

Under stringent model type and variable distribution assumptions, differentiable score-based causal discovery methods learn a directed acyclic graph (DAG) from observational data by evaluating candidate graphs over an average score function. Despite great success in low-dimensional linear systems, it has been observed that these approaches overly exploit easier-to-fit samples, thus inevitably learning spurious edges. Worse still, the common homogeneity assumption can be easily violated, due to the widespread existence of heterogeneous data in the real world, resulting in performance vulnerability when noise distributions vary. We propose a simple yet effective model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore for short, where the weights tailor quantitatively to the importance degree of each sample. Intuitively, we leverage the bilevel optimization scheme to alternately train a standard DAG learner and reweight samples -that is, upweight the samples the learner fails to fit and downweight the samples that the learner easily extracts the spurious information from. Extensive experiments on both synthetic and real-world datasets are carried out to validate the effectiveness of ReScore. We observe consistent and significant boosts in structure learning performance. Furthermore, we visualize that ReScore concurrently mitigates the influence of spurious edges and generalizes to heterogeneous data. Finally, we perform the theoretical analysis to guarantee the structure identifiability and the weight adaptive properties of ReScore in linear systems. Our codes are available at https://github.com/anzhang314/ReScore.

1. INTRODUCTION

Learning causal structure from purely observational data (i.e., causal discovery) is a fundamental but daunting task (Chickering et al., 2004; Shen et al., 2020) . It strives to identify causal relationships between variables and encode the conditional independence as a directed acyclic graph (DAG). Differentiable score-based optimization is a crucial enabler of causal discovery (Vowels et al., 2021) . Specifically, it is formulated as a continuous constraint optimization problem by minimizing the average score function and a smooth acyclicity constraint. To ensure the structure is fully or partially identifiable (see Section 2), researchers impose stringent restrictions on model parametric family (e.g., linear, additive) and common assumptions of variable distributions (e.g., data homogeneity) (Peters et al., 2014; Ng et al., 2019a) . Following this scheme, recent follow-on studies (Kalainathan et al., 2018; Ng et al., 2019b; Zhu et al., 2020; Khemakhem et al., 2021; Yu et al., 2021) extend the formulation to general nonlinear problems by utilizing a variety of deep learning models. However, upon careful inspections, we spot and justify two unsatisfactory behaviors of the current differentiable score-based methods: • Differentiable score-based causal discovery is error-prone to learning spurious edges or reverse causal directions between variables, which derails the structure learning accuracy (He et al., 2021 ; 𝑃 # = 1, 𝑃 $ = 0 𝑃 # = 0.2, 𝑃 $ = 0.8 𝑃 # = 0, 𝑃 $ = 1 Figure 1: A simple example of basic chain structure that NOTEARS would learn spurious edges while ReScore can help to mitigate the bad influence. Ng et al., 2022) . We substantiate our claim with an illustrative example as shown in Figure 1 (see another example in Appendix D.3.1). We find that even the fundamental chain structure in a linear system is easily misidentified by the state-of-the-art method, NOTEARS (Zheng et al., 2018) . • Despite being appealing in synthetic data, differentiable score-based methods suffer from severe performance degradation when encountering heterogeneous data (Huang et al., 2020; 2019) . Considering Figure 1 again, NOTEARS is susceptible to learning redundant causations when the distributions of noise variables vary. Taking a closer look at this dominant scheme (i.e., optimizing the DAG learner via an average score function under strict assumptions), we ascribe these undesirable behaviors to its inherent limitations: • The collected datasets naturally include an overwhelming number of easy samples and a small number of informative samples that might contain crucial causation information (Shrivastava et al., 2016) . Averagely scoring the samples deprives the discovery process of differentiating sample importance, thus easy samples dominate the learning of DAG. As a result, prevailing score-based techniques fail to learn true causal relationship but instead yield the easier-to-fit spurious edges. • Noise distribution shifts are inevitable and common in real-world training, as the observations are typically collected at different periods, environments, locations, and so forth (Arjovsky et al., 2019) . As a result, the strong assumption of noise homogeneity for differentiable DAG learner is easily violated in real-world data (Peters et al., 2016) . A line of works (Ghassami et al., 2018; Wang et al., 2022) dedicated to heterogeneous data can successfully address this issue. However, they often require explicit domain annotations (i.e., ideal partition according to heterogeneity underlying the data) for each sample, which are prohibitively expensive and hard to obtain (Creager et al., 2021) , thus further limiting their applicability. To reshape the optimization scheme and resolve these limitations, we propose to adaptively reweight the samples, which de facto concurrently mitigates the influence of spurious edges and generalizes to heterogeneous data. The core idea is to discover and upweight a set of less-fitted samples that offer additional insight into depicting the causal edges, compared to the samples easily fitted via spurious edges. Focusing more on less-fitted samples enables the DAG learner to effectively generalize to heterogeneous data, especially in real-world scenarios whose samples typically come from disadvantaged domains. However, due to the difficulty of accessing domain annotations, distinguishing such disadvantaged but informative samples and adaptively assigning their weights are challenging. Towards this end, we present a simple yet effective model-agnostic optimization framework, coined ReScore, which automatically learns to reweight the samples and optimize the differentiable DAG learner, without any knowledge of domain annotations. Specifically, we frame the adaptive weights learning and the differentiable DAG learning as a bilevel optimization problem, where the outer-level problem is solved subject to the optimal value of the inner-level problem: • In the inner loop, the DAG learner is first fixed and evaluated by the reweighted score function to quantify the reliance on easier-to-fit samples, and then the instance-wise weights are adaptively optimized to induce the DAG learner to the worst-case. • In the outer loop, upon the reweighted observation data where the weights are determined by the inner loop, any differential score-based causal discovery method can be applied to optimize the DAG learner and refine the causal structure. Benefiting from this optimization scheme, our ReScore has three desirable properties. First, it is a model-agnostic technique that can empower any differentiable score-based causal discovery method. Moreover, we theoretically reveal that the structure identifiability is inherited by ReScore from the original causal discovery method in linear systems (cf. Theorem 1). Second, ReScore jointly mitigates the negative effect of spurious edge learning and performance drop in heterogeneous data via auto-learnable adaptive weights. Theoretical analysis in Section 3.3 (cf. Theorem 2) validates the oracle adaptive properties of weights. Third, ReScore boosts the causal discovery performance by a large margin. Surprisingly, it performs competitively or even outperforms CD-NOD (Huang et al., 2020) and DICD (Wang et al., 2022) , which require domain annotation, on heterogeneous synthetic data and real-world data (cf. Section 4.2).

2. DIFFERENTIABLE CAUSAL DISCOVERY

We begin by introducing the task formulation of causal discovery and the identifiability issue. We then present the differentiable score-based scheme to optimize the DAG learner. Task Formulation. Causal discovery aims to infer the Structural Causal Model (SCM) (Pearl, 2000; Pearl et al., 2016) from the observational data, which best describes the data generating procedure. Formally, let X ∈ R n×d be a matrix of observational data, which consists of n independent and identically distributed (i.i.d.) random vectors X = (X 1 , . . . , X d ) ∈ R d . Given X, we aim to learn a SCM (P X , G), which encodes a causal directed acyclic graph (DAG) with a structural equation model (SEM) to reveal the data generation from the distribution of variables X. Specifically, we denote the DAG by G = (V (G), E(G)), where V (G) is the variable set and E(G) collects the causal directed edges between variables. We present the joint distribution over X as P X , which is Markov w.r.t. G. The probability distribution function of P X is factored as p(x) = d i=1 P (x i |x pa(i) ), where pa(i) = {j ∈ V (G) : X j → X i ∈ E(G)} is the set of parents of variable X i in G and P (x i |x pa(i) ) is the conditional probability density function of variable X i given X pa(i) . As a result, the SEM can be formulated as a collection of d structural equations: X i = f i (X pa(i) , N i ), i = 1, • • • , d where f i : R |X pa(i) | → R can be any linear or nonlinear function, and N = (N 1 , . . . , N d ) are jointly independent noise variables. Identifiability Issue. In general, without further assumption on the SEM (cf. Equation 1), it is not possible to uniquely learn the DAG G by only using the observations of P X . This is the identifiability issue in causal discovery (Lachapelle et al., 2020) . Nonetheless, with the assumption of the SEM, the DAG G is said to be identifiable over P X , if no other SEM can encode the same distribution P X with a different DAG under the same assumption. To guarantee the identifiability, most prior studies restrict the form of the structural equations to be additive w.r.t. to noises, i.e., additive noise models (ANM). Assuming ANM, as long as the structural equations are linear with non-Gaussian errors (Shimizu et al., 2006; Loh & Bühlmann, 2014) , linear Gaussian model with equal noise variances (Peters & Bühlmann, 2014) , or nonlinear structural equation model with mild conditions (Hoyer et al., 2008; Zhang & Hyvarinen, 2009; Peters et al., 2014) , then the DAG G is identifiable. Solution to Causal Discovery. Prevailing causal discovery approaches roughly fall into two lines: constraint-and score-based methods (Spirtes & Zhang, 2016; Glymour et al., 2019) . Specifically, constraint-based methods (Spirtes et al., 1995; Spirtes & Glymour, 1991; Colombo et al., 2012) determine up to the Markov equivalence class of causal graphs, based on conditional independent tests under certain assumptions. Score-based methods (Vowels et al., 2021) evaluate the candidate graphs with a predefined score function and search the DAG space for the optimal graph. Here we focus on the score-based line. Score-based Causal Discovery. With a slight abuse of notation, G refers to a directed graph in the rest of the paper. Formally, the score-based scheme casts the task of DAG learning as a combinatorial optimization problem: min G S(G; X) = L(G; X) + λR sparse (G) s.t. G ∈ DAG, Here this problem consists of two ingredients: the combinatorial acyclicity constraint G ∈ DAG and the score function S(G; X). The score function composes two terms: (1) the goodness-of-fit measure L(G; X) = 1 n n i=1 l(x i , f (x i )), where l(x i , f (x i )) represents the loss of fitting observation x i ; (2) the sparsity regularization R sparse (G) stipulating that the total number of edges in G should be penalized; And λ is a hyperparameter controlling the regularization strengths. Next, we will elaborate on the previous implementations of these two major ingredients. To implement S(G; X), various approaches have been proposed, such as penalized least-squares loss (Zheng et al., 2020; 2018; Ng et al., 2019b) , Evidence Lower Bound (ELBO) (Yu et al., 2019) , loglikelihood with complexity regularizers (Kalainathan et al., 2018; Van de Geer & Bühlmann, 2013; Ng et al., 2020) , Maximum Mean Discrepancy (MMD) (Goudet et al., 2018) , Bayesian Information Criterion (BIC) (Geiger & Heckerman, 1994; Zhu et al., 2020) , Bayesian Dirichlet equivalence uniform (BDeu) score (Heckerman et al., 1995) , Bayesian Gaussian equivalent (BGe) score (Kuipers et al., 2014) , and others (Huang et al., 2018; Bach & Jordan, 2002; Sokolova et al., 2014) . As G ∈ DAG enforces G to be acyclic, it becomes the main obstacle to the score-based scheme. Prior studies propose various approaches to search in the acyclic space, such as greedy search (Chickering, 2002; Hauser & Bühlmann, 2012 ), hill-climbing (Gámez et al., 2011; Tsamardinos et al., 2006) , dynamic programming (Silander & Myllymäki, 2006; Koivisto & Sood, 2004) , A* (Yuan & Malone, 2013) , integer linear programming (Jaakkola et al., 2010; Cussens, 2011) . Differentiable Score-based Optimization. Different from the aforementioned search approaches, NOTEARS (Zheng et al., 2018) reframes the combinatorial optimization problem as a continuous constrained optimization problem: min G S(G; X) s.t. H(G) = 0, where H(G) = 0 is a differentiable equality DAG constraint. As for the DAG constraint H(G) = 0, the prior effort (Zheng et al., 2018) turns to depict the "DAGness" of G's adjacency matrix et al., 2019) , and others (Wei et al., 2020; Kyono et al., 2020; Bello et al., 2022; Zhu et al., 2021) . As a result, this optimization problem in Equation 3 can be further formulated via the augmented Lagrangian method as: A(G) ∈ {0, 1} d×d . Specifically, [A(G)] ij = 1 if the causal edge X j → X i exists in E(G), otherwise [A(G)] ij = 0. Prevailing implementations of DAGness constraints are H(G) = Tr(e A⊙A ) -d (Zheng et al., 2018), H(G) = Tr[(I + αA ⊙ A) d ] -d (Yu min G S(G; X) + P DAG (G), where P DAG (G) = αH(G) + ρ 2 |H(G)| 2 is the penalty term enforcing the DAGness on G, and ρ > 0 is a penalty parameter and α is the Lagrange multiplier.

3. METHODOLOGY OF RESCORE

On the basis of differentiable score-based causal discovery methods, we first devise our ReScore and then present its desirable properties.

3.1. BILEVEL FORMULATION OF RESCORE

Aiming to learn the causal structure accurately in practical scenarios, we focus on the observational data that is heterogeneous and contains a large proportion of easy samples. Standard differentiable score-based causal discovery methods apply the average score function on all samples equally, which inherently rely on easy samples to obtain high average goodness-of-fit. As a result, the DAG learner is error-prone to constructing easier-to-fit spurious edges based on the easy samples, while ignoring the causal relationship information maintained in hard samples. Assuming the oracle importance of each sample is known at hand, we can assign distinct weights to different samples and formulate the reweighted score function S w (G; X), instead of the average score function: S w (G; X) = L w (G; X) + λR sparse (G) = n i=1 w i l(x i , f (x i )) + λR sparse (G), where w = (w 1 , . . . , w n ) is a sample reweighting vector with length n, wherein w i indicates the importance of the i-th observed sample x i . However, the oracle sample importance is usually unavailable in real-world scenarios. The problem, hence, comes to how to automatically learn appropriate the sample reweighting vector w. Intuitively, samples easily fitted with spurious edges should contribute less to the DAG learning, while samples that do not hold spurious edges but contain critical information about causal edges should be more importance. We therefore use a simple heuristic of downweighting the easier-to-fit but less informative samples, and upweighting the less-fitted but more informative samples. This inspires us to learn to allocate weights adaptively, with the aim of maximizing the influence of less well-fitted samples and failing the DAG learner. Formally, we cast the overall framework of reweighting samples to boost causal discovery as the following bilevel optimization problem: min G S w * (G; X) + P DAG (G), s.t. w * ∈ arg max w∈C(τ ) S w (G; X), where C(τ ) := {w : 0 < τ n ≤ w 1 , . . . , w n ≤ 1 τ n , n i=1 w i = 1} for the cutoff threshold τ ∈ (0, 1). The deviation of the weight distribution from the uniform distribution is bound by the hyperparameter τ . Clearly, Equation 6 consists of two objectives, where the inner-level objective (i.e., optimize w by maximizing the reweighted score function) is nested within the outer-level objective (i.e., optimize G by minimizing the differentiable score-based loss). Solving the outer-level problem should be subject to the optimal value of the inner-level problem. Now we introduce how to solve this bilevel optimization problem. In the inner loop, we first fix the DAG learner which evaluates the error of each observed sample x i , ∀i ∈ {1, • • • , n}, and then maximize the reweighted score function to learn the weight w * i correspondingly. In the outer loop, upon the reweighted observations whose weights are determined in the inner loop, we minimize the reweighted score function to optimize the DAG learner. By alternately training the inner and outer loops, the importance of each sample is adaptively estimated based on the DAG learner's error, and in turn gradually guides the DAG learner to perform better on the informative samples. It is worth highlighting that this ReScore scheme can be applied to any differentiable score-based causal discovery method listed in Section 2. The procedure of training ReScore is outlined in Algorithm 1. Furthermore, our ReScore has the following desirable advantages: • As shown in Section 3.2, under mild conditions, our ReScore inherits the identifiability property of the original differentiable score-based causal discovery method. • ReScore is able to generate adaptive weights to observations through the bilevel optimize, so as to distinguish more information samples and fulfill their potentials to guide the DAG learning. This is consistent with our theoretical analysis in Section 3.3 and empirical results in Section 4.2. • ReScore is widely applicable to various types of data and models. In other words, it is modelagnostic and can effectively handle heterogeneous data without knowing the domain annotations in advance. Detailed ReScore performance can be found in Section 4.

3.2. THEORETICAL ANALYSIS ON IDENTIFIABILITY

The graph identifiability issue is the primary challenge hindering the development of structure learning. As an optimization framework, the most desired property of ReScore is the capacity to ensure graph identifiability and substantially boost the performance of the differentiable score-based DAG learner. We develop Theorem 1 that guarantees the DAG identifiability when using ReScore. Rendering a DAG theoretically identifiable requires three standard steps (Peters et al., 2014; Zheng et al., 2020; Ng et al., 2022) : (1) assuming the particular restricted family of functions and data distributions of SEM in Equation 1; (2) theoretically proving the identifiability of SEM; and (3) developing an optimization algorithm with a predefined score function and showing that learned DAG asymptotically converges to the ground-truth DAG. Clearly, ReScore naturally inherits the original identifiability of a specific SEM as stated in Section 2. Consequently, the key concern lies on the third step -whether the DAG learned by our new optimization framework with the reweighted score function S w (G; X) can asymptotically converge to the ground-truth DAG. To address this, we present the following theorem. Specifically, it demonstrates that, by guaranteeing the equivalence of optimization problems (Equation 2 and Equation 6) in linear systems, the bounded weights will not affect the consistency results in identifiability analysis. See detailed proof in Appendix C.1. Theorem 1. Suppose the SEM in Equation 1 is linear and the size of observational data X is n. As the data size increases, i.e., n → ∞, arg min G S w (G; X) + P DAG (G) -arg min G S(G; X) + P DAG (G) a.s. --→ 0 in the following cases: a. Using the least-squares loss L(G; X) = 1 2n ∥X -f (X)∥ 2 F ; b. Using the negative log-likelihood loss with standard Gaussian noise.

Remark:

The identifiability property of ReScore with two most common score functions, namely least-square loss and negative log-likelihood loss, is proved in Theorem 1. Similar conclusions can be easily derived for other loss functions, which we will explore in future work.

3.3. ORACLE PROPERTY OF ADAPTIVE WEIGHTS

Our ReScore suggests assigning varying degrees of importance to different observational samples. At its core is the simple yet effective heuristic: the less-fitted samples are more important than the easier-to-fit samples, as they do not hold spurious edges but contain critical information about the causal edges. Hence, mining hard-to-learn causation information is promising to help DAG learners mitigate the negative influence of spurious edges. The following theorem shows the adaptiveness property of ReScore, i.e., instead of equally treating all samples, ReScore tends to upweight the importance of hard but informative samples while downweighting the reliance on easier-to-fit samples. Theorem 2. Suppose that in the optimization phase, the i-th observation has a larger error than the j-th observation in the sense that l(x i , f (x i )) > l(x j , f (x j )), where i, j ∈ {1, . . . , n}. Then, w * i ≥w * j , where w * i , w * i are the optimal weights in Equation 6. The equality holds if and only if w * i = w * j = τ n or w * i = w * j = 1 τ n . See Appendix C.2 for the detailed proof. It is simple to infer that, following the inner loop that maximizes the reweighted score function S w (G; X), the observations are ranked by learned adaptive weights w * . That is, one observation equipped with a higher weight will have a greater impact on the subsequent outer loop to dominate the DAG learning.

4. EXPERIMENTS

We aim to answer the following research questions: • RQ1: As a model-agnostic framework, can ReScore widely strengthen the differentiable scorebased causal discovery baselines? • RQ2: How does ReScore perform when noise distribution varies? Can ReScore effectively learn the adaptive weights that successfully identify the important samples? Baselines. To answer the first question (RQ1), we implement various backbone models including NOTEARS (Zheng et al., 2018) and GOLEM (Ng et al., 2020) in linear systems, and NOTEARS-MLP (Zheng et al., 2020) , and GraN-DAG (Lachapelle et al., 2020) in nonlinear settings. To answer the second question (RQ2), we compare GOLEM+ReScore, NOTEARS-MLP+ReScore to a SOTA baseline CD-NOD (Huang et al., 2020) and a recently proposed approach DICD (Wang et al., 2022) , which both require the ground-truth domain annotation. For a comprehensive comparison, extensive experiments are conducted on both homogeneous and heterogeneous synthetic datasets as well as a real-world benchmark dataset, i.e., Sachs (Sachs et al., 2005) . In Sachs, GES (Chickering, 2002) , a benchmark discrete score-based causal discovery method, is also considered. A detailed description of the employed baselines can be found in Appendix D.1. Evaluation Metrics. To evaluate the quality of structure learning, four metrics are reported: True Positive Rate (TPR), False Discovery Rate (FDR), Structural Hamming Distance (SHD), and Structural Intervention Distance (SID) (Peters & Bühlmann, 2015) , averaged over ten random trails. Simulations. The generating data differs along three dimensions: number of nodes, the degree of edge sparsity, and the type of graph. Two well-known graph sampling models, namely Erdos-Renyi (ER) and scale-free (SF) (Barabási & Albert, 1999) are considered with kd expected edges (denoted as ERk or SFk) and d = {10, 20, 50} nodes. Specifically, in linear settings, similar to (Zheng et al., 2018; Gao et al., 2021) , the coefficients are assigned following U (-2, -0.5) ∪ U (0.5, 2) with additive standard Gaussian noise. In nonlinear settings, following (Zheng et al., 2020) , the groundtruth SEM in Equation 1 is generated under the Gaussian process (GP) with radial basis function kernel of bandwidth one, where f i (•) is additive noise models with N i as an i.i.d. random variable following standard normal distribution. Both of these settings are known to be fully identifiable (Peters & Bühlmann, 2014; Peters et al., 2014) . For each graph, 10 data sets of 2,000 samples are generated and the mean and standard deviations of the metrics are reported for a fair comparison. Results. Tables 1, 9 and Tables in Appendix D.4 report the empirical results on both linear and nonlinear synthetic data. The error bars depict the standard deviation across datasets over ten trails. The red and blue percentages separately refer to the increase and decrease of ReScore relative to the original score-based methods in each metric. The best performing methods are bold. We find that: • ReScore consistently and significantly strengthens the score-based methods for structure learning across all datasets. In particular, it achieves substantial gains over the state-of-the-art baselines by around 3% to 60% in terms of SHD, revealing a lower number of missing, falsely detected, and reversed edges. We attribute the improvements to the dynamically learnable adaptive weights, which boost the quality of score-based DAG learners. With a closer look at the TPR and FDR, ReScore typically lowers FDR by eliminating spurious edges and enhances TPR by actively identifying more correct edges. This clearly demonstrates that ReScore effectively filters and upweights the more informative samples to better extract the causal relationship. Figure 2 also illustrates the clear trend that ReScore is excelling over NOTEARS-MLP as the sparsity penalty climbs. Additionally, as Table 7 indicates, ReScore only adds a negligible amount of computational complexity as compared to the backbone score-based DAG learners. • Score-based causal discovery baselines suffer from a severe performance drop on highdimensional dense graph data. Despite the advances, beyond linear, NOTEARS-MLP and GraN-DAG fail to scale to more than 50 nodes in SF4 and ER4 graphs, mainly due to difficulties in enforcing acyclicity in high-dimensional dense graph data (Varando, 2020; Lippe et al., 2022) . Specifically, the TPR of GraN-DAG and NOTEARS-MLP in SF4 of 50 nodes is lower than 0.2, which indicates that they are not even able to accurately detect 40 edges out of 200 ground-truth edges. ReScore, as an optimization framework, relies heavily on the performance of the score- Motivations. It is commonplace to encounter heterogeneous data in real-world applications, of which the underlying causal generating process remain stable but the noise distribution may vary. Specific DAG learners designed for heterogeneous data are prone to assume strict conditions and require the knowledge of group annotation for each sample. Group annotations, however, are extremely costly and challenging to obtain. We conjecture that a robust DAG learner is able to successfully handle heterogeneous data without the information of group annotation. Simulations. Synthetic heterogeneous data in both linear and nonlinear settings (n = 1000, d = 20, ER2) containing two distinct groups are also considered. 10% of observations come from the disadvantaged group, where half of the noise variables N i defined in Equation 1 follow N (0, 1) and the remaining half of noise variables follow N (0, 0.1). 90% of the observations, in contrast, are generated from the dominant group where the scales of noise variables are flipped. Results. To evaluate whether ReScore can handle heterogeneous data without requiring the group annotation by automatically identifying and upweighting informative samples, we compare base-line+ReScore to CD-NOD and DICD, two SOTA causal discovery approaches that rely on group annotations and are developed for heterogeneous data. Additionally, a non-adaptive reweighting method called baseline+IPS is taken into account, in which sample weights are inversely proportional to group sizes. Specifically, we divide the whole observations into two subgroups. Obviously, a single sample from the disadvantaged group is undoubtedly more informative than a sample from the dominant group, as it offers additional insight to depict the causal edges. As Figure 3 shows, dots of different colours are mixed and scattered at the beginning of the training. After multiple iterations of training in inner and outer loops, the red dots from the disadvantaged group are gradually identified and assigned to relatively larger weights as compared to those blue dots with the same measureof-fitness. This illustrates the effectiveness of ReScore and further offers insight into the reason for its performance improvements when handling heterogeneous data. Overall, all figures show clear positive trends, i.e., the underrepresented samples tend to learn bigger weights. These results validate the property of adaptive weights in Theorem 2. Table 2 indicates that ReScore drives impressive performance breakthroughs in heterogeneous data, achieving competitive or even lower SHD without group annotations compared to CD-NOD and DICD recognized as the lower bound. Specifically, both GOLEM and NOTEARS-MLP are struggling from notorious performance drop when homogeneity assumption is invalidated, and posing hurdle from being scaled up to real-world large-scale applications. We ascribe this hurdle to blindly scoring the observational samples evenly, rather than distilling the crucial group information from distribution shift of noise variables. To better highlight the significance of the adaptive property, we also take Baseline+IPS into account, which views the ratio of group size as the propensity score and exploits its inverse to re-weight each sample's loss. Baseline+IPS suffers from severe performance drops in terms of TPR, revealing the limitation of fixed weights. In stark contrast, benefiting from adaptive weights, ReScore can even extract group information from heterogeneous data that accomplish more profound causation understanding, leading to higher DAG learning quality. This validates that ReScore endows the backbone score-based DAG learner with better robustness against the heterogeneous data and alleviates the negative influence of spurious edges. 4.2.2 EVALUATIONS ON REAL HETEROGENEOUS DATA. Sachs (Sachs et al., 2005) contains the measurement of multiple phosphorylated protein and phospholipid components simultaneously in a large number of individual primary human immune system cells. In Sachs, nine different perturbation conditions are applied to sets of individual cells, each of which administers certain reagents to the cells. With the annotations of perturbation conditions, we consider the Sachs as real-world heterogeneous data (Mooij et al., 2020) . We train baselines on 7,466 samples, where the ground-truth graph (11 nodes and 17 edges) is widely accepted by the biological community. As Table 3 illustrates, ReScore steadily and prominently boosts all baselines, including both differentiable and discrete score-based causal discovery approaches w.r.t. SHD and SID metrics. This clearly shows the effectiveness of ReScore to better mitigate the reliance on easier-to-fit samples. With a closer look at the TPR and FDR, baseline+ReScore surpasses the state-of-the-art corresponding baseline by a large margin in most cases, indicating that ReScore can help successfully predict more correct edges and fewer false edges. Remarkably, compared to CD-NOD, which is designed for heterogeneous data and utilizes the annotations as prior knowledge, GES+ReScore obtains competitive TPR without using ground-truth annotations. Moreover, GraN-DAG+ReScore can reach the same SHD as CD-NOD when 15 and 18 edges are predicted, respectively. These findings validate the potential of ReScore as a promising research direction for enhancing the generalization and accuracy of DAG learning methods when dealing with real-world data.

5. CONCLUSION

Today's differentiable score-based causal discovery approaches are still far from being able to accurately detect the causal structures, despite their great success on synthetic linear data. In this paper, we proposed ReScore, a simple-yet-effective model-agnostic optimization framework that simultaneously eliminates spurious edge learning and generalizes to heterogeneous data by utilizing learnable adaptive weights. Grounded by theoretical proof and empirical visualization studies, ReScore successfully identifies the informative samples and yields a consistent and significant boost in DAG learning. Extensive experiments verify that the remarkable improvement of ReScore on a variety of synthetic and real-world datasets indeed comes from adaptive weights. Two aspects of ReScore's limitations will be covered in subsequent works. First, the performance of ReScore is highly related to the causal discovery backbone models, which leads to minor improvements when the backbone methods fail. Second, having empirically explored the sensitivity to pure noise samples in D.3.2, we will theoretically analyze and further enhance the robustness of ReScore against these noises. It is expected to substantially improve the DAG learning quality, as well as distinguish true informative samples from pure noise samples. We believe that ReScore provides a promising research direction to diagnose the performance degradation for nonlinear and heterogeneous data in the structure learning challenge and will inspire more works in the future.

A RELATED WORK

Differentiable score-based causal discovery methods. Learning the directed acyclic graph (DAG) from purely observational data is challenging, owing mainly to the intractable combinatorial nature of acyclic graph space. A recent breakthrough, NOTEARS (Zheng et al., 2018) , formulates the discrete DAG constraint into a continuous equality constraint, resulting in a differentiable score-based optimization problem. Recent subsequent works extends the formulation to deal with nonlinear problems by using a variety of deep learning models, such as neural networks (NOTEARS+ (Zheng et al., 2020) , GraN-DAG (Lachapelle et al., 2020) , CASTLE (Kyono et al., 2020) , MCSL (Ng et al., 2019a) , DARING (He et al., 2021) ), generative autoencoder (CGNN (Goudet et al., 2018) , Causal-VAE (Yang et al., 2021) , ICL (Wang et al., 2020) , DAG-GAN (Gao et al., 2021) ), graph neural network (DAG-GNN (Gao et al., 2021) , GAE (Ng et al., 2019b) ), generative adversarial network (SAM (Kalainathan et al., 2018) , ICL (Wang et al., 2020) ), and reinforcement learning (RL-BIC (Zhu et al., 2020) ). Multi-domain causal structure learning. Most multi-domain causal structure learning methods are constraint-based and have diverse definition of domains. In our paper, the multi-domain or multi-group refers to heterogeneous data whose underlying causal generating process remain stable but the distributions of noise variables may vary. In literature, our definition of multi-domain is consistent with MC (Ghassami et al., 2018) , CD-NOD (Huang et al., 2020) , LRE (Ghassami et al., 2017) , DICD (Wang et al., 2022) , and others (Peters et al., 2016) . In addition to the strict restriction of knowing the domain annotation in advance, the majority of structure learning models dedicated to heterogeneous data exhibit limited applicability, due to linear case assumption (Ghassami et al., 2018; 2017) , causal direction identification only (Huang et al., 2019; Cai et al., 2020) , and timeconsuming (Huang et al., 2020) . --→ 0 in the following cases: a. Using the least-squares loss L(G; X) = 1 2n ∥X -f (X)∥ 2 F ; b. Using the negative log-likelihood loss with standard Gaussian noise.

B ALGORITHM OF RESCORE

, k 1 = 0, k 2 = 0 for k 1 ≤ K outer do Fix reweighting model parameters θ w Calculate w * by applying threshold [ τ n , 1 nτ ] Optimize θ G by minimizing S w * (G; X) + P DAG (G) # Outer optimization in Equation 6 if k 1 ≥ k reweight then for k 2 ≤ K inner do Fix the DAG learner's parameters θ G Get w from θ w by applying threshold [ τ n , 1 nτ ] Optimize θ w by maximizing S w (G; X) # Inner optimization in Equation 6 k 2 ← k 2 + 1 end for k 1 ← k 1 + 1 k 2 ← 0 end if end for return predicted G from Proof. Let B = (β 1 , . . . , β d ) ∈ R d×d be the weighted adjacent matrix of a SEM, the linear SEM can be written in the matrix form: X = XB + N (7) where E(N |X) = - → 0 , Var(N |X) = diag(σ 2 1 , . . . , σ 2 d ) , and B ii = 0 since X i cannot be the parent of itself. Let X ∈ R n×d be the observational data and N ∈ R n×d be the corresponding errors, then X = XB + N. The original and reweighted functions for optimization are S(B; X) + P DAG (B) = L(B; X) + λR sparse (B) + P(B), S w (B; X) + P DAG (B) = L w (B; X) + λR sparse (B) + P DAG (B). Comparing the above functions, only the first goodness-of-fit term are different, we will only consider this term. For the least-squares loss case, the optimization problem is min B L w (B; X) = min B n i=1 w i l(x i , x i B), s.t. B ii = 0, i = 1, . . . , d. Let W = diag(w 1 , . . . , w n ) be the n-dimensional matrix, and rewrite the loss function as L w (B; X) = n i=1 w i ∥x i -x i B∥ 2 2 = n i=1 w i (x i -x i B)(x i -x i B) ⊤ = n i=1 d j=1 w i (X ij -x i β j ) 2 = d j=1 (x j -Xβ j ) ⊤ W (x j -Xβ j ), where x j is the j-th column in matrix X. Let D j be the d-dimensional identify matrix by setting j-th element as 0, for j = 1, . . . , d. The above optimization is able to be written without the restriction: min B Lw (B; X) = min B d j=1 (x j -XD j β j ) ⊤ W (x j -XD j β j ) = min B d j=1 (x j ) ⊤ W x j -2(x j ) ⊤ W XD j β j + β ⊤ j D ⊤ j X ⊤ W XD j β j . The partial derivative of the loss function with respect to β j is ∂ Lw (B; X) ∂β j = ∂ d j=1 (x j ) ⊤ W X j -2(x j ) ⊤ W XD j β j + β ⊤ j D ⊤ j X ⊤ W XD j β j ∂β j = ∂ (x j ) ⊤ W x j -2(x j ) ⊤ W XD j β j + β ⊤ j D ⊤ j X ⊤ W XD j β j ∂β j = -2D ⊤ j X ⊤ W x j + 2D ⊤ j X ⊤ W XD j β j . Setting the partial derivative to zero produces the optimal parameter: βj = D ⊤ j (X ⊤ W X) -1 D j D ⊤ j X ⊤ W x j = D ⊤ j (X ⊤ W X) -1 D j D ⊤ j X ⊤ W (XD j β j + N j ) = D j β j + D j (X ⊤ W X) -1 X ⊤ W N j , ( ) where N j ∈ R n is the j-th column in matrix N. In the above equation, the second equality holds because x j = XD j β j + N j . Similarly, one can easily obtain that the optimum parameter for ordinary mean-squared loss is βj = D j β j + D j (X ⊤ X) -1 X ⊤ N j . ( ) It is obvious that the difference between Equation 8and Equation 9is the second term. Compute the mean and variance matrix of the second term in Equation 8, we can get E (X ⊤ W X) -1 X ⊤ W N j = E E (X ⊤ W X) -1 X ⊤ W N j |X = E (X ⊤ W X) -1 X ⊤ W • E N j |X = - → 0 , Var (X ⊤ W X) -1 X ⊤ W N j = E (X ⊤ W X) -1 X ⊤ W N j (N j ) ⊤ W ⊤ X(X ⊤ W X) -1 -E (X ⊤ W X) -1 X ⊤ W N j E (X ⊤ W X) -1 X ⊤ W N j ⊤ = E E (X ⊤ W X) -1 X ⊤ W N j (N j ) ⊤ W X(X ⊤ W X) -1 X = E (X ⊤ W X) -1 X ⊤ W • E N j (N j ) ⊤ |X • W X(X ⊤ W X) -1 = E (X ⊤ W X) -1 X ⊤ W • E N j (N j ) ⊤ |X • W X(X ⊤ W X) -1 = σ 2 j E (X ⊤ W X) -1 X ⊤ W 2 X(X ⊤ W X) -1 . The last equality holds because E(N N ⊤ |X) = Var(N |X) + E(N |X)[E(N |X)] ⊤ = diag(σ 2 1 , . . . , σ 2 d ). Since w ∈ C(τ ) , it is easy to know that the variance matrix is finite. By the Kolmogorov's strong law of large numbers, the second term converges to zero, thus βj a.s. --→D j β j , which is same as the ordinary case. Since noise N = (N 1 , . . . , N d ) are jointly independent, the previous process can be apply to the other j ∈ {1, . . . , d}. Let B = ( β1 , . . . , βd ) and B = ( β1 , . . . , βd ), then B -B a.s. --→ 0. Therefore, the convergence has been shown for 'case a.' Since the noise follows a Gaussian distribution, i.e. X -XB = N = (N 1 , . . . , N d ) ∼ N - → 0 , diag(σ 2 1 , . . . , σ 2 d ) , the loss function (negative log-likelihood function) is L w (B; X) = - n i=1 w i d j=1 log 1 σ j √ 2π - (X ij -x i β j ) 2 2σ 2 j = d j=1 n i=1 w i log σ j √ 2π + d j=1 n i=1 w i 2σ 2 j (X ij -x i β j ) 2 = d j=1 n i=1 w i log σ j √ 2π + d j=1 1 2σ 2 j (x j -Xβ j ) ⊤ W (x j -Xβ j ). To minimize the loss function above w.r.t. B, it is equivalent to minimize the second term in Equation 10: min B L w (B; X) ⇐⇒ min B d j=1 1 2σ 2 j (x j -Xβ j ) ⊤ W (x j -Xβ j ). It can be seen that the RHS above is similar to the loss function in 'case a.' except the coefficients 1 2σ 2 j , j = 1, . . . , d. Therefore, one can use same approaches to get the equivalence result for 'case b.' Consequently, the proofs of the two special cases have been done.

C.2 PROOF OF THEOREM 2

Theorem 2. Suppose that in the optimization phase, the i-th observation has a larger error than the j-th observation in the sense that l(x i , f (x i )) > l(x j , f (x j )), where i, j ∈ {1, . . . , n}. Then, w * i ≥w * j , where w * i , w * i are the optimal weights in Equation 6. The equality holds if and only if w * i = w * j = τ n or w * i = w * j = 1 τ n . Proof. We will show the theorem by contradiction. Without loss of generality, let i = 1, j = 2, and suppose w * 1 <w * 2 . Since w * ∈ C(τ ), one can find a small constant ε ∈ 0, min{w * 1 -τ n , 1 τ n -w * 2 } , such that w * * = (w * 1 + ε, w * 2 -ε, w * 3 . . . , w * n ) ∈ C(τ ). Therefore, S w * (G; X) -S w * * (G; X) = w * 1 • l(x 1 , f (x 1 )) + w * 2 • l(x 2 , f (x 2 )) -(w * 1 + ε) • l(x 1 , f (x 1 )) + (w * 2 -ε) • l(x 2 , f (x 2 )) = ε • [l(x 2 , f (x 2 )) -l(x 1 , f (x 1 ))] < 0, which contradicts w * ∈ arg max w S w (G; X). Thus, by contradiction, we can get w * 1 ≥ w * 2 as stated in the theorem. When τ n < w * 1 = w * 2 < 1 τ n , we can also find a small ε ∈ 0, min{w * 1 -τ n , 1 τ n -w * 2 } such that Equation 11 holds. Similarly, we can get S w * (G; X) < S w * * (G; X), and w * 1 = w * 2 = τ n or w * 1 = w * 2 = 1 τ n by contradiction. D SUPPLEMENTARY EXPERIMENTS D.1 BASELINES We select seven state-of-the-art causal discovery methods as baselines for comparison: • NOTEARS (Zheng et al., 2018) is a breakthrough work that firstly recasts the combinatorial graph search problem as a continuous optimization problem in linear settings. NOTEARS estimates the true causal graph by minimizing the reconstruction loss with the continuous acyclicity constraint. • NOTEARS-MLP (Zheng et al., 2020) is an extension of NOTEARS for nonlinear settings, approximating the generative SEM model by MLP while only applying the continuous acyclicity constraint to the first layer of the MLP. • GraN-DAG (Lachapelle et al., 2020) adapts the continuous constrained optimization formulation to allow for nonlinear relationships between variables using neural networks and makes use of a final pruning step to remove spurious edges, thus achieving good results in nonlinear settings. • GOLEM (Ng et al., 2020) improves on the least squares score function (Zheng et al., 2018) by proposing a score function that directly maximizes the data likelihood. They show the likelihoodbased score function with soft sparsity regularization is sufficient to asymptotically learn a DAG equivalent to the ground-truth DAG. • DICD (Wang et al., 2022) aims to discover the environment-invariant causation while removing the environment-dependent correlation based on ground truth domain annotation. • CD-NOD (Huang et al., 2020 ) is a constrained-based causal discovery method that is designed for heterogeneous data, i.e., datasets from different environments. CD-NOD utilizes the independent changes across environments to predict the causal orientations and proposes constrained-based and kernel-based methods to find the causal structure. • GES (Chickering, 2002) is a score-based search algorithm that searches over the space of equivalence classes of Bayesian network structures.

D.2 EXPERIMENTAL SETTINGS

For NOTEARS, we follow the original linear implementation. For GOLEM, we adopt the GOLEM-NV setting from the original repo. For NOTEARS-MLP, we follow the original non-linear implementation which consist a Multilayer Perceptron (MLP) comprising of two hidden layers with ten neurons each and ReLU activation functions (except for the Sachs dataset, which uses only one hidden layer, inherent the settings from Zheng et al. (2020) ). For GraN-DAG, we employ the pns, training, and cam-pruning stages from the original code and tune three pipeline stages together for best performance. The ReScore adaptive weights learning model for all nonlinear baselines consists of two hidden layer and ReLU activation, and for linear baselines the layer size is reduced to one. All Experiments are conducted on a single Tesla V100 GPU. Detailed hyperparameter search space for different methods is shown in Table 4 .

D.3 STUDY ON RESCORE D.3.1 ILLUSTRATIVE EXAMPLES OF RESCORE

Motivations. To fully comprehend the benefits of reweighting, two research hypotheses need to be verified. First, we have to determine the validity of the fundamental understanding of ReScore, which states that real-world datasets inevitably include samples of varying importance. In other words, there are many informative samples that come from disadvantaged groups in real-world scenarios. Additionally, we must confirm that the adaptive weights learned by ReScore are the faithful reflection of sample importance, i.e., less-fitted samples typically come from disadvantaged groups, which are more important than those well-fitted samples. Simulations. Real-world Sachs (Sachs et al., 2005) dataset naturally contains nine groups, where each group corresponds with a different experimental condition. We first rank the importance of each group in Sachs by using the average weights for each group learned by ReScore as the criterion. Then we eliminate 500 randomly selected samples in one specific group, perform NOTEARS-MLP, and show its DAG accuracy inferred from the remaining samples. Note that the sample size in each group, which ranges from 700 to 900, is fairly balanced. Results. Table 5 clearly shows a declining trend w.r.t. SHD and TPR metrics as the significance of deleting groups grows. Specifically, removing samples from disadvantaged groups such as Groups 8 and 0, which have the highest average weights, will significantly influence the DAG learning quality. In contrast, the SHD and TPR of NOTEARS-MLP can even be maintained or slightly decreased by excluding the samples from groups with relatively low average weights. This illustrates that samples of different importance are naturally present in real-world datasets, and ReScore is capable of successfully extracting this importance. w i = 1} prevents overexploitation of pure noise samples, which further strengthens ReScore's ability to withstand outliers. To evaluate the robustness of ReScore against pure noise samples, the following experiments are conducted. Simulations. We produce p corrupt percentage pure noise samples in nonlinear settings (n = 2000, d = 20, ER2) , where those noise samples are generated from a different structural causal model. We try out a broad range of p corrupt = {0, 0.01, 0.02, 0.05, 0.08, 0.1, 0.2, 0.3, 0.5}. Results. Table 6 reports the comparison of performance in NOTEARS-MLP and two ReScore methods (no cut-off threshold and optimal τ ) when encountering pure noise data. The best-performing methods are bold; Imp.% measures the relative improvements of ReScore (Optimal τ ) over the backbone NOTEARS-MLP. We observe that ReScore (Optimal τ ) consistently yields remarkable improvements compared with NOTEARS-MLP in the case that less than 20% of samples are corrupted. These results demonstrate the robustness of ReScore when handling data that contains a small proportion of pure noise data. Surprisingly, when the cutoff threshold τ is set to be close to 0, the ReScore can still achieve relative gains over the baseline when less than 10% of the samples are pure noise. Although it is more sensitive to noise samples than the optimum cutoff threshold τ . These surprising findings support the effectiveness of adaptive weights and show the potential of ReScore.

D.3.3 EFFECT OF HYPERPARAMETER τ .

We investigate the effect of cut-off threshold τ on the performance of ReScore. Intuitively, ReScore relies on the hyperparameter τ to control the balance between hard sample mining and robustness towards extremely noisy samples. On one hand, setting the threshold closer to 0 results in no weightclipping and leaves the model susceptible to noises, which results in sub-optimal performance. On the other hand, setting the threshold closer to 1 disables the reweighting scheme and eventually reduces ReScore performance to its backbone model. We conduct experiments under different settings of τ using n = 2000 samples generated from GP model on ER4 graphs with d = 20 nodes. The weight distribution under best performing threshold tau = 0.9 and the trend of SHD w.r.t. to τ is shown in Figure 4 . One can observe that ReScore obtains its best performance at τ = 0.9, while a smaller or bigger threshold results in sub-optimal performance. Furthermore, we find that in different settings, the optimal threshold τ usually falls in the range of [0.7, 0.99]. This indicates that ReScore performs best when adaptive reweighting is conducted within a restricted range.

D.3.4 SENSITIVITY TO NEURAL NETWORK COMPLEXITY.

We also investigated the effect of number of hidden units in our adaptive weights learning model for ReScore. We plot the TPR, FDR, SHD, SID with varing number of hidden units ranging from 10 to 100 units in nonlinear settings, using n = 600 and n = 2, 000 samples generated from GP model on ER4 graph with d = 10 nodes. Detailed results could be found in Figure 5 . One can first observe our model is stable when increasing the neurons, illustrating the insensitivity of ReScore w.r.t. the number of neurons in adaptive weights learning model. On the other hand, more observational In terms of time complexity, as shown in 



Figure 2: Performance comparison between NOTEARS-MLP and ReScore on ER4 graphs of 10 nodes on nonlinear synthetic datasets. The hyperparameter λ defined in Equation 2 refers to the graph sparsity. See more results in Appendix D.4 4.1 OVERALL PERFORMANCE COMPARISON ON SYNTHETIC DATA (RQ1)

Figure 3: Illustration of adaptive weights learned by ReScore w.r.t. sample loss on both linear and nonlinear synthetic data. For each dataset, the left and right plots refer to the distribution of adaptive weights at the first and last epochs in the outer loop, respectively (i.e., the value of w * , when k 1 = 0 and k 1 = K outer in Algorithm 1, respectively). The disadvantaged but more informative samples are represented by the red dots. The dominant and easy samples, in contrast, are in blue.

DAG learner C IN-DEPTH ANALYSIS OF RESCORE C.1 PROOF OF THEOREM 1 Theorem 1. Suppose the SEM in Equation 1 is linear and the size of observational data X is n. As the data size increases, i.e., n → ∞, arg min G S w (G; X) + P DAG (G) -arg min G S(G; X) + P DAG (G) a.s.

Figure 4: Study of varying τ in ReScore model.

Figure 6: Performance comparison between NOTEARS-MLP and ReScore on ER2 graphs of 10 nodes on nonlinear synthetic datasets. The hyperparameter λ defined in Equation 2 refers to the graph sparsity.

Figure 8: Performance comparison between NOTEARS-MLP and ReScore on ER4 graphs of 20 nodes on nonlinear synthetic datasets. The hyperparameter λ defined in Equation 2 refers to the graph sparsity.

Results for ER graphs of 10 nodes on linear and nonlinear synthetic datasets. +5% 0.08±0.09 -12% 4.6±2.3 +26% 12.8±7.0 +63% 0.85±0.04 +8% 0.05±0.04 +57% 7.2±1.9 +39% +2% 0.01±0.03 +35% 2.4±1.1 +13% 7.20±3.0 +21% 0.99±0.01 +1% 0.11±0.01 +12% 4.80±0.6 +13% 0.50±0.81

Results on heterogeneous data.

The performance comparison on Sachs dataset.

Hyperparameter search spaces for each algorithm.

Performance comparison for removing samples in different groups

SHD for p corrupt percentage noise samples.Motivations. A basic assumption of ReScore is that no pure noise outliers are involved in the training process. Otherwise, the DAG learner might get overwhelmed by arbitrarily up-weighting less well-fitted samples, in this case, pure noise data. The good news is that the constraint of the cutoff threshold τ ∈ C(τ ) = {w : 0 < τ n ≤ w 1 , . . . , w n ≤ 1 τ n ,

we report the time for each baseline and ReScore on Sachs. Compared with backbone methods, ReScore adds very little computing cost to training. More experimental results on both the linear and nonlinear synthetic data are reported in Figures6 -8and Tables8 -11. The error bars depict the standard deviation across datasets over ten trails. The red and blue percentages separately refer to the increase and decrease of ReScore relative to the original score-based causal discovery methods in each metric. The best performing methods per task are bold.

Training cost on Sachs (seconds per iteration/in total).

Results for ER graphs of 20 nodes on linear and nonlinear synthetic datasets. +8% 77.0±21.5 +2% 0.48±0.06 +3% 0.43±0.06 +16% 70.2±8.3 +5% 246.2±11.4 +1% +8% 88.8±23.3 +11% 0.41±0.07 -6% 0.17±0.08 +54% 51.6±6.4 +7% 179.9±33.7 -2% +0% 0.05±0.04 +64% 8.5±5.7 +9% 51.0±24.6 +5% 0.21±0.07 +5% 0.17±0.09 +8% 56.2±4.6 +2% 125.4±23.3 +5%

Results for ER graphs of 50 nodes on linear and nonlinear synthetic datasets. +59% 0.10±0.07 +30% 53.5±8.7 +30% 628.1±120.6 +41% 0.26±0.04 +52% 0.11±0.05 -51% 154.4±6.4 +8% 1437.7±111.1 +12%

Results for SF graphs of 10 nodes on linear and nonlinear synthetic datasets. +4% 0.04±0.03 +28% 5.3±2.8 +11% 10.5±8.7 +14% 0.86±0.12 +5% 0.12±0.08 -12% 8.1±2.0 +7% 7.0±6.7 +20%

Results for SF graphs of 20 nodes on linear and nonlinear synthetic datasets.

Results for SF graphs of 50 nodes on linear and nonlinear synthetic datasets.

ACKNOWLEDGMENTS

This research is supported by the Sea-NExT Joint Lab, the National Natural Science Foundation of China (9227010114), and CCCD Key Lab of the Ministry of Culture and Tourism.

