ON LOW RANK DIRECTED ACYCLIC GRAPHS AND CAUSAL STRUCTURE LEARNING

Abstract

Despite several important advances in recent years, learning causal structures represented by directed acyclic graphs (DAGs) remains a challenging task in high dimensional settings when the graphs to be learned are not sparse. In this paper, we propose to exploit a low rank assumption regarding the (weighted) adjacency matrix of a DAG causal model to mitigate this problem. We demonstrate how to adapt existing methods for causal structure learning to take advantage of this assumption and establish several useful results relating interpretable graphical conditions to the low rank assumption. In particular, we show that the maximum rank is highly related to hubs, suggesting that scale-free networks which are frequently encountered in real applications tend to be low rank. We also provide empirical evidence for the utility of our low rank adaptations, especially on relatively large and dense graphs. Not only do they outperform existing algorithms when the low rank condition is satisfied, the performance is also competitive even though the rank of the underlying DAG may not be as low as is assumed.

1. INTRODUCTION

An important goal in many sciences is to discover the underlying causal structures in various domains, both for the purpose of explaining and understanding phenomena, and for the purpose of predicting effects of interventions (Pearl, 2009) . Due to the relative abundance of passively observed data as opposed to experimental data, how to learn causal structures from purely observational data has been vigorously investigated (Peters et al., 2017; Spirtes et al., 2000) . In this context, causal structures are usually represented by directed acyclic graphs (DAGs) over a set of random variables. For this task, existing methods can be roughly categorized into two classes: constraint-and scorebased. The former use statistical tests to extract from data a number of constraints in the form of conditional (in)dependence and seek to identify the class of causal structures compatible with those constraints (Meek, 1995; Spirtes et al., 2000; Zhang, 2008) . The latter employ a score function to evaluate candidate causal structures relative to data and seek to locate the causal structure (or a class of causal structures) with the optimal score. Due to the combinatorial nature of the acyclicity constraint (Chickering, 1996; He et al., 2015) , most score-based methods rely on local heuristics to perform the search. A particular example is the greedy equivalence search (GES) algorithm (Chickering, 2002) that can find an optimal solution with infinite data and proper model assumptions. Recently, Zheng et al. (2018) introduced a smooth acyclicity constraint w.r.t. graph adjacency matrix, and the task on linear data models was then formulated as a continuous optimization problem with least-squares loss. This change of perspective allows using deep learning techniques to model causal mechanisms and has already given rise to several new algorithms for causal structure learning with non-linear data, e.g., Yu et al. (2019) ; Ng et al. (2019b; a) ; Ke et al. (2019) ; Lachapelle et al. (2020) ; Zheng et al. (2020) , among others. While these new algorithms represent the current state of the art in many settings, their performance generally degrades when the target DAG becomes large and relatively dense, as seen from the empirical results reported in the referred works and also in this paper. This issue is of course a challenge to other approaches. Ramsey et al. (2017) proposed fast GES for impressively large problems, but it works reasonably well only when the large structure is very sparse. The max-min hill-climbing (MMHC) (Tsamardinos et al., 2006) relies on local learning methods that often do not perform well when the target node has a large neighborhood. How to improve the performance on relatively large and dense DAGs is therefore an important question. In this work, we study the potential of exploiting a kind of low rank assumption on the DAG structure to help address this problem. The rank of a graph that concerns us is the algebraic rank of its associated weighted adjacency matrix. Similar to the role of a sparsity assumption on graph structures, we treat the low rank assumption as methodological and it is not restricted to a particular DAG learning method. However, unlike sparsity assumption, it is much less apparent when DAGs tend to be low rank and how low rank DAGs behave. Thus, besides demonstrating the utility of exploiting a low rank assumption in causal structure learning, another important goal is to improve our understanding of the low rank assumption by relating the rank of a graph to its graphical structure. Such a result also enables us to characterize the rank of a graph from several structural priors and helps to choose rank related hyperparameters for the learning algorithm. Our contributions are summarized as follows: • We show how to adapt existing causal structure learning methods to take advantage of the low rank assumption, and provide a strategy to select rank related hyperparameters utilizing the lower and upper bounds on the true rank, if they are available. • To improve our understanding of low rank DAGs, we establish some lower bounds on the rank of a DAG in terms of simple graphical conditions, which imply necessary conditions for DAGs to be low rank. • We also show that the maximum possible rank of weighted adjacency matrices associated with a directed graph is highly related to hubs in the graph, which suggests that scale-free networks tend to be low rank. From this result, we derive several graphical conditions to bound the rank of a DAG from above, providing simple sufficient conditions for low rank. • Empirically, we demonstrate that the low rank adaptations are indeed useful. Not only do they outperform the original algorithms when the low rank condition is satisfied, the performance is also very competitive even when the true rank is not as low as is assumed.

Related Work

The low rank assumption is frequently adopted in graph-based applications (Smith et al., 2012; Zhou et al., 2013; Yao & Kwok, 2016; Frot et al., 2019) , matrix completion and factorization (Recht, 2011; Koltchinskii et al., 2011; Cao et al., 2015; Davenport & Romberg, 2016) , network sciences (Hsieh et al., 2012; Huang et al., 2013; Zhang et al., 2017) and so on, but to our best knowledge, has not been used on the DAG structures in the context of learning causal DAGs. We notice two works Barik & Honorio (2019) ; Tichavskỳ & Vomlel (2018) that assume low rank conditional probability tables in learning Bayesian networks, which are different from ours. Also related are existing works that studied the rank of real weighted matrices described by a given simple directed/undirected graph. However, most works only considered the zero-nonzero pattern of off-diagonal entries (see, e.g., Fallat & Hogben (2007) ; Hogben (2010) ; Mitchell et al. (2010) ), whereas we also take into account the diagonal entries. This difference is crucial: if one only considers the off-diagonal entries, then the maximum rank over all possible weighted matrices is trivial and is always equal to the number of vertices. Consequently, many works focus on the minimum rank of a given graph, but to characterize exactly the minimum rank remains open, except for some special graph structures like trees (Hogben, 2010) . Apart from these works, Edmonds (1967) studied algebraically the maximum rank for matrices with a common zero-nonzero pattern. In Section 4, we use this result to relate the maximum possible rank to a more interpretable graphical condition, which further implies several structural conditions of DAGs that may be easier to obtain in practice.

2.1. GRAPH TERMINOLOGY

A graph G is defined as a pair (V, E), where V = {X 1 , X 2 , • • • , X d } is the vertex set and E ⊂ V 2 denotes the edge set. We are particularly interested in directed (acyclic) graphs in the context of causal structure learning. For any S ⊂ V, we use pa(S, G), ch(S, G), and adj(S, G) to denote the union of all parents, children, and adjacent vertices of the nodes of S in G, respectively. A graph is called weighted if every edge in the graph is associated with a non-zero value. We will work with weighted graphs and treat unweighted graphs as a special case where the edge weights are set to 1. Weighted graphs can be treated algebraically via weighted adjacency matrices. Specifically, the weighted adjacency matrix of a weighted graph G is a matrix W ∈ R d×d , where W (i, j) is the weight of edge X i → X j and W (i, j) = 0 if and only if X i → X j exists in G. The binary adjacency matrix A ∈ {0, 1} d×d is such that A(i, j) = 1 if X i → X j in G and A(i, j) = 0 otherwise. The rank of a weighted graph is defined as the rank of the associated weighted adjacency matrix.

2.2. CAUSAL STRUCTURE LEARNING AND RECENT GRADIENT-BASED METHODS

A commonly used model in causal structure learning is the structural equation model (SEM) that describes data generating procedure. In a slight abuse of notation, we also use X i 's to denote random variables associated with the nodes in a graph G. Assuming G being a DAG, then the SEM is given by X i = f i (pa(X i , G), i ) , i = 1, 2, . . . , d, where f i is a deterministic function and i 's are jointly independent noises. The SEM induces a marginal distribution P (X) over X = [X 1 , X 2 , • • • , X d ] T , and G and P (X) are said to form a causal Bayesian network (Pearl, 2009; Spirtes et al., 2000) . The problem of causal structure learning is to infer the underlying causal DAG G based on the marginal distribution P (X), or more practically, an empirical version consisting of a number of i.i.d. observations from P (X). We next briefly review recently developed gradient-based methods that rely on a smooth characterization of acyclicity of directed graphs. These methods aim to find a DAG that optimizes a score function and can be categorized into two classes. The first class of methods explicitly associates the target causal model with a weighted adjacency matrix W and then estimate W by solving optimization problems in the following form: min W,φ E X∼P (X) S X, h(X; W, φ) , subject to trace e W •W -d = 0, where h : R d → R d is a model function parameterized by W (and other possible parameter φ) that aims to reconstruct X, S(•, •) denotes a score function between the true and reconstructed variables, notation • denotes the element-wise product, and e M is the matrix exponential of a square matrix M . The constraint was proposed by Zheng et al. (2018) , which is smooth and holds if and only if W indicates a DAG. Methods in this class include: NOTEARS (Zheng et al., 2018) , which targets linear models, with h(X; W, φ) = W T X and S(•, •) being the Frobenius norm or equivalently the least-squares loss; and DAG-GNN (Yu et al., 2019) and the graph autoencoder approach (Ng et al., 2019b) , where neural networks are used for the function h with φ being the weights of neural networks, and the score function can be chosen as the evidence lower bound (Kingma & Welling, 2013) . A sparsity inducing term may be further added when the causal graph is assumed to be sparse. These objectives are equivalent to or are variants of some well studied score functions like the penalized maximum likelihood (Chickering, 2002; Van de Geer et al., 2013; Loh & Bühlmann, 2014) . The second class uses certain functions, with parameter θ, to construct a weighted adjacency matrix W (θ) (or a binary one A(θ)) to represent the causal structure. These methods can be summarized as min θ, φ E X∼P (X) S X, h(X; W (θ), φ) , subject to trace e W (θ) • W (θ) -d = 0. For example, GraN-DAG (Lachapelle et al., 2020) and NOTEARS-MLP (Zheng et al., 2020) respectively use neural network path products and partial derivatives between variables to construct W (θ). The binary matrix A(θ) can be obtained by sampling according to some distributions with learnable parameters, as used by Kalainathan et al. (2018) Before ending this section, we remark that while the gradient-based methods intend to learn a causal DAG, the learned DAG may not be identical to the underlying one for general SEMs due to the Markov equivalence (Spirtes et al., 2000; Peters et al., 2017) . For such cases, one may convert the obtained DAG to its corresponding Completed Partially Directed Acyclic Graph (CPDAG) as the estimate. Nevertheless, if the SEM is identifiable and a proper score function is used, then the exact solution to the optimization problem is consistent, i.e., same as the true graph with probability 1; see, e.g., Shimizu et al. (2006) ; Peters & Bühlmann (2013) ; Peters et al. (2014) ; Zhang & Hyvärinen (2009) . For further details and other technical issues like parameter optimization of the gradient-based methods, we refer the reader to the cited works and references therein.

3. EXPLOITING LOW RANK ASSUMPTION IN CAUSAL STRUCTURE LEARNING

This section shows how to adapt existing gradient-based methods to take advantage of the low rank assumption, by providing a way for each class to utilize this assumption using techniques from the matrix completion literature. We remark that our adaptations with the low rank assumption are not restricted to a particular learning algorithm; other DAG learning methods may potentially combine one of the proposed modifications for learning low rank causal graphs, too. Matrix Factorization Since the weighted adjacency matrix W is explicitly optimized in the first class of methods, we can then apply the matrix factorization technique. Specifically, with an estimate r for the graph rank, we can factorize W as W = U V T with U, V ∈ R d×r . Problem (1) is then to optimize U and V that minimizes the score function under the DAG constraint, and has the same solution W (obtained from the product U V T ) as the original one if r is greater than or equal to the true rank. Furthermore, if r d, we have a much reduced number of parameters to optimize. Nuclear Norm For the second class of methods, the adjacency matrix W (θ) is not an explicit parameter to be optimized. In such a case, we can adopt a commonly used technique to add a nuclear norm term λ W (θ) * , with λ > 0 being a tuning parameter, to the objective to induce low-rankness. The optimization procedures in these recent structure learning methods can directly incorporate the two adaptations as they are all gradient-based, though some extra care needs to be taken. Appendix C provides a detailed description of the optimization procedure and our implementation. The second approach is also feasible for the first class of methods, but we find that it does not work as well as the matrix factorization approach, possibly due to the singular value decomposition to compute the (sub-)gradient w.r.t. W at each optimization step. An acute reader may have noticed that we assumed a proper rank estimate r or a proper penalty parameter λ. Yet knowing exactly the rank of the graph to be learned can be difficult in practice. Similar to the sparsity assumption, one may determine the hyperparameters r and λ assisted by a validation dataset (or by cross-validation if the observed dataset is not sufficiently large). Alternatively, we can try different choices of the hyperparameters and then apply traditional score-based method where the search space is restricted to the resulting DAGs. However, since we are more concerned with relatively large and dense problems, the possible ranks may be too many to choose. As such, a lower bound r l and an upper bound r u on the graph rank would be beneficial-we need only consider ranks in [r l , r u ] in the matrix factorization method, while the bounds are still useful by providing qualitative information for the nuclear norm approach: the lower an upper bound, the higher the tuning parameter λ should be chosen. Moreover, a lower bound can also justify the low rank assumption, i.e., if the lower bound is high, then the low rank assumption is likely to fail to hold.

4. GRAPHICAL BOUNDS ON RANKS

Obtaining exact algebraic information of a DAG such as its rank and eigenvalues may be infeasible in practice, because it may require a full knowledge of the graph to be learned. On the other hand, structural information, such as graph connectivity, distributions of in-degrees and out-degrees, and an estimate of number of hubs, is sometimes more accessible. As such, this section is devoted to relating the rank of a graph to more easily interpretable graphical conditions, for the sake of a better understanding of what kinds of DAGs tend to satisfy the low rank assumption and for lower and upper bounds on the graph rank from certain structural priors.

4.1. PROBLEM SETTING

Consider a DAG G = (V, E) with weighted adjacency matrix W and binary adjacency matrix A. We aim to seek upper and lower bounds on rank(W ) using only the graphical structure. Specifically, we focus on the weighted adjacency matrices with the same binary adjacency matrix A, i.e., W A = {W ∈ R d×d ; sign(|W |) = A}, where sign(•) and | • | are point-wise sign and absolute value functions, respectively. Notice that there exist trivial upper bound d -1 and lower bound 0 for any DAG, but they are generally too loose for our purpose. In the following, we investigate the maximum rank max{rank(W ); W ∈ W A } and minimum rank min{rank(W ); W ∈ W A } to find tighter upper and lower bounds for any W ∈ W A . Before introducing two useful graph concepts, we comment that low rank DAGs are not necessarily sparse and vice versa; see a discussion in Appendix A. Definition 1 (Height). Given a DAG G = (V, E) and a vertex X i ∈ V, the height of X i , denoted by l(X i ), is defined as the length of the longest directed path starting from X i . The height of G, denoted by l(G), is the length of the longest path in G.  V 0 = {X 1 , X 2 , X 3 , X 4 }, V 1 = {X 5 , X 6 , X 7 }, V 2 = {X 8 , X 9 }, and V 3 = {X 10 , X 11 , X 12 }. Definition 2 (Head-tail vertex cover). Let G = (V, E) be a directed graph and H, T be two subsets of V. (H, T) is called a head-tail vertex cover of G if every edge in G has its head vertex in H or its tail vertex in T. The size of a head-tail vertex cover (H, T) is defined as |H| + |T|. As an example, Figure 1c is a head-tail vertex cover of G in Figure 1a , where H = {X 2 , X 4 , X 8 } (red nodes) and T = {X 8 , X 9 , X 10 } (blue nodes). The size of this vertex cover is 6.

4.2. LOWER BOUNDS

We first study lower bounds on the rank of a weighted DAG. Define V -1 = ∅ and V s = {X i ; l(X i ) = s} for s = 0, 1, . . . , l(G). Denote by G s,s-1 the induced subgraph of G over V s ∪V s-1 . Let C(G s,s-1 ) be the set of non-singleton connected components of G s,s-1 and |C(G s,s-1 )| the cardinality. We have the following lower bounds. Theorem 1. Let G be a DAG with binary adjacency matrix A. Then min{rank(W ) ; W ∈ W A } ≥ l(G) s=1 |C(G s,s-1 )| ≥ l(G). All the proofs in this paper are provided in Appendix B. Theorem 1 shows that rank(W ) is greater than or equal to the sum of the number of non-singleton connected components in each G s,s-1 . As G s,s-1 has at least one non-singleton connected component, we obtain the second inequality. In other words, the rank of a weighted DAG is at least as high as the length of the longest directed path. As an example, consider the graph shown in Figure 1 . One can verify that min{rank(W ); W ∈ W A } = 6, |C(G 1,0 )| = 2, |C(G 2,1 )| = 1, |C(G 3,2 )| = 1, and l(G) = 3. Thus, we have min{rank(W ); W ∈ W A } = 6 > 2 + 1 + 1 = 4 > 3. We remark that the bounds in Theorem 1 may be loose in some cases. To characterize the minimum rank exactly is an on-going research problem (Hogben, 2010) .

4.3. UPPER BOUNDS

We turn to the more important issue for our purpose, regarding upper bounds on rank(W ). The next theorem shows that max{rank(W ); W ∈ W A } can be characterized exactly in graphical terms. Theorem 2. Let G be a directed graph with binary adjacency matrix A. Then max{rank(W ); W ∈ W A } is equal to the minimum size of the head-tail vertex cover of G, that is, max{rank(W ) ; W ∈ W A } = min{|H| + |T| ; (H, T) is a head-tail vertex cover of G}. We comment that Theorem 2 holds for all directed graphs (not only DAGs), which may be of independent interest to other applications. A head-tail vertex cover of minimum size is called a minimum head-tail vertex cover, which in general is not unique. For a head-tail vertex cover (H, T), the vertices in H cover all the edges pointing towards these vertices while the vertices in T cover the edges pointing away. A head-tail cover of a relatively small size then indicates the presence of hubs, that is, vertices with relatively high in-degrees or out-degrees. Therefore, Theorem 2 suggests that the maximum rank of a weighted DAG is highly related to the presence of hubs: a DAG with many hubs tends to have low rank. Intuitively, a hub of high in-degree (out-degree) is a common effect (cause) of a number of direct causes (effect variables), comprising many V-structures (inverted V-structures). For example, in Figure 1a , X 8 is a hub of V-structures and X 9 is a hub of inverted V-structures. Such features are fairly common in real graph structures. Appendix A presents a real network, called pathfinder, which describes the causal relations among 109 variables (Heckerman et al., 1992) with the center node being the parent of a large number of other nodes. The famous scale-free (SF) graphs also tend to have hubs. A scale-free graph is one whose distribution of degree k follows a power law: P (k) ∼ k -γ , where γ is the power parameter typically within [2, 3] and P (k) denotes the fraction of nodes with degree k (Nikolova & Aluru, 2012) . It is observed that many real-world networks are scale-free, and some of them, such as gene regulatory networks, protein networks, and financial system network, may be viewed as causal networks (Guelzim et al., 2002; Barabasi & Oltvai, 2004; Hartemink, 2005; Eguíluz et al., 2005; Gao & en Ren, 2013; Ramsey et al., 2017) . In particular, Barabasi & Oltvai (2004) claimed that most protein networks, some of which are directed and acyclic due to irreversible reactions, are the results of growth processes and preferential attachments, probably due to the gene duplication. Figure 2 : 100-node graphs. Empirically, the ranks of scale-free graphs are relatively low, especially in comparison to Erdös-Rényi (ER) random graphs (Mihail & Papadimitriou, 2002) . Figure 2 provides a simulated example where γ is chosen from {2, 3} and each reported value is over 100 random runs. As graph becomes denser, the graph rank also increases. However, for scale-free graphs with a relatively large γ, the increase of their ranks is much slower than that of Erdös-Rényi graphs; indeed, their ranks tend to stay fairly low even when the graph degree is large. Theorem 2 can also be used to generate a low rank graph, or more precisely, a random DAG with a given rank r and a properly specified graph degree. Here we briefly describe the idea and leave the detailed algorithm to Appendix C.1: first generate a graph with r edges and rank r; a random edge is sampled without replacement and would be added to the graph, if adding this edge does not increase the size of the minimum head-tail vertex cover; repeat the previous step until the pre-specified degree is reached or no edge could be added to the graph; finally, assign the edge weights randomly according to a continuous distribution and the weighted graph will have rank r with high probability. The next two theorems report some looser but simpler upper bounds on rank(W ). Theorem 3. Let G be a DAG with binary adjacency matrix A, and denote the set of vertices with at least one parent by V ch and those with at least one child by V pa . Then we have max{rank(W ) ; W ∈ W A } ≤          l(G) s=1 min (|V s |, |ch(V s )|) ≤ |V pa |, l(G)-1 s=0 min (|V s |, |pa(V s )|) ≤ |V ch |, |V| -max{|V s | ; 0 ≤ s ≤ l(G)}. Since V ch and V pa are the non-root and the non-leaf vertices, respectively, the first two inequalities of (4) indicate that the maximum rank is bounded from above by the number of non-root vertices and also by the number of non-leaf vertices. The last inequality of ( 4) is a generalization of the first two, which implies that the rank is likely to be low if most vertices have the same height. Theorem 4. Let G be a DAG with binary adjacency matrix A. Denote by skeleton(A) and moral(A) the binary adjacency matrices of the skeleton and moral graph of G, respectively. Then we have max{rank(W ) ; W ∈ W A } ≤ max{rank(W ) ; sign(|W |) = skeleton(A)} ≤ max{rank(W ) ; sign(|W |) = moral(A)}. The skeleton of a DAG is the undirected graph obtained by removing all the arrowheads, and the moral graph is the undirected graph where two vertices are adjacent if they are adjacent or if they share a common child in the DAG. This result is useful when the skeleton or the moral graph can be accurately estimated and the corresponding rank is low. In practice, we may use all available structural priors to obtain upper bounds on the underlying rank and choose the lowest one as our estimate.

5. EXPERIMENTS

This section reports empirical results of the low rank adaptations of existing methods, compared with their original versions. We choose NOTEARS (Zheng et al., 2018) for linear SEMs by adopting the matrix factorization approach, denoted as NOTEARS-low-rank, and use the nuclear norm approach in combination with GraN-DAG (Lachapelle et al., 2020) for a non-linear data model. Again we remark that the two methods are only demonstrations of the utility of low rank assumption, which can be potentially combined with other methods as well. For more information, we also include several benchmark methods: fast GES (Ramsey et al., 2017) , PC (Spirtes et al., 2000) , MMHC (Tsamardinos et al., 2006) , ICA-LiNGAM (Shimizu et al., 2006) specifically designed with non-Gaussian noises, for linear SEMs;foot_0 and DAG-GNN (Yu et al., 2019) , NOTEARS-MLP (Zheng et al., 2020) , and CAM (Bühlmann et al., 2014) for the non-linear case. Their implementations are described in Appendix C. We consider randomly sampled DAGs with specified ranks (the generating procedure was described in Section 4.3 and is given as Algorithm 1 in Appendix C.1), scale-free graphs, and a real network structure. For linear SEMs, the weights are uniformly sampled from [-2, -0.5] ∪ [0.5, 2] and the noises are either standard Gaussian or standard exponential. For non-linear SEMs, we use additive Gaussian noise model with functions sampled from Gaussian processes with RBF kernel of bandwidth one. These data models are known to be identifiable (Shimizu et al., 2006; Peters & Bühlmann, 2013; Peters et al., 2014) . From each SEM, we then generate n = 3, 000 observations. We repeat ten times over different seeds for each experiment setting. Detailed information about the setup can be found in Appendix C.3. Below we mainly report structural Hamming distance (SHD) which takes into account both false positives and false negatives, and a smaller SHD indicates a better estimate.

5.1. LINEAR SEMS WITH RANK-SPECIFIED GRAPHS

We first consider linear SEMs on rank-specified graphs, with number of nodes d ∈ {100, 300}, rank r = 0.1d , and average degree k ∈ {2, 4, 6, 8}. The true rank is assumed to be known and is used as the rank parameter r in NOTEARS-low-rank. For a better visualization, Figure 3 only reports the average SHDs, while the true positive rate, false discovery rate, and running time are left to Appendix D. We also show the results after using the interquartile range rule to remove outlier SHDs. We observe that the low rank assumption can greatly improve the performance of NOTEARS, reducing the SHDs by at least a half. For this data model, the fast GES has much higher SHDs (see also Appendix D). PC is too slow (for example, it did not finish in 16 hours for a dataset with 100 nodes and degree 6), because some nodes may have a high in-degree. For the same reason, the skeleton may not be well estimated by MMHC; its performance is slightly worse than the fast GES and is not reported. For more information regarding the role of sparsity, we include NOTEARS with an 1 penalty, named NOTEARS-L1. Here the 1 penalty weight is chosen from {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}. Instead of relying on an additional validation dataset, we treat NOTEARS-L1 favorably by picking the lowest SHD obtained from different weights for each dataset. As seen from Figure 3a , NOTEARS-L1 is slightly better than NOTEARS when the average degree is 2, but is largely outperformed with relatively dense graphs. This observation was also reported in Zheng et al. (2018) . We conjecture that it is because our experiments consider relatively sufficient data and dense graphs. Moreover, the thresholding procedure controls false discoveries and may have a similar effect to the 1 penalty. Appendix D.1 studies graphs with higher ranks, where it is observed that the advantage of NOTEARSlow-rank over NOTEARS decreases when the rank of the underlying DAG increases. Nevertheless, NOTEARS-low-rank is still competitive when the true rank is d/2 and the factorized matrix has the same number of parameters as NOTEARS. We also conduct an empirical analysis with different sample sizes in Appendix D.2, which shows that NOTEARS-low-rank performs reasonably well when the sample size is small and tends to have a better performance with a larger number of samples. Due to space limit, please find further details in the appendix.

5.2. LINEAR SEMS WITH SCALE-FREE GRAPHS

Figure 4 : Scale-free graphs. We next consider scale-free graphs with d = 100 nodes, average degree k = 6, and power γ = 2.5. For this experiment, the minimum, maximum, and mean ranks of generated graphs are 14, 24, and 18.7, respectively. Here we choose the rank parameter r from {20, 30, 40} for NOTEARS-low-rank. As seen from Figure 4 , NOTEARSlow-rank with rank parameter r = 20 performs the best, even though there are graphs with ranks greater than 20.

5.3. SENSITIVITY OF RANK PARAMETERS AND VALIDATION

So far we have assumed that the true rank or an accurate estimate is known. In this experiment, we conduct an empirical analysis with different rank parameters for linear Gaussian data model on rank-specified graphs with 100 nodes, degree 8, and rank 10. We also include the validation based approach where 2, 000 samples are chosen as training dataset and the rest as validation dataset. We use the derived lower and upper bounds in Theorems 1 and 3 to obtain a range of possible rank parameters, assuming that the corresponding structural priors are available. Within this range, we then select 7 evenly distributed rank parameters used with NOTEARS-low-rank to learn causal graphs. Finally, we evaluate each learned DAG using the validation dataset and choose the DAG with the best score as our estimate. As seen from Figure 5 , NOTEARS-low-rank performs the best when the rank parameter is identical to the true rank, while the rank parameter chosen by validation has almost the same performance. Compared with NOTEARS on the same datasets, the low rank version performs well across a range of rank parameters. Although this validation approach increases the total running time that depends on the number of candidate rank parameters, we believe that it is acceptable given the gained accuracy and also the fact that this strategy has been frequently adopted for tuning hyperparameters in practice. Figure 5 : Different rank parameters. Figure 6 : Non-linear SEMs.

5.4. NON-LINEAR SEMS

For non-linear data models, we pick rank-specified graphs with 50 nodes, rank 5, and average degree k ∈ {2, 4, 6, 8}. To our knowledge, the selected benchmark methods CAM, NOTEARS-MLP, and GraN-DAG are state-of-the-art methods on this data model. As a demonstration of the low rank assumption, we apply the nuclear norm approach to GraN-DAG and choose from {0.3, 0.5, 1.0} as penalty weights. For validation, we use the same splitting ratio as in Section 5.3 and consider more penalty weights from {0.1, 0.2, 0.3, 0.5, 1, 2, 5}. Similarly, the learned graph that achieves the best score on the validation dataset is chosen as final estimate. We apply the proposed method to the arth150 gene network, which is a DAG containing 107 genes and 150 edges. Its maximum rank is 40. Since the real dataset has only 22 samples, we instead use simulated data from linear Gaussian SEMs. We pick r from {36, 40, 44} and also use validation to select the rank parameter. We apply NOTEARS-L1 where the 1 penalty weight is chosen from {0.05, 0.1, 0.2}, and similarly treat this method favorably by picking the lowest SHD for each dataset. The mean and median SHDs are shown in Figure 7 . Using Student's t-test, we find that with significance level 0.1, the results obtained with r = 44 and the validation approach are significantly better than NOTEARS. This experiment demonstrates again the utility of the low rank assumption, even when the true rank of the graph is not very low.

6. CONCLUDING REMARKS

This paper studies the potential of low rank assumption in causal structure learning. Empirically, we show that the low rank adaptations perform noticeably better than existing algorithms when the low rank condition is satisfied, and also deliver competitive performances when the rank is not as low as is assumed. Theoretically, we provide an improved understanding of what kinds of graphs tend to be low rank and a possibility to obtain bounds on the underlying rank from several structural priors. We treat the present work as our first step to incorporate low-rankness into causal DAG learning. A future direction is to approximate a high rank DAG with a low rank one (possibly adding an additional DAG that is sparse). While there is a rich literature on low rank approximations of matrices and combining low-rankness with sparsity, it is non-trivial to us to conclude under what conditions such an approximation is guaranteed to be effective to learn causal DAGs. Another direction is to compare the low rank assumption to other structural or parametric priors affecting model selection through marginal likelihood (Eggeling et al., 2019; Silander et al., 2007) . Finally, it is also interesting to investigate if a low rank DAG model implies any useful behavior in the data. Minimum rank of the graph in Figure 1 We first show that the minimum rank of the DAG structure in Figure 1 is 6. It is clear that the 6-th to 10-th rows of A are always linearly independent, so it suffices to show that the 11-th row is linearly independent of the 6-th to 10-th rows. To see this, notice that if the 11-th row is a linear combination of the 6-th to 10-th rows, then A(11, 1) would be non-zero, which is a contradiction. The pathfinder and arth150 networks Figure 8 visualizes the pathfinder and arth150 networks that are mentioned in Sections 4.3 and 5, respectively. Both networks can be found at http: //www.bnlearn.com/bnrepository. As one can see, these two networks contain hubs: the center note in the pathfinder network has a large number of children, while the arth150 network contains many 'small' hubs, each of which has 5 ∼ 10 children. We also notice that nearly all the hubs in the two networks have high out-degrees.  , i.e. X 1 → X 2 → • • • → X d , while the rank of its binary adjacency matrix is d -1. According to Theorems 1 and 2, the maximum and minimum ranks of a directed linear graph are equal to its number of edges. Thus, directed linear graphs are sparse but have high ranks. On the other hand, for some non-sparse graphs, we can assign the edge weights so that the resulting graphs have low ranks. A simple example would be a fully connected directed balanced bipartite graph, as shown in Figure 9 . The definition of bipartite graphs can be found in Appendix B.1. A bipartite graph is called balanced if its two parts contain the same number of vertices. The rank of a fully connected balanced bipartite graph with d vertices is 1 if all the edge weights are the same (e.g., the binary adjacency matrix), but the number of edges is d 2 /4. We also notice that there exist some connections between the maximum rank and the graph degree, or more precisely, the total number of edges in the graph, according to Theorem 2. Intuitively, if the graph is dense, then we need more vertices to cover all the edges. Thus, the size of the minimum head-tail vertex cover should be large. Explicitly providing a formula to characterize these two graph parameters is an interesting problem, which will be explored in the future.

B PROOFS

In this section, we present proofs for the theorems given in the main content.

B.1 PRELIMINARIES

A bipartite graph is a graph whose vertex set V can be partitioned into two disjoint subsets V 0 and V 1 , such that the vertices within each subset are not adjacent to one another. V 0 and V 1 are called the parts of the graph. A matching of a graph is a subset of its edges where no two of them share a common endpoint. A vertex cover of a graph is a subset of the vertex set where every edge in the graph has at least one endpoint in the subset. The size of a matching (vertex cover) is the number of edges (vertices) in the matching (vertex cover). A maximum matching of a graph is a matching of the largest possible size and a minimum vertex cover is a vertex cover of the smallest possible size. An important result about bipartite graphs is König's theorem (Dénes, 1931) , which states that the size of a minimum vertex cover is equal to the size of a maximum matching in a bipartite graph. Based on the heights of vertices in V, we can define a weak ordering among the vertices: X i X j if and only if l(X i ) > l(X j ), and X i ∼ X j if and only if l(X i ) = l(X j ). Given this weak ordering, we can group the vertices by their heights, and the resulting graph shows a hierarchical structure; see Figure 1 in the main text for an example. This hierarchical representation has some simple and nice properties. Let V s = {X i ; l(X i ) = s}, s = 0, 1, . . . , l(G), and let V -1 = ∅. We have: (1) for any given s ∈ {0, 1, . . . , l(G)} and two distinct vertices X 1 , X 2 ∈ V s , X 1 and X 2 are not adjacent, and (2) for any given s ∈ {1, 2, . . . , l(G)} and X i ∈ V s , there is at least one vertex in V s-1 which is a child of X i . If we denote the induced subgraph of G over V s ∪ V s-1 by G s,s-1 , then G s,s-1 is a bipartite graph with V s and V s-1 as parts, and singletons in G s,s-1 (i.e., vertices that are not endpoints of any edge) only appear in V s-1 . For ease of presentation, we occasionally use index i to represent variable X i in the following sections.

B.2 PROOF OF THEOREM 1

Proof. Let G = (V, E). Consider an equivalence relation, denoted by ∼, among vertices in V defined as follows: for any X i , X j ∈ V, X i ∼ X j if and only if l(X i ) = l(X j ) and X i and X j are connected. Here, connected means that there is a path between X i and X j . Below we use C(X i ) to denote the equivalence class containing X i . Next, we define a weak ordering π on V/ ∼, i.e., the equivalence classes induced by ∼, by letting C(X i ) π C(X j ) if and only if l(X i ) ≥ l(X j ). Then, we extend π to a total ordering ρ on V/ ∼. The ordering ρ also induces a weak ordering (denoted by ρ) on V: X i ρ X j if and only if C(X i ) ρ C(X j ). Finally, we extend ρ to a total ordering γ on V. It can be verified that γ is a topological ordering of G, that is, if we relabel the vertices according to γ, then X i ∈ pa(X j , G) if and only if i > j and X i and X j are adjacent, and the adjacency matrix of G becomes lower triangular. Assume that the vertices of G are relabeled according to γ and we will consider the binary adjacency matrix A of the resulting graph throughout the rest of this proof. Note that relabelling is equivalent to applying a permutation onto the adjacency matrix, which does not change the rank. Let V 0 = {1, 2, . . . , k 1 -1} for some k 1 ≥ 2. Then the k 1 -th row of A, denoted by A(k 1 , •), is the first non-zero row vector of A. Letting S = {A(k 1 , •)}, then S contains a subset of linearly independent vector(s) of the first k 1 rows of A. Suppose that we have visited the first m rows of A and S = {A(k 1 , •), A(k 2 , •), . . . , A(k t , •)} contains a subset of linearly independent vector(s) of the first m rows of A, where k 1 ≤ m < d. If X m+1 X kt , then we add A(m + 1, •) to S; otherwise, we keep S unchanged. We claim that the vectors in S are still linearly independent after the above step. Clearly, if we do not add any new vector, then S contains only linearly independent vectors. To show the other case, note that if l(X m+1 ) > l(X kt ) ≥ • • • ≥ l(X k1 ), then there is an index i ∈ V l(Xm+1)-1 such that A(m + 1, i) = 0, by the definition of height. Since l(X m+1 ) > l(X kt ), we have l(X kt ) ≤ l(X m+1 ) -1 and thus A(k j , i) = 0 for all j = 1, 2, . . . , t. Therefore, A(m + 1, •) cannot be linearly represented by {A(k j , •); j = 1, 2, . . . , t} and the vectors in S are linearly independent. On the other hand, if l(X m+1 ) = l(X kt ), then the definition of the equivalence relation ∼ implies that X m+1 and X kt are disconnected, which means that X m+1 and X kt do not share a common child in V l(Xm+1)-1 . Consequently, there is an index i ∈ V l(Xm+1)-1 such that A(m + 1, i) = 0 but A(k t , i) = 0. Similarly, we can show that A(k j , i) = 0 for all j = 1, 2, . . . , t. Thus, the vectors in S are still linearly independent. After visiting all the rows in A, the number of vectors in S is equal to l(G) s=1 |C(G s,s-1 )| based on the definition of ∼. The second inequality can be shown by noting that C(G s,s-1 ) has at least one elements. The proof is complete.

B.3 PROOF OF THEOREM 2

Proof. Denote the directed graph by G = (V, E). Edmonds (1967, Theorem 1) showed that max{rank(W ); W ∈ W A } is equal to the maximum number of nonzero entries of A, no two of which lie in a common row or column. Therefore, it suffices to show that the latter quantity is equal to the size of the minimum head-tail vertex cover. Let , E ) where E = {(X i , 0) → (X j , 1); (X i , X j ) ∈ E}. Denote by M a set of nonzero entries of A so that no two entries lie in the same row or column. Notice that M can be viewed as an edge set and no two edges in M share a common endpoint. Thus, M is a matching of B. Conversely, it can be shown by similar arguments that any matching of B corresponds to a set of nonzero entries of A, no two of which lie in a common row or column. Therefore, max{rank(W ), W ∈ W A } equals the size of the maximum matching of B, and further the size of the minimum vertex cover of B according to König's theorem. Note that any vertex cover of B can be equivalently transformed to a head-tail vertex cover of G, by letting H and T be the subsets of the vertex cover containing all variables in V 0 and of the vertex cover containing all variables in V 1 , respectively. Thus, max{rank(W ), W ∈ W A } is equal to the size of the minimum head-tail vertex cover. V = V 0 ∪ V 1 , where V 0 = V × {0} = {(X i , 0); X i ∈ V} and V 1 = V × {1} = {(X i , 1); X i ∈ V}. Now define a bipartite graph B = (V

B.4 PROOF OF THEOREM 3

Proof. We start with the first inequality in Equation (4). Let h 1 , . . . , h p denote the heights where |V s | < |ch(V s )|, and t 1 , . . . , t q the height where |V s | > |ch(V s )|. Let H = ∪ p i=1 V hi and T = ∪ q i=1 V ti . It is straightforward to see that (H, T) is a head-tail vertex cover. Thus, Equation (4) holds according to Theorem 2. The second inequality can be shown similarly and its proof is omitted. For the third inequality, let m = argmax{|V s | : 0 ≤ s ≤ l(G)}, and define H = ∪ i>m V i and T = ∪ i<m V i . Then (H, T) is also a head-tail vertex cover and the third inequality follows from Theorem 2, too.

B.5 PROOF OF THEOREM 4

Proof. Notice that Theorem 2 holds for all directed graphs. This theorem then follows by treating the skeleton and the moral graph as directed graphs with loops, i.e., an undirected edge X i -X j is treated as two directed edges X i → X j and X j → X i .

C IMPLEMENTATION DETAILS

In this section, we present an algorithm to generate a random DAG with a given rank, a low rank version of NOTEARS and GraN-DAG, and also a description of our experimental settings.

C.1 GENERATING RANDOM DAGS

In Section 4.3, we briefly discuss the idea of generating a random DAG with a given rank. We now describe the detailed procedure in Algorithm 1. In particular, we aim to generate a random DAG with d nodes, average degree k, and rank r. The first part of Algorithm 1 after initialization is to sample a number N , representing the total number of edges, from a binomial distribution B(d(d -1)/2, p) Algorithm 1 Generating random DAGs Require: Number of nodes d, average degree k, and rank r. Ensure: A randomly sampled DAG with the number of nodes d, average degree k, and rank r. 1: Set M = empty graph, M p = ∅, and R = {(i, j); i < j, i, j = 1, 2, ..., d}. 2: Set p = k/(d -1). Sample an index j from i + 1 to d. Sample an edge (i, j) from R and remove it from R. return FAIL 20: end if 21: return M where p = k/(d -1). If N < r, Algorithm 1 would return FAIL since a graph with N < r edges could never have rank r. Otherwise, Algorithm 1 samples an initial graph with r edges and rank r, by choosing r edges such that no two of them share the same head points or the same tail points, i.e., each row and each column of the corresponding adjacency matrix have at most one non-zero entry. Then, Algorithm 1 sequentially samples an edge from R containing all possible edges and checks whether adding this edge to the graph changes the size of the minimum head-tail vertex cover. If not, the edge will be added to the graph; otherwise, it will be removed from R. This is because if a graph G is a super-graph of another graph H, then the size of the minimum head-tail cover of G is no less than that of H. We repeat the above sampling procedure until there is no edge in R or the number of edges in the resulting graph reaches N . If the latter happens, the algorithm will return the generated graph; otherwise, it will return FAIL. The theoretic basis of Algorithm 1 is Theorem 2. Note that the algorithm may not return a valid graph if the desired number N of edges cannot be reached. This could happen if the input rank is too low while the input average degree is too high. With our experiment settings, we find it rare for Algorithm 1 to fail to return a desired graph.

C.2 OPTIMIZATION

For this part, we consider a dataset consisting of n i.i.d. observations from P (X) and consequently the expectations in Problems (1) and ( 2) are replaced by empirical means. Denote the design matrix by X ∈ R n×d , where each row of X corresponds to an observation and each column represents a variable. Here we use NOTEARS (Zheng et al., 2018) and Gran-DAG (Lachapelle et al., 2020) from each class of methods as examples and will describe their low rank versions in the following. Other gradient-based methods and their optimization procedures can be similarly modified to incorporate the low rank assumption. Algorithm 2 Optimization procedure for NOTEARS-low-rank Require: Design matrix X, starting point (U 0 , V 0 , α 0 ), rate c ∈ (0, 1), tolerance > 0, and threshold w > 0. Ensure: Locally optimal parameter W * . 1: for t = 1, 2, . . . do 2: (Solve primal) U t+1 , V t+1 ← arg min U,V L ρ (U, V, α t ) with ρ such that g(U t+1 V T t+1 ) < cg(U t V T t ). 3: (Dual ascent) α t+1 ← α t + ρg(U t+1 V T t+1 ). 4: if g(U t+1 V T t+1 ) < then 5: Set U * = U t+1 and V * = V t+1 . 6: break 7: end if 8: end for 9: (Thresholding) Set W * = U * V * T • 1(|U * V * T | > w). 10: return W * C.2.1 NOTEARS WITH LOW RANK ASSUMPTION Following Section 3, the optimization problem in our work can be written as min W 1 2n X -XU V T 2 F , subject to trace e U V T •U V T -d = 0, where U, V ∈ R d×r and • is the point-wise product. The constraint in Problem ( 5) holds if and only if U V T is a weighted adjacency matrix of a DAG. This problem can then be solved by standard numeric optimization methods such as the augmented Lagrangian method (Bertsekas, 1999) . In particular, the augmented Lagrangian is given by L ρ (U, V, α) = 1 2n X -XU V T 2 F + αg(U V T ) + ρ 2 |g(U V T )| 2 , where g(U V T ) := trace e U V T •U V T -d, α is the Lagrange multiplier, and ρ > 0 is the penalty parameter. The optimization procedure is summarized in Algorithm 2, similar to Zheng et al. (2018, Algorithm 1) . Notice that here we do not include the 1 penalty term (except for the first and last experiments in Sections 5.1 and 5.5, respectively), for the following reasons: (1) the thresholding procedure can also control false discoveries; (2) we consider relatively sufficient data for the experiments and NOTEARS with thresholding has been shown in Zheng et al. (2018) to perform consistently well even when the graph is sparse; (3) we are more concerned with relatively large and dense graphs, so a sparsity assumption may be harmful, as shown also by Zheng et al. (2018) ; (4) the 1 penalty term requires a tuning parameter, which itself is not easy to choose. Zheng et al. (2018) used L-BFGS to solve the unconstrained subproblem in Step 2. We alternatively use the Newton conjugate gradient method that is written in C. Empirically, these two optimizers behave similarly in terms of the estimate performance, while the latter can run much faster thanks to its C implementation. The DAG constraint may not be satisfied exactly using iterative numeric methods, so it is a common practice to pick a small tolerance, followed by a thresholding procedure on the estimated entries to obtain exact DAGs. In our implementation, we choose U 0 and V 0 to be the first r columns of the d × d identity matrices. Other parameter choices are: α 0 = 0, c = 0.25, = 10 -6 , and w = 0.3, similar to those used in related methods on the same datasets (e.g., Zheng et al. (2018) ; Yu et al. (2019) ; Zhu et al. (2020) ). The chosen threshold w = 0.3 works well in our experiments and in the experiments of related works that use the same data model. In case the thresholded matrix is not a DAG, one may further increase the threshold until the resulting matrix corresponds to a DAG. After obtaining W * , we add an additional pruning step: we use linear regression to refit the dataset based on the structure indicated by W * and then apply another thresholding (with w = 0.3) to the refitted weighted adjacency matrix. Both the Newton conjugate gradient optimizer and the pruning technique are also applied to NOTEARS, which not only accelerate the optimization but also improve its performance by obtaining a much lower SHD, particularly for large and dense graphs. See Appendix D.3 for an empirical comparison. • CAM (Peters et al., 2014) : its codes are available through the CRAN R package repository at https://cran.r-project.org/web/packages/CAM. • NOTEARS (Zheng et al., 2018) and NOTEARS-MLP (Zheng et al., 2020) : codes are available at the first author's github repository https://github.com/xunzheng/ notears. • GraN-DAG (Lachapelle et al., 2020) : an implementation is available at the first author's github repository https://github.com/kurowasan/GraN-DAG. Note that for graphs of 50 nodes or more, GraN-DAG performs a preliminary neighborhood selection step to avoid overfitting. • DAG-GNN (Yu et al., 2019) : the codes are available at the first author's github repository https://github.com/fishmoon1234/DAG-GNN. • ICA-LiNGAM (Shimizu et al., 2006 ): an implementation is available at https://sites. google.com/site/sshimizu06/lingam. In the experiments, we mostly use default hyperparameters unless otherwise stated. We next empirically study the consistency of NOTEARSlow-rank. Again, we use rank-specified random graphs (sampled according to Algorithm 1) with d = 100 nodes, degree k = 8, rank r = 10, and linear Gaussian SEMs. We also assume that the true rank is known. We fix the rank parameter r = 10 and use different sample sizes ranging from 200 to 5, 000. From Figure 11 , NOTEARSlow-rank performs reasonably well when the sample size is small and tends to have a better performance with a larger number of samples. 



Here we choose ICA-LiNGAM, other than alternative LiNGAM methods like DirectLiNGAM(Shimizu et al., 2011), based on our empirical observation. Specifically, an implementation of ICA-LiNGAM has a noticeably better performance than DirectLiNGAM for relatively dense graphs. Please find a detailed discussion and an empirical comparison in Appendix D.4.



Ke et al. (2019); Ng et al. (2019a); Zhu et al. (2020).

Figure 1: A DAG G with 12 vertices, 12 edges and height 3, whereV 0 = {X 1 , X 2 , X 3 , X 4 }, V 1 = {X 5 , X 6 , X 7 }, V 2 ={X 8 , X 9 }, and V 3 = {X 10 , X 11 , X 12 }.

Figure 3: Average SHDs on rank-specified graphs. The models are linear SEMs with (a)-(b) Gaussian noises, and (c) exponential noises. The true rank is assumed to be known.

Figure 7: Real network.

Figure 8: The pathfinder (left) and arth150 (right) networks.

Figure 9: A fully connected directed balanced bipartite graph G and its binary adjacency matrix.

(i, j) to M and remove (i, j) from R. 11: end for 12: while R = ∅ and |M | < N do 13:

14:if adding (i, j) to M does not change the size of the minimum head-tail vertex cover of M then 15:Add (i, j) to M . 16: end if 17: end while 18: if |M | < N then 19:

LINEAR SEMS WITH HIGHER RANKSThis experiment considers graphs of higher ranks. We use rank-specified random graphs with d = 100 nodes and rank r ∈ {30, 35, 40, 45, 50} on linear Gaussian SEMs. The results are shown in Figures10a and 10bwith degrees 2 and 8, respectively. We observe that when the rank of the underlying graph becomes higher, the advantage of NOTEARS-low-rank over NOTEARS decreases. Nonetheless, NOTEARS-low-rank with rank r = 50 is still comparable to NOTEARS, and has a lower average SHD after removing outlier SHDs using the interquartile range rule.

Figure10: Average SHDs on rank-specified graphs with higher ranks. The true rank is assumed to be known.

Sample r indices from 1, . . . , d -1 and store them in M p in descending order. 8: for each i in M p do

Detailed results for linear Gaussian data model with equal noise variances. DETAILED RESULTS FOR EXPERIMENT 4 WITH NON-LINEAR SEMS Table3reports the detailed SHDs for each method in Section 5.4. We also mark in bold the best results from methods with or without low rank modifications.

Detailed SHDs for Experiment 4 with non-linear SEMs.

A EXAMPLES AND DISCUSSIONS

We provide more examples and discussions in this section.

C.2.2 GRAN-DAG WITH LOW RANK ASSUMPTION

We next consider a low rank version of GraN-DAG. The optimization problem can be written as i | pa(X i , W (θ)) (l) ; θ + λ W (θ) * subject to trace e W (θ) -d = 0,where X(l)i is the l-th sample of variable X i and pa(X i , W (θ)) (l) means the l-th sample of X i 's parents indicated by the adjacency matrix W (θ). Here, θ denotes the parameters of neural networks and W (θ) with non-negative entries is obtained from the neural network path products.Problem (6) can be solved similarly using augmented Lagrangian. The procedure is similar to Algorithm 2 and is the same to that used by GraN-DAG, with slight modifications: (1) the subproblem in Step 2 is approximately solved using first-order methods; (2) the thresholding at Step 9 is replaced by a variable selection method proposed by Bühlmann et al. (2014) . The same variable selection or pruning method is adopted by two other benchmark methods CAM and NOTEARS-MLP in our experiment. Please refer to Lachapelle et al. (2020) and Bühlmann et al. (2014) for further details.

C.3 EXPERIMENT SETUP

In our experiments, we consider three data models: linear Gaussian SEMs, linear non-Gaussian SEMs (linear exponential SEMs), and non-linear SEMs (Gaussian processes). Given a randomly generated DAG G, the associated SEM is generated as follows:Linear Gaussian A linear Gaussian SEM is given bywhere pa(X i , G) denotes X i 's parents in G and i 's are jointly independent standard Gaussian noises.In our experiments, the weights W (i, j)'s are uniformly sampled from [-2, -0.5] ∪ [0.5, 2].Linear Exponential A linear exponential SEM is also generated according to Equation ( 7), where i 's are replaced by jointly independent Exp(1) random variables. The weights W (i, j)'s are sampled from [-2, -0.5] ∪ [0.5, 2] uniformly, too.Gaussian Processes We consider the following additive noise model:where i 's are jointly independent standard Gaussian noises and f i 's are functions sampled from Gaussian processes with RBF kernel of bandwidth one.We sample 3, 000 observations according the SEM. The reported results of each setting are summarized over 10 repetitions with different seeds. The experiments are run on a Linux workstation with 16-core Intel Xeon 3.20GHz CPU and 128GB RAM.

C.4 BENCHMARK METHODS

Existing causal structure learning methods used in our experiments all have available implementations, as listed below:• GES and PC: an implementation of both methods is available through the py-causal package at https://github.com/bd2kccd/py-causal. We note that, the implementation of py-causal package is based on the CMU TETRAD project, in which the version of GES is indeed the fast GES algorithm proposed by Ramsey et al. (2017) . • MMHC (Tsamardinos et al., 2006 ): an implementation is available in the bnlearn package at https://CRAN.R-project.org/package=bnlearn.

D.3 FURTHER PRUNING

We compare the empirical results before and after applying the additional pruning technique described in Appendix C.2. The graphs are rank-specified with d ∈ {100, 300} nodes, rank r = 0.1d , and degree k ∈ {2, 4, 6, 8}. We again use linear Gaussian data model with equal noise variances to generate the datasets. The average SHDs are reported in Figure 12 . We see that applying an additional pruning step indeed improves the final performance of both NOTEARS and NOTEARS-low-rank, especially on relatively large and dense graphs. 

D.4 AN EMPIRICAL COMPARISON BETWEEN ICA-LINGAM AND DIRECTLINGAM

To our best knowledge, there are two Python implementations of ICA-LiNGAM (Shimizu et al., 2006) released by the authors, available at https://sites.google.com/site/sshimizu06/ lingam and https://github.com/cdt15/lingam, respectively, where the latter is a Python package containing several LiNGAM related methods. In the following, we use ICA-LiNGAM-pre and ICA-LiNGAM-cdt to denote these two implementations, respectively. For Di-rectLiNGAM (Shimizu et al., 2011) , we only find a Python implementation available at the previously mentioned Python package containing ICA-LiNGAM-cdt.Here we run DirectLiNGAM, ICA-LiNGAM-cdt, and ICA-LiNGAM-pre on linear exponential data models with 100-node and rank-10 graphs. The mean SHDs are reported below in Table 1 . Based on this experimental result as well as our past experience, DirectLiNGAM usually has a (slightly) better performance than ICA-LiNGAM-cdt, while ICA-LiNGAM-pre has a noticeably (if not much) better performance for relatively dense and large graphs. We are more concerned with relatively large and dense graphs and hence report the results achieved by ICA-LiNGAM-pre in the main paper. Here the true rank is assumed to be known and is used as the rank parameter in NOTEARSlow-rank. We also test (fast) GES, MMHC, and PC. However, PC is too slow since some nodes may have a high in-degree (i.e., hubs) in large, dense, and low rank graphs. For the same reason, the skeleton may not be correctly estimated by MMHC, which has a similar performance to that of GES. Therefore, we only include the results of GES for comparison. We treat GES favorably by regarding undirected edges as true positives if the true graph has a directed edge in place of the undirected ones.

