ON LOW RANK DIRECTED ACYCLIC GRAPHS AND CAUSAL STRUCTURE LEARNING

Abstract

Despite several important advances in recent years, learning causal structures represented by directed acyclic graphs (DAGs) remains a challenging task in high dimensional settings when the graphs to be learned are not sparse. In this paper, we propose to exploit a low rank assumption regarding the (weighted) adjacency matrix of a DAG causal model to mitigate this problem. We demonstrate how to adapt existing methods for causal structure learning to take advantage of this assumption and establish several useful results relating interpretable graphical conditions to the low rank assumption. In particular, we show that the maximum rank is highly related to hubs, suggesting that scale-free networks which are frequently encountered in real applications tend to be low rank. We also provide empirical evidence for the utility of our low rank adaptations, especially on relatively large and dense graphs. Not only do they outperform existing algorithms when the low rank condition is satisfied, the performance is also competitive even though the rank of the underlying DAG may not be as low as is assumed.

1. INTRODUCTION

An important goal in many sciences is to discover the underlying causal structures in various domains, both for the purpose of explaining and understanding phenomena, and for the purpose of predicting effects of interventions (Pearl, 2009) . Due to the relative abundance of passively observed data as opposed to experimental data, how to learn causal structures from purely observational data has been vigorously investigated (Peters et al., 2017; Spirtes et al., 2000) . In this context, causal structures are usually represented by directed acyclic graphs (DAGs) over a set of random variables. For this task, existing methods can be roughly categorized into two classes: constraint-and scorebased. The former use statistical tests to extract from data a number of constraints in the form of conditional (in)dependence and seek to identify the class of causal structures compatible with those constraints (Meek, 1995; Spirtes et al., 2000; Zhang, 2008) . The latter employ a score function to evaluate candidate causal structures relative to data and seek to locate the causal structure (or a class of causal structures) with the optimal score. Due to the combinatorial nature of the acyclicity constraint (Chickering, 1996; He et al., 2015) , most score-based methods rely on local heuristics to perform the search. A particular example is the greedy equivalence search (GES) algorithm (Chickering, 2002) that can find an optimal solution with infinite data and proper model assumptions. 2020), among others. While these new algorithms represent the current state of the art in many settings, their performance generally degrades when the target DAG becomes large and relatively dense, as seen from the empirical results reported in the referred works and also in this paper. This issue is of course a challenge to other approaches. Ramsey et al. (2017) proposed fast GES for impressively large problems, but it works reasonably well only when the large structure is very sparse. The max-min hill-climbing (MMHC) (Tsamardinos et al., 2006) relies on local learning methods that often do not perform well when the target node has a large neighborhood. How to improve the performance on relatively large and dense DAGs is therefore an important question. In this work, we study the potential of exploiting a kind of low rank assumption on the DAG structure to help address this problem. The rank of a graph that concerns us is the algebraic rank of its associated weighted adjacency matrix. Similar to the role of a sparsity assumption on graph structures, we treat the low rank assumption as methodological and it is not restricted to a particular DAG learning method. However, unlike sparsity assumption, it is much less apparent when DAGs tend to be low rank and how low rank DAGs behave. Thus, besides demonstrating the utility of exploiting a low rank assumption in causal structure learning, another important goal is to improve our understanding of the low rank assumption by relating the rank of a graph to its graphical structure. Such a result also enables us to characterize the rank of a graph from several structural priors and helps to choose rank related hyperparameters for the learning algorithm. Our contributions are summarized as follows: • We show how to adapt existing causal structure learning methods to take advantage of the low rank assumption, and provide a strategy to select rank related hyperparameters utilizing the lower and upper bounds on the true rank, if they are available. • To improve our understanding of low rank DAGs, we establish some lower bounds on the rank of a DAG in terms of simple graphical conditions, which imply necessary conditions for DAGs to be low rank. • We also show that the maximum possible rank of weighted adjacency matrices associated with a directed graph is highly related to hubs in the graph, which suggests that scale-free networks tend to be low rank. From this result, we derive several graphical conditions to bound the rank of a DAG from above, providing simple sufficient conditions for low rank. • Empirically, we demonstrate that the low rank adaptations are indeed useful. Not only do they outperform the original algorithms when the low rank condition is satisfied, the performance is also very competitive even when the true rank is not as low as is assumed.

Related Work

The low rank assumption is frequently adopted in graph-based applications (Smith et al., 2012; Zhou et al., 2013; Yao & Kwok, 2016; Frot et al., 2019) , matrix completion and factorization (Recht, 2011; Koltchinskii et al., 2011; Cao et al., 2015; Davenport & Romberg, 2016) , network sciences (Hsieh et al., 2012; Huang et al., 2013; Zhang et al., 2017) and so on, but to our best knowledge, has not been used on the DAG structures in the context of learning causal DAGs. We notice two works Barik & Honorio (2019); Tichavskỳ & Vomlel (2018) that assume low rank conditional probability tables in learning Bayesian networks, which are different from ours. Also related are existing works that studied the rank of real weighted matrices described by a given simple directed/undirected graph. However, most works only considered the zero-nonzero pattern of off-diagonal entries (see, e.g., Fallat & Hogben (2007) ; Hogben (2010); Mitchell et al. ( 2010)), whereas we also take into account the diagonal entries. This difference is crucial: if one only considers the off-diagonal entries, then the maximum rank over all possible weighted matrices is trivial and is always equal to the number of vertices. Consequently, many works focus on the minimum rank of a given graph, but to characterize exactly the minimum rank remains open, except for some special graph structures like trees (Hogben, 2010) . Apart from these works, Edmonds (1967) studied algebraically the maximum rank for matrices with a common zero-nonzero pattern. In Section 4, we use this result to relate the maximum possible rank to a more interpretable graphical condition, which further implies several structural conditions of DAGs that may be easier to obtain in practice.

2. PRELIMINARIES 2.1 GRAPH TERMINOLOGY

A graph G is defined as a pair (V, E), where V = {X 1 , X 2 , • • • , X d } is the vertex set and E ⊂ V 2 denotes the edge set. We are particularly interested in directed (acyclic) graphs in the context of causal structure learning. For any S ⊂ V, we use pa(S, G), ch(S, G), and adj(S, G) to denote the union of all parents, children, and adjacent vertices of the nodes of S in G, respectively. A graph is called weighted if every edge in the graph is associated with a non-zero value. We will work with weighted graphs and treat unweighted graphs as a special case where the edge weights are set to 1. Weighted graphs can be treated algebraically via weighted adjacency matrices. Specifically, the weighted adjacency matrix of a weighted graph G is a matrix W ∈ R d×d , where W (i, j) is the weight of edge X i → X j and W (i, j) = 0 if and only if X i → X j exists in G. The binary adjacency



Recently, Zheng et al. (2018) introduced a smooth acyclicity constraint w.r.t. graph adjacency matrix, and the task on linear data models was then formulated as a continuous optimization problem with least-squares loss. This change of perspective allows using deep learning techniques to model causal mechanisms and has already given rise to several new algorithms for causal structure learning with non-linear data, e.g., Yu et al. (2019); Ng et al. (2019b;a); Ke et al. (2019); Lachapelle et al. (2020); Zheng et al. (

