GENERALIZED PRECISION MATRIX FOR SCALABLE ES-TIMATION OF NONPARAMETRIC MARKOV NETWORKS

Abstract

A Markov network characterizes the conditional independence structure, or Markov property, among a set of random variables. Existing work focuses on specific families of distributions (e.g., exponential families) and/or certain structures of graphs, and most of them can only handle variables of a single data type (continuous or discrete). In this work, we characterize the conditional independence structure in general distributions for all data types (i.e., continuous, discrete, and mixed-type) with a Generalized Precision Matrix (GPM). Besides, we also allow general functional relations among variables, thus giving rise to a Markov network structure learning algorithm in one of the most general settings. To deal with the computational challenge of the problem, especially for large graphs, we unify all cases under the same umbrella of a regularized score matching framework. We validate the theoretical results and demonstrate the scalability empirically in various settings.

1. INTRODUCTION

Markov networks (also known as Markov random fields) represent conditional dependencies among random variables. They provide clear semantics in a graphical manner to cope with uncertainty in probability theory, with a wide application in fields including physics (Cimini et al., 2019) , chemistry (Dodani et al., 2016) , biology (Jaimovich et al., 2006), and sociology (Carrington et al., 2005) . The undirected nature of edges also allows cyclic, overlapping, or hierarchical interactions (Shen et al., 2009) . To estimate the Markov network from observational data, existing work focuses on certain parametric families of distributions, a majority of which study the Gaussian case. By assuming that the variables are from a multivariate Gaussian distribution, the dependencies can be well represented by the support of the precision, or inverse covariance, matrix according to Hammersley-Clifford theorem (Besag, 1974; Grimmett, 1973) . Together with various statistical estimators (e.g., the graphical lasso (Friedman et al., 2008) and neighborhood selection (Meinshausen & Bühlmann, 2006) ), this connection between the precision matrix and graphical structure has been well exploited in the Gaussian case in the past decades (Yuan, 2010; Ravikumar et al., 2011) . However, methods for Gaussian graphical models fail to correctly capture dependencies among variables deviating from Gaussian or including nonlinearity (Raskutti et al., 2008; Ravikumar et al., 2011) . While non-Gaussianity is more common in real-world data generating process, few results are applicable to Markov network structure learning on non-Gaussian data. In the discrete setting, Ravikumar et al. (2010) showed that a binary Ising model can be recovered by neighborhood selection using ℓ 1 penalized logistic regression. Loh & Wainwright (2013) encoded extra structural relations in the proposed generalized covariance matrix to model the dependencies for Markov networks with certain structures (e.g., tree structures or graphs with only singleton separator sets) among variables from exponential families. Several approaches allowed estimation for non-Gaussian continuous variables while most of them assumed parametric assumptions such as the exponential families (Yang et al., 2015; Lin et al., 2016; Suggala et al., 2017) or Gaussian copulas (Liu et al., 2009; 2012; Harris & Drton, 2013) . These methods illustrate the possibility of reliable Markov network estimations in several non-Gaussian cases, but still, the models are restricted to specific parametric families of distributions and/or structures of conditional independencies. Concerned with describing Markov properties of non-Gaussian data with general continuous distributions, Morrison et al. (2017) used the second-order derivatives to encode the conditional independence structure. Specifically, their approach is based on a theorem that the zero pattern in the Hessian matrix of the log-density determines the conditional independencies between non-Gaussian continuous variables (Spantini et al., 2018) . A method based on transport map, i.e., Sparsity Identification in Non-Gaussian distributions (SING) (Baptista et al., 2021) , is then designed to estimate the data density from samples, and the structure is derived from the estimated density. This approach achieves consistent Markov network structure recovery in a general non-Gaussian continuous setting. However, methods relying on the Hessian matrix cannot cope with discrete or mixed-type data. In addition, density estimation, especially for non-Gaussian data, can be computationally challenging for large graphs, limiting the scalability of this approach. Kernel-based Conditional Independence test (KCI) (Zhang et al., 2012) and Generalized Score (GS) (Huang et al., 2018) can handle the mixed-type case for structure learning, but as kernel-based methods, they are computationally challenging since the complexity scales cubically in the number of samples. To deal with these remaining obstacles, we explore a Generalized Precision Matrix (GPM) for nonparametric Markov networks learning. Based on the necessary and sufficient conditions for the conditional independence among structures in continuous, discrete, and mixed-type cases, GPM characterizes the Markov network structures with arbitrary data types. Moreover, our work does not constrain the distribution to be of specific families, such as exponential families, or has been normalized. Besides, it is also noteworthy that there are no specific assumptions on the functional relations among variables. To the best of our knowledge, the proposed GPM illustrates the feasibility of Markov network structure learning in one of the most general nonparametric settings. Furthermore, we put all these cases under the same umbrella of the estimation framework based on regularized score matching, as an extension of the score matching framework (Hyvärinen & Dayan, 2005) . Different from the previous approach (SING) that applies a transport map to estimate the data density for general continuous distributions, our framework allows us to only estimate the model score function parameterized by a deep model, from which the characterization matrix of the Markov network structure can be directly calculated. To facilitate the estimation process, we also exploit suitable penalties on the characterization matrix to encourage constantly sparse entries. Besides, we adopt recent advancements on score matching (Song et al., 2020) to further scale up the process. Our method therefore narrows the gap between reliable structure learning and scalable deep learning techniques. We validate the theoretical results experimentally, and the scalability has been illustrated.

2. GENERALIZED PRECISION MATRIX

Suppose that we observe a collection of random variables X = (X 1 , . . . , X d ). Our goal is to discover the underlying Markov network structure. Specifically, it is an undirected graph G comprising a set of vertices V = {1, . . . , d} and edges E. The edges E encode the conditional independence relations or the global Markov property: for any disjoint subsets A, B, and C in the vertices set V such that C separates A and B, X A and X B are conditionally independent given X C , i.e., X A ⊥ ⊥ X B | X C .foot_0 Throughout this paper, we use an uppercase letter to denote a random variable and a lowercase letter with subscripts to denote the value of a random variable (e.g., X i = x i for the value of X i ). For a discrete variable, say X i , we denote its support by {x i1 , . . . , x iMi }, where M i is its cardinality. As an alternative characterization of the conditional independence relations encoded by the graph, the pairwise Markov property requires that every pair of non-adjacent variables in the graph is conditionally independent given the remaining variables. That is, for any i ̸ = j, an edge between X i and X j is absent if and only if X i and X j are conditionally independent given the remaining variables, i.e., X i ⊥ ⊥ X j | X V\{i,j} . The conditioning set consisting of all remaining variables is essential. According to Lauritzen (1996) , the pairwise Markov property is equivalent to the global one when the density is strictly positive. In order to estimate nonparametric Markov networks in this setting, we explore generalized characterizations of conditional independence in all types of data (i.e., continuous, discrete, and mixed-type) without distributional constraints. We start from learning conditional independence structures in continuous data with a procedure inspired by Spantini et al. (2018) , and then propose new characterizations for discrete and mixed-type data. Ideally, we aim to construct a Generalized Precision Matrix Ω that satisfies the following desiderata: a. For any i ̸ = j, if Ω i,j = 0, then X i ⊥ ⊥ X j | X V\{i,j} ; b. The probability measure is not restricted to be from specific families but only needs to be strictly positive; c. The undirected graph G is not restricted to be of certain structures; d. For continuous variables, the density has continuous derivatives up to second order w.r.t. the Lebesgue measure; e. For discrete variables, the cardinality is not restricted; f. To enable practical estimation procedure, Ω is differentiable w.r.t. X. Property (a) is the characterization of the pairwise Markov property. Properties (b) and (c) differentiate our work from most previous works that assumes Gaussianity or/and certain structures of the conditional independence. Properties (d) and (e) further raise the difficulty of our task, because, in addition to not being restricted to a specific family of distributions, our characterization Ω has to be available for all data types (i.e., continuous, discrete, and mixed-type) with mild assumptions. For discrete variables, Property (e) removes the limitation of cardinality, thus differentiating our work from those focusing on the binary Ising model. Property (f) allows us to incorporate an ℓ 1 regularization term in the estimation procedure and make use of gradient-based optimization.

2.1. CHARACTERIZATION FOR CONTINUOUS DATA

We aim to find the necessary and sufficient conditions for X i ⊥ ⊥ X j | X V\{i,j} . By definition, if X i is conditionally independent of X j given all remaining variable X V\{i,j} , we can factor the probability density function (PDF) p X as follows p X (x) = p(x i | x V\{i,j} )p(x j | x V\{i,j} )p(x V\{i,j} ). (1) Together with the assumption that p X has continuous derivatives up to second order w.r.t. the Lebesgue measure, we have ∂ 2 log p X ∂x i ∂x j = 0. (2) Conversely, the solution of Eq. ( 2) is given by log p X (x) = g(x 1:i-1 , x i+1:d ) + h(x 1:j-1 , x j+1:d ) for some functions g, h : R d-1 → R. It thus follows X i ⊥ ⊥ X j | X V\{i,j} . This connection between pairwise conditional independence and cross derivatives of the log density has been observed in Spantini et al. (2018) . Methods based on this connection have also been proposed recently (Morrison et al., 2017; Baptista et al., 2021) . Following Baptista et al. (2021) , one can characterize the conditional independence between X i and X j in the continuous distribution as Ω [c] ij := E p X f [c] i,j (x) 2 1 2 , where f [c] i,j (•) denotes the LHS of Eq. ( 2) and [c] denotes continuous data as a type label. In practice, p X is the empirical PDF. The group structure of it could help achieve simultaneous sparse approximation (Yuan & Lin, 2006; Huang & Zhang, 2010) when being applied as an ℓ 1 regularizer in the estimation, which we will describe in Sec. 3. We also apply the same group structures for both the discrete and mixed-type cases, but we will skip the reintroduction for brevity. The characterization of the Markov property is as follows. Corollary 1. Assume i. X = (X 1 , . . . , X d ) is a set of continuous variable. ii. The PDFs of X are strictly positive and smooth. iii. The characterization matrix Ω [c] is defined according to Eq. (3).

Then for any

i ̸ = j, Ω [c] i,j = 0 implies X i ⊥ ⊥ X j | X V\{i,j} . The proof is shown in Appx. A.1. It is worth noting that Cor. 1 also covers the Gaussian case, where the cross-derivatives of the log-density correspond to entries in the precision or inverse covariance matrix (Drton et al., 2008) , thus generalizing previous work assuming Gaussianity. Hence, the support of Ω characterizes conditional independence among continuous variables for general distributions.

2.2. CHARACTERIZATION FOR DISCRETE DATA

Since most of the previous work focuses on the Gaussian setting, and works for non-Gaussian distribution are mostly restricted to the exponential family, the characterization for continuous data discussed in Sec. 2.1 has broadened the scope of reliable Markov network learning. However, the characterization is not applicable to discrete data as the gradient does not exist. In this section, we provide such a characterization of Markov network structure in the discrete case. Similar to the continuous case, a key ingredient of the proposed characterization is the necessary and sufficient conditions of conditional independence for discrete data, which we establish in the following theorem. Theorem 1. Denote V as a set of discrete variables and X i , X j ∈ V. For brevity, denote V\{X i , X j } as Z. Let {x i1 , . . . , x iMi } and {x j1 , . . . , x jMj } be the support of variables X i and X j . Denote z as any value(s) of Z. Then, X i ⊥ ⊥ X j | Z if and only if, for all k ∈ [M i ] and l ∈ [M j ] with k ̸ = 1 and l ̸ = 1, we have (log m(x i1 , x j1 , z) -log m(x ik , x j1 , z)) -(log m(x i1 , x jl , z) -log m(x ik , x jl , z)) = 0. (4) Proof sketch. For the sufficient condition, we want to show that the general solution to Eq. 4 has no term that takes the values of both X i and X j . We first iterate all possible differences w.r.t. X j to get the discrete score function of X i , which does not take the value of X j as the argument. Then we obtain the desired solution by summation over all possible differences w.r.t. X i . For the necessary condition, we decompose the PMF according to the conditional independence to obtain Eq. 4. The full proof is provided in Appx. A.2. Note that we denote m(x ik , x jl , z) as the joint probability mass function (PMF) of {X i , X j , Z}, simplified from m Xi,Xj ,Z (x ik , x jl , z). Based on Thm. 1, we propose the characterization matrix of conditional independence for discrete data Ω i,j as follows: Ω [d] i,j := E m X   k,l f [d] (x i1 , x ik , x j1 , x jl , z) 2   , where f [d] (x i1 , x ik , x j1 , x jl , z) denotes the LHS of Eq. ( 4) and [d] is a type label denoting discrete data. The support of the matrix above satisfies the pairwise Markov property and characterizes the Markov network structure, formally stated below with its proof in Appx. A.3. Corollary 2. Assume i. X = (X 1 , . . . , X d ) is a set of discrete variable. ii. The PMFs of X are strictly positive. iii. The characterization matrix Ω [d] is defined according to Eq. (5).

Then for any

i ̸ = j, Ω [d] i,j = 0 implies X i ⊥ ⊥ X j | X V\{i,j} . Therefore, we have a characterization matrix Ω [d] to represent the conditional independence structure for discrete data. It is worth noting that, unlike the generalized covariance matrix in Loh & Wainwright (2012) that only applies to certain structures among variables from exponential families, the proposed characterization matrix Ω [d] encodes the Markov properties for general discrete distributions without any structural constraints. Also, compared with Ravikumar et al. (2010) , Thm. 1 can be applied to general graphical models apart from binary Ising models and does not rely on the structural condition. It also does not limit the cardinalities of discrete variables. Hence, Theorem 1 sheds light on characterizing arbitrary conditional independence structures for general discrete distributions.

2.3. CHARACTERIZATION FOR MIXED-TYPE DATA

In the previous sections, we have presented characterizations of conditional independence structures for both general continuous and discrete distribution. However, it is common for real-world datasets to have a mixture of continuous and discrete variables. Unfortunately, most works focus on either continuous or discrete data, and previous results for mixed-type data are mostly based on conditional Gaussian distribution (Lauritzen et al., 1989; Edwards, 1990; Lauritzen, 1996; Fellinghauer et al., 2013; Lee & Hastie, 2015; Cheng et al., 2017) . Similar to the continuous and discrete settings, in this section, we introduce a novel characterization of the pairwise Markov property for general distributions with mixed data-types. We first provide necessary and sufficient conditions of conditional independence for mixed-type data in the following theorem, with full proof given in Appx. A.4. Theorem 2. Denote V as a set of mixed-type variables and X i , X j ∈ V, where X i is discrete and X j is continuous. Let {x i1 , . . . , x iMi } be the support of variables X i . For brevity, denote V\{X i , X j } as Z. Denote z as any value(s) of Z and x j as any value of the continuous variable X j . Then, X i ⊥ ⊥ X j | Z if and only if, for all k ∈ [M i ] with k ̸ = 1, we have ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j - ∂ log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j = 0. Proof sketch. Similar to the proof sketch of Thm. 1, we consider X i and X j separately to construct the desired general solution of Eq. 6 for the sufficient condition. For the necessary condition, we decompose the density function according to the conditional independence to obtain Eq. 6. Based on Thm. 2, we propose to characterize the conditional independence between X i and X j given all remaining variables with the GPM Ω [m] , of which the element is defined as Ω [m] i,j :=                E π X f [c] i,j (x) 2 if X i ∈ X c , X j ∈ X c E π X k,l f [d] (x i1 , x ik , x j1 , x jl , z) 2 if X i ∈ X d , X j ∈ X d E π X k f [m] (x i , x j1 , x jk , z) 2 if X i ∈ X c , X j ∈ X d E π X k f [m] (x j , x i1 , x ik , z) 2 if X i ∈ X d , X j ∈ X c , where f [m] denotes LHS of Eq. ( 6) and π X is the probability function. The type label [m] denotes mixed-type data. X c and X d are sets of continuous and discrete variables, respectively. Its characterization of Markov property is as follows Corollary 3. Assume i. X = (X 1 , . . . , X d ) is a set of variables containing both continuous and discrete variables. ii. For continuous variables, the PDFs are strictly positive and smooth. iii. For discrete variables, the PMFs are strictly positive. iv. The characterization matrix Ω [m] is defined according to Eq. (7).

Then for any

i ̸ = j, Ω [m] i,j = 0 implies X i ⊥ ⊥ X j | X V\{i,j} . The proof is given in Appx. A.5. The GPM Ω [m] encodes the pairwise Markov property for mixed-type data. More general than previous works, it does not require specific families of distributions, structures of the underlying graph, or cardinality of discrete variables.

3. SCALABLE ESTIMATION WITH REGULARIZED SCORE MATCHING

In Sec. 2, we provide characterizations of conditional independencies for general distribution in continuous, discrete, and mixed-type settings. Based on the introduced necessary and sufficient conditions, these characterizations generalize previous work and establish one of the foundations for nonparametric estimation of Markov network structures with minimal assumptions. In addition to general characterizations of the Markov property with theoretical guarantees (i.e., GPM), a scalable estimation framework is necessary for reliable and practical structure learning. Ideally, we would like to exploit the advancements on scalable deep learning models. Hence, we introduce a regularized score matching-based framework for all considered settings (i.e., general distributions of continuous, discrete, and mixed-type variables).

3.1. ESTIMATION FOR CONTINUOUS DATA

We start with the continuous setting. Denote p(x; θ) as a parameterized density model with a parameter vector θ. The goal is to estimate parameter θ from the observation x. We aim to optimize the following objective function, which is based on Fisher divergence: O c (θ) = 1 2 x∈R d p(x)∥∇ x log p(x; θ) -∇ x log p(x)∥ 2 dx + ρ λ (Ω [c] ), where ρ λ (•) denotes a sparsity penalty function and λ is the penalty parameter with domain [0, 1]. Ω [c] is defined in Eq. 3 as our characterization of the conditional independence structure for continuous data. If we assume the model is not degenerate, where different values of θ correspond to different PDFs, the asymptotic consistency of the optimization has been shown in Thm. 2 by Hyvärinen & Dayan (2005) . We impose a sparsity penalty to encounter for finite-sampling errors in practice. Also with a strategy in Hyvärinen & Dayan (2005) ; Pham & Garat (1997), one can remove the data log-density log p X from Eq. ( 8) by optimizing the following equation, which is equivalent to Eq. ( 8): O c (θ) = x∈R d p(x) d i=1 1 2 ∥∇ xi log p(x; θ)∥ 2 + H xi (log p(x; θ)) dx + ρ λ (Ω [c] ), where H denotes the Hessian. The proof is directly based on Hyvärinen & Dayan (2005) and we include it (Lemma 1) in Appx. A.6.1 for completeness. It is worth noting that previous work on Markov network structure learning with general continuous distribution (SING (Morrison et al., 2017; Baptista et al., 2021) ) applies a transport map to estimate data density from samples, which can be computationally challenging for non-Gaussian data with a large number of variables. Thus, it may not be scalable as suggested by Fig. 1 and Table 1 . To avoid this, the proposed regularized score matching allows us to optimize the objective function by only estimating the model score function. Moreover, the estimated model score function directly leads to the characterization matrix Ω [c] by taking further derivatives, thus efficiently giving rise to the estimated Markov network structure. After training, the expectation in Equation 3 is computed over the parameterized model p(x; θ).

3.2. ESTIMATION FOR DISCRETE DATA

For the estimation in the discrete case, one cannot directly apply the method introduced for the continuous case since the gradient, on which the continuous score function is based, is not defined for discrete data. An intuitive solution is to replace the gradient with a general linear operator L (Lyu, 2012) . Of course, one also needs to replace integration with summation and PDF with PMF. For instance, Eq. ( 8) can be reformulated as follows O d (θ) = 1 2 x m X (x) L(m(x; θ)) m(x; θ) - L(m X (x)) m X (x) 2 + ρ λ (Ω [d] ), where m denotes PMF. In this formulation, L(•) is a generalized version of the score function for discrete data. As shown in Lyu (2012) , O d (θ) keeps the computational advantages of score matching for continuous data, i.e., the normalizing partition is canceled out and the formulation can be transformed to an expectation of functions of the unnormalized model. In order to guarantee the consistency of score matching based on Eq. ( 10), the linear operator L(•) needs to be complete according to the following definition. Definition 1 (Completeness (Lyu, 2012)) . A linear operator L(•) is complete if L(p(x)) p(x) = L(q(x)) q(x) implies p(x) = q(x) almost everywhere, where p(x) and q(x) are two PMFs. According to Defn. 1, Lyu (2012) used the marginalization operator M(•) : F 1 → F d as a choice for L(•), which is defined as M(f (x)) =     . . . M i (f (x)) . . .     =     . . . x f (x) . . .     , where f ∈ F 1 . We can observe that M i (f (x)) is the marginal density of x \i , where x \i denotes the vector x after dropping the i-th element (i.e., marginalization). The completeness of M(•) has been shown in Brook (1964) , and included as Lemma 3 in Lyu (2012) . We have O d (θ) = 1 2 x m X (x) M(m(x; θ)) m(x; θ) - M(m X (x)) m X (x) 2 + ρ λ (Ω [d] ). Thus, it is plausible for us to replace the gradient with M(•) for discrete data. However, one key advantage of regularized score matching is that it does not have to explicitly estimate the data density (i.e., p X (x) in Theorem 1). As shown by Lyu (2012) , we can also optimize Eq. ( 12) in a similar way, which is equivalent to optimizing the following equation O d (θ) = 1 2 x m X (x) d i=1 M i (m(x; θ)) m(x; θ) 2 -2M i M i (m(x; θ)) m(x; θ) + ρ λ (Ω [d] ). ( ) The simplification is directly from results in Lyu (2012) , of which the corresponding lemma (Lemma 2) is formalized with its proof in Appx. A.6.2 for completeness. Based on Thm. 1 and Thm. 2, similar to the continuous case, we can estimate Markov network structures for general distributions in the discrete setting under the same umbrella of regularized score matching.

3.3. ESTIMATION FOR MIXED-TYPE DATA

For mixed-type data, we define the objective function as follows O m (θ) = E π X i s i (x; θ) + ρ λ (Ω [m] ), where s i (x; θ) :=    1 2 ∥∇ xi log π(x; θ)∥ 2 + H xi (log π(x; θ)) X i ∈ X c 1 2 Mi(m(x;θ)) m(x;θ) 2 -M i Mi(m(x;θ)) m(x;θ) X i ∈ X d , Here, the density π is strictly positive. Basically, O m (θ) is a regularized version of the combination of the objective functions for the continuous and discrete cases. Because Ω [m] also encodes the dependencies between continuous and discrete variables, we can estimate its support for mixedtype data without assuming group structures of data types. The following corollary guarantees the consistency, where we define O ′ m (θ) as O m (θ) -ρ λ (Ω [m] ). Corollary 4. Assume i. The data density π X (•) is equal to π(•; θ * ) for some θ * . ii. The data density π X (•) and model density π(•; θ) are strictly positive. π X (•) and π(•; θ) is differentiable and twice-differentiable, respectively, w.r.t. continuous variables. For some θ * , π X (•) = π(•; θ * ) and no other parameter value gives a density that is equal to π(•; θ * ) almost everywhere. iii. The expectations E π X ∥ log π(x; θ)∥ 2 and E π X ∥log π X (x)∥ 2 are finite for any θ, and π X (x) log π(x; θ) goes to zero for any θ when ∥x∥ → ∞. Then O ′ m (θ) = 0 implies θ = θ * . Cor. 4 follows from Lemma 1 (Hyvärinen & Dayan, 2005) and Lemma 2 (Lyu, 2012) , which are included in Appx. A.6. Together with Thm. 2, one can estimate Markov network structures for mixed-type data in a general setting.

3.4. SPARSITY REGULARIZATION

By minimizing the objective function O(θ) ∈ {O c (θ), O d (θ), O m (θ)}, our goal is to essentially perform a model selection task, i.e., to learn of the support of Ω ∈ {Ω [c] , Ω [d] , Ω [m] }. Here, using ℓ 0 penalty may be computationally infeasible because it leads to a discrete optimization problem that is difficult to solve. Following previous works (Tibshirani, 1996) , we adopt the ℓ 1 regularizer ρ λ (Ω) = λ∥Ω∥ 1 . In particular, the high-dimensional support recovery of ℓ 1 regularizer has been extensively studied in the literature; for instance, see Wainwright (2009) for variable selection and Ravikumar et al. (2008) for Gaussian graphical model selection. Although ℓ 1 regularizer induces sparsity, it may lead to bias in the resulting solution and thereby worsen the performance (Fan & Li, 2001; Breheny & Huang, 2011) . This is because the ℓ 1 norm increases linearly with the absolute value of nonzero entries, which is different from ℓ 0 norm that is constant for nonzero entries. Therefore, we experiment with smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, 2001) , minimax concave penalty (MCP) (Zhang, 2010) , and adaptive ℓ 1 penalty (Zou, 2006) in this work, which helps remedy the bias issue of ℓ 1 regularization. Specifically, SCAD and MCP penalties may be interpreted as a hybrid of ℓ 0 and ℓ 1 penalties, while adaptive ℓ 1 penalty reweighs the penalty coefficient λ by the initial estimate of Ω without regularization. Furthermore, the support recovery of ℓ 1 penalty relies on the incoherence condition in various cases (Wainwright, 2009; Ravikumar et al., 2008; 2011) , which may be a rather strong assumption in practice, whereas the SCAD and MCP penalties do not (Loh & Wainwright, 2017) . Thus, we adopt the SCAD penalty according to experimental results (Fig. 7 in Appx. B). We integrate the SCAD penalties for all cases but only introduced here for brevity.

4. EXPERIMENTS

Setup. We conduct experiments on two sets of distributions: (1) Butterfly distributions (Morrison et al., 2017; Baptista et al., 2021) and ( 2) distributions from random graphs. For Butterfly distribution in the continuous setting, we have r i.i.d. pairs of random variables (P i , Q i ) defined as P i ∼ N (0, 1) and Q i = W i P i with W i ∼ N (0, 1) and W i ⊥ ⊥ P i . We replace the Gaussian distribution with the Multinomial distribution for the discrete case and mix the two different types of pairs for the mixed-type case with uniformly sampled proportion. For distributions from random graphs, we first generate a random decomposable directed acyclic graph. Then, for the continuous case, the data are sampled from nonlinear structural equation models (SEMs) with exogenous noises from an exponential distribution. We employ a multilayer perceptron (MLP) with randomly generated weights as the nonlinear function. For the discrete case, variables are generated via randomly parameterized Multinomial distributions of the variable being simulated and the discrete parents (Andrews et al., 2018) . For the mixed-type case, we simulate data with the process described in Results. We first conduct comparisons in general distributions for all data types (i.e., discrete, continuous, and mixed-type) with different numbers of variables and a sample size of 1000. Among the considered methods, both KCI and GS are available for the estimation of Markov network structures for general distributions with all data types. SING can only deal with continuous data and is therefore only applied in the continuous setting. We also include (semi)parametric methods (GLASSO and NPN) for baselines in the considered general settings. We use Hamming distance between the estimated graph and the ground truth graph as the metric. All results are from 5 trials with different random seeds. The missing results are either due to timeout (i.e., > 1 day) or OOM. For the Butterfly distributions (Fig. 1 ), one can observe that KCI, GS, and our method can almost recover the true structures with all data types. At the same time, in the more complex setting (i.e., distributions from random graphs, Fig. 2 ), it is clear that our method outperforms others in most datasets. This suggests that, compared to baselines, our method may have more obvious advantages in more complicated scenarios. Meanwhile, the running times of KCI, GS, and SING are significantly longer than that of our method (Table. 1). Besides, SING and GS cannot scale with more than 12 and 18 variables, respectively. GLASSO and NPN are remarkably fast but fail to accurately recover the structure in the general setting. NPN performs worse than GLASSO in structure recovery, which may be due to its misaligned hypothesis of the nonparanormal transformation in the general mixed-type setting. We also conduct experiments on large graphs, with {250, 500, . . . , 5000} continuous variables from Butterfly distributions. Other settings are identical to those for smaller graphs. From Fig. 3 , we observe that the running time is approximately linear w.r.t. the number of variables. Besides, all experiments are conducted on CPUs while our framework could be easily deployed on GPUs. This suggests the potential of taking advantage of recent advances in computation, especially for deep models, to even further improve the scalability.

5. CONCLUSION

We provide a scalable estimation framework based on regularized score matching for nonparametric Markov network structures. We first introduce necessary and sufficient conditions of conditional independence among variables in general distributions for all data types (i.e., continuous, discrete, and mixed-type) without specific assumptions on functional relations among variables, thus giving rise to the corresponding characterizations of the structure, i.e., Generalized Precision Matrix. Then, we unify all these cases under the same umbrella of the estimation framework based on regularized score matching. Appropriate penalties on the characterization matrix are introduced to promote constantly sparse entries for stable estimation. We validate our theoretical claims experimentally in various settings. Future work includes exploring the connection between Markov networks and causal graphs.

A.2 PROOF OF THEOREM 1

Theorem 1. Denote V as a set of discrete variables and X i , X j ∈ V. For brevity, denote V\{X i , X j } as Z. Let {x i1 , . . . , x iMi } and {x j1 , . . . , x jMj } be the support of variables X i and X j . Denote z as any value(s) of Z. Then, X i ⊥ ⊥ X j | Z if and only if, for all k ∈ [M i ] and l ∈ [M j ] with k ̸ = 1 and l ̸ = 1, we have (log m(x i1 , x j1 , z) -log m(x ik , x j1 , z)) -(log m(x i1 , x jl , z) -log m(x ik , x jl , z)) = 0. (4) Proof. Sufficient condition. Without loss of generality, let us consider three discrete variables, i.e., {X i , X j , Z}. Let {x i1 , . . . , x iMi } and {x j1 , . . . , x jMj } be the support of variables X i and X j , respectively. Consider the case that the finite difference of the discrete score function of X i w.r.t. X j equals zero. When X i = x i1 , X j = x j1 , and differences are considered w.r.t. to x ik ′ and x jl ′ , we have (log m(x i1 , x j1 , z) -log m(x ik ′ , x j1 , z)) -(log m(x i1 , x jl ′ , z) -log m(x ik ′ , x jl ′ , z)) = 0, ( ) where m(x i1 , x j1 , z) is the joint PMF simplified from m Xi,Xj ,Z {x i1 , x j1 , z}. By iterating all possible differences w.r.t. X j , for all l in {2, . . . , M j }, we have (log m(x i1 , x j1 , z) -log m(x ik ′ , x j1 , z)) -(log m(x i1 , x jl , z) -log m(x ik ′ , x jl , z)) = 0. ( ) Define the discrete score function of X i as g(x i1 , x ik ′ , γ) = log m(x i1 , γ) -log m(x ik ′ , γ) , where γ denotes other variables. Eq. 19 means g(x i1 , x ik ′ , γ) doe not take the value of X j as an argument when the LHS of Eq. 19 equals zero. As a result, Eq. 19 could be formulated as log m(x i1 , x j1 , z) = log m(x ik ′ , x j1 , z) + g(x i1 , x ik ′ , γ). ( ) Then by iterating all possible differences w.r.t. X i , for all k in I xi = {2, . . . , M i }, we have log m(x i1 , x j1 , z) = log m(x ik , x j1 , z) + g(x i1 , x ik , γ). ( ) By summation, we have (N -1) log m(x i1 , x j1 , z) = k∈Ix i (log m(x ik , x j1 , z) + g(x i1 , x ik , γ)) = k∈Ix i log m(x ik , x j1 , z) + k∈Ix i g(x i1 , x ik , γ), which implies that M i log m(x i1 , x j1 , z) = Mi k=1 log m(x ik , x j1 , z) + k∈Ix i g(x i1 , x ik , γ), log m(x i1 , x j1 , z) = 1 M i   Mi k=1 log m(x ik , x j1 , z) + k∈Ix i g(x i1 , x ik , γ)   . ( ) Because Mi k=1 log m(x ik , x j1 , z) covers all possible values of X i , this term does not depend on the specific value of X i . Besides, the other term k∈Ix i g(x i1 , x ik , γ) does not depend on X j . It is worth noting that X i1 could be any value of X i w.o.l.g.. Therefore, we could see that when the finite difference of the discrete score function of X i w.r.t. to X j equal to zero (after some aggregation of samples), X i ⊥ ⊥ X j | Z. Necessary condition. When X i ⊥ ⊥ X j | Z, we could decompose m(x i , x j , z) as m Xi|Z (x i | z)m Xj |Z (x j | z)m Z (z) . This implies that, for all k in {2, . . . , M i } and l in {2, . . . , M j }, we have (log m(x i1 , x j1 , z) -log m(x ik , x j1 , z)) -(log m(x i1 , x jl , z) -log m(x ik , y jl , z)) = log m Xi|Z (x i1 | z)m Xj |Z (x j1 | z)m Z (z) -log m Xi|Z (x ik | z)m Xj |Z (x j1 | z)m Z (z) -log m Xi|Z (x i1 | z)m Xj |Z (x jl | z)m Z (z) -log m Xi|Z (x ik | z)m Xj |Z (x jl | z)m Z (z) = log m Xi|Z (x i1 | z) + log m Xj |Z (x j1 | z) + log m Z (z) -log m Xi|Z (x ik | z) + log m Xj |Z (x j1 | z) + log m Z (z) -log m Xi|Z (x i1 | z) + log m Xj |Z (x jl | z) + log m Z (z) -log m Xi|Z (x ik | z) + log m Xj |Z (x jl | z) + log m Z (z) = 0. (24) Therefore, when X i ⊥ ⊥ X j | Z, the finite difference of the discrete score function of X i w.r.t. to X j equals zero. The proof is complete. A.3 PROOF OF COROLLARY 2 Corollary 2. Assume i. X = (X 1 , . . . , X d ) is a set of discrete variable. ii. The PMFs of X are strictly positive. iii. The characterization matrix Ω [d] is defined according to Eq. (5).

Then for any

i ̸ = j, Ω [d] i,j = 0 implies X i ⊥ ⊥ X j | X V\{i,j} . Proof. According to Eq. ( 5), we have Ω [d] i,j := E m X   k,l f [d] (x i1 , x ik , x j1 , x jl , z) 2   , where f [d] (x i1 , x ik , x j1 , x jl , z) denotes the LHS of Eq. ( 4), i.e., f [d] (x i1 , x ik , x j1 , x jl , z) = (log m(x i1 , x j1 , z) -log m(x ik , x j1 , z)) -(log m(x i1 , x jl , z) -log m(x ik , x jl , z)) . Thus, if Ω [d] i,j = 0, we must have (log m(x i1 , x j1 , z) -log m(x ik , x j1 , z)) -(log m(x i1 , x jl , z) -log m(x ik , x jl , z)) = 0, for all k ∈ [M i ] and l ∈ [M j ], where M i and M j denote the cardinalities of X i and X j , respectively. Based on Theorem. 1, we have X i ⊥ ⊥ X j | Z. The proof is complete.

A.4 PROOF OF THEOREM 2

Theorem 2. Denote V as a set of mixed-type variables and X i , X j ∈ V, where X i is discrete and X j is continuous. Let {x i1 , . . . , x iMi } be the support of variables X i . For brevity, denote V\{X i , X j } as Z. Denote z as any value(s) of Z and x j as any value of the continuous variable X j . Then, X i ⊥ ⊥ X j | Z if and only if, for all k ∈ [M i ] with k ̸ = 1, we have ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j - ∂ log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j = 0. Proof. Sufficient condition. Without loss of generality, let us consider three variables, i.e., {X i , X j , Z}: x i ∈ {x i1 , . . . , x iMi } x j ∈ R, where we set X i = x i as the discrete variable and X j = x j as the continuous variable w.l.o.g. Note that we do not constraint the type of Z = z here but set Z as continuous for brevity. Consider the case that the finite difference of the score function of X j w.r.t. X i equals zero. Also, we define p as the p.d.f. and m as the p.m.f.. We first consider the difference between x i1 and x ik ′ . ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j - ∂ log p Xj ,Z|Xi (x j , z | x ik ′ )m Xi (x ik ′ ) ∂x j = 0. ( ) By iterating all possible differences w.r.t. x i , for all k in I xi = {2, . . . , M }, we have ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j - ∂ log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j = 0, which is equivalent to ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j = ∂ log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j . Then by integrating on both sides w.r.t. X j , we have log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) = log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) + C k , where C k is a constant. We then apply a summation as follows (M i -1) log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) = k∈Ix i log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) + C k , which implies that log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) = 1 M i   Mi k=1 log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) + k∈Ix i C k   . Because Mi k=1 log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) covers all possible values of k, this term does not depend on the specific value of X i . Besides, C k does not depend on X j . Therefore, by iterating all possible differences of X i , we could see that when the finite difference of the score function of X j w.r.t. X i equals zero (after some aggregation of samples), X i ⊥ ⊥ X j | Z. It is noteworthy that another "symmetric" case, where the derivative of the discrete score function of X i w.r.t. X j equals zero, is as follows ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) -log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j = 0, which is equivalent to ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j - ∂ log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j = 0. Thus, we only need to consider Eq. 36, which is the case that the finite difference of the discrete score function of X i w.r.t. X j equals zero. z) . This implies that, for all k in {2, . . . , M i }, we have

Necessary condition. When

X i ⊥ ⊥ X j | Z, we could decompose p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) as m Xi|Z (x i1 | z)p Xj |Z (x j | z)p Z ( ∂ log p Xj ,Z|Xi (x j , z | x i1 )m Xi (x i1 ) ∂x j - ∂ log p Xj ,Z|Xi (x j , z | x ik )m Xi (x ik ) ∂x j = ∂ log m Xi|Z (x i1 | z)p Xj |Z (x j | z)p Z (z) ∂x j - ∂ log m Xi|Z (x ik | z)p Xj |Z (x j | z)p Z (z) ∂x j = ∂ log m Xi|Z (x i1 | z) + log p Xj |Z (x j | z) + log p Z (z) ∂x j - ∂ log m Xi|Z (x ik | z) + log p Xj |Z (x j | z) + log p Z (z) ∂x j = ∂ log p Xj |Z (x j | z) ∂x j - ∂ log p Xj |Z (x j | z) ∂x J =0. (37) Therefore, when X i ⊥ ⊥ X j | Z, the finite difference of the score function of X j w.r.t. to X i equal to zero. The proof is complete. A.5 PROOF OF COROLLARY 3 Corollary 3. Assume i. X = (X 1 , . . . , X d ) is a set of variables containing both continuous and discrete variables. ii. For continuous variables, the PDFs are strictly positive and smooth. iii. For discrete variables, the PMFs are strictly positive. iv. The characterization matrix Ω [m] is defined according to Eq. (7).

Then for any

i ̸ = j, Ω [m] i,j = 0 implies X i ⊥ ⊥ X j | X V\{i,j} . Proof. According to Eq. ( 7), we have Ω [m] i,j :=                E π X f [c] i,j (x) 2 if X i ∈ X c , X j ∈ X c E π X k,l f [d] (x i1 , x ik , x j1 , x jl , z) 2 if X i ∈ X d , X j ∈ X d E π X k f [m] (x i , x j1 , x jk , z) 2 if X i ∈ X c , X j ∈ X d E π X k f [m] (x j , x i1 , x ik , z) 2 if X i ∈ X d , X j ∈ X c , where X c and X d are the sets of continuous and discrete variables, respectively. We have already proved the first two cases (i.e., {X i ∈ X c , X j ∈ X c } and {X i ∈ X d , X j ∈ X d }) in the proofs of Cor. 1 and Cor. 2, respectively. So here we will focus on the other two cases. We start from the third case, where {X i ∈ X c , X j ∈ X d }. We have f [m] (x i , x j1 , x jk , z) = ∂ log p Xi,Z|Xj (x i , z | x j1 )m Xj (x j1 ) ∂x i - ∂ log p Xi,Z|Xj (x i , z | x jk )m Xj (x jk ) ∂x i . (39) Thus, if Ω [m] i,j = 0 for {X i ∈ X c , X j ∈ X d }, we must have ∂ log p Xi,Z|Xj (x i , z | x j1 )m Xj (x j1 ) ∂x i - ∂ log p Xi,Z|Xj (x i , z | x jk )m Xj (x jk ) ∂x i = 0, for all k ∈ [M j ], where M j denotes the cardinality of X j . Thus, according to Theorem 2, we have X i ⊥ ⊥ X j | Z if Ω [m] i,j = 0 for {X i ∈ X c , X j ∈ X d }. The similar derivation applies for the last case, where {X i ∈ X d , X j ∈ X c }.

A.6 PROOF OF COROLLARY 4

We first introduce the following lemmas and their proofs for completeness. A.6.1 PROOF OF LEMMA 1 Lemma 1. [directly from Thm. 1 in (Hyvärinen & Dayan, 2005) ] Assume i. X = (X 1 , . . . , X d ) is a set of continuous variables. ii. The data PDF p X (x) is differentiable. The model PDF (x; θ) is twice-differentiable. Both of them are strictly positive. iii. The expectations E x ∥ log p(x; θ)∥ 2 and E x ∥log p X (x)∥ 2 are finite for any θ, and p X (x) log p(x; θ) goes to zero for any θ when ∥x∥ → ∞. Then Eq. ( 8) is equivalent to O c (θ) = x∈R n p X (x) d i=1 1 2 ∥∇ xi log p(x; θ)∥ 2 + H xi (log p(x; θ)) dx + ρ λ (Ω [c] ). Proof. Based on Eq. ( 8), we have O c (θ) = 1 2 p x (x)|∇ x log p(x; θ) -∇ x log p x (x)| 2 dx + ρ λ (Ω [c] ), where ρ λ (•) denotes a sparsity penalty function and λ is the penalty parameter. This is equivalent to O c (θ) = 1 2 p X (x) ∥∇ x log p(x; θ)∥ 2 + ∥∇ x log p X (x)∥ 2 -2 (∇ x log p X (x)) ⊤ (∇ x log p(x; θ))dx + ρ λ (Ω [c] ). We first consider the integral for the following part -p X (x) (∇ x log p X (x)) ⊤ (∇ x log p(x; θ)) dx, by which we could obtain - i p X (x) (∇ xi log p X (x)) (∇ xi log p(x; θ)) dx = - i (∇ xi p X (x)) (∇ xi log p(x; θ)) dx = - i ∇ xi p X (x) (∇ xi log p(x; θ)) dx 1 d(x 1 , . . . , x d ) (⋆) = - i lim a→∞,b→-∞ [p X (a, x 2 , . . . , x d ) ∇ xi log p(a, x 2 , . . . , x d , θ) -p X (b, x 2 , . . . , x n ) ∇ xi log p(b, x 2 , . . . , x d , θ)] - ∂ 2 log p X ∂x i 2 p X (x)dx 1 d(x 2 , . . . , x d ), where Eq. (⋆) is because if we assume f and g are both differential, we have ∂f (x)g(x) ∂x i = f (x) ∂g(x) ∂x 1 + g(x) ∂f (x) ∂x 1 . ( ) For i ̸ = 1, the cases follow similarly. Because we assume p X (x) log p(x; θ) goes to zero for any θ when ∥x∥ → ∞, the limit is zero. Thus, we have proven that - i p X (x)∇ xi log p X (x) ∇ xi log p(x; θ)) dx = i ∂ 2 log p X ∂x i 2 p X (x)dx, By injecting it into Eq. ( 43), we obtain O c (θ) = p X (x) 1 2 ∥∇ x log p(x; θ)∥ 2 + 1 2 ∥∇ x log p X (x)∥ 2 + tr (H x (log p(x; θ)))]dx + ρ λ (Ω [c] ). Because 1 2 ∥∇ x log p X (x)∥ 2 does not depend on θ, we could ignore it. Then we have O c (θ) = d i=1 1 2 ∥∇ xi log p(x; θ)∥ 2 + H xi (log p(x; θ)) dx + ρ λ (Ω [c] ). Thus, the proof is complete. A.6.2 PROOF OF LEMMA 2 Lemma 2. [directly from (Lyu, 2012)  ] Assume i. X = (X 1 , . . . , X d ) is a set of discrete variables. ii. The data PMF m X (x) and the model PMF m(x; θ) are strictly positive. Then Eq. ( 12) is equivalent to O d (θ) = x m X (x) d i=1 M i (m(x; θ)) m(x; θ) 2 -2M i M i (m(x; θ)) m(x; θ) + ρ λ (Ω [d] ). Proof. Based on Eq. ( 12), we have O d (θ) = x m X (x) M(m(x; θ)) m(x; θ) - M(m X (x)) m X (x) 2 + ρ λ (Ω [d] ), which implies O d (θ) = x m X (x) d i=1 M i (m(x; θ)) m(x; θ) - M i (m X (x)) m X (x) 2 + ρ λ (Ω [d] ) (⋆) = x m X (x) d i=1 M i (m(x; θ)) m(x; θ) 2 -2M i M i (m(x; θ)) m(x; θ) + ρ λ (Ω [d] ), where Eq. (⋆) is due to the fact that Mi(m X (x)) m X (x) 2 does not take θ as an argument. The proof is complete. Then the corollary follows from these lemmas, which is included as follows. Corollary 4. Assume i. The data density π X (•) is equal to π(•; θ * ) for some θ * . ii. The data density π X (•) and model density π(•; θ) are strictly positive. π X (•) and π(•; θ) is differentiable and twice-differentiable, respectively, w.r.t. continuous variables. For some θ * , π X (•) = π(•; θ * ) and no other parameter value gives a density that is equal to π(•; θ * ) almost everywhere. iii. The expectations E π X ∥ log π(x; θ)∥ 2 and E π X ∥log π X (x)∥ 2 are finite for any θ, and π X (x) log π(x; θ) goes to zero for any θ when ∥x∥ → ∞. Then O ′ m (θ) = 0 implies θ = θ * . Proof. The O ′ m (θ) is defined as follows O ′ m (θ) = E π X i s i (x; θ) , where s i (x; θ) :=    1 2 ∥∇ xi log π(x; θ)∥ 2 + H xi (log π(x; θ)) X i ∈ X c 1 2 Mi(m(x;θ)) m(x;θ) 2 -M i Mi(m(x;θ)) m(x;θ) X i ∈ X d , where the probability function π is strictly positive. According to Lemma 1 and Lemma 2, both cases of s i (•; θ) in O ′ m (θ) are equivalent to 1 2 E π g(π(•;θ)) π(•;θ) -g(π(•)) π(•) , where g denotes the gradient operator for continuous variables or the marginalization operator for discrete variables. If O ′ m (θ) = 0, s i (x; θ) must equal to zero for any i. Because of Brook's Lemma (Brook, 1964) , which is also included in Lyu (2012) as Lemma 3, the marginalization operator M is complete (Defn. 1). Thus, for the discrete variables, we could replace the gradient in the continuous score function with the marginalization operator while preserving local consistency as that for the continuous variables, which is shown by Theorem 2 in Hyvärinen & Dayan (2005) .

B EXPERIMENTS B.1 GENERATING PROCESS FOR MIXED-TYPE DATA

For the mixed-type case, we simulate data with the process described in Andrews et al. (2018) , of which the details are included here for completeness. After generating a random decomposable DAG, we first assign a data type (continuous or discrete) to each variable with equal probability. For variables without parents in the ground-truth graph, we sample their values from Gaussian and Multinomial distributions, respectively. Then for each continuous variable, we create a temporary discretized version by applying equal frequency binning. The number of bins is uniformly chosen between and including 2 and 5. The cardinality of each discrete variable is uniformly chosen between and including 2 and 4. The randomly generated decomposable DAGs are moralized to obtain the ground-truth Markov network structures. Next, for variables with parents in the ground-truth graph, we sample the values of them as follows. For continuous variables, we first adopt partitioning according to its discrete variables. Then the values of these continuous variables are generated by randomly parameterizing the coefficients of a regression for each partition. For discrete variables, we generate the values of them by randomly parameterizing Multinomial distributions of the variables of the target variable and its discrete parents (temporary or not). After the simulation, all temporary discretized variables are removed.

B.2 INFLUENCE OF THE SAMPLE SIZE

In this section, we report additional experimental results with a larger sample size. We conduct experiments for all settings (continuous, discrete, and mixed-type) with different numbers of variables (d ∈ {4, 6, . . . , 20}) and 10000 samples. The results are summarized in Fig. 4 , Fig. 5 , and Fig. 6 . One can observe that both KCI and GS fail in all settings, indicating that they cannot scale well with large sample sizes. It is because the complexities of KCI and GS grow cubically in the number of samples, which is one of the motivations for the development of our method. Besides, SING cannot scale with more than 6 variables because of OOM. At the same time, our method works well across all datasets without any scalability issues. Together with the better performance illustrated in Sec. 4 (note that one can even go beyond 5000 variables, e.g., it takes 7725 seconds for 10,000 variables in our setting), we believe the potential of our method is not only theoretically exciting but also empirically clear in both consistency and scalability. 

B.3 INFLUENCE OF DIFFERENT PENALTY FUNCTIONS

To explore the effect of different regularization functions, we compare the results of our method with different sparsity penalties, which are shown in Fig. 7 . The experiments are conducted on Butterfly distributions with the number of continuous variables ranging from 4 to 20 and a sample size of 1000. One could observe that SCAD and MCP outperform other penalties, while SCAD performs slightly better than MCP in general. Adaptive ℓ 1 (Zou, 2006 ) also illustrates its advantage compared to the original ℓ 1 penalty. This suggests the importance of appropriate penalty functions.

C DISCUSSION

C.1 TOWARDS NONPARAMETRIC CAUSAL DISCOVERY In this section, we briefly discuss the implication of our proposed Markov network estimation method in causal discovery, of which the goal is to learn graphical models with causal interpretations. The major classes of approaches for causal discovery are constraint-based approaches that utilize conditional independence tests and score-based approaches that optimize a specific score function. Among them, PC (Spirtes & Glymour, 1991) with kernel-based conditional independence test (Zhang et al., 2012) and GES (Chickering, 2002) with generalized score (Huang et al., 2018) are able to handle nonparametric cases with assumptions such as causal sufficiency. Both of these approaches rely on kernel methods whose computational complexity is cubic w.r.t. the number of samples. Therefore, the running time could be long if the sample size is large. Furthermore, when the number of variables is large, the search procedure may involve computing the kernel-based conditional independence test or score function many times, which therefore may also increase the running time. As shown by Loh & Bühlmann (2014) ; Ng et al. (2021) in the linear Gaussian case, the Markov network (i.e., the support of the inverse covariance matrix of the distribution) is guaranteed to be the super-structure of the ground truth directed acyclic graph (DAG) under a specific type of faithfulness assumption. That is, the super-structure contains all edges of the true DAG. Using this idea, they showed that the Markov network may be used to restrict the search space of score-based approaches for causal discovery, which improves the scalability. Their works focus only on the linear case and adopt classical methods like graphical Lasso (Friedman et al., 2008) to estimate the Markov network. In this work, the nonparametric Markov network estimated by our proposed procedure could potentially be used as a super-structure to restrict the search space for nonparametric causal discovery methods, i.e., (kernel-based) PC and GES. Similar to (Loh & Bühlmann, 2014; Ng et al., 2021) , this may help reduce the running time and improve the scalability of these methods.



For any set S ⊂ V, we write X S = {Xi : i ∈ S}.



Figure 1: Hamming distances for Butterfly distributions.

Figure 2: Hamming distances for distributions from random graphs.

Figure 3: Running time for large graphs.

Figure 4: Hamming distances for continuous data, n = 10000.

Because this procedure estimates causal structures represented by completed partially DAGs (CPDAGs), we moralize the results to obtain the Markov network structures. (SING) Sparsity Identification in Non-Gaussian distributions (SING)(Morrison et al., 2017;Baptista et al., 2021) is an algorithm designed for the estimation of Markov networks in non-Gaussian continuous distributions. It applies a transport map to estimate the data density. (GLASSO) Graphical Lasso (GLASSO) is a classical sparse penalized estimator for the inverse covariance matrix. (NPN) GLASSO with the nonparanormal transformation(Liu et al., 2009). Running time for 12 variables.

6. ACKNOWLEDGMENT

We thank the anonymous reviewers for their constructive comments. This project was partially supported by the National Institutes of Health (NIH) under Contract R01HL159805, by the NSF-Convergence Accelerator Track-D award #2134901, by a grant from Apple Inc., a grant from KDDI Research Inc, and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research.

Appendix Table of Contents

ii. The PDFs of X are strictly positive and smooth.iii. The characterization matrix Ω [c] is defined according to Eq. (3).

Then for any

Proof. According to Eq. ( 3), it is clear that when Ω[c] i,j = 0, we have ∂ 2 log p X ∂xi∂xj = 0, which is a necessary and sufficient condition of x i ⊥ ⊥ x j | x V\{i,j} for strictly positive, smooth, and continuous distributions as shown in Sec. 2.1.It is worth noting that it also applies in the Gaussian case, where X ∼ N (µ, Σ) is a Gaussian vector with mean µ and non-singular covariance Σ. In this case, we havewhere (Σ -1 ) i,j denotes the corresponding entry of the inverse covariance matrix. This well-known property of Gaussian distribution was also shown in Baptista et al. (2021) ; Drton et al. (2008) .Because the inverse covariance matrix encodes the conditional independence structure when variables are from Gaussian distributions, Ω [c] characterizes the Markov property for the Gaussian case.

