GENERALIZED PRECISION MATRIX FOR SCALABLE ES-TIMATION OF NONPARAMETRIC MARKOV NETWORKS

Abstract

A Markov network characterizes the conditional independence structure, or Markov property, among a set of random variables. Existing work focuses on specific families of distributions (e.g., exponential families) and/or certain structures of graphs, and most of them can only handle variables of a single data type (continuous or discrete). In this work, we characterize the conditional independence structure in general distributions for all data types (i.e., continuous, discrete, and mixed-type) with a Generalized Precision Matrix (GPM). Besides, we also allow general functional relations among variables, thus giving rise to a Markov network structure learning algorithm in one of the most general settings. To deal with the computational challenge of the problem, especially for large graphs, we unify all cases under the same umbrella of a regularized score matching framework. We validate the theoretical results and demonstrate the scalability empirically in various settings.

1. INTRODUCTION

Markov networks (also known as Markov random fields) represent conditional dependencies among random variables. They provide clear semantics in a graphical manner to cope with uncertainty in probability theory, with a wide application in fields including physics (Cimini et al., 2019) , chemistry (Dodani et al., 2016) , biology (Jaimovich et al., 2006), and sociology (Carrington et al., 2005) . The undirected nature of edges also allows cyclic, overlapping, or hierarchical interactions (Shen et al., 2009) . To estimate the Markov network from observational data, existing work focuses on certain parametric families of distributions, a majority of which study the Gaussian case. By assuming that the variables are from a multivariate Gaussian distribution, the dependencies can be well represented by the support of the precision, or inverse covariance, matrix according to Hammersley-Clifford theorem (Besag, 1974; Grimmett, 1973) . Together with various statistical estimators (e.g., the graphical lasso (Friedman et al., 2008) and neighborhood selection (Meinshausen & Bühlmann, 2006) ), this connection between the precision matrix and graphical structure has been well exploited in the Gaussian case in the past decades (Yuan, 2010; Ravikumar et al., 2011) . However, methods for Gaussian graphical models fail to correctly capture dependencies among variables deviating from Gaussian or including nonlinearity (Raskutti et al., 2008; Ravikumar et al., 2011) . While non-Gaussianity is more common in real-world data generating process, few results are applicable to Markov network structure learning on non-Gaussian data. In the discrete setting, Ravikumar et al. (2010) showed that a binary Ising model can be recovered by neighborhood selection using ℓ 1 penalized logistic regression. Loh & Wainwright (2013) encoded extra structural relations in the proposed generalized covariance matrix to model the dependencies for Markov networks with certain structures (e.g., tree structures or graphs with only singleton separator sets) among variables from exponential families. Several approaches allowed estimation for non-Gaussian continuous variables while most of them assumed parametric assumptions such as the exponential families (Yang et al., 2015; Lin et al., 2016; Suggala et al., 2017) or Gaussian copulas (Liu et al., 2009; 2012; Harris & Drton, 2013) . These methods illustrate the possibility of reliable Markov network estimations in several non-Gaussian cases, but still, the models are restricted to specific parametric families of distributions and/or structures of conditional independencies. Concerned with describing Markov properties of non-Gaussian data with general continuous distributions, Morrison et al. (2017) used the second-order derivatives to encode the conditional independence structure. Specifically, their approach is based on a theorem that the zero pattern in the Hessian matrix of the log-density determines the conditional independencies between non-Gaussian continuous variables (Spantini et al., 2018) . A method based on transport map, i.e., Sparsity Identification in Non-Gaussian distributions (SING) (Baptista et al., 2021) , is then designed to estimate the data density from samples, and the structure is derived from the estimated density. This approach achieves consistent Markov network structure recovery in a general non-Gaussian continuous setting. However, methods relying on the Hessian matrix cannot cope with discrete or mixed-type data. In addition, density estimation, especially for non-Gaussian data, can be computationally challenging for large graphs, limiting the scalability of this approach. Kernel-based Conditional Independence test (KCI) (Zhang et al., 2012) and Generalized Score (GS) (Huang et al., 2018) can handle the mixed-type case for structure learning, but as kernel-based methods, they are computationally challenging since the complexity scales cubically in the number of samples. To deal with these remaining obstacles, we explore a Generalized Precision Matrix (GPM) for nonparametric Markov networks learning. Based on the necessary and sufficient conditions for the conditional independence among structures in continuous, discrete, and mixed-type cases, GPM characterizes the Markov network structures with arbitrary data types. Moreover, our work does not constrain the distribution to be of specific families, such as exponential families, or has been normalized. Besides, it is also noteworthy that there are no specific assumptions on the functional relations among variables. To the best of our knowledge, the proposed GPM illustrates the feasibility of Markov network structure learning in one of the most general nonparametric settings. Furthermore, we put all these cases under the same umbrella of the estimation framework based on regularized score matching, as an extension of the score matching framework (Hyvärinen & Dayan, 2005) . Different from the previous approach (SING) that applies a transport map to estimate the data density for general continuous distributions, our framework allows us to only estimate the model score function parameterized by a deep model, from which the characterization matrix of the Markov network structure can be directly calculated. To facilitate the estimation process, we also exploit suitable penalties on the characterization matrix to encourage constantly sparse entries. Besides, we adopt recent advancements on score matching (Song et al., 2020) to further scale up the process. Our method therefore narrows the gap between reliable structure learning and scalable deep learning techniques. We validate the theoretical results experimentally, and the scalability has been illustrated.

2. GENERALIZED PRECISION MATRIX

Suppose that we observe a collection of random variables X = (X 1 , . . . , X d ). Our goal is to discover the underlying Markov network structure. Specifically, it is an undirected graph G comprising a set of vertices V = {1, . . . , d} and edges E. The edges E encode the conditional independence relations or the global Markov property: for any disjoint subsets A, B, and C in the vertices set V such that C separates A and B, X A and X B are conditionally independent given X C , i.e., X A ⊥ ⊥ X B | X C .foot_0 Throughout this paper, we use an uppercase letter to denote a random variable and a lowercase letter with subscripts to denote the value of a random variable (e.g., X i = x i for the value of X i ). For a discrete variable, say X i , we denote its support by {x i1 , . . . , x iMi }, where M i is its cardinality. As an alternative characterization of the conditional independence relations encoded by the graph, the pairwise Markov property requires that every pair of non-adjacent variables in the graph is conditionally independent given the remaining variables. That is, for any i ̸ = j, an edge between X i and X j is absent if and only if X i and X j are conditionally independent given the remaining variables, i.e., X i ⊥ ⊥ X j | X V\{i,j} . The conditioning set consisting of all remaining variables is essential. According to Lauritzen (1996) , the pairwise Markov property is equivalent to the global one when the density is strictly positive. In order to estimate nonparametric Markov networks in this setting, we explore generalized characterizations of conditional independence in all types of data (i.e., continuous, discrete, and mixed-type) without distributional constraints. We start from learning conditional independence structures in continuous data with a procedure inspired by Spantini et al. (2018) , and then propose new characterizations for discrete and mixed-type data.



For any set S ⊂ V, we write X S = {Xi : i ∈ S}.

