ESTIMATION OF NUMBER OF COMMUNITIES IN AS-SORTATIVE SPARSE NETWORKS

Abstract

Most community detection algorithms assume the number of communities, K, to be known a priori. Among various approaches that have been proposed to estimate K, the non-parametric methods based on the spectral properties of the Bethe Hessian matrices have garnered much popularity for their simplicity, computational efficiency, and robust performance irrespective of the sparsity of the input data. Recently, one such method has been shown to estimate K consistently if the input network is generated from the (semi-dense) stochastic block model, when the average of the expected degrees ( d) of all the nodes in the network satisfies d log(N ) (N being the number of nodes in the network). In this paper, we prove some finite sample results that hold for d = o(log(N )), which in turn show that the estimation of K based on the spectra of the Bethe Hessian matrices is consistent not only for the semi-dense regime, but also for the sub-logarithmic sparse regime when 1 d log(N ). Thus, our estimation procedure is a robust method for a wide range of problem settings, regardless of the sparsity of the network input.

1. INTRODUCTION

Statistical analysis of network data has now become an extensively studied field within statistics and machine learning (see (Goldenberg et al., 2010; Kolaczyk & Csárdi, 2014; Newman, 2018) for reviews). Network datasets show up in several disciplines. Examples include networks originating from biosciences such as gene regulation networks (Emmert-Streib et al. (2014) ), protein-protein interaction networks (De Las Rivas & Fontanillo (2010) ), structural (Rubinov & Sporns (2010) ) and functional networks (Friston (2011)) of brain and epidemiological networks (Reis et al. (2007) 2011)). There are several active areas of research in developing statistical methodologies for network data analysis and also deriving the theoretical properties of the statistical methods. In this paper, we focus on networks with community structure and finding the number of communities in networks with arbitrary sparsity level. The last two decades saw a resurgence of interest in a problem popularly known as "community detection". A common problem definition is to partition N nodes in a graph into K communities such that there are differences in edge densities between within and between communities, where K is assumed to be known a priori. Estimating number of communities (K) has recently become active in the literature. While the initial focus in the literature for estimating K has been developing algorithms and drawing support from domain-specific intuition and empirical studies using the Stochastic Block Model (SBM), first proposed in Holland et al. (1983) (2018) proposed an estimator based on the loss of binary segmentation using pseudo-likelihood ratio. All of these approaches had theoretical guarantees. However, all the theoretical results were obtained under the assumption that mean density of the networks is greater than log(N ). Methods based on the spectrum of a certain class of matrices have become increasingly popular in recent years as non-parametric alternatives that are more computationally efficient and applicable to a wider range of settings. Most notably the non-backtracking matrices (e.g., Krzakala et al. ( 2013 2015)) have received much attention due to their non-parametric form and competitive performance in the presence of degree heterogeneity and sparsity. In particular, unlike the non-backtracking operator, the Bethe Hessian is a real symmetric operator and hence offers additional computational advantages. Through simulations, Saade et al. (2014a) demonstrated that the Bethe Hessian outperformed the non-backtracking operator, belief propagation, and the adjacency matrices on clustering on both accuracy and efficiency. Le & Levina (2015) proved the consistency of the method based on the spectrum of the Bethe Hessian operator in semi-dense regimes, i.e., with the expected degree d log(N ) and the scalar parameter chosen from the two values commonly used in the literature based on heuristics for assortative and disassortative networks. However, other than the two candidate values and their variations, there are no other known values for the scalar parameter to ensure the consistency result in any regime. Furthermore, real-world networks are generally much more sparse and there is no theoretical result in the literature that guarantees the effectiveness of the Bethe Hessian operator in more sparse regimes. Our contribution. In this paper, we contribute to the theoretical understanding of the Bethe Hessian operator in estimating K for networks generated from the SBM in any regime regardless of the sparsity. We have three main contributions. • We show that the method of estimating K based on the spectral properties of the Bethe Hessian matrix ("spectral method") is statistically consistent, even in regimes more sparse than those previously considered in the literature, with the expected degree 1 d log(N ). The precise definition of d is given in §2.1. • We provide the first-of-its-kind interval of values for the scalar parameter of the Bethe Hessian operator that serves as a sufficient condition for the spectral method to correctly estimate K asymptotically in network data. • Through extensive simulations, we demonstrate that for any value chosen from the interval for the scalar parameter, the spectral method correctly estimates K in networks regardless of sparsity. We also consider the heuristics-based values commonly used in the literature for the scalar parameter in the context of the interval. The paper is arranged as follows. We present the definitions and a formal problem statement in §2. We present our main theoretical result and a sketch of the proof in §3, followed by empirical methods in §4. The simulation results and concluding remarks are given in §5 and §6, respectively.

2. PRELIMINARIES

2.1 NOTATION An adjacency matrix, denoted by A, is a random matrix whose rows and columns are labeled by nodes i, j ∈ [N ], where A ij = 1 if there is an edge between nodes i and j and 0 otherwise, and [N ] denotes the set {1, . . . , N }. The mean observed degree is denoted by d := 1 N 1 T N A1 N and the



); networks originating from social media such as Facebook, Twitter and LinkedIn (Faloutsos et al. (2010)); citation and collaboration networks (Lehmann et al. (2003)); information and technological networks such as internet-based networks (Adamic & Glance (2005)), power networks (Pagani & Aiello (2013)) and cell-tower networks (Isaacman et al. (

, (such as, Saade et al. (2014a), Yan et al. (2018)), there has been recent progress on attaining theoretical understanding of community numbers. Bickel & Sarkar (2015) and Lei et al. (2016) proposed hypothesis testing approaches based on principal eigenvalues or singular values. Some likelihood-based methods using the BIC criterion were proposed by Wang et al. (2017) and Hu et al. (2019). From a Bayesian perspective, Riolo et al. (2017) discussed priors for number of communities under the SBM and designed an Markov Chain Monte Carlo algorithm, Kemp et al. (2006) presented a nonparametric Bayesian approach for detecting concept systems, Xu et al. (2006) introduced an infinite-state latent variable as part of a Dirichlet process mixture model, and Cerqueira & Leonardi (2020) proposed an estimator based on integrated likelihood for the SBM. Rosvall & Bergstrom (2007) introduced the concept of the minimum description length (MDL) to describe network modularities in partitioning networks, and Peixoto (2013) employed MDL to detect the number of communities. Chen & Lei (2018) and Li et al. (2020) proposed cross-validation based approaches with theoretical guarantees to estimate K. Yan et al. (2018) proposed a semi-definite programming approach, and Ma et al.

), Saade et al. (2014b), Coste & Zhu (2019), Bordenave et al. (2015), Saade et al. (2016)) and the Bethe Hessian matrices (e.g., Saade et al. (2015b), Lelarge (2018), Dall'Amico et al. (2019), Saade et al. (2015a), Dall'Amico et al. (2020), Saade et al. (2014a), Le & Levina (

