ESTIMATION OF NUMBER OF COMMUNITIES IN AS-SORTATIVE SPARSE NETWORKS

Abstract

Most community detection algorithms assume the number of communities, K, to be known a priori. Among various approaches that have been proposed to estimate K, the non-parametric methods based on the spectral properties of the Bethe Hessian matrices have garnered much popularity for their simplicity, computational efficiency, and robust performance irrespective of the sparsity of the input data. Recently, one such method has been shown to estimate K consistently if the input network is generated from the (semi-dense) stochastic block model, when the average of the expected degrees ( d) of all the nodes in the network satisfies d log(N ) (N being the number of nodes in the network). In this paper, we prove some finite sample results that hold for d = o(log(N )), which in turn show that the estimation of K based on the spectra of the Bethe Hessian matrices is consistent not only for the semi-dense regime, but also for the sub-logarithmic sparse regime when 1 d log(N ). Thus, our estimation procedure is a robust method for a wide range of problem settings, regardless of the sparsity of the network input.

1. INTRODUCTION

Statistical analysis of network data has now become an extensively studied field within statistics and machine learning (see (Goldenberg et al., 2010; Kolaczyk & Csárdi, 2014; Newman, 2018) 2011)). There are several active areas of research in developing statistical methodologies for network data analysis and also deriving the theoretical properties of the statistical methods. In this paper, we focus on networks with community structure and finding the number of communities in networks with arbitrary sparsity level. The last two decades saw a resurgence of interest in a problem popularly known as "community detection". A common problem definition is to partition N nodes in a graph into K communities such that there are differences in edge densities between within and between communities, where K is assumed to be known a priori. Estimating number of communities (K) has recently become active in the literature. While the initial focus in the literature for estimating K has been developing algorithms and drawing support from domain-specific intuition and empirical studies using the Stochastic Block Model (SBM), first proposed in Holland et al. (1983) 



for reviews). Network datasets show up in several disciplines. Examples include networks originating from biosciences such as gene regulation networks (Emmert-Streib et al. (2014)), protein-protein interaction networks (De Las Rivas & Fontanillo (2010)), structural (Rubinov & Sporns (2010)) and functional networks (Friston (2011)) of brain and epidemiological networks (Reis et al. (2007)); networks originating from social media such as Facebook, Twitter and LinkedIn (Faloutsos et al. (2010)); citation and collaboration networks (Lehmann et al. (2003)); information and technological networks such as internet-based networks (Adamic & Glance (2005)), power networks (Pagani & Aiello (2013)) and cell-tower networks (Isaacman et al. (

, (such as, Saade et al. (2014a), Yan et al. (2018)), there has been recent progress on attaining theoretical understanding of community numbers. Bickel & Sarkar (2015) and Lei et al. (2016) proposed hypothesis testing approaches based on principal eigenvalues or singular values. Some likelihood-based methods using the BIC criterion were proposed by Wang et al. (2017) and Hu et al. (2019). From a Bayesian perspective, Riolo et al. (2017) discussed priors for number of communities under the SBM and designed an Markov Chain Monte Carlo algorithm, Kemp et al. (2006) presented a nonparametric Bayesian approach for detecting concept systems, Xu et al. (2006) introduced an infinite-state latent 1

