COMBINATORIAL-PROBABILISTIC TRADE-OFF: P-VALUES OF COMMUNITY PROPERTIES TEST IN THE STOCHASTIC BLOCK MODELS

Abstract

We propose an inferential framework testing the general community combinatorial properties of the stochastic block model. We aim to test the hypothesis on whether a certain community property is satisfied, e.g., whether a given set of nodes belong to the same community, and provide p-values for uncertainty quantification. Our framework is applicable to all symmetric community properties. To ease the challenges caused by the combinatorial nature of community properties, we develop a novel shadowing bootstrap method. Utilizing the symmetry, our method can find a shadowing representative of the true assignment and the number of tested assignments in the alternative is largely reduced. In theory, we introduce a combinatorial distance between two community classes and show a combinatorial-probabilistic trade-off phenomenon. Our test is honest as long as the product of the combinatorial distance between two community property classes and the probabilistic distance between two connection probabilities is sufficiently large. Besides, we show that such trade-off also exists in the information-theoretic lower bound. We also implement numerical experiments to show the validity of our method.

1. INTRODUCTION

Clustering is an important feature for network studies, which refers to the presence of node communities in the underlying graph. Community partitions the nodes into subgroups, within which a higher level of connectivity is perceived. Its broad spectrum of applications includes the fields of sociology (Wasserman & Faust, 1994 ), biology (Barabási & Oltvai, 2004 ), physics (Newman, 2003) and internet (Albert et al., 1999) . Stochastic block model (SBM) (Holland et al., 1983 ) is one of the most widely studied statistical models depicting the network community structures. It is a random graph model which divides the nodes into disjoint communities and assigns the probability of connection between two nodes according to their community memberships. One of its central problems in previous studies is community detection. However, most of the existing research focused on estimating the community labeling without uncertainty quantification (Choi et al., 2012; Mossel et al., 2012; Airoldi et al., 2013; Massoulié, 2014; Abbe et al., 2016; Mossel et al., 2016) . Some fundamental limits of community recovery have also been established in previous studies. For example, Abbe et al. (2016) showed the optimal phase transition for the exact recovery of the community assignments using the maximum likelihood. The semi-definite relaxation methods (Abbe et al., 2016; Hajek et al., 2016; Agarwal et al., 2017; Bandeira, 2018) and the spectral methods (Yun & Proutiere, 2014; Abbe & Sandon, 2015; Gao et al., 2017; Abbe et al., 2020) are also shown to be optimal in exact recovery. Besides the exact recovery, Zhang & Zhou (2016) quantified the statistical rate of the community estimation via the mismatch ratio and showed the minimax rate of the mismatch ratio for community detection. In summary, the community estimation studies have two major limits: 1) it does not provide the p-values to evaluate the uncertainty of the estimation, which are essential in many scientific applications, and 2) it requires the recovery of community assignments for all nodes, while in many scientific applications, we are interested in the community properties of a specific subset of nodes, e.g., whether two sets of nodes belong to the same community. We formulate the following examples of statistical hypotheses for illustration.

