COMBINATORIAL-PROBABILISTIC TRADE-OFF: P-VALUES OF COMMUNITY PROPERTIES TEST IN THE STOCHASTIC BLOCK MODELS

Abstract

We propose an inferential framework testing the general community combinatorial properties of the stochastic block model. We aim to test the hypothesis on whether a certain community property is satisfied, e.g., whether a given set of nodes belong to the same community, and provide p-values for uncertainty quantification. Our framework is applicable to all symmetric community properties. To ease the challenges caused by the combinatorial nature of community properties, we develop a novel shadowing bootstrap method. Utilizing the symmetry, our method can find a shadowing representative of the true assignment and the number of tested assignments in the alternative is largely reduced. In theory, we introduce a combinatorial distance between two community classes and show a combinatorial-probabilistic trade-off phenomenon. Our test is honest as long as the product of the combinatorial distance between two community property classes and the probabilistic distance between two connection probabilities is sufficiently large. Besides, we show that such trade-off also exists in the information-theoretic lower bound. We also implement numerical experiments to show the validity of our method.

1. INTRODUCTION

Clustering is an important feature for network studies, which refers to the presence of node communities in the underlying graph. Community partitions the nodes into subgroups, within which a higher level of connectivity is perceived. Its broad spectrum of applications includes the fields of sociology (Wasserman & Faust, 1994) , biology (Barabási & Oltvai, 2004) , physics (Newman, 2003) and internet (Albert et al., 1999) . Stochastic block model (SBM) (Holland et al., 1983 ) is one of the most widely studied statistical models depicting the network community structures. It is a random graph model which divides the nodes into disjoint communities and assigns the probability of connection between two nodes according to their community memberships. One of its central problems in previous studies is community detection. However, most of the existing research focused on estimating the community labeling without uncertainty quantification (Choi et al., 2012; Mossel et al., 2012; Airoldi et al., 2013; Massoulié, 2014; Abbe et al., 2016; Mossel et al., 2016) . Some fundamental limits of community recovery have also been established in previous studies. For example, Abbe et al. (2016) showed the optimal phase transition for the exact recovery of the community assignments using the maximum likelihood. The semi-definite relaxation methods (Abbe et al., 2016; Hajek et al., 2016; Agarwal et al., 2017; Bandeira, 2018) and the spectral methods (Yun & Proutiere, 2014; Abbe & Sandon, 2015; Gao et al., 2017; Abbe et al., 2020) are also shown to be optimal in exact recovery. Besides the exact recovery, Zhang & Zhou (2016) quantified the statistical rate of the community estimation via the mismatch ratio and showed the minimax rate of the mismatch ratio for community detection. In summary, the community estimation studies have two major limits: 1) it does not provide the p-values to evaluate the uncertainty of the estimation, which are essential in many scientific applications, and 2) it requires the recovery of community assignments for all nodes, while in many scientific applications, we are interested in the community properties of a specific subset of nodes, e.g., whether two sets of nodes belong to the same community. We formulate the following examples of statistical hypotheses for illustration. Example 1.1 (Same community test for m nodes). We want to test whether m given nodes are in the same cluster or not. Without loss of generality, we have the hypothesis: H 0 : Nodes 1, . . . , m belong to the same community, H 1 : There exists two nodes 1 ≤ j ̸ = k ≤ m belonging to two different communities. Example 1.2 (Group community test). For two groups of nodes, we know in prior that nodes within each group belong to the same community. We aim to further test whether these two groups belong to the same community. We denote one node set as S m = {1, . . . , m} and the other as S m ′ = {m + 1, . . . , m + m ′ }. The group community hypothesis is H 0 : Nodes in S m ∪ S m ′ belong to the same community, H 1 : Nodes in S m belong to community a, but nodes in S m ′ belong to community b ̸ = a. Example 1.3 (Equal-sized communities test). Given an SBM of n nodes and K communities, we aim to test whether each community has the same size. Namely, we aim to test the hypothesis: H 0 : Each community has the size n/K, H 1 : At least one of the communities has size not equal to n/K. In order to conduct hypothesis tests including the above examples, we develop a general community property test. We consider the SBM with n nodes and K communities. Denote the community assignment of the nodes by z = (z(1), ..., z(n)) ∈ {1, . . . , K} n . The homogeneous SBM assumes that the edges of the random graph are independent Bernoulli random variables with connection probability p if z(i) = z(j) and q if z(i) ̸ = z(j). We reparameterize p, q as p = ρ n λ 1 and q = ρ n λ 2 , where λ 1 and λ 2 are constants independent of n, and ρ n is the signal strength. Let C 0 , C 1 ⊂ {1, . . . , K} n be two disjoint community assignment families. We are interested in the general community property test: H 0 : z ∈ C 0 versus H 1 : z ∈ C 1 . (1.1) We characterize the hardness of the test by two kinds of "distances": the probabilistic distance between p and q, and the combinatorial distance between C 0 and C 1 . The existing literature on community detection only focused on the probability distance, e.g., the Rényi divergence & Zhou, 2016) . In comparison, our paper introduces a novel combinatorial distance between C 0 and C 1 denoted as d(C 0 , C 1 ) (see Definition 2.4) and proposes a general testing method that is honest and powerful when Combinatorial-Probabilistic Trade-Off: I(p, q) = -2 log √ pq + (1 -p)(1 -q) (Zhang I(p, q)d(C 0 , C 1 ) = Ω(n ϵ ) (1.2) for some arbitrarily small ϵ > 0. On the other hand, we show the minimax lower bound of the test in the sense that H 0 and H 1 in (1.1) cannot be differentiated when I(p, q)d(C 0 , C 1 ) ≤ c log n for some constant c > 0.foot_0 The multiplication between I(p, q) and d(C 0 , C 1 ) reveals the trade-off between the probabilistic distance and the combinatorial distance in the general community property test.

2. COMMUNITY PROPERTIES OF THE STOCHASTIC BLOCK MODEL

In our paper, we consider the fixed assignment stochastic block model, denoted by M(n, K, p, q, z). Denote by [n] = {1, . . . , n} for any integer n. For simplicity, we start with the scenario where the community sizes are even, and will generalize to the uneven case in Appendix B. We denote the even assignment class by K n := {z ∈ [K] n : |{i : z(i) = k}| = n/K, ∀k ∈ [K]}. In our paper, we assume K to be bounded. Let A ∈ {0, 1} n×n be the symmetric adjacency matrix of the random graph generated from the SBM. In the following part of the paper, we will study the community property test with an observation of the adjacency matrix A ∼ M(n, K, p, q, z).

2.1. SYMMETRIC COMMUNITY PROPERTIES

In this section, we aim to define the community property and the distance between two community families. In general, we say a community property is a subset of [K] n . However, such a definition is too general and may include some ill-posed examples. For instance, if we can transfer one



We refer to Theorem 3.2 and Theorem 4.1 for the rigorous arguments about the upper and lower bounds.

