PHASE TRANSITION FOR DETECTING A SMALL COMMUNITY IN A LARGE NETWORK

Abstract

How to detect a small community in a large network is an interesting problem, including clique detection as a special case, where a naive degree-based χ 2 -test was shown to be powerful in the presence of an Erdős-Renyi background. Using Sinkhorn's theorem, we show that the signal captured by the χ 2 -test may be a modeling artifact, and it may disappear once we replace the Erdős-Renyi model by a broader network model. We show that the recent SgnQ test is more appropriate for such a setting. The test is optimal in detecting communities with sizes comparable to the whole network, but has never been studied for our setting, which is substantially different and more challenging. Using a degree-corrected block model (DCBM), we establish phase transitions of this testing problem concerning the size of the small community and the edge densities in small and large communities. When the size of the small community is larger than √ n, the SgnQ test is optimal for it attains the computational lower bound (CLB), the information lower bound for methods allowing polynomial computation time. When the size of the small community is smaller than √ n, we establish the parameter regime where the SgnQ test has full power and make some conjectures of the CLB. We also study the classical information lower bound (LB) and show that there is always a gap between the CLB and LB in our range of interest.

1. INTRODUCTION

Consider an undirected network with n nodes and K communities. We assume n is large and the network is connected for convenience. We are interested in testing whether K = 1 or K > 1 and the sizes of some of the communities are much smaller than n (communities are scientifically meaningful but mathematically hard to define; intuitively, they are clusters of nodes that have more edges "within" than "across" (Jin, 2015; Zhao et al., 2012) ). The problem is a special case of network global testing, a topic that has received a lot of attention (e.g., Jin et al. (2018; 2021b) ). However, existing works focused on the so-called balanced case, where the sizes of communities are at the same order. Our case is severely unbalanced, where the sizes of some communities are much smaller than n (e.g., n ε ). The problem also includes clique detection (a problem of primary interest in graph learning (Alon et al., 1998; Ron & Feige, 2010 )) as a special case. Along this line, Arias-Castro & Verzelen (2014) ; & Verzelen (2014) ; Verzelen & Arias-Castro (2015) by a properly broader model. When this happens, the χ 2 -test will be asymptotically powerless in the whole range of parameter space. We explain the idea with the popular Degree-Corrected Block Model (DCBM) (Karrer & Newman, 2011) , though it is valid in broader settings. Let A ∈ R n,n be the network adjacency matrix, where A(i, j) ∈ {0, 1} indicates whether there is an edge between nodes i and j, 1 ≤ i, j ≤ n. By convention, we do not allow for self-edges, so the diagonals of A are always 0. Suppose there are K communities, C 1 , . . . , C K . For each node i, 1 ≤ i ≤ n, we use a parameter θ i to model the degree heterogeneity and π i to model the membership: when i ∈ C k , π i ( ) = 1 if = k and π i ( ) = 0 otherwise. For a K × K symmetric and irreducible non-negative matrix P that models the community structure, DCBM assumes that the upper triangle of A contains independent Bernoulli random variables satisfyingfoot_1 P(A(i, j) = 1) = θ i θ j π i P π j , 1 ≤ i, j ≤ n. (1.1) In practice, we interpret P (k, ) as the baseline connecting probability between communities k and . Write θ = (θ 1 , θ 2 , . . . , θ n ) , Π = [π 1 , π 2 , . . . , π n ] , and Θ = diag(θ) ≡ diag(θ 1 , θ 2 , . . . , θ n ). Introduce n × n matrices Ω and W by Ω = ΘΠP Π Θ and W = A -E [A] . We can re-write (1.1) as A = Ω -diag(Ω) + W. (1.2) We call Ω the Bernoulli probability matrix and W the noise matrix. When θ i in the same community are equal, DCBM reduces to the Stochastic Block Model (SBM) (Holland et al., 1983) . When K = 1, the SBM reduces to the Erdős-Renyi model, where Ω(i, j) take the same value for all 1 ≤ i, j ≤ n. We first describe why the signal captured by the χ 2 -test in Arias-Castro & Verzelen (2014); Verzelen & Arias-Castro ( 2015) is a modeling artifact. Using Sinkhorn's matrix scaling theorem (Sinkhorn, 1974) , it is possible to build a null DCBM with K = 1 that has no community structure and an alternative DCBM with K ≥ 2 and clear community structure such that the two models have the same expected degrees. Thus, we do not expect that degree-based test such as χ 2 can tell them apart. We make this Sinkhorn argument precise in Section 2.1 and show the failure of χ 2 in Theorem 2.3. In the Erdős-Renyi setting in Arias-Castro & Verzelen (2014) , the null has one parameter and the alternative has two parameters. In such a setting, we cannot have degree-matching. In these cases, a naive degree-based χ 2 -test may have good power, but it is due to the very specific models they choose. For clique detection in more realistic settings, we prefer to use a broader model such as the DCBM, where by the degree-matching argument above, the χ 2 -test is asymptotically powerless. This motivates us to look for a different test. One candidate is the scan statistic Bogerd et al. (2021) . However, a scan statistic is only computationally feasible when each time we scan a very small subset of nodes. For example, if each time we only scan a finite number of nodes, then the computational cost is polynomial; we call the test the Economic Scan Test (EST). Another candidate may come from the Signed-Polygon test family (Jin et al., 2021b) , including the Signed-Quadrilateral (SgnQ) as a special case. Let η = (1 n A1 n ) -1/2 A1 n and A = A -η η. Define Q n = i1,i2,i3,i4(dist) A i1i2 A i2i3 A i3i4 A i4i1 where the shorthand (dist) indicates we sum over distinct indices. The SgnQ test statistic is ψ n = Q n -2( η 2 -1) 2 / 8( η 2 -1) 4 . (1.3) SgnQ is computationally attractive because it can be evaluated in time O(n 2 d), where d is the average degree of the network (Jin et al., 2021b) . Moreover, it was shown in Jin et al. (2021b) that (a) when K = 1 (the null case), ψ n → N (0, 1), and (b) when K > 1 and all communities are at the same order (i.e., a balanced alternative case), the SgnQ test achieves the classical information lower bound (LB) for global testing and so is optimal. Unfortunately, our case is much more delicate: the signal of interest is contained in a community with a size that is much smaller than n (e.g., n ε ), so the signal can be easily overshadowed by the noise term of Q n . Even in the simple alternative case where we only have two communities (with sizes N and (n -N )), it is unclear (a) how the lower bounds vary as N/n → 0, and especially whether there is a gap between the computation lower bound (CLB) and classical information lower bound (LB), and (b) to what extent the SgnQ test attains the CLB and so is optimal.

1.1. RESULTS AND CONTRIBUTIONS

We consider the problem of detecting a small community in the DCBM. In this work, we specifically focus on the case K = 2 as this problem already displays a rich set of phase transitions, and we believe it captures the essential behavior for constant K > 1. Let N n denote the size of this small community under the alternative. Our first contribution analyzes the power of SgnQ for this problem, extending results of Jin et al. (2021b) that focus on the balanced case. Let λ 1 = λ 1 (Ω). In Section 2.2, we define a population counterpart Ω of Â and let λ = λ 1 ( Ω). We show that SgnQ has full power if λ 1 / √ λ 2 → ∞, which reduces to N (a -c)/ √ nc → ∞ in the SBM case. For optimality, we obtain a computational lower bound (CLB), relying on the low-degree polynomial conjecture, which is a standard approach in studying CLB (e.g., Kunisky et al. (2019) ). Consider a case where K = 2 and we have a small community with size N . Suppose the edge probability within the community and outside the community are a and c, where a > c. The quantity (a -c)/ √ c acts as the Node-wise Signal-to-Noise Ratio (SNR) for the detection problem.foot_2 When N √ n, we find that the CLB is completely determined by N and node-wise SNR; moreover, SgnQ matches with the CLB and is optimal. When N √ n, the situation is more subtle: if the node-wise SNR (a -c)/ √ c → 0 (weak signal case), we show the problem is computationally hard and the LB depends on N and the node-wise SNR. If (a -c)/ √ c n 1/2 (strong signal case), then SgnQ solves the detection problem. In the range 1 (a -c)/ √ c n 1/2 (moderate signal case), the CLB depends on not only N and the node-wise SNR but also the background edge density c. In this regime, we make conjectures of the CLB, from the study of the aforementioned economic scan test (EST). Our results are summarized in Figure 1 and explained in full detail in Section 2.7. We also obtain the classical information lower bound (LB), and discover that as N/n → 0, there is big gap between CLB and LB. Notably the LB is achieved by an (inefficient) signed scan test. In the balanced case in Jin et al. (2021b) , the SgnQ test is optimal among all tests (even those that are allowed unbounded computation time), and such a gap does not exist. We also show that that the naive degree-based χ 2 -test is asymptotically powerless due to the aforementioned degree-matching phenomenon. Our statistical lower bound, computational lower bound, and the powerlessness of χ 2 based on degree-matching are also valid for all K > 2 since any model with K ≥ 2 contains K = 2 as a special case. We also expect that our lower bounds are tight for these broader models and that our lower bound constructions for K = 2 represent the least favorable cases when community sizes are severely unbalanced. Compared to Verzelen & Arias-Castro (2015); Arias-Castro & Verzelen (2014), we consider network global testing in a more realistic setting, and show that optimal tests there (i.e., a naive degree-based χ 2 test) may be asymptotically powerless here. Compared with Bogerd et al. (2021) , our setting is very different (they considered a setting where both the null and alternative are DCBM with K = 1). Compared to the study in the balanced case (e.g., Jin et al. (2018; 2021b) ; Gao & Lafferty (2017) ), our study is more challenging for two reasons. First, in the balanced case, there is no gap between the UB (the upper bound provided by the SgnQ test) and LB, so there is no need to derive the CLB, which is usually technical demanding. Second, the size of the smaller community can get as small as n ε , where ε > 0 is any constant. Due this imbalance in community sizes, the techniques of Jin et al. (2021b) do not directly apply. As a result, our proof involves the careful study of the 256 terms that compose SgnQ, which requires using bounds tailored specifically for the severely unbalanced case. Our study of the CLB is connected to that of Hajek et al. (2015) in the Erdös-Renyi setting of Arias-Castro & Verzelen (2014) . Hajek et al. (2015) proved via computational reducibility that the naive χ 2 -test is the optimal polynomial-time test (conditionally on the planted clique hypothesis). We also note work of Chen & Xu (2016) that studied a K-cluster generalization of the Erdös-Renyi model of Arias-Castro & Verzelen (2014); Verzelen & Arias-Castro (2015) and provided conjectures of the CLB. Compared to our setting, these models are very different because the expected degree profiles of the null and alternative differ significantly. In this work we consider the DCBM model, where due to the subtle phenomenon of degree matching between the null and alternative hypotheses, both CLB and LB are different from those obtained by Hajek et al. (2015) .

Notations:

We use 1 n to denote a n-dimensional vector of ones. For a vector θ = (θ 1 , . . . , θ n ), diag(θ) is the diagonal matrix where the i-th diagonal entry is θ i . For a matrix Ω ∈ R n×n , diag(Ω) is the diagonal matrix where the i-th diagonal entry is Ω(i, i). For a vector θ ∈ R n , θ max = max{θ 1 , . . . , θ n } and θ min = min{θ 1 , . . . , θ n }. For two positive sequences {a n } and {b n }, we write a n b n if c 1 ≤ a n /b n ≤ c 2 for constants c 2 > c 1 > 0. We say a n ∼ b n if (a n /b n ) = 1+o(1).

2. MAIN RESULTS

In Section 2.1, following our discussion on Sinkhorn's theorem in Section 1, we introduce calibrations (including conditions on identifiability and balance) that are appropriate for severely unbalanced DCBM and illustrate with some examples. In Sections 2.2-2.3, we analyze the power of the SgnQ test and compare it with the χ 2 -test. In Sections 2.4-2.5, we discuss the information lower bounds (both the LB and CLB) and show that SgnQ test is optimal among polynomial time tests, when N √ n. In Section 2.6, we study the EST and make some conjectures of the CLB when N √ n. In Section 2.7, we summarize our results and present the phase transitions.

2.1. DCBM FOR SEVERELY UNBALANCED NETWORKS: IDENTIFIABILITY, BALANCE METRICS, AND GLOBAL TESTING

In the DCBM (1.1)-(1.2), Ω = ΘΠP Π Θ. It is known that the matrices (Θ, Π, P ) are not identifiable. One issue is that (Π, P ) are only unique up to a permutation: for a K × K permutation matrix Q, ΠP Π = (ΠQ)(Q P Q)(ΠQ) . This issue is easily fixable in applications so is usually neglected. A bigger issue is that, (Θ, P ) are not uniquely defined. For example, fixing a positive diagonal matrix D ∈ R K×K , let P * = DP D and Θ * = diag(θ * 1 , θ * 2 , . . . , θ * n ) where θ * i = θ i / D(k, k) if i ∈ C k , 1 ≤ k ≤ K. It is seen that ΘΠP Π Θ = Θ * ΠP * Π Θ * , so (Θ, P ) are not uniquely defined. To motivate our identifiability condition, we formalize the degree-matching argument discussed in the introduction. Fix (θ, P ) and let h = (h 1 , . . . , h K ) and h k > 0 is the fraction of nodes in community k, 1 ≤ k ≤ K. By the main result of Sinkhorn (1974) , there is a unique positive diagonal matrix D = diag(d 1 , . . . , d K ) such that DP Dh = 1 K . Consider a pair of two DCBM, a null with K = 1 and an alternative with K > 1, with parameters Ω = Θ1 n 1 n Θ ≡ θθ and Ω * (i, j) = θ * i θ * j π i P π j with θ * i = d k θ i if i ∈ C k , 1 ≤ k ≤ K, respectively. Direct calculation shows that node i has the same expected degree under the null and alternative. There are many ways to resolve the issue. For example, in the balanced case (e.g., Jin et al. (2021b; 2022 )), we can resolve it by requiring that P has unit diagonals. However, for our case, this is inappropriate. Recall that, in practice, P (k, ) represents as the baseline connecting probability between community k and . If we forcefully rescale P to have a unit diagonal here, both (P, Θ) lose their practical meanings. Motivated by the degree-matching argument, we propose an identifiability condition that is more appropriate for the severely unbalanced DCBM. By our discussion in Section 1, for any DCBM with a Bernoulli probability matrix Ω, we can always use Sinkhorn's theorem to define (Θ, P ) (while Π is unchanged) such that for the new (Θ, P ), Θ = ΘΠP Π Θ and P h ∝ 1 K , where h = (h 1 , . . . , h K ) and h k > 0 is the fraction of nodes in community k, 1 ≤ k ≤ K. This motivates the following identifiability condition (which is more appropriate for our case): θ 1 = n, P h ∝ 1 K , where h k is fraction of nodes in C k , 1 ≤ k ≤ K. (2.1) Lemma 2.1. For any Ω that satisfies the DCBM (1.2) and has positive diagonal elements, we can always find (Θ, Π, P ) such that Ω = ΘΠP Π Θ and (2.1) holds. Also, any (Θ, P ) that satisfy Ω = ΘΠP Π Θ and (2.1) are unique. Moreover, for network balance, the following two vectors in R K are natural metrics: d = ( θ 1 ) -1 Π Θ1 n , g = ( θ ) -2 Π Θ 2 Π1 K , (2.2) In the balanced case (e.g., Jin et al. (2021b; 2022 )), we usually assume the entries of d and g are at the same order. For our setting, this is not the case. Next we introduce the null and alternative hypotheses that we consider. Under each hypothesis, we impose the identifiability condition (2.1). General null model for the DCBM. When K = 1 and h = 1, P is scalar (say, P = α), and Ω = αθθ satisfies θ 1 = n by (2.1). The expected total degree is α( θ 2 1 -θ 2 ) ∼ α θ 2 1 = n 2 α under mild conditions, so we view α as the parameter for network sparsity. In this model, d = g = 1. Alternative model for the DCBM . We assume K = 2 and that the sizes of the two communities, C 0 and C 1 , are (n -N ) and N , respectively. For some positive numbers a, b, c, we have P = a b b c , Ω(i, j) = θ i θ j • a, if i, j ∈ C 1 , θ i θ j • c, if i, j ∈ C 0 , θ i θ j • b, otherwise. (2.3) In the classical clique detection problem (e.g., Bogerd et al. ( 2021)), a and c are the baseline probability where two nodes have an edge when both of them are in the clique and outside the clique, respectively. By (2.1), a + b(1 -) = b + c(1 -) if we write = N/n. Therefore, b = (c(n -N ) -aN )/(n -2N ). (2.4) Note that this is the direct result of Sinkhorn's theorem and the parameter calibration we choose, not a condition we choose for technical convenience. Write d = (d 0 , d 1 ) and g = (g 0 , g 1 ) . It is seen that d 0 = 1 -d 1 , g 0 = 1 -g 0 , d 1 = θ -1 1 i∈C1 θ i , and g 1 = θ -2 i∈C1 θ 2 i . If all θ i are at the same order, then d 1 g 1 (N/n) and d 0 ∼ g 0 ∼ 1. We also observe that b = c + O(a ) which makes the problem seem very close to Arias-Castro & Verzelen (2014); Bogerd et al. (2021) , although in fact the problems are quite different. Extension . An extension of our alternative is that, for the K communities, the sizes of m of them are at the order of N , for an N n and an integer m, 1 ≤ m < K, and the sizes of remaining (K -m) are at the order of n. In this case, m entries of d are O(N/n) and other entries are O(1); same for g.

2.2. THE SGNQ TEST: LIMITING NULL, P-VALUE, AND POWER

In the null case, K = 1 and we assume Ω = αθθ , where θ 1 = n. As n → ∞, both (α, θ) may vary with n. Write θ max = θ ∞ . We assume nα → ∞, and αθ 2 max log(n 2 α) → 0. (2.5) The following theorem is adapted from Jin et al. (2021b) and the proof is omitted. Theorem 2.1 (Limiting null of the SgnQ statistic). Suppose the null hypothesis is true and the regularity conditions (2.1) and (2.5) hold. As n → ∞, ψ n → N (0, 1) in law. We have two comments. First, since the DCBM has many parameters (even in the null case), it is not an easy task to find a test statistic with a limiting null that is completely parameter free. For example, if we use the largest eigenvalue of A as the test statistic, it is unclear how to normalize it so to have such a limiting null. Second, since the limiting null is completely explicit, we can approximate the (one-sided) p-value of ψ n by P(N (0, 1) ≥ ψ n ). The p-values are useful in practice, as we show in our numerical experiments.. For example, using a recent data set on the statisticians' publication (Ji et al., 2022) , for each author, we can construct an ego network and apply the SgnQ test. We can then use the p-value to measure the co-authorship diversity of the author. Also, in many hierarchical community detection algorithms (which are presumably recursive, aiming to estimate the tree structure of communities), we can use the p-values to determine whether we should further divide a sub-community in each stage of the algorithm (e.g. Ji et al. (2022) ). The power of the SgnQ test hinges on the matrix Ω = Ω -(1 n Ω1 n ) -1 Ω1 n 1 n Ω. By basic algebra, Ω = ΘΠ P Π Θ, where P = P -(d P d) -1 P dd P . (2.6) Let λ1 be the largest (in magnitude) eigenvalue of Ω. Lemma 2.2 is proved in the supplement. Lemma 2.2. The rank and trace of the matrix Ω are (K -1) and θ 2 diag( P ) g, respectively. When K = 2, λ1 = trace( Ω) = θ 2 (ac -b 2 )(d 2 0 g 1 + d 2 1 g 0 )/(ad 2 1 + 2bd 0 d 1 + cd 2 0 ). As a result of this lemma, we observe that in the SBM case, d = h and thus λ 1 = λ 2 N (a -c). To see intuitively that the power of the SgnQ test hinges on λ4 1 /λ 2 1 , if we heuristically replace the terms of SgnQ by population counterparts, we obtain Q n = i1,i2,i3,i4(distinct) Âi1i2 Âi2i3 Âi3i4 Âi4i1 ≈ trace([Ω -ηη ] 4 ) = trace( Ω 4 ) = λ4 1 . We now formally discuss the power of the SgnQ test. We focus on the alternative hypothesis in Section 2.1. Let d = (d 1 , d 0 ) and g = (g 1 , g 0 ) be as in (2.2), and let θ max,0 = max i∈C0 θ i and θ max,1 = max i∈C1 θ i . Suppose d 1 g 1 N/n, aθ 2 max,1 = O(1), cn → ∞, cθ 2 max,0 log(n 2 c) → 0. (2.7) These conditions are mild. For example, when θ i 's are at the same order, the first inequality in (2.7) automatically holds, and the other inequalities in (2.7) hold if a ≤ C for an absolute constant C > 0, cn → ∞, and c log(n) → 0. Fixing 0 < κ < 1, let z κ > 0 be the value such that P(N (0, 1) Corollary 2.1. Suppose the same conditions of Theorem 2.2 hold, and additionally θ max ≤ Cθ min so all θ i are at the same order. In this case, λ 1 cn and | λ1 | N (a -c), and the power of the level-κ SgnQ test tends to ≥ z κ ) = κ. 1 if N (a -c)/ √ cn → ∞. In Theorem 2.2 and Corollary 2.1, if κ = κ n and κ n → 0 slowly enough, then the results continues to hold, and the sum of Type I and Type II errors of the SgnQ test at level-κ n → 0. The power of the SgnQ test was only studied in the balanced case (Jin et al., 2021b) , but our setting is a severely unbalanced case, where the community sizes are at different orders as well as the entries of d and g. In the balanced case, the signal-to-noise ratio of SgnQ is governed by |λ 2 |/ √ λ 1 , but in our setting, the signal-to-noise ratio is governed by | λ1 |/ √ λ 1 . The proof is also subtly different. Since the entries of P are at different orders, many terms deemed negligible in the power analysis of the balanced case may become non-negligible in the unbalanced case and require careful analysis.

2.3. COMPARISON WITH THE NAIVE DEGREE-BASED χ 2 -TEST

Consider a setting where Ω = αΘ1 n 1 n Θ ≡ αθθ under the null and Ω = ΘΠP Π Θ under the alternative, and (2.1) holds. When θ is unknown, it is unclear how to apply the χ 2 -test: the null case has n unknown parameters θ 1 , . . . , θ n , and we need to use the degrees to estimate θ i first. As a result, the resultant χ 2 -statistic may be trivially 0. Therefore, we consider a simpler SBM case where θ = 1 n . In this case, Ω = α1 n 1 n , and Ω = ΠP Π and the null case only has one unknown parameter α. Let y i be the degree of node i, and let α = [n(n -1)] -1 1 n A1 n . The χ 2 -statistic is X n = n i=1 (y i -nα) 2 /[(n -1)α(1 -α)]. (2.8) It is seen that as nα → ∞ and α → 0, (X n -n)/ √ 2n → N (0, 1) in law. For a fixed level κ ∈ (0, 1), consider the χ 2 -test that rejects the null if and only if (X n -n)/ √ 2n > z κ . Let α 0 = n -2 (1 n Ω1 n ). The power of the χ 2 -test hinges on the quantity (nα 0 ) -1 (Ω1 n -nα 0 ) 2 = (nα 0 ) -1 ΠP h -(h P h) -1 1 n 2 = 0, if P h ∝ 1 K . The next theorem is proved in the supplement. Theorem 2.3. Suppose θ = 1 n and (2.7) holds. If | λ1 |/ √ λ 1 → ∞ under the alternative hypothesis, the power of the level-κ SgnQ test goes to 1, while the power of the level-κ χ 2 -test goes to κ.

2.4. THE STATISTICAL LOWER BOUND AND THE OPTIMALITY OF THE SCAN TEST

For lower bounds, it is standard to consider a random-membership DCBM (Jin et al., 2021b) , where θ 1 = n, P is as in (2.3)-(2.4) and for a number N n, Π = [π 1 , π 2 , . . . , π n ] satisfies π i = (X i , 1 -X i ), where X i are iid Bernoulli(ε) with ε = N/n. (2.9) Theorem 2.4 (Statistical lower bound). Consider the null and alternative hypotheses of Section 2.1, and assume that (2.9) is satisfied, θ max ≤ Cθ min and N c/ log n → ∞. If √ N (a -c)/ √ c → 0, then for any test, the sum of the type-I and type-II errors tends to 1. To show the tightness of this lower bound, we introduce the signed scan test, by adapting the idea in Arias-Castro & Verzelen (2014) from the SBM case to the DCBM case. Unlike the SgnQ test and the χ 2 -test, signed scan test is not a polynomial time test, but it provides sharper upper bounds. Let η be the same as in (1.3). For any subset S ⊂ {1, 2, . . . , n}, let 1 S ∈ R n be the vector whose ith coordinate is 1{i ∈ S}. Define the signed scan statistic Unfortunately, the signed scan test is not polynomial-time computable. Does there exist a polynomialtime computable test that is optimal? We address this in the next section. φ sc = max S⊂{1,

2.5. THE COMPUTATIONAL LOWER BOUND

Consider the same hypothesis pair as in Section 2.4, where K = 2, P is as in (2.3)-(2.4), and Π is as in (2.9). For simplicity, we only consider SBM, i.e., θ i ≡ 1. The low-degree polynomials argument emerges recently as a major tool to predicting the average-case computational barriers in a wide range of high-dimensional problems (Hopkins & Steurer, 2017; Hopkins et al., 2017) . Many powerful methods, such as spectral algorithms and approximate message passing, can be formulated as functions of the input data, where the functions are polynomials with degree at most logarithm of the problem dimension. In comparison to many other schemes of developing computational lower barriers, the low-degree polynomial method yields the same threshold for various average-case hardness problems, such as community detection in the SBM (Hopkins & Steurer, 2017 ) and (hyper)planted clique detection (Hopkins, 2018; Luo & Zhang, 2022) . The foundation of the low-degree polynomial argument is the following low-degree polynomial conjecture (Hopkins et al., 2017) : Conjecture 2.1 (Adapted from Kunisky et al. (2019) ). Let P n and Q n denote a sequence of probability measures with sample space R n k where k = O(1). Suppose that every polynomial f of degree O(log n) with E Qn f 2 = 1 is bounded under P n with high probability as n → ∞ and that some further regularity conditions hold. Then there is no polynomial-time test distinguishing P n from Q n with type I and type II error tending to 0 as n → ∞. We refer to Hopkins (2018) for a precise statement of this conjecture's required regularity conditions. The low-degree polynomial computational lower bound for our testing problem is as follows. Theorem 2.6 (Computational lower bound). Consider the null and alternative hypotheses in Section 2.1, and assume θ i ≡ 1 and (2.9) holds. As n → ∞, assume c < a, c < 1-δ for constant δ > 0, N < n/3, D = O(log n), and lim sup n→∞ log n N √ n + log n a-c √ c ∨ D/2 -1 log n a-c √ c < 0. For any series of degree-D polynomials φ n : A → R, whenever E H0 φ n (A) = 0, Var H0 (φ n (A)) = 1, we must have E H1 φ n (A) = o(1). This implies if Conjecture 2.1 is true, there is no consistent polynomial-time test for this problem. EST ≥ e. EST can be computed in time O(n v ), which is polynomial time. For simplicity, we consider the SBM, i.e. where θ = 1 n , and a specific setting of parameters for the null and alternative hypotheses. Theorem 2.7 (Power of EST). Suppose β ∈ [1/2, 1) and 0 < ω < δ < 1 are fixed constants. Under the alternative, suppose θ = 1 n , (2.9) holds, N = n 1-β , a = n -ω , and c = n -δ . Under the null, suppose θ = 1 n and α = a(N/n) + b(1 -N/n). If ω/(1 -β) < δ, the sum of type I and type II errors of the EST with v and e satisfying ω/(1 -β) < v/e < δ tends to 0. Theorem 2.7 follows from standard results in probabilistic combinatorics (Alon & Spencer, 2016) . It is conjectured in Bhaskara et al. (2010) that EST attains the CLB in the Erdös-Renyi setting considered by Arias-Castro & Verzelen (2014) ; Verzelen & Arias-Castro (2015) . This suggests that the CLB in Theorem 2.6 is likely not tight when N = o( √ n) and (a -c)/ √ c → ∞. However, this is not because our inequalities in proving the CLB are loose. A possible reason is that the prediction from the low-degree polynomial conjecture does not provide a tight bound. It remains an open question whether other computational infeasibility frameworks provide a tight CLB in our problem.

2.7. THE PHASE TRANSITION

We describe more precisely our results in terms of the phase transitions shown in Figure 1 . Consider the null and alternative hypotheses from Section 2.1. For illustration purposes, we fix constants β ∈ (0, 1) and γ ∈ R and assume that N = n 1-β and (a -c)/ √ c = n -γ . In the two-dimensional space of (γ, β), the region of β > 1/2 and β < 1/2 corresponds to that the size of the small community is √ n and o( √ n), respectively, and the regions of γ > 0, -1/2 < γ < 0 and γ < -1/2 correspond to 'weak node-wise signal', 'moderate node-wise signal,' and the 'strong node-wise signal', respectively. See Figure 1 . By our results in Section 2.4, the testing problem is statistically impossible if β + 2γ > 1 (orange region). By our results in Section 2.2, SgnQ has a full power if β + γ < 1/2 (blue region). Our results in Section 2.5 state that the testing problem is computationally infeasible if both γ > 0 and β + γ > 1/2 (green and orange regions). Combining these results, when β < 1/2, we have a complete understanding of the LB and CLB.



Verzelen & Arias-Castro (2015) have made remarkable progress. In detail, they considered the problem of testing whether a graph is generated from a one-parameter Erdős-Renyi model or a two-parameter model: for any nodes 1 ≤ i, j ≤ n, the probability that they have an edge equals b if i, j both are in a small planted subset and equals a otherwise. A remarkable conclusion of these papers is: a naive degree-based χ 2 -test is optimal, provided that the clique size is in a certain range. Therefore, at first glance, it seems that the problem has been elegantly solved, at least to some extent.Unfortunately, recent progress in network testing tells a very different story: the signal captured by the χ 2 -test may be a modeling artifact. It may disappear once we replace the models in Arias-Castro In this work we use M to denote the transpose of a matrix or vector M . Note that the node-wise SNR captures the ratio of the mean difference and standard deviation of Bernoulli(a) versus Bernoulli(c), which motivates our terminology.



Figure 1: Phase diagram ((a -c)/ √ c = n -γ and N = n 1-β ).

Figure 2: Left: Null distribution of SgnQ (n = 500). Middle and right: Power comparison of SgnQ and χ 2 (n = 100, N = 10, 50 repetitions). We consider a 2-community SBM with P 11 = a, P 22 = 0.1, P 12 = 0.1 (middle plot) and P 12 = an-(a+0.1)N n

The level-κ SgnQ test rejects the null if and only if ψ n ≥ z κ , where ψ n is as in (1.3). Theorem 2.2 and Corollary 2.1 are proved in the supplement. Recall that our alternative hypothesis is defined in Section 2.1. By power we mean the probability that the alternative hypothesis is rejected, minimized over all possible alternative DCBMs satisfying our regularity conditions.

Theorem 2.5 (Tightness of the statistical lower bound). Consider the signed scan test (2.10) that rejects the null hypothesis if φ sc > t n . Under the assumptions of Theorem 2.4, if √ N (a -c)/ c log(n) → ∞, then there exists a sequence t n such that the sum of type I and type II errors of the signed scan test tends to 0. Therefore, the lower bound is sharp, up to log-factors, and the signed scan test is nearly optimal.

acknowledgement

Acknowledgments. We thank the anonymous referees for their helpful comments. We thank Louis Cammarata for assistance with the simulations in Section A.3. J. Jin was partially supported by NSF grant DMS-2015469. Z.T. Ke was supported in part by NSF CAREER Grant DMS-1943902. A.R. Zhang acknowledges the grant NSF CAREER-2203741.

3. NUMERICAL RESULTS

Simulations. First in Figure 2 (left panel) we demonstrate the asymptotic normality of SgnQ under a null of the form Ω = θθ , where θ i are i.i.d. generated from Pareto(4, 0.375). Though the degree heterogeneity is severe, SgnQ properly standardized is approximately standard normal under the null. Next in Figure 2 we compare the power of SgnQ in an asymmetric and symmetric SBM model. As our theory predicts, both tests are powerful when degrees are not calibrated in each model, but only SgnQ is powerful in the symmetric case. We also compare the power of SgnQ with the scan test to show evidence of a statistical-computational gap. We relegate these experiments to the supplement.Real data: Next we demonstrate the effectiveness of SgnQ in detecting small communities in coauthorship networks studied in Ji et al. (2022) . In Example 1, we consider the personalized network of Raymond Carroll, whose nodes consist of his coauthors for papers in a set of 36 statistics journals from the time period 1975 -2015. An edge is placed between two coauthors if they wrote a paper in this set of journals during the specified time period. The SgnQ p-value for Carroll's personalized network G Carroll is 0.02, which suggests the presence of more than one community. In Ji et al. (2022) , the authors identify a small cluster of coauthors from a collaboration with the National Cancer Institute. We applied the SCORE community detection module with K = 2 (e.g. Ke & Jin ( 2022)) and obtained a larger community G 0Carroll of size 218 and a smaller community G 1 Carroll of size 17. Precisely, we removed Carroll from his network, applied SCORE on the remaining giant component, and defined G 0Carroll to be the complement of the smaller community. The SgnQ p-values in the table below suggest that both G 0 Carroll and G 1 Carroll are tightly clustered. Refer to the supplement for a visualization of Carroll's network and its smaller community labeled by author names. In Example 2, we consider three different coauthorship networks G old , G recent , and G new corresponding to time periods (i) 1975-1997, (ii) 1995-2007, and (iii) 2005-2015 for the journals AoS, Bka, JASA, and JRSSB. Nodes are given by authors, and an edge is placed between two authors if they coauthored at least one paper in one of these journals during the corresponding time period. For each network, we perform a similar procedure as in the first example. First we compute the SgnQ p-value, which turns out to be ≈ 0 (up to 16 digits of precision) for all networks. For each i ∈ {old, recent, new}, we apply SCORE with K = 2 to G i and compute the SgnQ p-value on both resulting communities, let us call them G 0 i and G 1 i . We refer to the table below for the results. For G old and G recent , SCORE with K = 2 extracts a small community. The SgnQ p-value further supports the hypothesis that this small community is well-connected. In the last network, SCORE splits G new into two similarly sized pieces whose p-values suggests they can be split into smaller subcommunities. Discussions: Global testing is a fundamental problem and often the starting point of a long line of research. For example, in the literature of Gaussian models, certain methods started as a global testing tool, but later grew into tools for variable selection, classification, and clustering and motivated many researches (e.g., Donoho & Jin (2004; 2015) ). The SgnQ test may also motivate tools for many other problems, such as estimating the locations of the clique and clustering more generally. For example, in Jin et al. (2022) , the SgnQ test motivated a tool for estimating the number of communities (see also Ma et al. (2021) ). SgnQ is also extendable to clique detection in a tensor (Yuan et al., 2021; Jin et al., 2021a) and for network change point detection. The LB and CLB we obtain in this paper are also useful for studying other problems, such as clique estimation. If you cannot tell whether there is a clique in the network, then it is impossible to estimate the clique. Therefore, the LB and CLB are also valid for the clique estimation problem (Alon et al., 1998; Ron & Feige, 2010) .The limiting distribution of SgnQ is N (0, 1). This is not easy to achieve if we use other testing ideas, such as the leading eigenvalues of the adjacency matrix: the limiting distribution depends on many unknown parameters and it is hard to normalize (Liu et al., 2019) . The p-value of the SgnQ test is easy to approximate and also useful in applications. For example, we can use it to measure the research diversity of a given author. Consider the ego sub-network of an author in a large co-authorship or citation network. A smaller p-value suggests that the ego network has more than 1 communitiy and has more diverse interests. The p-values can also be useful as a stopping criterion in hierarchical community detection modules.

