SEMI-SUPERVISED COMMUNITY DETECTION VIA STRUCTURAL SIMILARITY METRICS

Abstract

Motivated by social network analysis and network-based recommendation systems, we study a semi-supervised community detection problem in which the objective is to estimate the community label of a new node using the network topology and partially observed community labels of existing nodes. The network is modeled using a degree-corrected stochastic block model, which allows for severe degree heterogeneity and potentially non-assortative communities. We propose an algorithm that computes a 'structural similarity metric' between the new node and each of the K communities by aggregating labeled and unlabeled data. The estimated label of the new node corresponds to the value of k that maximizes this similarity metric. Our method is fast and numerically outperforms existing semi-supervised algorithms. Theoretically, we derive explicit bounds for the misclassification error and show the efficiency of our method by comparing it with an ideal classifier. Our findings highlight, to the best of our knowledge, the first semi-supervised community detection algorithm that offers theoretical guarantees. Recall that A is the n × n adjacency matrix on the existing nodes and Y L contains the community labels of nodes in L. Write [n] = {1, 2, . . . , n} and let U = [n] \ L denote the set of unlabeled nodes. We index the new node by n + 1 and let X ∈ R n be the binary vector consisting of the edges between the new node and existing nodes. Denote by Ā the adjacency matrix for the network of (n + 1) nodes. The DCBM model and structural equivalence of communities We model Ā with the degreecorrected block model (DCBM) (Karrer & Newman, 2011) . Define a K-dimensional membership

1. INTRODUCTION

Nowadays, large network data are frequently observed on social media (such as Facebook, Twitter, and LinkedIn), science, and social science. Learning the latent community structure in a network is of particular interest. For example, community analysis is useful in designing recommendation systems (Debnath et al., 2008) , measuring scholarly impacts (Ji et al., 2022) , and re-constructing pseudo-dynamics in single-cell data (Liu et al., 2018) . In this paper, we consider a semi-supervised community detection setting: we are given a symmetric network with n nodes, and denote by A ∈ R n×n the adjacency matrix, where A ij ∈ {0, 1} indicates whether there is an edge between nodes i and j. Suppose the nodes partition into K non-overlapping communities C 1 , C 2 , . . . , C K . For a subset L ⊂ {1, 2, . . . , n}, we observe the true community label y i ∈ {1, 2, . . . , K} for each i ∈ L. Write m = |L| and Y L = (y i ) i∈L . In this context, there are two related semi-supervised community detection problems: (i) in-sample classification, where the goal is to classify all the existing unlabeled nodes; (ii) prediction, where the goal is to classify a new node joining the network. Notably, the in-sample classification problem can be easily reduced to prediction problem: we can successively single out each existing unlabeled node, regard it as the "new node", and then predict its label by applying an algorithms for the prediction problem. Hence, for most of the paper, we focus on the prediction problem and defer the study of in-sample classification to Section 3. In the prediction problem, let X ∈ {0, 1} n denote the vector consisting of edges between the new node and each of the existing nodes. Given (A, Y L , X), our goal is to estimate the community label of the new node. This problem has multiple applications. Consider the news suggestion or online advertising push for a new Facebook user (Shapira et al., 2013) . Given a big Facebook network of existing users, for a small fraction of nodes (e.g., active users), we may have good information about the communities to which they belong, whereas for the majority of users, we just observe who they link to. We are interested in estimating the community label of the new user in order to personalize news or ad recommendations. For another example, in a co-citation network of researchers (Ji et al., 2022) , each community might be interpreted as a group of researchers working on the same research area. We frequently have a clear understanding of the research areas of some authors (e.g., senior authors), and we intend to use this knowledge to determine the community to which a new node (e.g., a junior author) belongs. The statistical literature on community detection has mainly focused on the unsupervised setting (Bickel & Chen, 2009; Rohe et al., 2011; Jin, 2015; Gao et al., 2018; Li et al., 2021) . The semisupervised setting is less studied. Leng & Ma (2019) offers a comprehensive literature review of semi-supervised community detection algorithms. Liu et al. (2014) and Ji et al. (2016) derive systems of linear equations for the community labels through physics theory, and predict the labels by solving those equations. Zhou et al. (2018) leverages on the belief function to propagate labels across the network, so that one can estimate the label of a node through its belief. Betzel et al. (2018) extracts several patterns in size and structural composition across the known communities and search for similar patterns in the graph. Yang et al. (2015) unifies a number of different community detection algorithms based on non-negative matrix factorization or spectral clustering under the unsupervised setting, and fits them into the semi-supervised scenario by adding various regularization terms to encourage the estimated labels for nodes in L to match with the clustering behavior of their observed labels. However, the existing methods still face challenges. First, many of them employ the heuristic that a node tends to have more edges with nodes in the same community than those in other communities. This is true only when communities are assortative. But non-assortative communities are also seen in real networks (Goldenberg et al., 2010; Betzel et al., 2018) ; for instance, Facebook users sharing similar restaurant preferences are not necessarily friends of each other. Second, real networks often have severe degree heterogeneity (i.e., the degrees of some nodes can be many times larger than the degrees of other nodes), but most semi-supervised community detection algorithms do not handle degree heterogeneity. Third, the optimization-based algorithms (Yang et al., 2015) solve non-convex problems and face the issue of local minima. Last, to our best knowledge, none of the existing methods have theoretical guarantees. Attributed network clustering is a problem related to community detection, for which many algorithms have been developed (please see Chunaev et al. (2019) for a nice survey). The graph neural networks (GNN) reported great successes in attributed network clustering. Kipf & Welling (2016) proposes a graph convolutional network (GCN) approach to semi-supervised community detection, and Jin et al. (2019) combines GNN with the Markov random field to predict node labels. However, GNN is designed for the setting where each node has a large number of attributes and these attributes contain rich information of community labels. The key question in the GNN research is how to utilize the graph to better propagate messages. In contrast, we are interested in the scenario where it is infeasible or costly to collect node attributes. For instance, it is easy to construct a co-authorship network from bibtex files, but collecting features of authors is much harder. Additionally, a number of benchmark network datasets do not have attributes (e.g. Caltech (Red et al., 2011; Traud et al., 2012) , Simmons (Red et al., 2011; Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005) ). It is unclear how to implement GNN on these data sets. In Section 4, we briefly study the performance of GNN with self-created nodal features from 1-hop representation, graph topology and node embedding. Our experiments indicate that GNN is often not suitable for the case of no node attributes. We propose a new algorithm for semi-supervised community detection to address the limitations of existing methods. We adopt the DCBM model (Karrer & Newman, 2011) for networks, which models degree heterogeneity and allows for both assortative and non-assortative communities. Inspired by the viewpoint of Goldenberg et al. (2010) that a 'community' is a group of 'structurally equivalent' nodes, we design a structural similar metric between the new node and each of the K communities. This metric aggregates information in both labeled and unlabeled nodes. We then estimate the community label of the new node by the k that maximizes this similarity metric. Our method is easy to implement, computationally fast, and compares favorably with other methods in numerical experiments. In theory, we derive explicit bounds for the misclassification probability of our method under the DCBM model. We also study the efficiency of our method by comparing its misclassification probability with that of an ideal classifier having access to the community labels of all nodes. matrix π i ∈ {e 1 , e 2 , . . . , e K }, where e k 's are the standard basis vectors of R K . We encode the community labels by π i , where π i = e k if and only if y i = k. For a symmetric nonnegative matrix P ∈ R K×K and a degree parameter θ i ∈ (0, 1] for each node i, we assume that the upper triangle of Ā contains independent Bernoulli variables, where P( Āij = 1) = θ i θ j • π ′ i P π j , for all 1 ≤ i ̸ = j ≤ n + 1. When θ i are equal, the DCBM model reduces to the stochastic block model (SBM). Compared with SBM, DCBM is more flexible as it accommodates degree heterogeneity. For a matrix M or a vector v, let diag(M ) and diag(v) denote the diagonal matrices whose diagonals are from the diagonal of M or the vector v, respectively. Write θ = (θ 1 , θ 2 , . . . , θ n+1 ) ′ , Θ = diag(θ), and Π = [π 1 , π 2 , . . . , π n+1 ] ′ ∈ R n×K . Model (1) yields that Ā = Ω -diag(Ω) + W, where Ω = ΘΠP Π ′ Θ and W = Ā -E Ā. (2) Here, Ω is a low-rank matrix that captures the 'signal', W is a generalized Wigner matrix that captures 'noise', and diag(Ω) yields a bias to the 'signal' but its effect is usually negligible. The DCBM belongs to the family of block models for networks. In block models, it is not necessarily true that the edge densities within a community are higher than those between different communities. Such communities are called assortative communities. However, non-assortative communities also appear in many real networks (Goldenberg et al., 2010; Betzel et al., 2018) . For instance, in news and ad recommendation, we are interested in identifying a group of users who have similar behaviors, but they may not be densely connected to each other. Goldenberg et al. (2010) introduced an intuitive notion of structural equivalence -two nodes are structurally equivalent if their connectivity with similar nodes is similar. They argued that a 'community' in block models is a group of structurally equivalent nodes. This way of defining communities is more general than assortative communities. We introduce a rigorous description of structural equivalence in the DCBM model. For two vectors u and v, define ψ(u, v) = arccos⟨ u ∥u∥ , v ∥v∥ ⟩, which is the angle between these two vectors. Let Āi be the ith column of Ā. This vector describes the 'behavior' of node i in the network. Recall that Ω is as in (2). When the signal-to-noise ratio is sufficiently large, Āi ≈ Ω i , where Ω i is the ith column of Ω. We approximate the angle between Āi and Āj by the angle between Ω i and Ω j . By DCBM model, for a node i in community k, Ω i = θ i ΘΠP e k , where e k is the kth standard basis of R K . It follows that for i ∈ C k and j ∈ C ℓ , the degree parameters θ i and θ j cancel out in our structural similarity: cos ψ(Ω i , Ω j ) = θ i ΘΠP e k , θ j ΘΠP e ℓ ∥ θ i ΘΠP e k ∥ • ∥ θ j ΘΠP e ℓ ∥ = M kℓ √ M kk M ℓℓ , with M := P Π ′ Θ 2 ΠP. (3) It is seen that cos ψ(Ω i , Ω j ) does not depend on the degree parameters of nodes and is solely determined by community membership. When k = ℓ (i.e., i and j are in the same community), cos ψ(Ω i , Ω j ) = 1, which means the angle between these two vectors is zero. When k ̸ = ℓ, as long as P is non-singular and Π has a full column rank, M is a positive-definite matrix. It follows that cos ψ(Ω i , Ω j ) < 1 and that the angle between Ω i and Ω j is nonzero. Example 1. Suppose K = 2, P ∈ R 2×2 is such that the diagonal entries are 1 and off-diagonal entries are b, for some b > 0 and b ̸ = 1, and max i {θ i } < min{1/b, 1} (to guarantee that all entries of Ω are smaller than 1). For simplicity, we assume i∈C1 θ 2 i = i∈C2 θ 2 i . It can be shown that M is proportional to the matrix whose diagonal entries are (1 + b 2 ) and off-diagonal entries are 2b. When b < 1, the communities are assortative, and when b > 1, the communities are non-assortative. However, regardless of the value of b, the off-diagonal entries of M are always strictly smaller than the diagonal entries, so that cos ψ(Ω i , Ω j ) < 1, for nodes in distinct communities.

2.2. Semi-supervised community detection

Inspired by (3), we propose assigning a community label to the new node based on its 'similarity' to those labeled nodes. For each 1 ≤ k ≤ K, assume that L ∩ C k ̸ = ∅ and define a vector k) describes the 'aggregated behavior' of all labeled nodes in community k. Recall that X ∈ R n contains the edges between the new node and all the existing nodes. We can estimate the community label of the new node by ŷ = arg min A (k) ∈ R n by A (k) j = i∈L∩C k A ij , for 1 ≤ j ≤ n. The vector A ( 1≤k≤K ψ(A (k) , X). We call (4) the AngleMin estimate. Note that each A (k) is an n-dimensional vector, the construction of which uses both A LL and A LU . Therefore, A (k) aggregates information from both labeled and unlabeled nodes, and so AngleMin is indeed a semi-supervised approach. The estimate in (4) still has space to improve. First, A (k) and X are high-dimensional random vectors, each entry of which is a sum of independent Bernoulli variables. When the network is very sparse or communities are heavily imbalanced in size or degree, the large-deviation bound for ψ(A (k) , X) can be unsatisfactory. Second, recall that our observed data include A and X. Denote by A LL the submatrix of A restricted on L × L and X L the subvector of X restricted on L; other notations are similar. In (4), only (A LL , A LU , X) are used, but the information in A U U is wasted. We now propose a variant of (4). For any vector x ∈ R n , let x L and x U be the sub-vectors restricted to indices in L and U, respectively. Let 1 (k) denote the |L|-dimensional vector indicating whether each labeled node is in community k. Given any |U| × K matrix H = [h 1 , h 2 , . . . , h K ], define f (x; H) = [x ′ L 1 (1) , . . . , x ′ L 1 (k) , x ′ U h 1 , . . . , x ′ U h K ] ′ ∈ R 2K . ( ) The mapping f (•; H) creates a low-dimensional projection of x. Suppose we now apply this mapping to A (k) . In the projected vector, each entry is a weighted sum of a large number of entries of A (k) . Since A (k) contains independent entries, it follows from large-deviation inequalities that each entry of f (A (k) , H) has a nice asymptotic tail behavior. This resolves the first issue above. We then modify the AngleMin estimate in (4) to the following estimate, which we call (3): 1 ŷ(H) = arg min 1≤k≤K ψ f (A (k) ; H), f (X; H) . AngleMin+ requires an input of H. Our theory suggests that H has to satisfy two conditions: (a) The spectral norm of H ′ H is O(|U|). In fact, given any H, we can always multiply it by a scalar so that ∥H ′ H∥ is at the order of |U|. Hence, this condition says that the scaling of H should be properly set to balance the contributions from labeled and unlabeled nodes. (b) The minimum singular value of H ′ Θ U U Π U has to be at least a constant times ∥H∥∥Θ U U Π U ∥, where Θ U U is the submatrix of Θ restricted to the (U, U) block and Π U is the sub-matrix of Π restricted to the rows in U. This condition prevents the columns of H from being orthogonal to the columns of Θ U U Π U , and it guarantees that the last K entries of f (x; H) retain enough information of the unlabeled nodes. We construct a data-driven H from A U U , by taking advantage of the existing unsupervised community detection algorithms such as Gao et al. (2018) ; Jin et al. (2021) . Let ΠU = [π i ] i∈U be the community labels obtained by applying a community detection algorithm on the sub-network restricted to unlabeled nodes, where πi = e k if and only if node k is clustered to community k. We propose using H = ΠU . This choice of H always satisfies the aforementioned condition (a). Furthermore, under mild regularity conditions, as long as the clustering error fraction is bounded by a constant, this H also satisfies the aforementioned condition (b). We note that the information in A U U has been absorbed into H, so it resolves the second issue above. Combining ( 7) with (3) gives a two-stage algorithm for estimating y. Remark 1: A nice property of AngleMin+ is that it tolerates an arbitrary permutation of communities in ΠU . In other words, the communities output by the unsupervised community detection algorithm do not need to have a one-to-one correspondence with the communities on the labeled nodes. To see the reason, we consider an arbitrary permutation of columns of ΠU . By (12), this yields a permutation of the last K entries of f (x; H), simultaneously for all x. However, the angle between f (A (k) ; H) and f (X; H)) is still the same, and so ŷ(H) is unchanged. This property brings a lot of practical conveniences. When K is large or the signals are weak, it is challenging (both computationally and statistically) to match the communities in ΠU with those in Π L . Our method avoids this issue. Remark 2: AngleMin+ is flexible to accommodate other choices of H. Some unsupervised community detection algorithms provide both ΠU and ΘUU (Jin et al., 2022) . We may use H ∝ ΘUU ΠU , (subject to a re-scaling to satisfy the aforementioned condition (a)). This H down-weights the contribution of low-degree unlabeled nodes in the last K entries of ( 12). This is beneficial if the signals are weak and the degree heterogeneity is severe. Another choice is H ∝ Ξ(U) Λ-1 (U ) , where ΛU is a diagonal matrix containing the K largest eigenvalues (in magnitude) of A U U and Ξ(U) is the associated matrix of eigenvectors. For this H, we do not even need to perform any community detection algorithm on A U U . We may also use spectral embedding (Rubin-Delanchy et al., 2017) . Remark 3: The local refinement algorithm (Gao et al., 2018) may be adapted to the semi-supervised setting, but it requires prior knowledge on assortativity or dis-assortativity and a strong balance condition on the average degrees of communities. When these conditions are not satisfied, we can construct examples where the error rate of AngleMin+ is o(1) but the error rate of local refinement is 0.5. See Section C.

2.3

The choice of the unsupervised community detection algorithm We discuss how to obtain ΠU . In the statistical literature, there are several approaches to unsupervised community detection. The first is modularity maximization (Girvan & Newman, 2002) . It exhaustively searches for all cluster assignments and selects the one that maximizes an empirical modularity function. The second is spectral clustering (Jin, 2015) . It applies k-means clustering to rows of the matrix consisting of empirical eigenvectors. Other methods include post-processing the output of spectral clustering by majority vote (Gao et al., 2018) . Not every method deals with degree heterogeneity and nonassortative communities as in the DCBM model. We use a recent spectral algorithm SCORE+ (Jin et al., 2021) , which allows for both severe degree heterogeneity and non-assortative communities.

SCORE+:

We tentatively write A U U =A and |U|=n and assume the network (on unlabeled nodes) is connected (otherwise consider its giant component). SCORE+ first computes L=D -1/2 τ AD -1/2 τ , where D τ =diag(d 1 , . . . , d n )+0.1d max I n , and d i is degree of node i. Let λk be the kth eigenvalue (in magnitude) of L and let ξk be the associated eigenvector. Let r=K or r=K+1 (see Jin et al. (2021) for details ). Let R ∈ R n×(r-1) by Rik = ( λk+1 / λ1 ) • [ ξk+1 (i)/ ξ1 (i)]. Run k-means on rows of R.

3. THEORETICAL PROPERTIES

We assume that the observed adjacency matrix Ā follows the DCBM model in (1)-(2). From now on, let θ * denote the degree parameter of the new node n + 1. Suppose k * ∈ {1, 2, . . . , K} is its true community label, and the corresponding K-dimensional membership vector is π * = e k * . In (2), θ and P are not identifiable. To have identifiability, we assume that all diagonal entries of P are equal to 1 (if this is not true, we replace P by [diag(P )] -1 2 P [diag(P )] -1 2 and each θ i in community k by θ i √ P kk , while keeping Ω = ΘΠP Π ′ Θ unchanged). In the asymptotic framework, we fix K and assume n → ∞. We need some regularity conditions. For any symmetric matrix B, let ∥B∥ max denote its entry-wise maximum norm and λ min (B) denote its minimum eigenvalue (in magnitude). We assume for a constant C 1 > 0 and a positive sequence β n (which may tend to 0), ∥P ∥ max ≤ C 1 , |λ min (P )| ≥ β n . For 1 ≤ k ≤ K, let θ (k) ∈ R n be the vector with θ (k) i = θ i • 1{i ∈ C k }, and let θ (k) L and θ (k) U be the sub-vectors restricted to indices in L and U, respectively. We assume for a constant C 2 > 0 and a properly small constant c 3 > 0, max k ∥θ (k) ∥ 1 ≤ C 2 min k ∥θ (k) ∥ 1 , ∥θ (k) L ∥ 2 ≤ c 3 β n ∥θ (k) L ∥ 1 ∥θ∥ 1 , for all 1 ≤ k ≤ K. (9) These conditions are mild. Consider (8). For identifiability, P is already scaled to make P kk = 1 for all k. It is thus a mild condition to assume ∥P ∥ max ≤ C 1 . The condition of |λ min (P )| ≥ β n is also mild, because we allow β n → 0. Here, β n captures the 'dissimilarity' of communities. To see this, consider a special P where the diagonals are 1 and the off-diagonals are all equal to b; in this example, |1 -b| captures the difference of within-community connectivity and between-community connectivity, and it can be shown that |λ min (P )| = |1 -b|. Consider (9). The first condition requires that the total degree in different communities are balanced, which is mild. The second condition is about degree heterogeneity. Let θ max and θ be the maximum and average of θ i , respectively. In the second inequality of (9), the left hand side is O(n -1 θ max / θ), so this condition is satisfied as long as θ max / θ = O(nβ n ). This is a very mild requirement. 3.1 The misclassification error of AngleMin+ For any |U| × K matrix H, let ψk (H) = ψ(f (A (k) ; H), f (X; H)) be as in (3). AngleMin+ estimates the community label to the new node by finding the minimum of ψ1 (H), . . . , ψK (H), with H = ΠU . We first introduce a counterpart of ψk (H). Recall that Ω is as in (2), which is the 'signal' matrix. Let Ω (k) ∈ R n by Ω (k) j = i∈L∩C k Ω ij , for 1 ≤ j ≤ n, and define ψ k (H) = ψ f (Ω (k) ; H), f (EX; H) , for 1 ≤ k ≤ K. ( ) The next lemma gives the explicit expression of ψ k (H) for an arbitrary H. Lemma 1. Consider the DCBM model where ( 8)-( 9) are satisfied. We define three K × K matrices: G LL = Π ′ L Θ LL Π L , G U U = Π ′ U Θ U U Π U , and Q = G -1 U U Π ′ U Θ U U H. For 1 ≤ k ≤ K, ψ k (H) = arccos M kk * √ M kk √ M k * k * , where M = P (G 2 LL + G U U QQ ′ G U U )P . The choice of H is flexible. For convenience, we focus on the class of H that is an eligible community membership matrix, i.e., H = ΠU . Our theory can be easily extended to more general forms of H. Definition 1. For any b 0 ∈ (0, 1), we say that ΠU is b 0 -correct if min T i∈U θ i • 1{T πi ̸ = π i } ≤ b 0 ∥θ∥ 1 , where the minimum is taken over all permutations of K columns of ΠU . The next two theorems study ψ k (H) and ψk (H), respectively, for H = ΠU . Theorem 1. Consider the DCBM model where ( 8)-( 9) hold. Let k * denote the true community label of the new node. Suppose ΠU is b 0 -correct, for a constant b 0 ∈ (0, 1). When b 0 is properly small, there exists a constant c 0 > 0, which does not depend on b 0 , such that ψ k * ( ΠU ) = 0 and min k̸ =k * {ψ k ( ΠU )} ≥ c 0 β n . Theorem 2. Consider the DCBM model where ( 8)-( 9) hold. There exists constant C > 0, such that for any δ ∈ (0, 1/2), with probability 1 -δ, simultaneously for 1 ≤ k ≤ K, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ C log(1/δ) ∥θ∥1•min{θ * ,∥θ (k) L ∥1} + ∥θ (k) L ∥ 2 ∥θ (k) L ∥1∥θ∥1 . Write ψk = ψk ( ΠU ) and ψ k = ψ k ( ΠU ) for short. When max k {| ψk -ψ k |} < (1/2) min k̸ =k * {ψ k }, the community label of the new node is correctly estimated. We can immediately translate the results in Theorems 1-2 to an upper bound for the misclassification probability. Corollary 1. Consider the DCBM model where (8)-( 9) hold. Suppose for some constants b 0 ∈ (0, 1) and ϵ ∈ (0, 1/2), ΠU is b 0 -correct with probability 1 -ϵ. When b 0 is properly small, there exist constants C 0 > 0 and C > 0, which do not depend on (b 0 , ϵ), such that P(ŷ ̸ = k * ) ≤ ϵ + C K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } . Remark 4: When min k ∥θ (k) L ∥ 1 ≥ O(θ * ), the stochastic noise in X will dominate the error, and the misspecification probability in Corollary 1 will not improve with more label information. Typically, the error rate will be the same as in the ideal case that Π U is known (except there is no ϵ in the ideal case). Hence, only little label information can make AngleMin+ perform almost as well as a fully supervised algorithm that possesses all the label information. We will formalize this in Section 3. Remark 5: Notice that min T i∈U θ i • 1{T πi ̸ = π i } ≤ 1 K! T i∈U θ i • 1{T πi ̸ = π i } ≤ K-1 K ∥θ U ∥ 1 . Therefore, if ∥θ L ∥ 1 ≥ (1 -Kb0 K-1 )∥θ∥ 1 , then min T i∈U θ i • 1{T πi ̸ = π i } ≤ b 0 ∥θ∥ 1 is always true. In other words, as long as the information on the labels is strong enough, AngleMin+ would not require any assumption on the unsupervised community detection algorithm. For AngleMin+ to be consistent, we need the bound in Corollary 1 to be o(1). It then requires that for a small constant b 0 , ΠU is b 0 -correct with probability 1 -o(1). This is a mild requirement and can be achieved by several unsupervised community detection algorithms. The next corollary studies the specific version of AngleMin+, when ΠU is from SCORE+: Corollary 2. Consider the DCBM model where (8)-( 9) hold. We apply SCORE+ to obtain ΠU and plug it into AngleMin+. As n → ∞, suppose for some constant q 0 > 0, min i∈U θ i ≥ q 0 max i∈U θ i , β n ∥θ U ∥ ≥ q 0 log(n), β 2 n ∥θ∥ 1 θ * → ∞, and β 2 n ∥θ∥ 1 min k {∥θ (k) L ∥ 1 } → ∞. Then, P(ŷ ̸ = k * ) → 0, so the AngleMin+ estimate is consistent.

3.2. Comparison with an information theoretical lower bound

We compare the performance of AngleMin+ with an ideal estimate that has access to all model parameters, except for the community label k * of the new node. For simplicity, we first consider the case of K = 2. For any label predictor ỹ for the new node, define Risk (ỹ) = k * ∈[K] P(ỹ ̸ = k * |π * = e k * ). Lemma 2. Consider a DCBM with K = 2 and P = (1 -b)I 2 + b1 2 1 ′ 2 . Suppose θ * = o(1), θ * min k ∥θ (k) L ∥1 = o(1), 1 -b = o(1), ∥θ (1) L ∥1 ∥θ (2) L ∥1 = ∥θ (1) U ∥1 ∥θ (2) U ∥1 = 1. There exists a constant c 4 > 0 such that inf ỹ {Risk(ỹ)} ≥ c 4 exp -2[1 + o(1)] (1-b) 2 8 • θ * (∥θ L ∥ 1 + ∥θ U ∥ 1 ) , where the infimum is taken over all measurable functions of A, X, and parameters Π L , Π U , Θ, P , θ * . In AngleMin+, suppose the second part of condition 9 holds with c 3 = o(1), ΠU is b0 -correct with b0 a.s. → 0. There is a constant C 4 > 0 such that, Risk(ŷ) ≤ C 4 exp -[1 -o(1)] (1-b) 2 8 • θ * (∥θ L ∥ 2 1 +∥θ U ∥ 2 1 ) 2 ∥θ L ∥ 3 1 +∥θ U ∥ 3 1 . Lemma 2 indicates that the classification error of AngleMin+ is almost the same as the information theoretical lower bound of an algorithm that knows all the parameters except π * apart from a mild difference of the exponents. This difference comes from two sources. The first is the extra "2" in the exponent of Risk(ŷ), which is largely an artifact of proof techniques, because we bound the total variation distance by the Hellinger distance (the total variation distance is hard to analyze directly). The second is the difference of ∥θ L ∥ 1 + ∥θ U ∥ 1 in inf ỹ {Risk(ỹ)} and (∥θ L ∥ 2 1 +∥θ U ∥ 2 1 ) 2 ∥θ L ∥ 3 1 +∥θ U ∥ 3 1 in Risk(ŷ). Note that (∥θ L ∥ 2 1 +∥θ U ∥ 2 1 ) 2 ∥θ L ∥ 3 1 +∥θ U ∥ 3 1 ≤ ∥θ L ∥ 1 + ∥θ U ∥ 1 ≤ 1.125 (∥θ L ∥ 2 1 +∥θ U ∥ 2 1 ) 2 ∥θ L ∥ 3 1 +∥θ U ∥ 3 1 , so this difference is quite mild. It arises from the fact that AngleMin+ does not aggregate the information in labeled and unlabeled data by adding the first and last K coordinates of f (x; H) together. The reason we do not do this is that unsupervised community detection methods only provide class labels up to a permutation, and practically it is really hard to estimate this permutation, which will result in the algorithm being extremely unstable. To conclude, the difference of the error rate of our method and the information theoretical lower bound is mild, demonstrating that our algorithm is nearly optimal. For a general K, we have a similar conclusion: Theorem 3. Suppose the conditions of Corollary 1 hold, where b 0 is properly small , and suppose that ΠU is b 0 -correct. Furthermore, we assume for sufficiently large constant C 3 , θ * ≤ 1 C3 , θ * ≤ min k∈[K] C 3 ∥θ (k) L ∥ 1 , and for a constant r 0 > 0, min k̸ =ℓ {P kℓ } ≥ r 0 . Then, there is a constant c2 = c2 (K, C 1 , C 2 , C 3 , c 3 , r 0 ) > 0 such that [-log(c 2 Risk(ŷ))]/[-log(inf ỹ {Risk(ỹ)})] ≥ c2 .

3.3. In-sample Classification

In this part, we briefly discuss the in-sample classification problem. Formally, our goal is to estimate π i for all i ∈ U. As mentioned in section 1, an in-sample classification algorithm can be directly derived from AngleMin+: for each i ∈ U, predict the label of i as ŷi (H) = arg min 1≤k≤K ψ f (A (k) -i ; H i ), f (A -i,i ; H i ) , where A (k) -i is the subvector of A (k) by removing the ith entry, A -i,i is the subvector of A i by removing the ith entry, and H i is a (|U| -1) × K projection matrix which may be different across distinct i. As discussed in subsection 2, the choices of H i are quite flexible. For purely theoretical convenience, we would focus on the case that H i = ΠU\{i} . For any in-sample classifier ỹ = (ỹ i ) i∈U ∈ [K] |U | , define the in-sample risk Risk ins (ỹ) = 1 |U | i∈U k * ∈[K] P(ỹ i ̸ = k * |π i = e k * ). For the above in-sample classification algorithm, we have similar theoretical results as in section 3 on consistency and efficiency under some very mild conditions: Theorem 4. Consider the DCBM model where ( 8)-( 9) hold. We apply SCORE+ to obtain ΠU\{i} and plug it into the above algorithm. As n → ∞, suppose for some constant q 0 > 0 , min i∈U θ i ≥ q 0 max i∈U θ i , β n ∥θ U ∥ ≥ q 0 log(n), β 2 n ∥θ∥ 1 min i∈U θ i → ∞, and β 2 n ∥θ∥ 1 min k {∥θ (k) L ∥ 1 } → ∞. Then, 1 |U | i∈U P(ŷ i ̸ = k i ) → 0, so the above in-sample classification algorithm is consistent. Theorem 5. Suppose the conditions of Corollary 1 hold, where b 0 is properly small , and suppose that ΠU\{i} is b 0 -correct for all i ∈ U. Furthermore, we assume for sufficiently large constant C 3 , In each plot, the x-axis is the number of labeled nodes, and the y-axis is the average misclassification rate over 100 repetitions. max i∈U θ i ≤ 1 C3 , max i∈U θ i ≤ min k∈[K] C 3 ∥θ (k) L ∥ 1 , log(|U|) ≤ C 3 β 2 n ∥θ∥ 1 min i∈U θ i , and for a constant r 0 > 0, min k̸ =ℓ {P kℓ } ≥ r 0 . Then, there is a constant c21 = c21 (K, C 1 , C 2 , C 3 , c 3 , r 0 ) > 0 such that [-log(c 21 Risk ins (ŷ))]/[-log(inf ỹ {Risk ins (ỹ)})] ≥ c21 , so the above in-sample classifi- cation algorithm is efficient.

4. EMPIRICAL STUDY

We study the performance of AngelMin+, where ΠU is from SCORE+ (Jin et al., 2021) . We compare our methods with SNMF (Yang et al., 2015) (a representative of semi-supervised approaches) and SCORE+ (a fully unsupervised approach). We also compare our algorithm to typical GNN methods (Kipf & Welling, 2016) in the real data part. Simulations: To illustrate how information in A U U will improve the classification accuracy, we would consider AngleMin in (4) in simulations. Also, to cast light on how information on unlabeled data will ameliorate the classification accuracy, we consider a special version of AngleMin+ in simulations by feeding into the algorithm only A LL and X L . It ignores information on unlabeled data and only uses the subnetwork consisting of labeled nodes. We call it AngleMin+(subnetwork). This method is practically uninteresting, but it serves as a representative of the fully supervised approach that ignores unlabeled nodes. We simulate data from the DCBM with (n, K) = (500, 3). To generate P , we draw its (off diagonal) entries from Uniform(0, 1), and then symmetrize it. We generate the degree heterogeneity parameters θ i i.i.d. from one of the 4 following distributions: n -0.5 log(n)Gamma(3.5), n -0.25 Gamma(3.5), n -0.5 log(n)Pareto(3.5), n -0.25 Pareto(3.5). They cover most scenarios: Gamma distributions have considerable mass near 0, so the network has severely low degree nodes; Pareto distributions have heavy tails, so the network has severely high degree nodes. The scaling n -0.5 log(n) corresponds to the sparse regime, where the average node degree is ≍ log(n) 2 , and n -0.25 corresponds to the dense regime, with average node degree ≍ √ n. We consider two cases of Π: the balanced case (bal.) and the imbalanced case (inbal.). In the former, π(i) are i.i.d. from Multinomial(1/3, 1/3, 1/3), and in the latter, π(i) are i.i.d. from Multinomial(0.2, 0.2, 0.6). We repeat the simulation 100 times. Our results are presented in Figure 1 , which shows the average classification error of each algorithm as the number of labeled nodes, N L increases. The plots indicate that AngleMin+ outperforms other methods in all the cases. Furthermore, though AngleMin is not so good as AngleMin+ when N L is small, it still surpasses all the other approaches except AngleMin+ in most scenarios. Compared to supervised and unsupervised methods which only use part of the data, we can see that AngleMin+ gains a great amount of accuracy by leveraging on both the labeled and unlabeled data. Real data: We consider three benchmark datasets for community detection, Caltech (Traud et al., 2012) , Simmons (Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005) . For each data set, we separate nodes into 10 folds and treat each fold as the test data at a time, with the other 9 folds as training data. In the training network, we randomly choose n L nodes as labeled nodes. We then estimate the label of each node in the test data and report the misclassification error rate (averaged over 10 folds). We consider n L /n ∈ {0.3, 0.5, 0.7}, where n is the number of nodes in training data. The results are shown in Table 1 . In most cases, AngleMin+ significantly outperforms the other methods (unsupervised or semi-supervised). Additionally, we notice that in the Polblogs data, the standard deviation of the error of SCORE+ is quite large, indicating that its performance is unstable. Remarkably, even though AngleMin+ uses SCORE+ to initialize, the performance of AngleMin+ is nearly unaffected: It still achieves low means and standard deviations in misclassification error. This is consistent with our theory in Section 3. We also compare the running time of different methods (please see Section B of the appendix) and find that AngleMin+ is much faster than SNMF. GNN is a popular approach for attributed node clustering. Although it is not designed for the case of no node attributes, we are still interested in whether GNN can be easily adapted to our setting by self-created features. We take the GCN method in Kipf & Welling ( 2016) and consider 6 schemes of creating a feature vector for each node: i) a 50-dimensional constant vector of 1's, ii) a 50-dimensional randomly generated feature vector, iii) the n-dimensional adjacency vector, iv) the vector of landing probabilities (LP) (Li et al., 2019) (which contains network topology information), v) the embedding vector from node2vec (Grover & Leskovec, 2016) , and vi) a practically infeasible vector e ′ i AΠ ∈ R K (which uses the true Π). The results are in Table 1 . GCN performs unsatisfactorily, regardless of how the features are created. For example, propagating messages with all-1 vectors seems to result in over-smoothing; and using adjacency vectors as node features means that the feature transformation linear layers' size changes with the number of nodes in a network, which could heavily overfit due to too many parameters. We conclude that it is not easy to adapt GNN to the case of no node attributes. For a fairer comparison, we also consider a real network, Citeseer (Sen et al., 2008) , that contains node features. We consider two state-of-the-art semi-supervised GNN algorithms, GCN (Kipf & Welling, 2016) and MasG (Jin et al., 2019) . Our methods can also be generalized to accommodate node features. Using the "fusion" idea surveyed in Chunaev et al. (2019) , we "fuse" the adjacency matrix Ā (on n + 1 nodes) and node features into a weighted adjacency matrix Āfuse (see the appendix for details). We denote its top left block by A fuse ∈ R n×n and its last column by X fuse ∈ R n and apply AngleMin+ by replacing (A, X) by (A fuse , X fuse ). The misclassification error averaged over 10 data splits is reported in Table 2 . The error rates of GCN and MasG are quoted from those papers, which are based on 1 particular data split. We also re-run GCN on our 10 data splits. 

Conclusion and discussions:

In this paper, we propose a fast semi-supervised community detection algorithm AngleMin+ based on the structural similarity metric of DCBM. Our method is able to address degree heterogeneity and non-assortative network, is computationally fast, and possesses favorable theoretical properties on consistency and efficiency. Also, our algorithm performs well on both simulations and real data, indicating its strong usage in practice. There are possible extensions for our method. Our method does not directly deal with soft label (a.k.a mixed membership) where the available label information is the probability of a certain node being in each community. We are currently endeavoring to solve this by fitting our algorithm into the degree-corrected mixed membership model (DCMM), and developing sharp theories for it.

ACKNOWLEDGMENTS

This work is partially supported by the NSF CAREER grant DMS-1943902.

ETHICS STATEMENT

This paper proposes a novel semi-supervised community detection algorithm, AngleMin+, based on the structural similarity metric of DCBM. Our method may be maliciously manipulated to identify certain group of people such as dissenters. This is a common drawback of all the community detection algorithms, and we think that this can be solved by replacing the network data by their differential private counterpart. All the real data we use come from public datasets which we have clearly cited, and we do not think that they will raise any privacy issues or other potential problems.

REPRODUCIBILITY STATEMENT

We provide detailed theory on our algorithm AngleMin+. we derive explicit bounds for the misclassification probability of our method under DCBM, and show that it is consistent. We also study the efficiency of our method by comparing its misclassification probability with that of an ideal classifier having access to the community labels of all nodes. Additionally, we provide clear explanations and insights of our theory. All the proofs, together with some generalization of our theory, are available in the appendix. Also, we perform empirical study on our proposed algorithms under both simulations and real data settings, and we consider a large number of scenarios in both cases. All the codes are available in the supplementary materials.  y i = k, 1 ≤ k ≤ K. Let H = ΠU . Compute x = X ′ L Π L , X ′ U H ′ , v k = e ′ k Π ′ L A LL Π L , e ′ k Π ′ L A LU H ′ , 1 ≤ k ≤ K. Suppose k * minimizes the angle between v k and x, among 1 ≤ k ≤ K (if there is a tie, pick the smaller k). Output ŷ = k * . 1 . It can be seen from the result that our algorithm AngleMin+ is much faster than all the other algorithms. This is one of the merits of our method. 

C COMPARISON WITH LOCAL REFINEMENT ALGORITHM

We would first illustrate why local refinement may not work with an example and then explain our insight behind it. Consider a network with n = 4m nodes and K = 2 communities. Suppose that there are 2m labeled nodes, m of them are in community C 1 and have degree heterogeneity θ = 0.8, and the other m of them are in community C 2 and have degree heterogeneity θ = 0.5. There are 2m unlabeled nodes, m of them are in community C 1 and have degree heterogeneity θ = 0.6, and the other m of them are in community C 2 and have degree heterogeneity θ = 0.7. The P matrix is defined as follows: P = 1 0.9 0.9 1 Under this setting, all the assumptions in our paper are satisfied. On the other hand, recall that the prototypical refinement algorithm, Algorithm 2 of Gao et al. (2018) is defined as follows: ŷi = arg max u∈[K] 1 |{j : ŷ0 (j) = u}| {j:ŷ 0 (j)=u} where ŷ0 is a vector of community label and ŷ is the refined community label. For semi-supervised setting, one may consider the following modification of local refinement algorithm: (i) Apply local refinement algorithm, with known labels to assign nodes in U. (ii) With the labels of all nodes, one updates the labels of every node by applying the same refinement procedure. Under the setting of our toy example, for step (i), all the unlabeled nodes which are actually in community C 2 will be assigned to community C 1 with probability converging to 1 as n → ∞. The reason is that for any unlabeled node i which is actually in community C 2 , when u = 1, {A ij : j ∈ L, y j = u} are iid ∼ Bern(θ i θ j P 21 ) = Bern(θ i • 0.8 • 0.9) = Bern(0.72θ i ); when u = 2, {A ij : j ∈ L, y j = u} are iid ∼ Bern(θ i θ j P 22 ) = Bern(θ i • 0.5 • 1) = Bern(0.5θ i ). Hence, by law of large numbers, 1 {A ij : j ∈ L, y j = u} {Aij :j∈L,yj =u} A ij a.s. → 0.72θ i , u = 1 0.5θ i , u = 2 Consequently, the prototypical refinement algorithm will incorrectly assign all the unlabeled nodes which are actually in the community C 2 to C 1 with probability converging to 1 as n → ∞. This will cause a classification error of at least 50%. Based on the huge classification error in step (i), step (ii) will also perform poorly. Similar to the reasoning above, by law of large numbers, it can be shown that after step (ii). the algorithm will still assign all the unlabeled nodes which are actually in the community C 2 to C 1 with probability converging to 1 as n → ∞. In other words, even if the local refinement algorithm is applied to the whole network, a classification error of at least 50% will always remain. Even if all the labels of the nodes are known, applying the local refinement algorithm still can cause severe errors. Still consider our toy example. Suppose now that we know the label of all the nodes, and we perform the local refinement algorithm on these known labels in an attempt to purify them. By the law of large numbers, however, it is not hard to show that for any node i which is actually in community C 2 , 1 {A ij : j ∈ L, y j = u} {Aij :yj =u} A ij a.s. → 0.63θ i , u = 1 0.6θ i , u = 2 Consequently, similar to the previous cases, the local refinement algorithm will incorrectly assign all the unlabeled nodes which are actually in the community C 2 to C 1 with probability converging to 1 as n → ∞. This will cause a classification error of at least 50%, even though the input of the algorithm is actually the true label vector. To conclude, in general, the local refinement algorithms may not work under the broad settings of our paper. Intrinsically, label refinement is quite challenging when there is moderate degree heterogeneity, not to mention the scenarios where non-assortative networks occur. Local refinement algorithm works theoretically because strong assumptions on degree heterogeneity are imposed. For instance, it is required that the mean of the degree heterogeneity parameter in each community is 1 + o(1), which means that the network is extremely dense and that the degree heterogeneity parameters across communities are strongly balanced. Both of these two assumptions are hardly true in the real world, where most of the networks are sparse and imbalanced. Gao et al. (2018) is a very good paper, but we think that local refinement algorithm or similar algorithms might not be good choices for our problem.

D GENERALIZATION OF LEMMA 2

In the main paper, for the smoothness and comprehensibility of the text, we do not present the most general form of Lemma 2. We have a more general version of the lemma by relaxing the condition ∥θ (1) L ∥1 ∥θ (2) L ∥1 = ∥θ (1) U ∥1 ∥θ (2) U ∥1 = 1. Please see Section J for more details.

E PRELIMINARIES

For any positive integer N , Define [N ] = {1, 2, ..., N }. For a matrix D and two index sets S 1 , S 2 , define D S1S2 to be the submatrix (D ij ) i∈S1,j∈S2 , D S1• to be the submatrix (D ij ) i∈S1,j∈L∪U , and D •S2 to be the submatrix (D ij ) i∈L∪U ,|∈S∈ . The two main assumptions (8), (9) in the main paper are presented below for convenience. ∥P ∥ max ≤ C 1 , |λ min (P )| ≥ β n . ( ) max 1≤k≤K {∥θ (k) ∥ 1 } min 1≤k≤K {∥θ (k) ∥ 1 } ≤ C 2 , max 1≤k≤K ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 ≤ c 3 β n . , where constant c 3 is properly small. We would specify this precisely in our proofs. A number of lemmas used in our proofs will be presented as follows. The following lemma shows that sin x and x have the same order. Lemma 3. Let x ∈ R. When x ≥ 0, sin x ≤ x; when x ∈ [0, π 2 ], sin x ≥ 2 π x. Lemma 3 is quite obvious, but for the completeness of our work, we provide a proof for it. Proof. Let g 1 (x) = sin x -x. Then d dx g 1 (x) = cos x -1 ≤ 0 Hence g 1 (x) is monotonously decreasing on R. As a result, when x ≥ 0, g 1 (x) ≥ g 1 (0) = 0. Therefore, when x ≥ 0, sin x ≤ x. Let g 2 (x) = sin x -2 π x. Then d dx g 2 (x) = cos x - 2 π Since cos x is monotonously decreasing on [0, π 2 ], d dx g 2 (x) ≥ 0 when x ∈ [0, arccos 2 π ] and d dx g 2 (x) ≤ 0 when x ∈ [arccos 2 π , π 2 ]. Hence, g 2 (x) is monotonously increasing on [0, arccos 2 π ] and is monotonously decreasing on [arccos 2 π , π 2 ]. As a result, when x ∈ [0, π 2 ], g 2 (x) ≥ min{g 2 (0), g 2 ( 2 π )} = 0 Therefore, when x ∈ [0, π 2 ], sin x ≥ 2 π x. The following lemma demonstrates that the angle ψ(u, v) in Definition 1 satisfies the triangle inequality, so that it can be regarded as a sort of "metric". Lemma 4 (Angle Inequality). Let x, y, z be three real vectors. Then, ψ(x, z) ≤ ψ(x, y) + ψ(y, z) The proof of Lemma 4 can be seen in Gustafson & Rao (1997) , pg 56. Also, for completeness of our work, we provide a proof of Lemma 4. Proof. Let x = x ∥x∥ , ỹ = y ∥y∥ , z = z ∥z∥ , then cos ψ(x, y) = ⟨x, ỹ⟩, cos ψ(y, z) = ⟨ỹ, z⟩, cos ψ(x, z) = ⟨x, z⟩. Consider the following matrix  G = 1 cos ψ(x, y) cos ψ(x, z) cos ψ(x, y) 1 cos ψ(y, z) cos ψ(x, z) cos ψ(y, z) 1 = ⟨x, c = (c 1 , c 2 , c 3 ) T ∈ R 3 , c T Gc = ⟨c 1 x + c 2 ỹ + c 3 z, c 1 x + c 2 ỹ + c 3 z⟩ ≥ 0 Also, G is symmetric. Therefore, G is positive semi-definite. As a result, det(G) ≥ 0. In other word, 1 -cos 2 ψ(x, y) -cos 2 ψ(y, z) -cos 2 ψ(x, z) + 2 cos ψ(x, y) cos ψ(y, z) cos ψ(x, z) ≥ 0 The above inequality can be rewritten as (1 -cos 2 ψ(x, y))(1 -cos 2 ψ(y, z)) ≥ (cos ψ(x, y) cos ψ(y, z) -cos ψ(x, z)) 2 or (sin ψ(x, y) sin ψ(y, z)) 2 ≥ (cos ψ(x, y) cos ψ(y, z) -cos ψ(x, z)) 2 By definition of arccos, ψ(x, y), ψ(y, z) ∈ [0, π], so sin ψ(x, y) sin ψ(y, z) ≥ 0. Therefore, -sin ψ(x, y) sin ψ(y, z) ≤ cos ψ(x, y) cos ψ(y, z) -cos ψ(x, z) ≤ sin ψ(x, y) sin ψ(y, z) cos ψ(x, z) ≥ cos ψ(x, y) cos ψ(y, z) -sin ψ(x, y) sin ψ(y, z) cos ψ(x, z) ≥ cos(ψ(x, y) + ψ(y, z)) If ψ(x, y) + ψ(y, z) > π, because by definition of arccos, ψ(x, z) ∈ [0, π], it is immediate that ψ(x, z) ≤ ψ(x, y) + ψ(y, z) If ψ(x, y) + ψ(y, z) ≤ π, recall that ψ(x, y), ψ(y, z) ∈ [0, π], hence ψ(x, y) + ψ(y, z) ∈ [0, π]. Also, ψ(x, z) ∈ [0, π]. Since cos is monotone decreasing on [0, π], we obtain ψ(x, z) ≤ ψ(x, y) + ψ(y, z) In all, ψ(x, z) ≤ ψ(x, y) + ψ(y, z) The following lemma relates angle to Euclidean distance. Lemma 5. Suppose that x, y ∈ R m , ∥y∥ < ∥x∥. Then, ψ(x, x + y) ≤ arcsin ∥y∥ ∥x∥ The equality holds if and only if ⟨y, x + y⟩ = 0 Proof. Let ρ = ∥y∥ ∥x∥ , ψ 0 = ψ(x, y). Then ⟨x, y⟩ = ∥x∥∥y∥ cos ψ 0 . Notice that ρ 2 (ρ + cos ψ 0 ) 2 ≥ 0 This can be rewritten as (1 + ρ cos ψ 0 ) 2 ≥ (1 + ρ 2 + 2ρ cos ψ 0 )(1 -ρ 2 ) Since ∥y∥ < ∥x∥, so ρ < 1, 1 + ρ cos ψ 0 > 0. Hence 1 + ρ cos ψ 0 1 + ρ 2 + 2ρ cos ψ 0 ≥ 1 -ρ 2 Plugging in ρ = ∥y∥ ∥x∥ , we have ∥x∥ 2 + ∥x∥∥y∥ cos ψ 0 ∥x∥ 2 + ∥y∥ 2 + 2∥x∥∥y∥ cos ψ 0 ≥ 1 - ∥y∥ 2 ∥x∥ 2 Since ⟨x, y⟩ = ∥x∥∥y∥ cos ψ 0 , ∥x∥ 2 + ⟨x, y⟩ ∥x∥ ∥x∥ 2 + ∥y∥ 2 + 2⟨x, y⟩ ≥ 1 - ∥y∥ 2 ∥x∥ 2 ⟨x, x + y⟩ ∥x∥ ⟨x + y, x + y⟩ ≥ 1 - ∥y∥ 2 ∥x∥ 2 In other words, cos ψ(x, x + y) ≥ 1 - ∥y∥ 2 ∥x∥ 2 = cos arcsin ∥y∥ ∥x∥ Since ∥y∥ ∥x∥ ≥ 0, arcsin ∥y∥ ∥x∥ ∈ [0, π 2 ]. Therefore, by monotonicity of cos on [0, π 2 ], ψ(x, x + y) ≤ arcsin ∥y∥ ∥x∥ The equality holds if and only if ρ 2 (ρ + cos ψ 0 ) 2 ≥ 0, or equivalently, (∥y∥ 2 + ∥x∥∥y∥ cos ψ 0 ) 2 = 0 This can be reduced to ⟨y, x + y⟩ = 0 F PROOF OF LEMMA 1 Lemma 1. Consider the DCBM model where ( 8)-( 9) are satisfied. We define three K × K matrices: G LL = Π ′ L Θ LL Π L , G U U = Π ′ U Θ U U Π U , and Q = G -1 U U Π ′ U Θ U U H. For 1 ≤ k ≤ K, ψ k (H) = arccos M kk * √ M kk √ M k * k * , where M = P G 2 LL + G U U QQ ′ G U U P. Proof. Recall that ψ k (H) = ψ f (Ω (k) ; H), f (EX; H) , for 1 ≤ k ≤ K. (11) where f (x; H) = x ′ L 1 (1) , . . . , x ′ L 1 (k) , x ′ U h 1 , . . . , x ′ U h K ′ = [x ′ L Π L , x ′ U H] ′ . ( ) and Ω (k) j = i∈L∩C k Ω ij = e ′ k Π ′ L Ω L• e j which indicates Ω (k) = (e ′ k Π ′ L Ω L• ) ′ Hence f (Ω (k) ; H) = f ((e ′ k Π ′ L Ω L• ) ′ ; H) = [e ′ k Π ′ L Ω LL Π L , e ′ k Π ′ L Ω LU H] ′ = [e ′ k Π ′ L Θ LL Π L P Π T L Θ LL Π L , e ′ k Π ′ L Θ LL Π L P Π T U Θ U U H] ′ = [Π ′ L Θ LL Π L , H ′ Π ′ U Θ U U ] ′ P Π ′ L Θ LL Π L e k = [G LL , Q ′ G U U ] ′ P Π ′ L Θ LL Π L e k Notice that (Π ′ L Θ LL Π L ) kl = 1 (k) Θ LL 1 (l) = ∥θ (k) L ∥ 1 , k = l 0, k ̸ = l In other words, Π ′ L Θ LL Π L = diag ∥θ (1) L ∥ 1 , ..., ∥θ (K) L ∥ 1 Hence f (Ω (k) ; H) = ∥θ (k) L ∥ 1 [G LL , Q ′ G U U ] ′ P e k Similarly, f (EX; H) = θ * [G LL , Q ′ G U U ] ′ P e k * Therefore, ⟨f (Ω (k) ; H), f (EX; H)⟩ = (∥θ (k) L ∥ 1 [G LL , Q ′ G U U ] ′ P e k ) ′ θ * [G LL , Q ′ G U U ] ′ P e k * = θ * ∥θ (k) L ∥ 1 e ′ k P [G LL , G U U Q][G LL , Q ′ G U U ] ′ P e k * = θ * ∥θ (k) L ∥ 1 e ′ k P G 2 LL + G U U QQ ′ G U U P e k * = θ * ∥θ (k) L ∥ 1 e ′ k M e k * = θ * ∥θ (k) L ∥ 1 M kk * (14) Similarly, ∥f (Ω (k) ; H)∥ = ⟨f (Ω (k) ; H), f (Ω (k) ; H)⟩ = ∥θ (k) L ∥ 1 M kk (15) ∥f (EX; H)∥ = ⟨f (EX; H), f (EX; H)⟩ = θ * M k * k * (16) Hence, ψ k (H) = ψ f (Ω (k) ; H), f (EX; H) = arccos ⟨f (Ω (k) ; H), f (EX; H)⟩ ∥f (Ω (k) ; H)∥∥f (EX; H)∥ = arccos θ * ∥θ (k) L ∥ 1 M kk * ∥θ (k) L ∥ 1 √ M kk θ * √ M k * k * = arccos M kk * √ M kk √ M k * k * G PROOF OF THEOREM 1 Theorem 1. Consider the DCBM model where ( 8)-( 9) hold. Let k * denote the true community label of the new node. Suppose ΠU is b 0 -correct, for a constant b 0 ∈ (0, 1). When b 0 is properly small, there exists a constant c 0 > 0, which does not depend on b 0 , such that ψ k * ( ΠU ) = 0 and min k̸ =k * {ψ k ( ΠU )} ≥ c 0 β n . Proof. Define G LL = Π ′ L Θ LL Π L , G U U = Π ′ U Θ U U Π U , Q = G -1 U U Π ′ U Θ U U ΠU , and M = P G 2 LL + G U U QQ ′ G U U P as in Lemma 1. According to Lemma 1, ψ k ( ΠU ) = arccos M kk * √ M kk √ M k * k * Hence, ψ k * ( ΠU ) = arccos M k * k * √ M k * k * √ M k * k * = arccos 1 = 0 When k ̸ = k * , according to Lemma 3, ψ k ( ΠU ) = 2 • 1 2 ψ k ( ΠU ) ≥ 2 sin 1 2 ψ k ( ΠU ) = 2 1 -cos ψ k ( ΠU ) 2 = 2 1 -cos arccos M kk * √ M kk √ M k * k * = 2 1 - M kk * √ M kk √ M k * k * (18) Let D M = diag(M 11 , ..., M KK ), M = D -1 2 M M D -1 2 M . Then (e k -e k * ) ′ M (e k -e k * ) = Mkk + Mk * k * -Mkk * -Mk * k = M kk √ M kk √ M kk + M k * k * √ M k * k * √ M k * k * - M kk * √ M kk √ M k * k * - M k * k M k * k * √ M kk = 2 1 - M kk * √ M kk √ M k * k * (19) Hence, ψ k ( ΠU ) ≥ (e k -e k * ) ′ M (e k -e k * ) M is affected by ΠU and is complicated to evaluate directly. Hence, we would first evaluate its oracle version and then reduce the noisy version to the oracle version. Define the oracle version of M as follow, where ΠU is replaced by Π U M (0) = P G 2 LL + G 2 U U P Similarly, define the oracle version of D M , D M (0) = diag(M (0) 11 , ..., M KK ), and the oracle version of M , M (0) = D -1 2 M (0) M (0) D -1 2 M (0) Oracle Case We first study the oracle case |α ′ M (0) α|. Since G LL = Π ′ L Θ LL Π L = diag ∥θ (1) L ∥ 1 , ..., ∥θ (K) L ∥ 1 , G U U = Π ′ U Θ U U Π U = diag ∥θ (1) U ∥ 1 , ..., ∥θ (K) U ∥ 1 , which indicates that G 2 LL + G 2 U U = diag ∥θ (k) L ∥ 2 1 + ∥θ (k) U ∥ 2 1 K k=1 , for any vector α ∈ R k , |α ′ M (0) α| = |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| = |α ′ D -1 2 M (0) P G 2 LL + G 2 U U P D -1 2 M (0) α| ≥ ∥P D -1 2 M (0) α∥ 2 min k (∥θ (k) L ∥ 2 1 + ∥θ (k) U ∥ 2 1 ) (Cauchy-Schwartz Inequality) ≥ |α ′ D -1 2 M (0) P P ′ D -1 2 M (0) α| min k 1 2 (∥θ (k) L ∥ 1 + ∥θ (k) U ∥ 1 ) 2 ≥ 1 2 λ min (P ) 2 ∥D -1 2 M (0) α∥ 2 min k (∥θ (k) ∥ 1 ) 2 (Condition (8)) ≥ 1 2 β 2 n |αD -1 M (0) α| min k (∥θ (k) ∥ 1 ) 2 ≥ 1 2 β 2 n ∥α∥ 2 min k (∥θ (k) L ∥ 2 1 + ∥θ (k) U ∥ 2 1 ) -1 min k (∥θ (k) ∥ 1 ) 2 ≥ 1 2 β 2 n ∥α∥ 2 min k (∥θ (k) L ∥ 1 + ∥θ (k) U ∥ 1 ) 2 -1 (min k ∥θ (k) ∥ 1 ) 2 = 1 2 β 2 n ∥α∥ 2 min k ∥θ (k) ∥ 1 max k ∥θ (k) ∥ 1 2 (Condition (9)) ≥ β 2 n ∥α∥ 2 2C 2 2 (21) It remains to study the noisy case. We reduce the noisy case to the oracle case through the following lemma. Lemma 6. Denote C 5 = 8K 2 √ KC 2 2 b 0 ∥θ U ∥ 1 ∥θ∥ 1 (22) Suppose that C 5 ≤ 1 4 . Then, for any vector α ∈ R k , |α ′ M α| ≥ 1 -3C 5 1 -C 5 |α ′ M (0) α| ≥ 1 3 |α ′ M (0) α| The proof of Lemma 6 is quite tedious and we would defer it to the end of this section. Published as a conference paper at ICLR 2023 Set b 0 ≤ 1 32K 2 √ KC 2 2 , then C 5 ≤ ∥θ U ∥1 4∥θ∥1 ≤ 1 4 . As a result, combining (21) with Lemma 6, we have for any vector α ∈ R k , |α ′ M α| ≥ 1 3 |α ′ M (0) α| ≥ β 2 n ∥α∥ 2 6C 2 2 (24) Hence, take α = e k -e k * in ( 24) and combine it with (20), we obtain that for any k ∈ [K] ψ k ( ΠU ) ≥ (e k -e k * ) ′ M (e k -e k * ) ≥ 1 6C 2 2 β 2 n ∥e k -e k * ∥ 2 = 1 3C 2 2 β n (25) Therefore, set c 0 = 1 3C 2 2 , we have min k̸ =k * {ψ k ( ΠU )} ≥ c 0 β n In all, when b 0 is properly small such that b 0 ≤ 1 32K 2 √ KC 2 2 , there exists constant c 0 = 1 3C 2 2 > 0 not depending on b 0 such that ψ k * ( ΠU ) = 0 and min k̸ =k * {ψ k ( ΠU )} ≥ c 0 β n . G.1 PROOF OF LEMMA 6 Proof. For any vector α ∈ R k , |α ′ M α -α ′ M (0) α| = |α ′ D -1 2 M M D -1 2 M α -α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| = |α ′ D -1 2 M (M -M (0) )D -1 2 M α + α ′ D -1 2 M M (0) D -1 2 M α -α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| ≤ |α ′ D -1 2 M (M -M (0) )D -1 2 M α| + |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| The first part on the RHS of ( 27), |α ′ D -1 2 M (M -M (0) )D -1 2 M α| = |α ′ D -1 2 M P Π ′ U Θ U U ( ΠU Π′ U -Π U Π ′ U )Θ U U Π U P D -1 2 M α| ≤ ∥P D -1 2 M α∥ 2 ∥Π ′ U Θ U U ( ΠU Π′ U -Π U Π ′ U )Θ U U Π U ∥ 2 ≤ √ K∥P D -1 2 M α∥ 2 ∥Π ′ U Θ U U ( ΠU Π′ U -Π U Π ′ U )Θ U U Π U ∥ ∞ (28) Denote G (d) = Π ′ U Θ U U ( ΠU Π′ U -Π U Π ′ U )Θ U U Π U . Define η l l = i∈U ,πi=e l ,πi=el θ i . In other words, η l l is the sum of the degree heterogeneity parameters of all the nodes in U with true label l and estimated label l. Then, (Π ′ U Θ U U ΠU ) l l = η l l (Π ′ U Θ U U Π U ) l l = i∈U ,πi=e l ,πi=el θ i = I l= l s∈[K] η ls where I l= l is the indicator function of event {l = l}. Hence, G (d) l l = (Π ′ U Θ U U ( ΠU Π′ U -Π U Π ′ U )Θ U U Π U ) l l = ((Π ′ U Θ U U ΠU )(Π ′ U Θ U U ΠU ) ′ ) l l -((Π ′ U Θ U U Π U )(Π ′ U Θ U U Π U ) ′ ) l l = s∈[K] η ls η ls -I l= l( s∈[K] η ls ) 2 (29) Since ΠU is b 0 correct, there exists permutation T of K columns of ΠU such that i∈U θ i •1{T πi ̸ = π i } ≤ b 0 ∥θ∥ 1 . Let r = r(l) satisfies e r = T -1 e l . When l = l, we have |G (d) l l | = | s∈[K] η 2 ls -( s∈[K] η ls ) 2 | = ( s∈[K] η ls ) 2 - s∈[K] η 2 ls ≤ ( s∈[K] η ls ) 2 -η 2 lr = ( s̸ =r η ls )(η lr + s∈[K] η ls ) ≤ 2( s̸ =r η ls )( s∈[K] η ls ) When l ̸ = l, we have |G (d) l l | = | s∈[K] η ls η ls | = s∈[K] η ls η ls (31) Therefore, ∥G (d) ∥ ∞ ≤ l, l∈[K] |G (d) l l | = l∈[K] |G (d) ll | + l̸ = l |G (d) ll | ≤ l∈[K] 2( s̸ =r η ls )( s∈[K] η ls ) + l̸ = l s∈[K] η ls η ls ≤ 2 max l s∈[K] η ls l∈[K],s̸ =r(l) η ls + s∈[K] l̸ = l η ls η ls = 2 max l s∈[K] η ls l∈[K],s̸ =r(l) η ls + s∈[K] l∈[k] η ls 2 - l∈[k] η 2 ls ≤ 2 max l s∈[K] η ls l∈[K],s̸ =r(l) η ls + s∈[K] l∈[k] η ls 2 - r(l)=s η 2 ls = 2 max l s∈[K] η ls l∈[K],s̸ =r(l) η ls + s∈[K] r(l)̸ =s η ls l∈[K] η ls + r(l)=s η ls ≤ 2 max l s∈[K] η ls l∈[K],s̸ =r(l) η ls + 2 s∈[K] r(l)̸ =s η ls l∈[K] η ls ≤ 2 max l s∈[K] η ls l∈[K],s̸ =r(l) η ls + 2 max s l∈[K] η ls s∈[K] r(l)̸ =s η ls = 2 l∈[K],s̸ =r(l) η ls max l s∈[K] η ls + max s l∈[K] η ls ≤ 2 l∈[K],s̸ =r(l) η ls l∈[K] s∈[K] η ls + s∈[K] l∈[K] η ls = 4∥θ U ∥ 1 l∈[K],s̸ =r(l) η ls (32) Recall that T satisfies i∈U θ i • 1{T πi ̸ = π i } ≤ b 0 ∥θ∥ 1 , hence l∈[K],s̸ =r(l) η ls = l∈[K],s̸ =r(l), i∈U ,πi=e l ,πi=es θ i = i∈U θ i • 1{T πi ̸ = π i } ≤ b 0 ∥θ∥ 1 Therefore, ∥G (d) ∥ ∞ ≤ 4b 0 ∥θ U ∥ 1 ∥θ∥ 1 (33) Plugging ( 33) into (28), we obtain |α ′ D -1 2 M (M -M (0) )D -1 2 M α| ≤ 4 √ Kb 0 ∥θ U ∥ 1 ∥θ∥ 1 ∥P D -1 2 M α∥ 2 On the other hand, |α ′ D -1 2 M M (0) D -1 2 M α| = |α ′ D -1 2 M P G 2 LL + G 2 U U P D -1 2 M α| Since G LL = Π ′ L Θ LL Π L = diag(∥θ L ∥ 1 , ..., ∥θ (K) L ∥ 1 ), G U U = Π ′ U Θ U U Π U = diag(∥θ (1) U ∥ 1 , ..., ∥θ (K) U ∥ 1 ), |α ′ D -1 2 M M (0) D -1 2 M α| ≥ ∥P D -1 2 M α∥ 2 min k (∥θ (k) L ∥ 2 1 + ∥θ (k) U ∥ 2 1 ) (Cauchy-Schwartz Inequality) ≥ ∥P D -1 2 M α∥ 2 min k 1 2 (∥θ (k) L ∥ 1 + ∥θ (k) U ∥ 1 ) 2 = 1 2 ∥P D -1 2 M α∥ 2 min k ∥θ (k) ∥ 2 1 Recall condition (9) in the main paper, max 1≤k≤K {∥θ (k) ∥ 1 } min 1≤k≤K {∥θ (k) ∥ 1 } ≤ C 2 Hence |α ′ D -1 2 M M (0) D -1 2 M α| ≥ 1 2 ∥P D -1 2 M α∥ 2 ( 1 C 2 max k ∥θ (k) ∥ 1 ) 2 ≥ 1 2 ∥P D -1 2 M α∥ 2 ( 1 KC 2 ∥θ∥ 1 ) 2 = 1 2K 2 C 2 2 ∥P D -1 2 M α∥ 2 ∥θ∥ 2 1 (35) Comparing ( 35) with (34), we obtain |α ′ D -1 2 M (M -M (0) )D -1 2 ≤ 8K 2 √ KC 2 2 b 0 ∥θ U ∥ 1 ∥θ∥ 1 |α ′ D -1 2 M M (0) D -1 2 M α| ≤ C 5 |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| + C 5 |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| Consequently, we bound the first part of ( 27) by the second part of ( 27). It remains to bound the second part of ( 27). Since D M , D M (0) , M (0) are all diagonal matrices, we can rewrite the second part on the LHS of ( 27) as follows: |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| = |α ′ D -1 2 M (0) (M (0) ) 1 2 (M (0) ) -1 2 D 1 2 M (0) D -1 2 M M (0) D -1 2 M D 1 2 M (0) (M (0) ) -1 2 -1 (M (0) ) 1 2 D -1 2 M (0) α| = |α ′ D -1 2 M (0) (M (0) ) 1 2 D M (0) D -1 M -1 (M (0) ) 1 2 D -1 2 M (0) α| ≤ λ max D M (0) D -1 M -1 ∥(M (0) ) 1 2 D -1 2 M (0) α∥ 2 = max k∈[K] M (0) kk M kk -1 • ∥(M (0) ) 1 2 D -1 2 M (0) α∥ 2 = max k∈[K] 1 M (0) kk (M (0) -M ) kk -1 • |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| Notice that for any k ∈ [K] M (0) kk (M (0) -M ) kk = e ′ k M (0) e k e ′ k (M (0) -M )e k = e ′ k P G 2 LL + G 2 U U P e k e ′ k P G (d) P e k ≥ ∥P e k ∥ 2 min k (∥θ (k) L ∥ 2 1 + ∥θ (k) U ∥ 2 1 ) ∥P e k ∥ 2 ∥G (d) ∥ 2 (Cauchy-Schwartz Inequality) ≥ min k 1 2 (∥θ (k) L ∥ 1 + ∥θ (k) U ∥ 1 ) 2 √ K∥G (d) ∥ ∞ (Plugging in (33)) ≥ min k (∥θ (k) ∥ 1 ) 2 8 √ Kb 0 ∥θ U ∥ 1 ∥θ∥ 1 (Condition (9)) ≥ (max k 1 C2 ∥θ (k) ∥ 1 ) 2 8 √ Kb 0 ∥θ U ∥ 1 ∥θ∥ 1 ≥ ( 1 C2K ∥θ∥ 1 ) 2 8 √ Kb 0 ∥θ U ∥ 1 ∥θ∥ 1 = 1 C 5 ≥ 4 > 1 (38) Plugging ( 38) into (37), we have |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| ≤ max k∈[K] 1 M (0) kk (M (0) -M ) kk -1 • |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| ≤ max k∈[K] 1 1 C5 -1 • |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| = C 5 1 -C 5 |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| Combining ( 27), (36), and (39), we have |α ′ M α -α ′ M (0) α| ≤ |α ′ D -1 2 M (M -M (0) )D -1 2 M α| + |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| ≤ C 5 |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| + C 5 |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| + |α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| = C 5 |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| + (C 5 + 1)|α ′ D -1 2 M M (0) D -1 2 M -D -1 2 M (0) M (0) D -1 2 M (0) α| ≤ C 5 |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| + (C 5 + 1) C 5 1 -C 5 |α ′ D -1 2 M (0) M (0) D -1 2 M (0) α| = 2C 5 1 -C 5 |α ′ M (0) α| Hence for any vector α ∈ R k , |α ′ M α| ≥ |α ′ M (0) α| -|α ′ M α -α ′ M (0) α| ≥ 1 -3C 5 1 -C 5 |α ′ M (0) α| To conclude, in this subsection, we successfully reduce the noisy case |α ′ M α| to the oracle case |α ′ M (0) α|. Result (41) will also be used in the proof of other claims.

H PROOF OF THEOREM 2

Theorem 2. Consider the DCBM model where ( 8)-( 9) hold. There exists constant C > 0, such that for any δ ∈ (0, 1/2), with probability 1 -δ, simultaneously for 1 ≤ k ≤ K, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } + ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 . To prove Theorem 2, we need a famous concentration inequality, Bernstein inequality: Lemma 7 (Bernstein inequality). Suppose X 1 , ..., X n are independent random variables such that EX i = 0, |X i | ≤ b and V ar(X i ) ≤ σ 2 i for all i. Let σ 2 = n -1 n i=1 σ 2 i . Then, for any t > 0, P n -1 | n i=1 X i | ≥ t ≤ 2 exp - nt 2 /2 σ 2 + bt/3 The proof of Lemma 7, Bernstein inequality, can be seen in most probability textbooks such as Uspensky (1937) . Proof. Recall that for k ∈ [K], ψk ( ΠU ) = ψ(f (A (k) ; ΠU ), f (X; H)), ψ k ( ΠU ) = ψ f (Ω (k) ; ΠU ), f (EX; ΠU ) . Denote v k = f (A (k) ; ΠU ), v * = f (X; ΠU ), ṽk = f (Ω (k) ; ΠU ), ṽ * = f (EX; ΠU ), k ∈ [K]. Then, by Lemma 4, ψk ( ΠU ) = ψ(v k , v * ) ≤ ψ(v k , ṽk ) + ψ(ṽ k , v * ) ≤ ψ(v k , ṽk ) + ψ(ṽ k , ṽ * ) + ψ(ṽ * , v * ) = ψ k ( ΠU ) + ψ(v k , ṽk ) + ψ(ṽ * , v * ) Similarly, ψ k ( ΠU ) ≤ ψk ( ΠU ) + ψ(v k , ṽk ) + ψ(ṽ * , v * ) Therefore, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ ψ(v k , ṽk ) + ψ(ṽ * , v * ). ( ) For any ϕ 1 , ..., ϕ K ≥ 0. P ∀k ∈ [K], | ψk ( ΠU ) -ψ k ( ΠU )| ≤ ϕ k = 1 -P ∃k ∈ [K], | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k ≥ 1 - K k=1 P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k (43) By definition of ψ, ψk ( ΠU ), ψ k ( ΠU ) ∈ [0, π]. Hence, | ψk ( ΠU ) -ψ k ( ΠU )| ∈ [0, π]. As a result, when ϕ k ≥ π, P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k = 0 When ϕ k < π, by (42), P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k ≤ P ψ(v k , ṽk ) + ψ(ṽ * , v * ) > ϕ k ≤ P ψ(v k , ṽk ) > 1 2 ϕ k or ψ(ṽ * , v * ) > 1 2 ϕ k ≤ P ψ(v k , ṽk ) > 1 2 ϕ k + P ψ(ṽ * , v * ) > 1 2 ϕ k (44) By lemma 5, when ∥v k -ṽk ∥ < ∥ṽ k ∥ ψ(v k , ṽk ) ≤ arcsin ∥v k -ṽk ∥ ∥ṽ k ∥ Hence, for any ϕ ∈ [0, π 2 ), ∥v k -ṽk ∥ ≤ sin(ϕ)∥ṽ k ∥ implies ψ(v k , ṽk ) ≤ ϕ. As a result, for any ϕ ∈ [0, π 2 ), ψ(v k , ṽk ) > ϕ implies ∥v k -ṽk ∥ > sin(ϕ)∥ṽ k ∥. Similarly, for any ϕ ∈ [0, π 2 ), ψ(v * , ṽ * ) > ϕ implies ∥v * -ṽ * ∥ ≥ sin(ϕ)∥ṽ * ∥. By definition of ϕ k , ϕ k ≥ 0. Hence, when ϕ k < π, 1 2 ϕ k ∈ [0, π 2 ). Plugging the above results into (44), we have P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k ≤ P ψ(v k , ṽk ) > 1 2 ϕ k + P ψ(ṽ * , v * ) > 1 2 ϕ k ≤ P ∥v k -ṽk ∥ ≥ sin( 1 2 ϕ k )∥ṽ k ∥ + P ∥v * -ṽ * ∥ ≥ sin( 1 2 ϕ k )∥ṽ * ∥ ≤ P ∃l ∈ [2K], |(v k -ṽk ) l | ≥ 1 √ K sin( 1 2 ϕ k )∥ṽ k ∥ + P ∃l ∈ [2K], |(v * -ṽ * ) l | ≥ 1 √ K sin( 1 2 ϕ k )∥ṽ * ∥ ≤ 2K l=1 P |(v k -ṽk ) l | ≥ 1 √ K sin( 1 2 ϕ k )∥ṽ k ∥ + 2K l=1 P |(v * -ṽ * ) l | ≥ 1 √ K sin( 1 2 ϕ k )∥ṽ * ∥ (45) Since when ϕ k < π, 1 2 ϕ k ∈ [0, π 2 ], by Lemma 3, sin( 1 2 ϕ k ) ≥ 2 π 1 2 ϕ k = 1 π ϕ k . Plugging back to (45), we have when ϕ k < π, P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k ≤ 2K l=1 P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ + 2K l=1 P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ (46) It remains to evaluate P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ and P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ , which are illustrated in the following two lemmas. Lemma 8. Define C 6 = C 2 16 √ 2π 2 C2K 2 ( √ K+ 1 3 ) . When ϕ k ≥ 2 √ 2πC 2 K 2 ∥θ (k) L ∥ 2 ∥θ (k) L ∥1∥θ∥1 , P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ ≤ 2 exp - C 6 C 2 ϕ 2 k ∥θ (k) L ∥ 1 ∥θ∥ 1 Lemma 9. Define C 7 = C 2 2 √ 2π 2 C2K 2 ( √ K+ 3 ) . Then, P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ ≤ 2 exp - C 7 C 2 ϕ 2 k θ * ∥θ∥ 1 The proof of Lemma 8 and 9 are quite tedious. We would defer their proofs to the end of this section. Choose ϕ k = ϕ k (C, δ) = C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } + ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 . Then leveraging on Lemma 8 and 9, we have when C ≥ 2 √ 2πK 2 , P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ ≤ 2 exp - C 6 C 2 log(1/δ)∥θ (k) L ∥ 1 ∥θ∥ 1 C 2 ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } ≤ 2 exp (-C 6 log(1/δ)) = 2δ C6 (47) Published as a conference paper at ICLR 2023 P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ ≤ 2 exp - C 7 C 2 log(1/δ)θ * ∥θ∥ 1 C 2 ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } ≤ 2 exp (-C 7 log(1/δ)) = 2δ C7 (48) Plugging ( 47) and ( 48) back to ( 46), leveraging on the fact that δ ≤ 1 2 < 1, we obtain when ϕ k < π, and C ≥ 2 √ 2πK 2 , P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k ≤ 4Kδ C6 + 4Kδ C7 ≤ 8Kδ C6 Recall that when ϕ k ≥ π, P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k = 0 . In all, we have that when C ≥ 2 √ 2πK 2 , P | ψk ( ΠU ) -ψ k ( ΠU )| > ϕ k ≤ 8Kδ C6 Substituting ( 49) into ( 43), we obtain that when C ≥ 2 √ 2πK 2 , P ∀k ∈ [K], | ψk ( ΠU ) -ψ k ( ΠU )| ≤ ϕ k ≥ 1 -8K 2 δ C6 Hence, it suffices to make 8K 2 δ C6 ≤ δ. Choose C = max{2 √ 2πK 2 , 16 √ 2π 2 C 2 K 2 ( √ K + 1 3 )(1 + log(8K 2 ) log 2 )}, Then C 6 -1 ≥ log(8K 2 ) log 2 ≥ 1. Since δ ≤ 1 2 , δ C6-1 ≤ 1 2 C6-1 ≤ 1 2 log(8K 2 ) log 2 = 1 8K 2 As a result, 8K 2 δ C6 ≤ δ. Hence, choose C as in (51), then C >), and for any δ ∈ (0, 1/2), P ∀k ∈ [K], | ψk ( ΠU ) -ψ k ( ΠU )| ≤ ϕ k ≥ 1 -δ (52) To conclude, there exists constant C > 0, such that for any δ ∈ (0, 1/2), with probability 1 -δ, simultaneously for 1 ≤ k ≤ K, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } . H.1 PROOF OF LEMMA 8 Proof. When l ∈ [K], (ṽ k ) l = (f (Ω (k) ; ΠU )) l = Ω (k) 1 (l) = i∈C k ∩L j∈C l ∩L Ω ij When l ∈ {K + 1, ..., 2K}, define Ĉl = {i ∈ U : πi = e l-K } , then (ṽ k ) l = (f (Ω (k) ; ΠU )) l = Ω (k) ΠU l = i∈C k ∩L j∈ Ĉl Ω ij Hence ∥ṽ k ∥ = l∈[2K] (ṽ k ) 2 l (Cauchy-Schwartz) ≥ 1 2K ( l∈[2K] (ṽ k ) l ) 2 = 1 √ 2K | l∈[2K] (ṽ k ) l | = 1 √ 2K | K l=1 i∈C k ∩L j∈C l ∩L Ω ij + 2K l=K+1 i∈C k ∩L j∈ Ĉl Ω ij | = 1 √ 2K | i∈C k ∩L j∈L Ω ij + i∈C k ∩L j∈ Û Ω ij | = 1 √ 2K i∈C k ∩L j∈[n] Ω ij ≥ 1 √ 2K i∈C k ∩L j∈C k Ω ij = 1 √ 2K i∈C k ∩L j∈C k θ i θ j P kk (Identifiability condition) = 1 √ 2K i∈C k ∩L j∈C k θ i θ j = 1 √ 2K ∥θ (k) L ∥ 1 ∥θ (k) ∥ 1 ≥ 1 √ 2K ∥θ (k) L ∥ 1 min l∈[K] ∥θ (l) ∥ 1 (Condition (9)) ≥ 1 C 2 √ 2K ∥θ (k) L ∥ 1 max l∈[K] ∥θ (l) ∥ 1 ≥ 1 C 2 K √ 2K ∥θ (k) L ∥ 1 ∥θ∥ 1 (53) When l ∈ [K], (v k ) l = (f (A (k) ; ΠU )) l = A (k) 1 (l) = i∈C k ∩L j∈C l ∩L A ij Recall that, (ṽ k ) l = (f (Ω (k) ; ΠU )) l = Ω (k) 1 (l) = i∈C k ∩L j∈C l ∩L Ω ij So |(v k -ṽk ) l | = i∈C k ∩L j∈C l ∩L (A ij -Ω ij ) When l ∈ [K]\{k}, since ΠU only depends on A U U , it is independent of A LL . Hence, given ΠU , {A ij -Ω ij : i ∈ C k ∩ L, j ∈ C l ∩ L} are a collection of |C k ∩ L||C l ∩ L| independent random variables. Furthermore, given ΠU , for any i ∈ C k ∩ L, j ∈ C l ∩ L, E A ij -Ω ij | ΠU = E A ij | ΠU -Ω ij = Ω ij -Ω ij = 0 Also, -1 ≤ -Ω ij ≤ A ij -Ω ij ≤ A ij ≤ 1 So |A ij -Ω ij | ≤ 1. Additionally, var A ij -Ω ij = var A ij | ΠU = Ω ij (1 -Ω ij ) ≤ Ω ij Therefore, denote n kl = |C k ∩ L||C l ∩ L|, by Lemma 7, P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ = E P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ | ΠU = E   P 1 n kl | i∈C k ∩L j∈C l ∩L (A ij -Ω ij )| ≥ 1 π √ Kn kl ϕ k ∥ṽ k ∥ | ΠU   ≤ 2E exp   - 1 2 n kl 1 π √ Kn kl ϕ k ∥ṽ k ∥ 2 1 n kl i∈C k ∩L j∈C l ∩L Ω ij + 1 3 1 π √ Kn kl ϕ k ∥ṽ k ∥    = 2E exp - ϕ 2 k 2π √ K ∥ṽ k ∥ 2 π √ K i∈C k ∩L j∈C l ∩L Ω ij + 1 3 ϕ k ∥ṽ k ∥ = 2E exp - ϕ 2 k 2π √ K ∥ṽ k ∥ 2 π √ K|(ṽ k ) l | + 1 3 ϕ k ∥ṽ k ∥ = 2E exp   - ϕ 2 k ∥ṽ k ∥ 2π √ K 1 π √ K |(ṽ k ) l | ∥ṽ k ∥ + 1 3 ϕ k   ≤ 2E exp - ϕ 2 k ∥ṽ k ∥ 2π √ K(π √ K + π 3 ) (54) When l = k, {A ij -Ω ij : i, j ∈ C k ∩L, i < j} are a collection of 1 2 |C k ∩L|(|C k ∩L|-1) independent random variables. Furthermore, for any i, j ∈ C k ∩ L, i < j, E(A ij -Ω ij ) = EA ij -Ω ij = Ω ij -Ω ij = 0 Also, -1 ≤ -Ω ij ≤ A ij -Ω ij ≤ A ij ≤ 1 So |A ij -Ω ij | ≤ 1. Additionally, var(A ij -Ω ij ) = var(A ij ) = Ω ij (1 -Ω ij ) ≤ Ω ij Denote n kk = 1 2 |C k ∩ L|(|C k ∩ L| -1), we have P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ = P 1 n kk | i∈C k ∩L j∈C k ∩L (A ij -Ω ij )| ≥ 1 π √ Kn kk ϕ k ∥ṽ k ∥ = P 1 n kk |2 i<j∈C k ∩L (A ij -Ω ij ) + i∈C k ∩L Ω ii | ≥ 1 π √ Kn kk ϕ k ∥ṽ k ∥ ≤ P 1 n kk |2 i<j∈C k ∩L (A ij -Ω ij )| ≥ 1 π √ Kn kk ϕ k ∥ṽ k ∥ - 1 n kk i∈C k ∩L Ω ii = P 1 n kk | i<j∈C k ∩L (A ij -Ω ij )| ≥ 1 2π √ Kn kk ϕ k ∥ṽ k ∥ - 1 2n kk i∈C k ∩L Ω ii = E   P 1 n kk | i<j∈C k ∩L (A ij -Ω ij )| ≥ 1 2π √ Kn kk ϕ k ∥ṽ k ∥ - 1 2n kk i∈C k ∩L Ω ii | ΠU   Notice that ϕ k ≥ C ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 . Since ϕ k ≥ 2 √ 2πC 2 K 2 ∥θ (k) L ∥ 2 ∥θ (k) L ∥1∥θ∥1 ∥ṽ k ∥, 1 2π √ Kn kk ϕ k ∥ṽ k ∥ ≥ 2 √ 2πK 2 2π √ Kn kk ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 ∥ṽ k ∥ ((By (53))) ≥ C 2 K √ 2K n kk ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 1 C 2 K √ 2K ∥θ (k) L ∥ 1 ∥θ∥ 1 = 2∥ 1 2n kk θ (k) L ∥ 2 = 2 1 2n kk i∈C k ∩L θ 2 i (Identifiability condition) = C √ 2πK 2 1 2n kk i∈C k ∩L θ 2 i P kk = 2 1 2n kk i∈C k ∩L Ω ii Therefore, by Lemma 7, P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ = E   P 1 n kk | i<j∈C k ∩L (A ij -Ω ij )| ≥ 1 2π √ Kn kk ϕ k ∥ṽ k ∥ - 1 2 1 2π √ Kn kk ϕ k ∥ṽ k ∥ | ΠU   ≤ 2E exp   - 1 2 n kk 1 4π √ Kn kk ϕ k ∥ṽ k ∥ 2 1 n kk i<j∈C k ∩L Ω ij + 1 3 1 2π √ Kn kk ϕ k ∥ṽ k ∥ -1 2n kk i∈C k ∩L Ω ii    ≤ 2E exp   - 1 16π √ K ϕ k ∥ṽ k ∥ 2 π √ K • 2 i<j∈C k ∩L Ω ij + 1 3 ϕ k ∥ṽ k ∥    ≤ 2E exp   - 1 16π √ K ϕ k ∥ṽ k ∥ 2 π √ K i,j∈C k ∩L Ω ij + 1 3 ϕ k ∥ṽ k ∥    = 2E exp - ϕ 2 k 16π √ K ∥ṽ k ∥ 2 π √ K|(ṽ k ) k | + 1 3 ϕ k ∥ṽ k ∥ = 2E exp   - ϕ 2 k ∥ṽ k ∥ 16π √ K 1 π √ K |(ṽ k ) k | ∥ṽ k ∥ + 1 3 ϕ k   ≤ 2E exp - ϕ 2 k ∥ṽ k ∥ 16π √ K(π √ K + π 3 ) When l ∈ {K + 1, ..., 2K}, recall Ĉl = {i ∈ U : πi = e l-K } So (v k ) l = (f (A (k) ; ΠU )) l = A (k) ΠU l = i∈C k ∩L j∈ Ĉl A ij Recall that (ṽ k ) l = (f (Ω (k) ; ΠU )) l = Ω (k) ΠU l = i∈C k ∩L j∈ Ĉl Ω ij So |(v k -ṽk ) l | = i∈C k ∩L j∈ Ĉl (A ij -Ω ij ) Since ΠU only depends on A U U , it is independent of A LU . Hence, given ΠU , {A ij -Ω ij : i ∈ C k ∩ L, j ∈ Ĉl } are a collection of |C k ∩ L|| Ĉl | independent random variables. Furthermore, given ΠU , for any i ∈ C k ∩ L, j ∈ Ĉl , E A ij -Ω ij | ΠU = E A ij | ΠU -Ω ij = Ω ij -Ω ij = 0 Also, -1 ≤ -Ω ij ≤ A ij -Ω ij ≤ A ij ≤ 1 So |A ij -Ω ij | ≤ 1. Additionally, var A ij -Ω ij = var A ij | ΠU = Ω ij (1 -Ω ij ) ≤ Ω ij Therefore, denote nkl = |C k ∩ L|| Ĉl |, by Lemma 7, P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ = E P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ | ΠU = E   P 1 nkl | i∈C k ∩L j∈ Ĉl (A ij -Ω ij )| ≥ 1 π √ K nkl ϕ k ∥ṽ k ∥ | ΠU   ≤ 2E exp   - 1 2 nkl 1 π √ K nkl ϕ k ∥ṽ k ∥ 2 1 nkl i∈C k ∩L j∈ Ĉl Ω ij + 1 3 1 π √ K nkl ϕ k ∥ṽ k ∥    = 2E exp - ϕ 2 k 2π √ K ∥ṽ k ∥ 2 π √ K i∈C k ∩L j∈ Ĉl Ω ij + 1 3 ϕ k ∥ṽ k ∥ = 2E exp - ϕ 2 k 2π √ K ∥ṽ k ∥ 2 π √ K|(ṽ k ) l | + 1 3 ϕ k ∥ṽ k ∥ = 2E exp   - ϕ 2 k ∥ṽ k ∥ 2π √ K 1 π √ K |(ṽ k ) l | ∥ṽ k ∥ + 1 3 ϕ k   ≤ 2E exp - ϕ 2 k ∥ṽ k ∥ 2π √ K(π √ K + π 3 ) In all, for any l ∈ [2K], P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ ≤ 2E exp - ϕ 2 k ∥ṽ k ∥ 16π √ K(π √ K + π 3 ) Plugging ( 53) into (59), we obtain P |(v k -ṽk ) l | ≥ 1 π √ K ϕ k ∥ṽ k ∥ ≤ 2 exp - ϕ 2 k ∥θ (k) L ∥ 1 ∥θ∥ 1 16 √ 2π 2 C 2 K 2 ( √ K + 1 3 ) = 2 exp - C 6 C 2 ϕ 2 k ∥θ (k) L ∥ 1 ∥θ∥ 1 That concludes the proof.

H.2 PROOF OF LEMMA 9

The proof of Lemma 9 is nearly the same as Lemma 8. For the completeness of our paper, we will present a proof of Lemma 9 as follows. Proof. When l ∈ [K], (v * ) l = (f (X; ΠU )) l = X1 (l) = j∈C l ∩L X j Similarly, (ṽ * ) l = (f (E[X]; ΠU )) l = E[X]1 (l) = j∈C l ∩L E[X j ] So |(v * -ṽ * ) l | = j∈C l ∩L (X j -E[X j ]) When l ∈ [K], since ΠU only depends on A U U , it is independent of X. Hence, given ΠU , {X j - E[X j ] : j ∈ C l ∩ L} are a collection of |C l ∩ L| independent random variables. Furthermore, given ΠU , for any j ∈ C l ∩ L, E X j -E[X j ]| ΠU = E X j | ΠU -E[X j ] = E[X j ] -E[X j ] = 0 Also, -1 ≤ -E[X j ] ≤ X j -E[X j ] ≤ X j ≤ 1 So |X j -E[X j ]| ≤ 1. Additionally, var X j -E[X j ] = var X j | ΠU = E[X j ](1 -E[X j ]) ≤ E[X j ] Therefore, denote n l = |C l ∩ L|, by Lemma 7, P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ = E P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ | ΠU = E   P 1 n l | j∈C l ∩L (X j -E[X j ])| ≥ 1 π √ Kn l ϕ k ∥ṽ * ∥ | ΠU   ≤ 2E exp   - 1 2 n l 1 π √ Kn l ϕ k ∥ṽ * ∥ 2 1 n l j∈C l ∩L E[X j ] + 1 3 1 π √ Kn l ϕ k ∥ṽ * ∥    = 2E exp - ϕ 2 k 2π √ K ∥ṽ * ∥ 2 π √ K j∈C l ∩L E[X j ] + 1 3 ϕ k ∥ṽ * ∥ = 2E exp - ϕ 2 k 2π √ K ∥ṽ * ∥ 2 π √ K|(ṽ * ) l | + 1 3 ϕ k ∥ṽ * ∥ = 2E exp   - ϕ 2 k ∥ṽ * ∥ 2π √ K 1 π √ K |(ṽ * ) l | ∥ṽ * ∥ + 1 3 ϕ k   ≤ 2E exp - ϕ 2 k ∥ṽ * ∥ 2π √ K(π √ K + π 3 ) When l ∈ {K + 1, ..., 2K}, define Ĉl = {i ∈ U : πi = e l-K } Then (v * ) l = (f (X; ΠU )) l = X1 (l) = j∈ Ĉl X j Similarly, (ṽ * ) l = (f (E[X]; ΠU )) l = E[X]1 (l) = j∈ Ĉl E[X j ] So |(v * -ṽ * ) l | = j∈ Ĉl (X j -E[X j ]) When l ∈ [K], since ΠU only depends on A U U , it is independent of X. Hence, given ΠU , {X j - E[X j ] : j ∈ Ĉl } are a collection of |C l ∩ L| independent random variables. Furthermore, given ΠU , for any j ∈ Ĉl , E X j -E[X j ]| ΠU = E X j | ΠU -E[X j ] = E[X j ] -E[X j ] = 0 Also, -1 ≤ -E[X j ] ≤ X j -E[X j ] ≤ X j ≤ 1 So |X j -E[X j ]| ≤ 1. Additionally, var X j -E[X j ] = var X j | ΠU = E[X j ](1 -E[X j ]) ≤ E[X j ] Therefore, denote nl = | Ĉl |, by Lemma 7, P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ = E P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ | ΠU = E   P 1 nl | j∈ Ĉl (X j -E[X j ])| ≥ 1 π √ K nl ϕ k ∥ṽ * ∥ | ΠU   ≤ 2E exp   - 1 2 nl 1 π √ K nl ϕ k ∥ṽ * ∥ 2 1 nl j∈ Ĉl E[X j ] + 1 3 1 π √ K nl ϕ k ∥ṽ * ∥    = 2E exp - ϕ 2 k 2π √ K ∥ṽ * ∥ 2 π √ K j∈ Ĉl E[X j ] + 1 3 ϕ k ∥ṽ * ∥ = 2E exp - ϕ 2 k 2π √ K ∥ṽ * ∥ 2 π √ K|(ṽ * ) l | + 1 3 ϕ k ∥ṽ * ∥ = 2E exp   - ϕ 2 k ∥ṽ * ∥ 2π √ K 1 π √ K |(ṽ * ) l | ∥ṽ * ∥ + 1 3 ϕ k   ≤ 2E exp - ϕ 2 k ∥ṽ * ∥ 2π √ K(π √ K + π 3 ) In all, for any l ∈ [2K], P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ ≤ 2E exp - ϕ 2 k ∥ṽ * ∥ 2π √ K(π √ K + π 3 ) Notice that ∥ṽ * ∥ = l∈[2K] (ṽ * ) 2 l (Cauchy-Schwartz) ≥ 1 2K ( l∈[2K] (ṽ * ) l ) 2 = 1 √ 2K | l∈[2K] (ṽ * ) l | = 1 √ 2K | K l=1 j∈ Ĉl E[X j ] + 2K l=K+1 j∈ Ĉl E[X j ]| = 1 √ 2K | j∈L E[X j ] + j∈ Û E[X j ]| = 1 √ 2K j∈[n] E[X j ] ≥ 1 √ 2K j∈C k * E[X j ] = 1 √ 2K j∈C k θ * θ j P k * k * (Identifiability condition) = 1 √ 2K j∈C k θ * θ j = 1 √ 2K θ * ∥θ (k) ∥ 1 ≥ 1 √ 2K θ * min l∈[K] ∥θ (l) ∥ 1 (Condition (9)) ≥ 1 C 2 √ 2K θ * max l∈[K] ∥θ (l) ∥ 1 ≥ 1 C 2 K √ 2K θ * ∥θ∥ 1 Plugging ( 64) into (63), we obtain P |(v * -ṽ * ) l | ≥ 1 π √ K ϕ k ∥ṽ * ∥ ≤ 2 exp - ϕ 2 k θ * ∥θ∥ 1 2 √ 2π 2 C 2 K 2 ( √ K + 1 3 ) = 2 exp - C 7 C 2 ϕ 2 k θ * ∥θ∥ 1 That concludes the proof. I PROOF OF COROLLARY 1, 2 I.1 PROOF OF COROLLARY 1 Corollary 1. Consider the DCBM model where ( 8)-( 9) hold. Suppose for some constants b 0 ∈ (0, 1) and ϵ ∈ (0, 1/2), ΠU is b 0 -correct with probability 1 -ϵ. When b 0 is properly small, there exist constants C 0 > 0 and C > 0, which do not depend on (b 0 , ϵ), such that P(ŷ ̸ = k * ) ≤ ϵ + C K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } . Proof. Let B 0 be the event that ΠU is b 0 -correct. Then, P(ŷ ̸ = k * ) = P(ŷ ̸ = k * , B C 0 ) + P ŷ ̸ = k * , B 0 ≤ P(B C 0 ) + P ∃k ̸ = k * , ψk ( ΠU ) ≤ ψk * ( ΠU ) , B 0 ≤ ϵ + P ∃k ̸ = k * , ψ k ( ΠU ) -ψk ( ΠU ) + ψk * ( ΠU ) -ψ k * ( ΠU ) ≥ ψ k ( ΠU ) -ψ k * ( ΠU ) , B 0 ≤ ϵ + P ∃k ̸ = k * , ψ k ( ΠU ) -ψk ( ΠU ) + ψk * ( ΠU ) -ψ k * ( ΠU ) ≥ ψ k ( ΠU ) -ψ k * ( ΠU ) , B 0 By Theorem 1, when b 0 is properly small, B 0 implies that there exists a constant c 0 > 0, which does not depend on b 0 , such that ψ k * ( ΠU ) = 0 and min k̸ =k * {ψ k ( ΠU )} ≥ c 0 β n . Substituting this result into (66), we have P(ŷ ̸ = k * ) ≤ ϵ + P ∃k ̸ = k * , ψ k ( ΠU ) -ψk ( ΠU ) + ψk * ( ΠU ) -ψ k * ( ΠU ) ≥ c 0 β n , B 0 ≤ ϵ + P ∃k ̸ = k * , ψ k ( ΠU ) -ψk ( ΠU ) + ψk * ( ΠU ) -ψ k * ( ΠU ) ≥ c 0 β n ≤ ϵ + P ∃k ∈ [K], ψ k ( ΠU ) -ψk ( ΠU ) ≥ 1 2 c 0 β n ≤ ϵ + P ∃k ∈ [K], ψ k ( ΠU ) -ψk ( ΠU ) > 1 3 c 0 β n (67) According to Theorem 2, there exists a constant C > 0, such that for any δ ∈ (0, 1/2), with probability 1 -δ, simultaneously for 1 ≤ k ≤ K, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } + ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 . Take C 0 = c 2 0 36C 2 , δ = exp -C 0 β 2 n ∥θ∥ 1 • min θ * , min k∈[K] ∥θ (k) L ∥ 1 Then C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } = C log 1/ exp -C 0 β 2 n ∥θ∥ 1 • min θ * , min k∈[K] ∥θ (k) L ∥ 1 ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } = C C 0 β 2 n ∥θ∥ 1 • min θ * , min k∈[K] ∥θ (k) L ∥ 1 ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } ≤ 1 6 c 0 β n (68) On the other hand, take c 3 in condition ( 9) properly small such that c 3 ≤ c0 6C , then according to condition (9), C ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 ≤ C • c 3 β n ≤ 1 6 c 0 β n Combining ( 68) and ( 69), we have C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } + ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 ≤ 1 3 c 0 β n Therefore, when δ < 1 2 , by Theorem 2, with probability 1 -δ, simultaneously for 1 ≤ k ≤ K, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ 1 3 c 0 β n . As a result, when δ < 1 2 , P ∃k ∈ [K], ψ k ( ΠU ) -ψk ( ΠU ) > 1 3 c 0 β n ≤ δ When δ ≥ 1 2 , P ∃k ∈ [K], ψ k ( ΠU ) -ψk ( ΠU ) > 1 3 c 0 β n ≤ 1 ≤ 2δ Hence in total, we have P ∃k ∈ [K], ψ k ( ΠU ) -ψk ( ΠU ) > 1 3 c 0 β n ≤ 2δ Plugging ( 70) into (67), we obtain P(ŷ ̸ = k * ) ≤ ϵ + 2δ = ϵ + 2 exp -C 0 β 2 n ∥θ∥ 1 • min θ * , min k∈[K] ∥θ (k) L ∥ 1 ≤ ϵ + 2 K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } Choose C = 2, we obtain P(ŷ ̸ = k * ) ≤ ϵ + C K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } To conclude, when b 0 is properly small, there exist constants C 0 = c 2 0 36C 2 > 0 and C = 2 > 0, which do not depend on (b 0 , ϵ), such that P(ŷ ̸ = k * ) ≤ ϵ + C K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } .

I.2 PROOF OF COROLLARY 2

Corollary 2. Consider the DCBM model where ( 8)-( 9) hold. We apply SCORE+ to obtain ΠU and plug it into AngleMin+. As n → ∞, suppose for some constant q 0 > 0, min i∈U θ i ≥ q 0 max i∈U θ i , β n ∥θ U ∥ ≥ q 0 log(n), β 2 n ∥θ∥ 1 θ * → ∞, and β 2 n ∥θ∥ 1 min k {∥θ (k) L ∥ 1 } → ∞. Then, P(ŷ ̸ = k * ) → 0, so the AngleMin+ estimate is consistent. Proof. By Corollary 2, let ϵ be the probability that ΠU obtained through SCORE+ is not b 0 -correct, then P(ŷ ̸ = k * ) ≤ ϵ + C K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ (k) L ∥ 1 } Since β 2 n ∥θ∥ 1 θ * → ∞, and β 2 n ∥θ∥ 1 min k {∥θ (k) L ∥ 1 } → ∞, C K k=1 exp -C 0 β 2 n ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } → 0 By Theorem 2.2 in Jin et al. (2021) , when q 0 is sufficiently large, min i∈U θ i ≥ q 0 max i∈U θ i and β n ∥θ U ∥ ≥ q 0 log(n) imply that ϵ → 0. Hence, in all, we have P(ŷ ̸ = k * ) → 0, so the AngleMin+ estimate is consistent.

J PROOF OF LEMMA 2

As mentioned in section D, in the main paper, for the smoothness and comprehensibility of the text, we do not present the most general form of Lemma 2. Here, we present both the original version, Lemma 2, and the generalized version, Lemma 2' below, where we relax the assumption that ∥θ (1) L ∥1 ∥θ (2) L ∥1 = ∥θ (1) U ∥1 ∥θ (2) U ∥1 = 1 to the much weaker assumption: the first part of condition (9) in the main text, which only assumes that ∥θ (1) ∥ 1 and ∥θ (2) ∥ 1 are of the same order. Lemma 2. Consider a DCBM with K = 2 and P = (1 -b)I 2 + b1 2 1 ′ 2 . Suppose θ * = o(1), θ * min k ∥θ (k) L ∥1 = o(1), 1 -b = o(1), ∥θ (1) L ∥1 ∥θ (2) L ∥1 = ∥θ (1) U ∥1 ∥θ (2) U ∥1 = 1. There exists a constant c 4 > 0 such that inf ỹ {Risk(ỹ)} ≥ c 4 exp -2[1 + o(1)] (1 -b) 2 8 • θ * (∥θ L ∥ 1 + ∥θ U ∥ 1 ) , ( ) where the infimum is taken over all measurable functions of A, X, and parameters Π L , Π U , Θ, P , θ * . In AngleMin+, suppose the second part of condition 9 holds with c 3 = o(1), ΠU is b0 -correct with b0 a.s. → 0. There is a constant C 4 > 0 such that, Risk(ŷ) ≤ C 4 exp -[1 -o(1)] (1 -b) 2 8 • θ * (∥θ L ∥ 2 1 + ∥θ U ∥ 2 1 ) 2 ∥θ L ∥ 3 1 + ∥θ U ∥ 3 1 . ( ) Lemma 2'. Consider a DCBM with K = 2 and P = (1 -b)I 2 + b1 2 1 ′ 2 . Suppose 1 -b = o(1). There exists a constant c 4 > 0 such that inf ỹ {Risk(ỹ)} ≥ c 4 exp -2[1 + o(1)] (1 -b) 2 8 • θ * (∥θ L ∥ 1 + ∥θ U ∥ 1 ) , ( ) where the infimum is taken over all measurable functions of A, X, and parameters Π L , Π U , Θ, P , θ * . In AngleMin+, suppose condition 9 holds with c 3 = o(1), θ * = o(1), θ * min k ∥θ (k) L ∥1 = o(1), ΠU is b0 -correct with b0 a.s. → 0. There is a constant C 4 > 0 such that, Risk(ŷ) ≤ C 4 exp   -[1 -o(1)] (1 -b) 2 8 • θ * 4 ∥θ (1) L ∥ 3 1 +∥θ (1) U ∥ 3 1 (∥θ (1) L ∥ 2 1 +∥θ (1) U ∥ 2 1 ) 2 + ∥θ (2) L ∥ 3 1 +∥θ (2) U ∥ 3 1 (∥θ (2) L ∥ 2 1 +∥θ (2) U ∥ 2 1 ) 2    . (76') When conditions of Lemma 2 hold, conditions of Lemma 2' hold. Also, with ∥θ (1) L ∥1 ∥θ (2) L ∥1 = ∥θ (1) U ∥1 ∥θ (2) U ∥1 = 1 assumed in Lemma 2, the results of Lemma 2' imply the results of 2. Therefore, it suffices to prove the generalized version, Lemma 2'. We prove the lower bound (75) and upper bound (76') separately. J.1 PROOF OF LOWER BOUND (75) Proof. Let P (1) and P (2) be the joint distribution of A and X given π * = e 1 and π * = e 2 , respectively. For a random variable or vector or matrix Y , let P According to Theorem 2.2 in Section 2.4.2 of Tsybakov ( 2009), inf ỹ {Risk(ỹ)} ≥ 2 • 1 2 (1 -H 2 (P (1) , P (2) )(1 -H 2 (P (1) , P (2) )/4)) = 1 -1 -1 - 1 2 H 2 (P (1) , P (2) ) 2 ≥ 1 -1 - 1 2 1 - 1 2 H 2 (P (1) , P (2) ) 2 = 1 2 1 - 1 2 H 2 (P (1) , P (2) ) 2 where H 2 (P (1) , P (2) ) = dP (1) -dP (2) 2 is the Hellinger distance between P (1) and P (2) . As in Section 2.4 of Tsybakov ( 2009), one key property of Hellinger distance is that if Q (1) and Q (2) are product measures, Q (1) = ⊗ N i=1 Q (1) i , Q (2) = ⊗ N i=1 Q (2) i , then H 2 (Q (1) , Q (2) ) = 2 1 - N i=1 1 - H 2 (Q (1) i , Q (2) i ) 2 Notice that for k = 1, 2, since according to DCBM, A, X 1 , ..., X n are independent, P (k) = P (k) A × ⊗ n i=1 P (k) Xi Combining ( 77) and ( 78), we obtain H 2 (P (1) , P (2) ) = 2 1 -1 - H 2 (P A , P (2) A ) 2 n i=1 1 - H 2 (P Xi , P Xi ) 2 (79) Given π * = e 1 and π * = e 2 , according to DCBM, the distribution of A remains the same. As a result, H 2 (P (1) A , P A ) = 0. On the other hand, for k = 1, 2 and i ∈ [n], according to DCBM model P (k) Xi ∼ Bern(θ * θ i P kki ) (By (83)) ≥ cos(ϕ x -ϕ y ) -1 cos(ϕ x -ϕ y ) = -2 sin 2 (ϕx-ϕy) 2 cos(ϕ x -ϕ y ) = - 1 2 4 sin 2 (ϕx-ϕy) 2 (sin ϕ x -sin ϕ y ) 2 1 cos(ϕ x -ϕ y ) (sin ϕ x -sin ϕ y ) 2 = - 1 2 4 sin 2 (ϕx-ϕy)

2

(2 sin (ϕx-ϕy) 2 cos (ϕx+ϕy) 2 ) 2 1 cos(ϕ x -ϕ y ) (x -y) 2 = - 1 2 (x -y) 2 1 cos(|ϕ x -ϕ y |) cos 2 (ϕx+ϕy) 2 (84) Since x, y ∈ [0, a], ϕ x , ϕ y ∈ [0, arcsin a].As a result, |ϕ x -ϕ y |, ∈ [0, arcsin a]. Plugging this result back into (84), we have g(x, y) ≥ - 1 2 (x -y) 2 1 cos(arcsin a) cos 2 arcsin a = - 1 2(1 -a 2 ) 3 2 (x -y) 2 This concludes the proof. Back to the proof of lower bound (75). Define θ a = θ * max i∈[n] θ i (max{1, b}), then for any k ∈ {1, 2}, i ∈ [n], θ * θ i P kki ≤ θ a . Therefore, applying Lemma 10 in (82), we have when θ a < 1, inf ỹ {Risk(ỹ)} ≥ 1 2 exp 2 n i=1 - 1 2(1 -(θ a ) 2 ) 3 2 θ * θ i P 1ki -θ * θ i P 2ki 2 = 1 2 exp - 1 (1 -(θ a ) 2 ) 3 2 n i=1 θ * θ i P 1ki -P 2ki 2 = 1 2 exp - 1 (1 -(θ a ) 2 ) 3 2 n i=1 θ * θ i (1 - √ b) 2 = 1 2 exp - 1 (1 -(θ a ) 2 ) 3 2 θ * ∥θ∥ 1 (1 -b) 2 (1 + √ b) 2 = 1 2 exp -2 4 (1 + √ b) 2 (1 -(θ a ) 2 ) 3 2 (1 -b) 2 8 • θ * (∥θ L ∥ 1 + ∥θ U ∥ 1 ) Since θ * = o(1), b = 1 -o(1) , and by DCBM model, max i∈[n] θ i ≤, we have θ a = θ * max i∈[n] θ i (max{1, b}) = o(1). Since b = 1 -o(1), (1 + √ b) 2 → 4. Therefore, 4 (1+ √ b) 2 (1-(θ a ) 2 ) 3 2 = 1 -o(1) . Substituting these results into (86), we obtain inf ỹ {Risk(ỹ)} ≥ 1 2 exp -2[1 + o(1)] (1 -b) 2 8 • θ * (∥θ L ∥ 1 + ∥θ U ∥ 1 ) , This concludes our proof of lower bound (75), with c 4 = 1 2 . J.2 PROOF OF UPPER BOUND (76') Proof. When K = 2, Risk(ŷ) = P(ŷ = 2|π * = e 1 ) + P(ŷ = 1|π * = e 2 ). The evaluation of P(ŷ = 2|π * = e 1 ) and P(ŷ = 1|π * = e 2 ) are exactly the same. Without the loss of generosity, we would focus on P(ŷ = 2|π * = e 1 ). Recall that in the proof of 2, we define v k = f (A (k) ; ΠU ), v * = f (X; ΠU ), ṽk = f (Ω (k) ; ΠU ), ṽ * = f (EX; ΠU ), k ∈ [K]. We have P(ŷ = 2|π * = e 1 ) = P(ψ(v 2 , v * ) ≥ ψ(v 1 , v * )|π * = e 1 ) = P ⟨v * , v 2 ⟩ ∥v * ∥∥v 2 ∥ ≥ ⟨v * , v 1 ⟩ ∥v * ∥∥v 1 ∥ π * = e 1 = P ⟨v * , v 2 ∥v 2 ∥ - v 1 ∥v 1 ∥ ⟩ ≥ 0 π * = e 1 Recall that in proof of Lemma 8 9, we define Ĉk = {i ∈ U : πi = e k-K }, k = K + 1, ..., 2K. Let w = v2 ∥v2∥ -v1 ∥v1∥ . Since when l ∈ [K], (v * ) l = (f (X; ΠU )) l = X1 (l) = j∈C l ∩L X j ; when l ∈ {K + 1, ..., 2K}, (v * ) l = (f (X; ΠU )) l = X1 (l) = j∈ Ĉl X j we have P(ŷ = 2|π * = e 1 ) = P K k=1 w k i∈C k ∩L X i + 2K k=K+1 w k i∈ Ĉk X i ≥ π * = e 1 = P K k=1 i∈C k ∩L w k (X i -EX i ) + 2K k=K+1 i∈ Ĉk w k (X i -EX i ) ≥ - K k=1 i∈C k ∩L w k EX i - 2K k=K+1 i∈ Ĉk w k EX i π * = e 1 ≤ P K k=1 i∈C k ∩L w k (X i -EX i ) + 2K k=K+1 i∈ Ĉk w k (X i -EX i ) ≥ K k=1 i∈C k ∩L w k EX i + 2K k=K+1 i∈ Ĉk w k EX i π * = e 1 = E P K k=1 i∈C k ∩L w k (X i -EX i ) + 2K k=K+1 i∈ Ĉk w k (X i -EX i ) ≥ K k=1 i∈C k ∩L w k EX i + 2K k=K+1 i∈ Ĉk w k EX i A, π * = e 1 Since A and X are independent and for k ∈ [2K], w k is measurable with respect to A, given A, {w k (X i -EX i ) : k ∈ [K], i ∈ C k ∩ L} ∪ {w k (X i -EX i ) : k ∈ [2K]\[K], i ∈ Ĉk } are a collection of independent random variables. Also, for any k ∈ [2K], i ∈ [n], E[w k (X i -EX i )|A] = w k (E[X i |A] -EX i ) = w k (EX i -EX i ) = 0 Furthermore, -1 ≤ -EX i ≤ X i -EX i ≤ X i ≤ 1 So |w k (X i -EX i )| ≤ max k∈[2K] |w k |. Additionally, var(w k (X i -EX i )|A) = w 2 k var(X i ) = w 2 k EX i (1 -EX i ) ≤ w 2 k EX i Let t = 1 n K k=1 i∈C k ∩L w k EX i + 2K k=K+1 i∈ Ĉk w k EX i = 1 n |⟨w, ṽ * ⟩| σ 2 = 1 n K k=1 i∈C k ∩L w 2 k EX i + 2K k=K+1 i∈ Ĉk w 2 k EX i = 1 n ⟨w • w, ṽ * ⟩ where w • w is defined as (w 2 1 , ..., w 2 2K ). Then by Lemma 7, P(ŷ = 2|π * = e 1 ) ≤ E P 1 n K k=1 i∈C k ∩L w k (X i -EX i ) + 2K k=K+1 i∈ Ĉk w k (X i -EX i ) ≥ 1 n K k=1 i∈C k ∩L w k EX i + 2K k=K+1 i∈ Ĉk w k EX i A, π * = e 1 ≤ 2E exp - 1 2 nt 2 σ 2 + 1 3 max k∈[2K] |w k |t = 2E exp - 1 2 ⟨w, ṽ * ⟩ 2 ⟨w • w, ṽ * ⟩ + 1 3 max k∈[2K] |w k ||⟨w, ṽ * ⟩| (90) By Lemma 8, when ϕ ≥ 2 √ 2πC 2 K 2 ∥θ (k) L ∥ 2 ∥θ (k) L ∥1∥θ∥1 , P |(v k -ṽk ) l | ≥ 1 π √ K ϕ∥ṽ k ∥ ≤ 2 exp - C 6 C 2 ϕ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 Take ϕ = max{2 √ 2πC 2 K 2 ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 , |1 -b| θ * min k∈[K] ∥θ (k) L ∥ 1 0.25 } Then because θ * min k∈[K] ∥θ (k) L ∥1 = o(1) and by condition 9, ∥θ (k) L ∥ 2 ∥θ (k) L ∥1∥θ∥1 ≤ c 3 β n = o(|1 -b|), we have ϕ = o(|1 -b|). Also, P ∃k ∈ [K], l ∈ [2K], |(v k -ṽk ) l | ≥ 1 π √ K ϕ∥ṽ k ∥ ≤ k∈[K] l∈[2K] P |(v k -ṽk ) l | ≥ 1 π √ K ϕ∥ṽ k ∥ ≤ k∈[K] l∈[2K] 2 exp - C 6 C 2 ϕ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 = k∈[K] l∈[2K] 2 exp - C 6 C 2 (1 -b) 2 θ * min k∈[K] ∥θ (k) L ∥ 1 -0.5 ∥θ (k) L ∥ 1 min k∈[K] ∥θ (k) L ∥ 1 θ * ∥θ∥ 1 ≤ 4K 2 exp - 1 o(1) (1 -b) 2 θ * ∥θ∥ 1 ≪ inf ỹ {Risk(ỹ)} Hence, we can focus on the case where for all k ∈ [K], l ∈ [2K], |(v k -ṽk ) l | = o(|1 -b|)∥ṽ k ∥. Until the end of the proof, we assume that we are under this case. We first evaluate ⟨w, ṽ * ⟩. Let w = ṽ2 ∥ṽ2∥ -ṽ1 ∥ṽ1∥ . Then, |⟨w, ṽ * ⟩ -⟨ w, ṽ * ⟩| = |⟨( v 2 ∥v 2 ∥ - v 1 ∥v 1 ∥ ) -( ṽ2 ∥ṽ 2 ∥ - ṽ1 ∥ṽ 1 ∥ ), ṽ * ⟩| = ∥ṽ * ∥ • |⟨( v 2 ∥v 2 ∥ - ṽ2 ∥ṽ 2 ∥ ) -( v 1 ∥v 1 ∥ - ṽ1 ∥ṽ 1 ∥ ), ṽ * ∥ṽ * ∥ ⟩| = ∥ṽ * ∥ • |(cos ψ(v 2 , ṽ * ) -cos ψ(ṽ 2 , ṽ * )) -(cos ψ(v 1 , ṽ * ) -cos ψ(ṽ 1 , ṽ * ))| = ∥ṽ * ∥ • | -2 sin ψ(v 2 , ṽ * ) -ψ(ṽ 2 , ṽ * ) 2 sin ψ(v 2 , ṽ * ) + ψ(ṽ 2 , ṽ * ) 2 + 2 sin ψ(v 1 , ṽ * ) -ψ(ṽ 1 , ṽ * ) 2 sin ψ(v 1 , ṽ * ) + ψ(ṽ 1 , ṽ * ) 2 | ≤ 2∥ṽ * ∥ • sin |ψ(v 2 , ṽ * ) -ψ(ṽ 2 , ṽ * )| 2 sin ψ(v 2 , ṽ * ) + ψ(ṽ 2 , ṽ * ) 2 + 2∥ṽ * ∥ • sin |ψ(v 1 , ṽ * ) -ψ(ṽ 1 , ṽ * )| 2 sin ψ(v 1 , ṽ * ) + ψ(ṽ 1 , ṽ * ) 2 (By Lemma 4) ≤ 2∥ṽ * ∥ • sin ψ(v 2 , ṽ2 ) 2 sin 2ψ(ṽ 2 , ṽ * ) + ψ(v 2 , ṽ2 ) 2 + 2∥ṽ * ∥ • sin ψ(v 1 , ṽ1 ) 2 sin 2ψ(ṽ 1 , ṽ * ) + ψ(v 1 , ṽ1 ) 2 (92) Since π * = 1, b0 → 0, by Theorem 1, ψ(ṽ 1 , ṽ * ) = ψ 1 = 0, ψ(ṽ 2 , ṽ * ) = ψ 2 ≥ c 0 β n = c 0 |1 -b|. On the other hand, that for all k ∈ [K], l ∈ [2K], |(v k -ṽk ) l | = o(|1 -b|)∥ṽ k ∥ indicates that ∥v k -ṽk ∥ = o(|1 -b|)∥ṽ k ∥. By lemma 5, this implies that ψ(v 1 , ṽ1 ) = o(|1 -b|), k = 1, 2. Therefore, |⟨w, ṽ * ⟩ -⟨ w, ṽ * ⟩| ≤ o(1) • 2∥ṽ * ∥ sin ψ(ṽ 2 , ṽ * ) 2 sin ψ(ṽ 2 , ṽ * ) = o(1) • 2∥ṽ * ∥ sin ψ(ṽ 2 , ṽ * ) 2 2 sin ψ(ṽ 2 , ṽ * ) 2 cos ψ(ṽ 2 , ṽ * ) 2 = o(1) • ∥ṽ * ∥ sin 2 ψ(ṽ 2 , ṽ * ) 2 cos ψ(ṽ 2 , ṽ * ) 2 ≤ o(1) • ∥ṽ * ∥ sin 2 ψ(ṽ 2 , ṽ * ) 2 ≤ o(1) • ∥ṽ * ∥(1 -cos ψ(ṽ 2 , ṽ * )) = o(1) • ∥ṽ * ∥(cos ψ(ṽ 1 , ṽ * ) -cos ψ(ṽ 2 , ṽ * )) = o(1) • ∥ṽ * ∥⟨ ṽ1 ∥ṽ 1 ∥ - ṽ2 ∥ṽ 2 ∥ , ṽ * ∥ṽ * ∥ ⟩ = o(1) • (-⟨ w, ṽ * ⟩) ≤ o(1) • |⟨ w, ṽ * ⟩| (93) Therefore, ⟨w, ṽ * ⟩ = (1 + o(1))⟨ w, ṽ * ⟩. Let η kl = πi=e k ,πi=e l θ i . Let µ (k) a = ∥θ (k) a ∥ 1 , a ∈ {L, U}, k ∈ [K]. Then, ṽ1 = µ (1) L (µ (1) L , bµ L , µ U -η 12 + bη 21 , bµ U + η 12 -bη 21 ) ṽ2 = µ (2) L (bµ (1) L , µ (2) L , bµ (1) U -bη 12 + η 21 , µ (2) U + bη 12 -η 21 ) ṽ * = θ * (µ (1) L , bµ (2) L , µ (1) U -η 12 + bη 21 , bµ (2) U + η 12 -bη 21 ) Hence, -⟨ w, ṽ * ⟩ = ∥ṽ * ∥ ⟨ṽ 1 ṽ * ⟩ ∥ṽ 1 ∥∥ṽ * ∥ -⟨ṽ 2 ṽ * ⟩ ∥ṽ 2 ∥∥ṽ * ∥ Since for all k ∈ [K], l ∈ [2K], |(v k -ṽk ) l | = o(|1 -b|)∥ṽ k ∥, we have |w k -wk | = o(|1 -b|), k = 1, 2, ..., 2K. Hence, denote γ = (µ (2) L ) 2 +(µ (2) U ) 2 -(µ (1) L ) 2 -(µ (1) U ) 2 ∥ν2∥ 2 , we obtain w 3 = - 1 -b ∥ν 1 ∥ µ (1) U (1 + γ) + o(1 -b) Similarly, we can show that w 1 = - 1 -b ∥ν 1 ∥ µ (1) L (1 + γ) + o(1 -b) (102) w 2 = 1 -b ∥ν 2 ∥ µ (2) L (1 -γ) + o(1 -b) w 4 = 1 -b ∥ν 2 ∥ µ (2) U (1 -γ) + o(1 -b) As a result, ⟨w • w, ν * ⟩ = a∈{L,U } K k=1 ( 1 -b ∥ν k ∥ µ (k) a (1 -(-1) k γ) + o(1 -b)) 2 θ * µ (k) a = θ * (1 -b) 2 a∈{L,U } K k=1 (µ (k) a ) 3 (1 -(-1) k γ) 2 ∥ν k ∥ 2 + 2o(1) (µ (k) a ) 2 (1 -(-1) k γ) ∥ν k ∥ 2 + o(1) 2 µ (k) a ∥ν k ∥ 2 When ∥θ∥ 1 = o(1), the bounds (75) and (76 ′ ) both become trivial, so we can focus on the case where ∥θ∥ 1 ≥ O(1). In this case, (µ (k) a ) 2 (1-(-1) k γ) ∥ν k ∥ 2 ≤ O(1), µ (k) a ∥ν k ∥ 2 ≤ O(1). On the other hand, for k ∈ [K], a∈{L,U } (µ (k) a ) 3 ∥ν k ∥ 2 = (µ (k) L ) 3 + (µ (k) U ) 3 ∥ν k ∥ 2 (Holder Inequality) ≥ (µ (k) L + µ (k) U ) 3 4∥ν k ∥ 2 = ∥θ (k) ∥ 3 1 4∥ν k ∥ 2 ≥ min k∈[K] ∥θ (k) ∥ 3 1 4∥ν k ∥ 2 (By (9)) ≥ max k∈[K] ∥θ (k) ∥ 3 1 4C 3 2 ∥ν k ∥ 2 ≥ ∥θ∥ 3 1 4K 3 C 3 2 ∥ν k ∥ 2 ≥ O(1) Since 1 -γ and 1 + γ cannot be both o(1), we have a∈{L,U } K k=1 (µ (k) a ) 3 (1 -(-1) k γ) 2 ∥ν k ∥ 2 ≥ O(1) Therefore, ⟨w • w, ν * ⟩ = (1 + o(1))θ * (1 -b) 2 a∈{L,U } K k=1 (µ (k) a ) 3 (1 -(-1) k γ) 2 ∥ν k ∥ 2 = (1 + o(1))4θ * (1 -b) 2 • ((µ (1) L ) 3 + (µ (1) U ) 3 )((µ (2) L ) 2 + (µ (2) U ) 2 ) 2 + ((µ (2) L ) 3 + (µ (2) U ) 3 )((µ (1) L ) 2 + (µ (1) U ) 2 ) 2 ((µ (1) L ) 2 + (µ (1) U ) 2 + (µ (2) L ) 2 + (µ (2) U ) 2 ) 3 (107) By (101) (102) (103) (104), max k∈[2K] |w k | = O(1 -b) = o(1) . Substituting this result together with ( 97) and ( 107) into (90), we obtain P(ŷ = 2|π * = e 1 ) = 2 exp   -(1 -o(1)) (1 -b) 2 8 • θ * 4 ∥θ (1) L ∥ 3 1 +∥θ (1) U ∥ 3 1 (∥θ (1) L ∥ 2 1 +∥θ (1) U ∥ 2 1 ) 2 + ∥θ (2) L ∥ 3 1 +∥θ (2) U ∥ 3 1 (∥θ (2) L ∥ 2 1 +∥θ (2) U ∥ 2 1 ) 2    Similarly, P(ŷ = 1|π * = e 2 ) ≤ 2 exp   -(1 -o(1)) (1 -b) 2 8 • θ * 4 ∥θ (1) L ∥ 3 1 +∥θ (1) U ∥ 3 1 (∥θ (1) L ∥ 2 1 +∥θ (1) U ∥ 2 1 ) 2 + ∥θ (2) L ∥ 3 1 +∥θ (2) U ∥ 3 1 (∥θ (2) L ∥ 2 1 +∥θ (2) U ∥ 2 1 ) 2    Therefore, Risk(ŷ) ≤ 4 exp   -(1 -o(1)) (1 -b) 2 8 • θ * 4 ∥θ (1) L ∥ 3 1 +∥θ (1) U ∥ 3 1 (∥θ (1) L ∥ 2 1 +∥θ (1) U ∥ 2 1 ) 2 + ∥θ (2) L ∥ 3 1 +∥θ (2) U ∥ 3 1 (∥θ (2) L ∥ 2 1 +∥θ (2) U ∥ 2 1 ) 2    . ( ) Taking C 4 = 4, we conclude the proof.

K PROOF OF THEOREM 3

Theorem 3. Suppose the conditions of Corollary 1 hold, where b 0 is properly small , and suppose that ΠU is b 0 -correct. Furthermore, we assume for sufficiently large constant C 3 , θ * ≤ 1 C3 , θ * ≤ min k∈[K] C 3 ∥θ (k) L ∥ 1 , and for a constant r 0 > 0, min k̸ =ℓ {P kℓ } ≥ r 0 . Then, there is a constant c2 = c2 (K, C 1 , C 2 , C 3 , c 3 , r 0 ) > 0 such that [-log(c 2 Risk(ŷ))]/[-log(inf ỹ {Risk(ỹ)})] ≥ c2 . Proof. On one hand, for any k, k * ∈ [K], k ̸ = k * , using exactly the same proof as in Section J.1, we can show that when C 3 > C 1 , inf ỹ (P(ŷ = k|π * = e k * ) + P(ŷ = k * |π * = k)) ≥ 1 2 exp 2 n i=1 - 1 2(1 -(θ a ) 2 ) 3 2 θ * θ i P kki -θ * θ i P k * ki 2 (111) where k i is the true label of node i, θ a = max i∈[n] max k∈[K] θ * θ i P kki . According to DCBM model and condition (8), θ a ≤ C C3 . Hence take Let B 0 be the event that ΠU is b 0 -correct. When inf ỹ {Risk(ỹ)} is replaced by the version conditioning on B 0 , since X and A are independent, and B 0 ∈ σ(A), conditioning on B 0 or not does not affect the distribution of X. On the other hand, for any k, k * , since π * does not affect the distribution of A, the distribution of A|B 0 , π * = e k and A|B 0 , π * = e k * are the same, so their Hellinger distance is still 0. Hence, all the proofs in Section J and above remain unaffected. In other words, one does not gain a lot of information from B 0 . C 3 ≥ √ 2C 1 , then inf ỹ {Risk(ỹ)} ≥ max k̸ =k * ∈[K] On the other hand, notice that proof of Theorem 2 still works conditioning on B 0 . In other words, there exists constant C > 0, such that given B 0 , for any δ ∈ (0, 1/2), with probability 1 -δ, simultaneously for 1 ≤ k ≤ K, | ψk ( ΠU ) -ψ k ( ΠU )| ≤ C log(1/δ) ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } + ∥θ (k) L ∥ 2 ∥θ (k) L ∥ 1 ∥θ∥ 1 . Define βk = ψ k ( ΠU ). Replacing c 0 β n by βk and replicating the proof of Corollary 1 , we can show that  P(ŷ ̸ = k * |B 0 , π * = e k * ) ≤ C K k=1 exp - C 0 c 2 0 β2 k ∥θ∥ 1 • min{θ * , ∥θ L ∥ 1 } Since θ * ≤ min k∈[K] C 3 ∥θ (k) L ∥ 1 , min{θ * , ∥θ By Lemma 6, denote C 5 = 8K 2 √ KC 2 2 b 0 ∥θ U ∥ 1 ∥θ∥ 1 Suppose that C 5 ≤ 1 4 . Then, for any vector α ∈ R k , 116) where recall that in Section G, we define Therefore, plugging the above result into (117), we have 8)-( 9) hold. We apply SCORE+ to obtain ΠU\{i} and plug it into the above algorithm. As n → ∞, suppose for some constant q 0 > 0 , min i∈U θ i ≥ q 0 max i∈U θ i , β n ∥θ U ∥ ≥ q 0 log(n), β 2 n ∥θ∥ 1 min i∈U θ i → ∞, and β 2 n ∥θ∥ 1 min k {∥θ (k) |α ′ M α| ≥ 1 -3C 5 1 -C 5 |α ′ M (0) α| ≥ 1 3 |α ′ M (0) α| M (0) = P G 2 LL + G 2 U U P D M (0) = diag(M β2 k ≥ 2 3   1 - M (0) kk * M (0) kk M (0) k * k *   = 2 3   1 - K l=1 λ 2 l P kl P k * l K l=1 λ 2 l P 2 kl K l=1 λ 2 l P 2 k * l   = 2 3   1 -1 - ( K l=1 λ 2 l P 2 kl )( K l=1 λ 2 l P 2 k * l ) -( K l=1 λ 2 l P kl P k * l ) 2 ( K l=1 λ 2 l P 2 kl )( K l=1 λ 2 l P 2 k * l )   ≥ 1 3 ( K l=1 λ 2 l P 2 kl )( K l=1 λ 2 l P 2 k * l ) -( K l=1 λ 2 l P kl P k * l ) 2 ( K l=1 λ 2 l P 2 kl )( K l=1 λ 2 l P 2 k * l ) L ∥ 1 } → ∞. Then, 1 |U | i∈U P(ŷ i ̸ = k i ) → 0, so the in-sample classification algorithm in section 3 is consistent. ((∥θ

≥ -log(

(k) ∥ 1 + ∥θ (k * ) ∥ 1 -θ i * )(1 -P k * k ) 2 ) (The true label of node i * is k * ) ≥ θ i * min k̸ =k * ∈[K] (∥θ (k) ∥ 1 (1 -P k * k ) 2 ) ( ) By assumption 8, for any k ∈ [K], (1  -P k * k ) 2 = (1 -P k * k ) 2 (1 + √ P k * k ) 2 = (2 -2P k * k ) 2 4(1 + √ P k * k ) 2 ( By assumption 9, for any k ∈ [K] ∥θ (k) ∥ 1 ≥ min k̸ =k * ∈[K] ∥θ (k) ∥ 1 ≥ 1 C 2 max k̸ =k * ∈[K] ∥θ (k) ∥ 1 ≥ 1 KC 2 k̸ =k * ∈[K] ∥θ (k) ∥ 1 = 1 KC 2 ∥θ∥ 1 (135) From the assumption, log(|U|) ≤ C 3 β 2 n ∥θ∥ 1 min i∈U θ i . Plugging (134) (135) into (133), we obtain  I i * ≥ θ i * min k̸ =k * ∈[K] ( 1 KC 2 (1 + √ C 1 ) 2 β 2 n ∥θ∥ 1 ) ≥ 1 KC 2 (1 + √ C 1 ) 2 β 2 n ∥θ∥ 1 min i∈U θ i ≥ 1 KC 2 C 3 (1 + √ C 1 ) 2 log(|U|) ≥ min{1, C 8 2 √ 2 + KC 2 C 3 (1 + √ C 1 ) 2 } ≥ c21 This concludes our proof.



In AngleMin+, H serves to reduce noise. For example, let X, Y ∈ R 2m be two random Bernoulli vectors, where EX = EY = (.1, . . . , .1, .4, . . . , .4) ′ . As m → ∞, it can be shown that ψ(X, Y ) → 0.34 ̸ = 1 almost surely. If we project X and Y into R 2 by summing the first m coordinates and last m coordinates separately, then as m → ∞, ψ(X, Y ) → 1 almost surely.



n -0.25 Pareto, imbal.

Figure1: Simulations (n = 500, K = 3; data are generated from DCBM). In each plot, the x-axis is the number of labeled nodes, and the y-axis is the average misclassification rate over 100 repetitions.

be the distribution of Y given π * = e 1 and π * = e 2 , respectively.

inf ỹ (P(ŷ = k|π * = e k * ) + P(ŷ = k * |π * = e k )) * θ i P kki -θ * θ i P k * ki

} ≥ min{1, 1 C3 }θ * , therefore, P(ŷ ̸ = k * |B 0 , π * = e k * ) (20) of Section G, we show that βk = ψ k ( ΠU ) ≥ (e k -e k * ) ′ M (e k -e k * )

b 0 is sufficiently small,β2 k ≥ (e k -e k * ) ′ M (e k -e k * ) ≥ 1 3 |(e k -e k * ) ′ M (0) (e k -e k * )

kk * P kl -P k * l + P kl -P kk * P k * l P kk * ) 2 (P kl -P k * l ) kl -P k * l ) 2 (119) Substituting (119) into (118), we haveβ2 k ≥ 1 6 min l∈[K] λ 2 l l∈[K] λ 2 l (P kl -P k * l ) 2 ( ) ∥ 1 ( P kl -P k * l ) 2 ̸ = k * |B 0 , π * = e k * ) ) ∥ 1 ( P kl -P k * l ) ) ∥ 1 ( P kl -P k * l ) 2 ≤ max k̸ =k * ∈[K] exp -C 8 θ * K l=1 ∥θ (l) ∥ 1 ( P kl -P k * l ) 2 = exp -C 8 θ * min k̸ =k * ∈[K] K l=1 ∥θ (l) ∥ 1 ( P kl -P k * l ) 2 θ * min k̸ =k * ∈[K] K l=1 ∥θ (l) ∥ 1 ( √ P kl -√ P k * l) 2 denotes the efficient information in the data.Notice that since I ≥ 0, whenC 8 ≥ 2 Risk(ŷ|B 0 )) -log(inf ỹ {Risk(ỹ)}) ≥ log(2) + C 8 I log(2 }. Then c2 only depends on K, C 1 , C 2 , C 3 , c 3 ,r 0 (recall that both C 0 and c 0 only depend on K, C 1 , C 2 , C 3 , c 3 , r 0 , and C = 2), and -log(c 2 Risk(ŷ|B 0 )) -log(inf ỹ {Risk(ỹ)}) ≥ -log( 1 2 CK 2 Risk(ŷ|B 0 )) -log(inf ỹ {Risk(ỹTheorem 4. Consider the DCBM model where (

2 }. Then c21 only depends on K, C 1 , C 2 , C 3 , c 3 , r 0 (recall that both C 0 and c 0 only depend on K, C 1 , C 2 , C 3 , c 3 , r 0 , and C = 2), and-log(c 21 Risk ins (ŷ)) -log(inf ỹ {Risk ins (ỹ)}) ≥ -log( 1 4 CK 2 Risk ins (ŷ)) -log(inf ỹ {Risk ins (ỹ)})

Average misclassification error over 10 data splits, with standard deviation in the parentheses.

Error rates on Citeseer, where node attributes are available. If the error rate has * , it is quoted from literature and based on one particular data split; otherwise, it is averaged over 10 data splits.

PSEUDO CODE OF THE ALGORITHMBelow are the pseudo code of AngleMin+ which is deferred to the appendix due to the page limit. Number of communities K, adjacency matrix A ∈ R n×n , community labels y i for nodes in i ∈ L, and the vector of edges between a new node and the existing nodes X ∈ R n . Output: Estimated community label ŷ of the new node.1. Unsupervised community detection: Apply a community detection algorithm (e.g., SCORE+ in Section 2) on A U U , and let ΠU = [π i ] i∈U store the estimated community labels, where πi = e k if and only if node k is clustered to community k, 1 ≤ k ≤ K. 2. Assigning the community label to a new node: Let Π L = [π i ] i∈L contain the community memberships of labeled nodes, where π i = e k if and only if



Running time on Caltech, Simons, and Polblogs networks. The quantities outside and inside the parentheses are the means and standard deviations of the running time, respectively.

x⟩ ⟨x, ỹ⟩ ⟨x, z⟩ ⟨ỹ, x⟩ ⟨ỹ, ỹ⟩ ⟨ỹ, z⟩ ⟨z, x⟩ ⟨z, ỹ⟩ ⟨z, z⟩

(P kl P k * l -P k lP k * l ) 2 kl P k * k -P kk P k * l ) 2 + P kl -P k * l ) 2 + λ 2 k * (P kl -P kk * P k * l ) 2

1 4 CK 2 Risk(ŷ i * )) -log( 1 |2U | Risk(ỹ) ∥ 1 ( P kl -P k * l ) 2 -θ i * (1 -P kk * ) 2 ) ≥ θ i * min k̸ =k * ∈[K] (∥θ (k) ∥ 1 ( P kk -P k * k ) 2 + ∥θ (k * ) ∥ 1 ( P kk * -P k * k * ) 2 -θ i * (1 -P kk * ) 2 ) (By identification condition that P kk = P k * k * = 1) = θ i * min

By identification condition that P kk = P k * k

In other words,log(|U|) ≤ KC 2 C 3 (1 + C 1 ) 2 I i * Risk ins (ŷ)) -log(inf ỹ {Risk ins (ỹ)})Similar to the proof of Theorem 3, notice that sinceI i * ≥ 0, when C 8 ≥ 2 √ 2 + KC 2 C 3 (1 + √ C 1 ) 2 , Risk ins (ŷ)) -log(inf ỹ {Risk ins (ỹ)})

annex

where k i is the true label of node i.As a result,Xi , P(2)For x, y ∈ R, denote g(x, y) = log(xy + (1 -x 2 )(1 -y 2 )).Then, 1 -,P(2) X i) 2= exp g( θ * θ i P 1ki , θ * θ i P 2ki )Hence, by ( 79)A , P(2)Xi , P(2)Substituting ( 81) into (76), we obtainTo reduce the RHS of (82), we need to evaluate g. The following lemma shows that g(x, y) ≈ -1 2 (x -y) 2 . Lemma 10. Suppose that 0 ≤ x, y ≤ a < 1. Then,Proof. We first prove a short inequality on log. For z > 0, definewhereNotice thatOn the other hand,Similarly,We turn to ⟨w • w, ṽ * ⟩.

Denote that

L , µ(1)L , bµL + bµ(2)U -η 12 + bη 21 ∥ν 1 ∥

= (bµ

(1)SinceWe have w3 = -bµ(1)On one hand, by Cauchy-Schwartz inequality,On the other hand,Plugging ( 121), ( 122), ( 123) into (120), we obtainSubstituting ( 124) into (113), we obtainNotice that the assumptions of theorem 4 directly imply the assumptions of Corollary 2 when taking i * as the new node. Hence, regard i * as the new node and leveraging on Corollary 2, we haveIn other words, the in-sample classification algorithm in section 3 is consistent.L.2 PROOF OF THEOREM 5Theorem 5. Suppose the conditions of Corollary 1 hold, where b 0 is properly small , and suppose that ΠU\{i} is b 0 -correct for all i ∈ U. Furthermore, we assume for sufficiently large constant C 3 ,n ∥θ∥ 1 min i∈U θ i , and for a constant r 0 > 0, min k̸ =ℓ {P kℓ } ≥ r 0 . Then, there is a constant c21The minimizer of Risk ins (ỹ) may not exist, so we define ỹ(0) to be an approximate minimizer such that Risk ins (ỹ (0) ) ≤ 2 inf ỹ {Risk ins (ỹ)}. By the definition of infimum, such ỹ(0) always exists as long as inf ỹ {Risk ins (ỹ)} > 0. Notice that for anyRegarding node i as the new node and leveraging on (112), we know that inf ỹi {Risk(ỹ i )} > 0. Hence, inf ỹ {Risk ins (ỹ)} ≥ inf ỹi 1 |U | {Risk(ỹ i )} > 0 (note that we are not taking n or |U| → ∞ here), and ỹ(0) is well-defined.Let i * = arg max i∈U Risk(ŷ i ), and let k * be the true label of i * . Regard [n]\{i * } as the existing nodes in the network and i * as the new node. By ( 112) and ( 126 

