SEMI-SUPERVISED COMMUNITY DETECTION VIA STRUCTURAL SIMILARITY METRICS

Abstract

Motivated by social network analysis and network-based recommendation systems, we study a semi-supervised community detection problem in which the objective is to estimate the community label of a new node using the network topology and partially observed community labels of existing nodes. The network is modeled using a degree-corrected stochastic block model, which allows for severe degree heterogeneity and potentially non-assortative communities. We propose an algorithm that computes a 'structural similarity metric' between the new node and each of the K communities by aggregating labeled and unlabeled data. The estimated label of the new node corresponds to the value of k that maximizes this similarity metric. Our method is fast and numerically outperforms existing semi-supervised algorithms. Theoretically, we derive explicit bounds for the misclassification error and show the efficiency of our method by comparing it with an ideal classifier. Our findings highlight, to the best of our knowledge, the first semi-supervised community detection algorithm that offers theoretical guarantees.

1. INTRODUCTION

Nowadays, large network data are frequently observed on social media (such as Facebook, Twitter, and LinkedIn), science, and social science. Learning the latent community structure in a network is of particular interest. For example, community analysis is useful in designing recommendation systems (Debnath et al., 2008) , measuring scholarly impacts (Ji et al., 2022) , and re-constructing pseudo-dynamics in single-cell data (Liu et al., 2018) . In this paper, we consider a semi-supervised community detection setting: we are given a symmetric network with n nodes, and denote by A ∈ R n×n the adjacency matrix, where A ij ∈ {0, 1} indicates whether there is an edge between nodes i and j. Suppose the nodes partition into K non-overlapping communities C 1 , C 2 , . . . , C K . For a subset L ⊂ {1, 2, . . . , n}, we observe the true community label y i ∈ {1, 2, . . . , K} for each i ∈ L. Write m = |L| and Y L = (y i ) i∈L . In this context, there are two related semi-supervised community detection problems: (i) in-sample classification, where the goal is to classify all the existing unlabeled nodes; (ii) prediction, where the goal is to classify a new node joining the network. Notably, the in-sample classification problem can be easily reduced to prediction problem: we can successively single out each existing unlabeled node, regard it as the "new node", and then predict its label by applying an algorithms for the prediction problem. Hence, for most of the paper, we focus on the prediction problem and defer the study of in-sample classification to Section 3. In the prediction problem, let X ∈ {0, 1} n denote the vector consisting of edges between the new node and each of the existing nodes. Given (A, Y L , X), our goal is to estimate the community label of the new node. This problem has multiple applications. Consider the news suggestion or online advertising push for a new Facebook user (Shapira et al., 2013) . Given a big Facebook network of existing users, for a small fraction of nodes (e.g., active users), we may have good information about the communities to which they belong, whereas for the majority of users, we just observe who they link to. We are interested in estimating the community label of the new user in order to personalize news or ad recommendations. For another example, in a co-citation network of researchers (Ji et al., 2022) , each community might be interpreted as a group of researchers working on the same research area. We frequently have a clear understanding of the research areas of some authors (e.g., senior authors), and we intend to use this knowledge to determine the community to which a new node (e.g., a junior author) belongs. The statistical literature on community detection has mainly focused on the unsupervised setting (Bickel & Chen, 2009; Rohe et al., 2011; Jin, 2015; Gao et al., 2018; Li et al., 2021) . The semisupervised setting is less studied. Leng & Ma (2019) offers a comprehensive literature review of semi-supervised community detection algorithms. Liu et al. (2014) and Ji et al. (2016) derive systems of linear equations for the community labels through physics theory, and predict the labels by solving those equations. Zhou et al. (2018) leverages on the belief function to propagate labels across the network, so that one can estimate the label of a node through its belief. Betzel et al. (2018) extracts several patterns in size and structural composition across the known communities and search for similar patterns in the graph. Yang et al. (2015) unifies a number of different community detection algorithms based on non-negative matrix factorization or spectral clustering under the unsupervised setting, and fits them into the semi-supervised scenario by adding various regularization terms to encourage the estimated labels for nodes in L to match with the clustering behavior of their observed labels. However, the existing methods still face challenges. First, many of them employ the heuristic that a node tends to have more edges with nodes in the same community than those in other communities. This is true only when communities are assortative. But non-assortative communities are also seen in real networks (Goldenberg et al., 2010; Betzel et al., 2018) ; for instance, Facebook users sharing similar restaurant preferences are not necessarily friends of each other. Second, real networks often have severe degree heterogeneity (i.e., the degrees of some nodes can be many times larger than the degrees of other nodes), but most semi-supervised community detection algorithms do not handle degree heterogeneity. Third, the optimization-based algorithms (Yang et al., 2015) solve non-convex problems and face the issue of local minima. Last, to our best knowledge, none of the existing methods have theoretical guarantees. Attributed network clustering is a problem related to community detection, for which many algorithms have been developed (please see Chunaev et al. (2019) for a nice survey). The graph neural networks (GNN) reported great successes in attributed network clustering. Kipf & Welling (2016) proposes a graph convolutional network (GCN) approach to semi-supervised community detection, and Jin et al. ( 2019) combines GNN with the Markov random field to predict node labels. However, GNN is designed for the setting where each node has a large number of attributes and these attributes contain rich information of community labels. The key question in the GNN research is how to utilize the graph to better propagate messages. In contrast, we are interested in the scenario where it is infeasible or costly to collect node attributes. For instance, it is easy to construct a co-authorship network from bibtex files, but collecting features of authors is much harder. Additionally, a number of benchmark network datasets do not have attributes (e.g. Caltech (Red et al., 2011; Traud et al., 2012 ), Simmons (Red et al., 2011; Traud et al., 2012) , and Polblogs (Adamic & Glance, 2005) ). It is unclear how to implement GNN on these data sets. In Section 4, we briefly study the performance of GNN with self-created nodal features from 1-hop representation, graph topology and node embedding. Our experiments indicate that GNN is often not suitable for the case of no node attributes. We propose a new algorithm for semi-supervised community detection to address the limitations of existing methods. We adopt the DCBM model (Karrer & Newman, 2011) for networks, which models degree heterogeneity and allows for both assortative and non-assortative communities. Inspired by the viewpoint of Goldenberg et al. ( 2010) that a 'community' is a group of 'structurally equivalent' nodes, we design a structural similar metric between the new node and each of the K communities. This metric aggregates information in both labeled and unlabeled nodes. We then estimate the community label of the new node by the k that maximizes this similarity metric. Our method is easy to implement, computationally fast, and compares favorably with other methods in numerical experiments. In theory, we derive explicit bounds for the misclassification probability of our method under the DCBM model. We also study the efficiency of our method by comparing its misclassification probability with that of an ideal classifier having access to the community labels of all nodes.

2. SEMI-SUPERVISED COMMUNITY DETECTION

Recall that A is the n × n adjacency matrix on the existing nodes and Y L contains the community labels of nodes in L. Write [n] = {1, 2, . . . , n} and let U = [n] \ L denote the set of unlabeled nodes. We index the new node by n + 1 and let X ∈ R n be the binary vector consisting of the edges between the new node and existing nodes. Denote by Ā the adjacency matrix for the network of (n + 1) nodes.

2.1

The DCBM model and structural equivalence of communities We model Ā with the degreecorrected block model (DCBM) (Karrer & Newman, 2011) . Define a K-dimensional membership

