SEMI-SUPERVISED COMMUNITY DETECTION VIA STRUCTURAL SIMILARITY METRICS

Abstract

Motivated by social network analysis and network-based recommendation systems, we study a semi-supervised community detection problem in which the objective is to estimate the community label of a new node using the network topology and partially observed community labels of existing nodes. The network is modeled using a degree-corrected stochastic block model, which allows for severe degree heterogeneity and potentially non-assortative communities. We propose an algorithm that computes a 'structural similarity metric' between the new node and each of the K communities by aggregating labeled and unlabeled data. The estimated label of the new node corresponds to the value of k that maximizes this similarity metric. Our method is fast and numerically outperforms existing semi-supervised algorithms. Theoretically, we derive explicit bounds for the misclassification error and show the efficiency of our method by comparing it with an ideal classifier. Our findings highlight, to the best of our knowledge, the first semi-supervised community detection algorithm that offers theoretical guarantees.

1. INTRODUCTION

Nowadays, large network data are frequently observed on social media (such as Facebook, Twitter, and LinkedIn), science, and social science. Learning the latent community structure in a network is of particular interest. For example, community analysis is useful in designing recommendation systems (Debnath et al., 2008) , measuring scholarly impacts (Ji et al., 2022) , and re-constructing pseudo-dynamics in single-cell data (Liu et al., 2018) . In this paper, we consider a semi-supervised community detection setting: we are given a symmetric network with n nodes, and denote by A ∈ R n×n the adjacency matrix, where A ij ∈ {0, 1} indicates whether there is an edge between nodes i and j. Suppose the nodes partition into K non-overlapping communities C 1 , C 2 , . . . , C K . For a subset L ⊂ {1, 2, . . . , n}, we observe the true community label y i ∈ {1, 2, . . . , K} for each i ∈ L. Write m = |L| and Y L = (y i ) i∈L . In this context, there are two related semi-supervised community detection problems: (i) in-sample classification, where the goal is to classify all the existing unlabeled nodes; (ii) prediction, where the goal is to classify a new node joining the network. Notably, the in-sample classification problem can be easily reduced to prediction problem: we can successively single out each existing unlabeled node, regard it as the "new node", and then predict its label by applying an algorithms for the prediction problem. Hence, for most of the paper, we focus on the prediction problem and defer the study of in-sample classification to Section 3. In the prediction problem, let X ∈ {0, 1} n denote the vector consisting of edges between the new node and each of the existing nodes. Given (A, Y L , X), our goal is to estimate the community label of the new node. This problem has multiple applications. Consider the news suggestion or online advertising push for a new Facebook user (Shapira et al., 2013) . Given a big Facebook network of existing users, for a small fraction of nodes (e.g., active users), we may have good information about the communities to which they belong, whereas for the majority of users, we just observe who they link to. We are interested in estimating the community label of the new user in order to personalize news or ad recommendations. For another example, in a co-citation network of researchers (Ji et al., 2022) , each community might be interpreted as a group of researchers working on the same research area. We frequently have a clear understanding of the research areas of some authors (e.g., senior authors), and we intend to use this knowledge to determine the community to which a new node (e.g., a junior author) belongs.

