SUBSAMPLING IN LARGE GRAPHS USING RICCI CUR-VATURE

Abstract

In recent decades, an increasing number of large graphs with millions of nodes have been collected and constructed. Despite their utility, analyzing such graphs is hindered by high computational costs and visualization difficulties. To tackle these issues, researchers have developed graph subsampling approaches that extract representative nodes to provide a rough sketch of the graph that preserves its global properties. By selecting representative nodes, these graph subsampling methods can help researchers estimate the graph statistics, e.g., the number of communities, of the large graph from the subsample. However, the available subsampling methods, such as degree node sampler and random walk sampler, tend to overlook minority communities, as they prioritize nodes with high degrees. To address this limitation, we propose leveraging community information hidden within the graph for subsampling. Although the community structure is typically unknown, geometric methods can reveal community structure information by defining an analog of Ricci curvature on the graph, known as the Ollivier Ricci curvature. We introduce a new subsampling algorithm called the Ollivier Ricci curvature Gradient-based subsampling (ORG-sub) algorithm based on our asymptotic results regarding the within-community and between-communities edges' OR curvature. The ORG-sub algorithm makes two significant contributions: Firstly, it provides a rigorous theoretical guarantee that the probability of taking all communities into the final subgraph converges to one. Secondly, extensive experiments on synthetic and benchmark datasets demonstrate the advantages of our algorithm.

1. INTRODUCTION

As we enter the big data era, our capacity to access large graphs (a.k.a. networks) has provided unprecedented opportunities and challenges. For example, Wang et al. (2011) presented a Twitter social network, which has more than 190 million nodes (users) who generate more than 65 million edges (tweets) every day (Wang et al., 2011) . Such huge networks enable researchers to tackle more complex problems. However, they pose great challenges to storing, visualizing, and analyzing since their sheer volumes render many computational methods infeasible. Graph subsampling. Graph subsampling is a commonly used technique to address this scalability issue because of its simplicity and efficiency. Graph subsampling aims to take a subgraph that preserves critical features of the full graph. Various graph subsampling methods that preserve different graph properties have been proposed, including node sampling (Mall et al., 2013; Zeng et al., 2019) , edge sampling (Krishnamurthy et al., 2005) , and exploration sampling (Goodman, 1961; Leskovec et al., 2005; Hübler et al., 2008; Maiya & Berger-Wolf, 2010) . Researchers evaluate the graph subsampling approaches by measuring the similarity between the features of the original graph and those of the subgraph. The features include, e.g., degree distribution (Adamic et al., 2001) , minimum cut (Hu & Lau, 2013) , and the number of triangles (Seshadhri et al., 2014) . An important graph feature that has received less attention is the number of communities (denoted by M ), which plays a crucial role in identifying the community structures. Network data often have natural communities, and the identification of these communities helps answer vital questions in a

