SUBSAMPLING IN LARGE GRAPHS USING RICCI CUR-VATURE

Abstract

In recent decades, an increasing number of large graphs with millions of nodes have been collected and constructed. Despite their utility, analyzing such graphs is hindered by high computational costs and visualization difficulties. To tackle these issues, researchers have developed graph subsampling approaches that extract representative nodes to provide a rough sketch of the graph that preserves its global properties. By selecting representative nodes, these graph subsampling methods can help researchers estimate the graph statistics, e.g., the number of communities, of the large graph from the subsample. However, the available subsampling methods, such as degree node sampler and random walk sampler, tend to overlook minority communities, as they prioritize nodes with high degrees. To address this limitation, we propose leveraging community information hidden within the graph for subsampling. Although the community structure is typically unknown, geometric methods can reveal community structure information by defining an analog of Ricci curvature on the graph, known as the Ollivier Ricci curvature. We introduce a new subsampling algorithm called the Ollivier Ricci curvature Gradient-based subsampling (ORG-sub) algorithm based on our asymptotic results regarding the within-community and between-communities edges' OR curvature. The ORG-sub algorithm makes two significant contributions: Firstly, it provides a rigorous theoretical guarantee that the probability of taking all communities into the final subgraph converges to one. Secondly, extensive experiments on synthetic and benchmark datasets demonstrate the advantages of our algorithm.

1. INTRODUCTION

As we enter the big data era, our capacity to access large graphs (a.k.a. networks) has provided unprecedented opportunities and challenges. For example, Wang et al. (2011) presented a Twitter social network, which has more than 190 million nodes (users) who generate more than 65 million edges (tweets) every day (Wang et al., 2011) . Such huge networks enable researchers to tackle more complex problems. However, they pose great challenges to storing, visualizing, and analyzing since their sheer volumes render many computational methods infeasible. Graph subsampling. Graph subsampling is a commonly used technique to address this scalability issue because of its simplicity and efficiency. Graph subsampling aims to take a subgraph that preserves critical features of the full graph. Various graph subsampling methods that preserve different graph properties have been proposed, including node sampling (Mall et al., 2013; Zeng et al., 2019) , edge sampling (Krishnamurthy et al., 2005) , and exploration sampling (Goodman, 1961; Leskovec et al., 2005; Hübler et al., 2008; Maiya & Berger-Wolf, 2010) . Researchers evaluate the graph subsampling approaches by measuring the similarity between the features of the original graph and those of the subgraph. The features include, e.g., degree distribution (Adamic et al., 2001) , minimum cut (Hu & Lau, 2013) , and the number of triangles (Seshadhri et al., 2014) . An important graph feature that has received less attention is the number of communities (denoted by M ), which plays a crucial role in identifying the community structures. Network data often have natural communities, and the identification of these communities helps answer vital questions in a variety of fields (Rohe et al., 2011) . For example, communities in social networks may represent groups of people who share a similar interest, and communities in protein-protein interaction networks could be regulatory modules of interacting proteins (Rohe et al., 2011) . In this paper, we focus on the setting where M is a fixed model parameter, and there is a ground truth about M . This setting is widely used in many models, e.g., stochastic block model (SBM) (Holland et al., 1983) and its variants such as degree-corrected SBM (DCBM) (Karrer & Newman, 2011) . The SBM family is arguably the most widely-used generative model for community detection from a theoretical perspective. In fact, Vaca-Ramírez & Peixoto (2022) performed a systematic analysis of the quality of fit of the SBM for 275 real networks and observed that "SBM is capable of providing an accurate description for the majority of networks considered". Indeed, there are other settings where the number of communities is not fixed. For example, Olhede & Wolfe (2014) used SBM to approximate a nonparametric graphon model, under which case the number of communities is a hyperparameter and is not fixed. The latter hyperparameter case is beyond the scope of this paper. Under the setting where M is a model parameter, many community detection methods have been proposed, such as modularity maximization (Newman, 2006; Good et al., 2010) , spectral clustering (Von Luxburg, 2007; Rohe et al., 2011; Liu et al., 2018) , and pseudo-likelihood based methods (Amini et al., 2013; Wang et al., 2021) . The theoretical properties of most community detection methods, such as consistency and asymptotic distributions, are built based on the assumption that M is known (Ma et al., 2021) . In addition, M is usually required as an input for those community detection algorithms. However, in practice, we do not have the information of M , which significantly diminishes the usefulness of the aforementioned methods. Existing methods for estimating M are usually very expensive. For example, the cross-validation method proposed by Li et al. (2020) requires a computational cost that is cubic in the number of nodes n. When there are thousands or millions of nodes, the computational cost is unaffordable. Thus, it is highly desirable for subsampling methods to yield subgraphs with ñ << n nodes preserving the number of communities M , such that we can use it to get an accurate estimation while reducing the computational cost. Despite many successful applications, existing subsampling methods tend to leave out minority communities (i.e., a community with a smaller number of nodes) because nodes with high degrees are more likely to be sampled into subgraphs. Consequently, these subsampling methods usually underestimate M , especially for graphs with imbalanced community structures. To overcome the shortcomings of existing methods, we develop a graph subsampling method that yields subgraphs that can be used to accurately estimate M . Achieving this goal is challenging since the community structure is hidden and unavailable. Fortunately, recent studies indicate that the community structure is a geometric phenomenon by considering a graph as a Riemannian geometric object (Ni et al., 2015) . Some insights into the community information can be obtained by applying some geometric methods to a graph (Ni et al., 2015; 2019; Sia et al., 2019) . Ollivier Ricci Curvature of Graph. In particular, a graph can be regarded as a discrete version of a Riemannian manifold (Ni et al., 2019) . A node of the graph is analogous to a point on a Riemannian manifold, and a pair of connected nodes in a graph is analogous to two points connected by a geodesic on a manifold. The partition of a graph into communities is analogous to the geometric decomposition of a Riemannian manifold (Ni et al., 2019) . Ricci curvature is a key tool for the geometric decomposition of a Riemannian manifold. It measures how the Riemannian manifold deviates from the flat manifold. Recently, Lin et al. (2011) defines an analog of Ricci curvature for the graph, i.e., the Ollivier Ricci (abbreviated as OR) curvature. Previous empirical results have shown the OR curvature is related to the connectivity of the graphs (Ni et al., 2015; Gosztolai & Arnaudon, 2021) . However, the relationship between OR curvature and connectivity is insufficiently explored in theoretical evidence. In this paper, we theoretically show that the OR curvatures of edges within a densely connected community are asymptotically larger than those of edges between two sparsely connected communities, as the size of the graph increases. Ollivier Ricci Curvature Gradient Based Graph Subsampling. Based on our theoretical result, we propose an OR curvature gradient-based graph subsampling algorithm (abbreviated as ORG-sub). Specifically, ORG-sub randomly chooses one edge as the starting point of the subgraph and calculates the OR curvature of the selected edge. ORG-sub then gradually expands the subgraph by taking the next edge whose OR curvature shows the largest difference from the OR curvature of the previously taken edge. Here, we define the difference as the OR curvature gradient (ORG). We use the ORG to guide the expansion of ORG-sub, i.e., direct ORG-sub which edge to take next. All edges that ORG-sub took expand into the final subgraph. Our proposed ORG-sub enjoys two main advantages. First, ORG-sub has a rigorous theoretical guarantee. In particular, under the SBM scenario, we prove that the probability of ORG-sub taking all communities into the final subgraph converges to one faster than the random walk algorithm, indicating that even nodes in the minority community are subsampled. More importantly, we theoretically show that the estimation of M by the subsampled graph converges to M of the full graph. Second, our extensive empirical experiments based on the simulated and real-world datasets show that the estimator based on ORG-sub subgraphs accurately estimates M while greatly reducing the computation cost.

2.1. NOTATIONS AND DEFINITIONS

A graph is made up of nodes that are connected by edges. Considering that directed graphs don't approximate Riemannian manifolds in discrete form since geodesics are always bidirectional, we only focus on undirected graphs in this paper. 

2.2. OLLIVIER RICCI CURVATURE OF GRAPHS

A Riemannian manifold (M, g) is defined as a smooth manifold M equipped with a metric g : T M × T M → R, where T M is the tangent space of M (Lee, 2018). Figure 1 : The left one is a tree-structured graph corresponding to a hyperbolic manifold with negative curvature. The middle one is a grid-structured graph corresponding to a Euclidean space with zero curvature. The right one is a complete graph K6, which is embedded in a spherical surface with positive curvature. Ricci curvature of Riemannian manifold measures the deviation of the manifold from a flat manifold (Lee, 2006; Ni et al., 2015; Samal et al., 2018) . The "flat" means the distance between two nodes is the same as the distance between the two local spaces centered around the nodes. In the discrete version, the space defined on the node is discretely uniformly distributed in the node's neighborhood, then the distance between two spaces can be explained as the average distance between the neighborhoods of the two nodes. Through such an analogy, the counterpart of Ricci curvature, Ollivier Ricci curvature (Lin et al., 2011) , is defined in graphs. Previous empirical results show that the OR curvature of graphs is related to the Ricci curvature of the underlying manifold: the graph with large (or small) OR curvature corresponds to the underlying manifold with large (or small) Ricci curvature (Ni et al., 2015; Samal et al., 2018; Ni et al., 2019; Gosztolai & Arnaudon, 2021) . Theoretical results have shown the OR curvature of random geometric graphs converges to the Ricci curvature of the underlying Riemannian manifold (Boguna et al., 2021; van der Hoorn et al., 2021) . Here we use a toy example to illustrate the relationship between the OR curvature of graphs and the corresponding underlying manifold. In Figure 1 , the hyperbolic space (with negative curvature) is especially suited for tree-structured graphs (Nickel & Kiela, 2017) , the Euclidean space (with zero curvature) is especially suited for grid-structured graphs, and the spherical space (with positive curvature) is especially suited for complete graphs (Ni et al., 2015) . We now introduce the definition of OR curvature of the graph. For α ∈ [0, 1] and any node u with degree d u , we first define the probability distribution of node u by m α u : m α u (x) =    α if x = u (1 -α)/d u if x ∈ δ(u) 0 otherwise (1) Here α is to keep the probability mass of α at node u itself and distribute the rest uniformly over the neighborhood. Following Ni et al. (2018) ; Ye et al. (2019) , we set α = 0.5. This means that each node keeps 50% of the probability mass to itself. Let d(u, v) be the geodesic distance between nodes u and v, which is the shortest path between two nodes of a graph. We shall present the Wasserstein distance of two nodes in the graph, which is defined as the transportation distance between two nodes' probability distributions m α u and m α v . Definition 2.1 (Wasserstein distance). Let X be the metric space with two probability distributions µ 1 and µ 2 with mass 1, respectively. A transportation plan from µ 1 to µ 2 is a mapping ξ : X ×X → [0, 1] satisfying y ξ(x, y) = µ 1 (x) and x ξ(x, y) = µ 2 (y). Substitute µ 1 and µ 2 by probability distributions of nodes m α u and m α v , the Wasserstein distance between two nodes is, W (m α u , m α v ) = inf ξ u,v∈V ξ(u, v)d(u, v) The Wasserstein distance W (m α u , m α v ) can be computed by linear programming (Lin et al., 2011) . Then the OR curvature (Ollivier, 2007; Lin et al., 2011) over the edge (u, v) is defined as follows. Definition 2.2 (Ollivier Ricci Curvature). Given the Wasserstein distance W (m α u , m α v ) and the geodesic distance d(u, v), we can define the Ollivier Ricci curvature as: κ(u, v) = 1 - W (m α u , m α v ) d(u, v) (3)

3. OLLIVIER RICCI CURVATURE GRADIENT AND COMMUNITY STRUCTURE

The OR curvature of the graph reveals the properties of the underlying Riemannian manifold and provides insights into the community structure of graphs. Previous work has shown that the large (or small) curvature corresponds to the edge which is more (or less) connected than a grid (Ni et al., 2015; Samal et al., 2018; Ni et al., 2019; Gosztolai & Arnaudon, 2021) . -1 1 0 As shown in Figure 2 , the left is a graph with five communities, and the right one is a graph with three communities generated by the stochastic block model. The warmer color indicates a smaller curvature, and the colder color indicates a larger curvature. Previous empirical results and the above toy example have related the OR curvature of graphs to the community structure of the graph. However, rigorous theoretical proofs that the OR curvature of the within-community edge is larger than the betweencommunities edge in a graph are not developed yet. In the Theorem 3.3, we theoretically prove the lower bound of the within-community edges' OR curvatures are larger than the upper bound of the between-communities edges' OR curvatures in stochastic block models (SBM).

3.1. ASYMPTOTIC RESULTS OF OR CURVATURE GRADIENT IN SBM

Lemma 3.1 (Ollivier-Ricci curvature of random graphs generated by stochastic block models). Consider a graph with M communities generated from a stochastic block model (SBM) B p (i) in M i=1 , p out , {n i } M i=1 , where p (i) in represents the probability of an edge connecting nodes in the same community B i , p out represents the probability of an edge connecting nodes from different communities, and {n i } M i=1 are the sizes of each community in the SBM. We assume that, for any i = 1, ..., M , p . We denote the node x from the community B i by x Bi , the edge (x Bi , y Bj ) where x is from B i and y is from B j , and the edge's OR curvature by κ(x Bi , y Bj ). The following statements hold for the OR curvature. (5) 3. If p out > (3 ln n ∨ ) p (i) in pout (n i + n j ) + n -n i -n j , for the upper bound of betweencommunities OR curvature, we almost surely have κ(x Bi , y Bj ) ≤ n -n i -n j n -1 p out + n i -n j -2 n -1 p ∨ + O p ∨ ln n p out n . ( ) Lemma 3.2. Combining the conclusion of statement 1 and statement 2 in Lemma 3.1, almost surely, for any (x ′ Bi , y ′ Bi ) ∈ ∆((x Bi , y Bi )), we have |κ(x Bi , y Bi ) -κ(x ′ Bi , y ′ Bi )| ≤ p out (p (i) in -p out ) ni-1 n-ni p (i) in + p out + O 1 p out n . ( ) Statements 1 and 2 in Lemma 3.1 together show that under certain conditions, the probability that the OR curvature of within-community edges falls in ( (ni-2)p (i) in 2 +(n-ni)pout 2 (ni-1)p (i) in +(n-ni)pout , p in ) goes to 1 as the graph size n goes to infinity. Statement 3 in Lemma 3.1 shows that under certain conditions, the probability that the OR curvature of between-communities edges being no larger than n-ni-nj n-1 p out + ni-nj -2 n-1 p ∨ goes to 1 as n goes to infinity. Thus, we can conclude Lemma 3.2 that the probability of the maximum difference of two within-community edges' OR curvatures being pout(p (i) in -pout) n i -1 n-n i p (i) in +pout goes to 1 as n goes to infinity. Since the largest OR curvature difference of two within-community edges can be bounded, we can prove the difference is small enough to distinguish from the difference of κ(x Bi , y Bi ) and κ(x * Bi , y * Bj ). Theorem 3.3. Combining the conclusion of statement 1 and statement 3 in Lemma 3.1, if p ∨ < p (i) in n ni+nj , p (i) in < pout(n-ni)(n+ni+nj ) n 2 -(ni+nj )(n-2ni) , and conditions given in the lemma 3.1 are satisfied, almost surely, for any (x * Bi , y * Bj ) ∈ ∆((x Bi , y Bi )), we have max (x ′ B i ,y ′ B i )∈∆((x B i ,y B i )) |κ(x Bi , y Bi ) -κ(x ′ Bi , y ′ Bi )| < κ(x Bi , y Bi ) -κ(x * Bi , y * Bj ). The Theorem 3.3 shows the OR curvature of the within-community edge κ(x Bi , y Bi ) is larger than the OR curvature of the between-communities edge κ(x Bi , y Bj ) by the maximum of |κ(x Bi , y Bi )κ(x ′ Bi , y ′ Bi )|. The theorem holds under mild conditions. When p out is small enough, and p ∨ is not too large compared to p out , the conditions can be satisfied. When the graph size is large, e.g., thousands of or millions of nodes, the constraint of p out and p (i) in is loosened, and the inequality in Theorem 3.3 can be easily satisfied.

3.2. OR CURVATURE GRADIENT AND COMMUNITIES

We define the OR curvature gradient (ORG) as |κ(x (j) , y (j) ) -κ(x (i) , y (i) )|, where (x (j) , y (j) ) ∈ ∆((x (i) , y (i) )). It is the difference between the OR curvature of two adjacent edges. According to the theoretical results, the within-community ORG |κ(x Bi , y Bi ) -κ(x ′ Bi , y ′ Bi )| is significantly smaller than the between-communities ORG |κ(x Bi , y Bi ) -κ(x * Bi , y * Bj )|, this motivates us to propose a subsampling algorithm based on ORG, which extracts the community information with theoretical guarantee.

4. OR CURVATURE GRADIENT BASED GRAPH SUBSAMPLING

The existing graph subsampling methods usually use degree information and random walk through the graph. However, the subsampled graph tends to leave out minority communities (i.e., a community with a smaller number ) since they tend to sample nodes with high degrees. Due to the lack of taking advantage of the community structure information during the expansion, the available subsampling algorithms usually underestimate M . From the previous work, the community structure relates to the geometric phenomenon of the underlying Riemannian Manifold. Thus, we develop a fast and efficient ORG-based subsampling algorithm that is able to traverse the communities of the graph instead of trapping in a single community. We then prove the probability of the proposed ORG-sub taking all communities of the full graph into the subsample converges to one as the subsample size increases. Consequently, the subsampled graph attained by the ORG-sub algorithm can get a larger proportion of the minority communities than other subsampling algorithms and helps estimate M more accurately, as shown in Figure 3 . In addition, we theoretically show the estimation of M by the subsampling algorithm converges to true M .

4.1. SUBSAMPLING ALGORITHM

The ORG carries the community structure information since it differentiates within-community edges and between-communities edges with theoretical guarantees in Theorem 3.3. We can use ORG as a guide for expanding the subgraph in the direction of more communities. First, we randomly sample an edge (x, y) ∈ E as the starting edge. The probability of obtaining an edge (x, y) at start is P ((x (1) , y (1) )) = 1 |E| . To take advantage of the theoretical properties of the ORG, the ORG-based subsampler expands to the next edge whose OR curvature shows the greatest difference from the OR curvature of the previously taken edge. The edge subsampled in the i + 1-th step (x (i+1) , y (i+1) ) given the edge subsampled in the i-th step (x (i) , y (i) ) can be expressed by: (x (i+1) , y (i+1) ) = arg max (x,y)∈∆((x (i) ,y (i) )) |κ(x, y) -κ(x (i) , y (i) )|. Since the difference between the within-community ORG is smaller than the between-community ORG, the proposed subsampler will expand to another community instead of trapping in one community. After the subsampler stops expanding or the subsampling budget is used up, we get subsampled nodes. Then we can get the subgraph induced by the subsampled nodes. Details of the algorithm are in Algorithm 1. Algorithm 1 Input: Graph G; Number of nodes to be subsampled ñ; OR curvature of the graph G calculated by parameter α; the subsampled node set S = ∅ Initialization: Randomly choose a node v 0 as the start of sampling. Here, t = 0. Cold Start: Among all neighbors of v 0 , randomly add v 1 to S. Here, t = 1. While: |S| < ñ • Step 1: Given the edge e t = (v t-1 , v t ) selected in the t-th step, get the edge curvatures of the e t 's neighboring edges ∆(e t ). • Step 2: Select the edge e t+1 in the edge set ∆(e t ) that has the greatest OGR. • Step 3: Add nodes connecting the edge e t+1 to the subsampled node set S. • t = t + 1. Output: Subsampled node set S and the induced subgraph G[S]

4.2. ORG-SUB FOR ESTIMATING M

Applying the proposed ORG-sub algorithm to the full graph, as shown in Figure 3 (A), we can get a subsample as shown in Figure 3 (B) . Compared with the subsample obtained by the degreebased subsampler in Figure 3 (C), the proposed algorithm can subsample more nodes in minority communities. The toy example illustrates that the proposed ORG-sub algorithm preserves the community structure information better. Thus, the subsampled graph can help estimate M of the full graph. We take multiple subgraphs by applying ORG-sub r times. For each subsampled graph G[S i ] given the sample size ñ at i-th time, we get an estimation of M , M (G[S i ]), then the mean of M (G[S i ]) is calculated as our final estimation of M . As for the choice of the estimation algorithm, we use the state-of-the-art network cross-validation method (Li et al., 2020) .  (A) (B) (C) P (v ∈ B i | v ∈ G[S] ) → 1. In addition, based on Theorem 4.1, we prove the theoretical advantage of our method over a popular graph sampling method, i.e., random walk sampling (Lovász, 1996) . Let G [S] denote a sampled subgraph of ñ nodes, obtained by the ORG-sub method, and G RW [S] denote a sampled subgraph obtained by the random walk sampling method. Let v RW denote a node in G RW [S]. Corollary 4.1.1 shows that P (v ∈ B i | v ∈ G[S]) is greater than P (v RW ∈ B i | v ∈ G[S] ). This means the ORG-sub is more likely to obtain a subgraph containing nodes from different communities. Corollary 4.1.1 (Theoretical advantage over random walk). Under the assumptions in Theorem 4.1, for any community B i in graph G, we have P (v RW ∈ B i | v ∈ G[S]) < P (v ∈ B i | v ∈ G[S] ). Theorem 4.1 theoretically guarantees each community in the full graph is subsampled with high probability, thus, the convergence of the estimation of M by the ORG-sub algorithm can be theoretically guaranteed referring to Theorem 3 in (Li et al., 2020) . Since cross-validation methods tend not to give M > M in SBMs (Li et al., 2020; Chen & Lei, 2018) , we can provide a theoretical guarantee against under-estimation of the proposed ORG-sub-based estimator M (G[S i ]). Corollary 4.1.2. If the subsample with size ñ satisfies ñρ log ñ → ∞, where ρ is the edge density, as the subsampling times r → ∞, the probability the subsampled graph underestimates the true M is: Choice of the subsample Size. When implementing our algorithm, we need to specify the size of the subsamples, which determines the performance and efficiency of the algorithm. Empirically, we plot the estimation of M as the function of the subsampling proportion in Figure 4 , and we can select the elbow where the estimation of M starts to converge in applications to save computational resources and stabilize the estimation results. P r( 1 r r i=1 M (G[S i ]) < M ) → 0,

5. EXPERIMENT

We evaluate the performance of our algorithm on both synthetic datasets and real-world datasets. We use the metric, MAE= 1 r r i=1 M -M (G[S i ]) to evaluate the accuracy of subsampling based estimation. We compare the proposed method with (1) Degree-based Node sampler (DBN) (Adamic et al., 2001) , (2) Community Structure Expansion Sampler (CSE) (Maiya & Berger-Wolf, 2010) which is a community structure preserved sampler, and six benchmark exploration-based graph subsampling methods, including (3) Metropolis Hasting Random Walk Sampler (MHRW) (Hübler et al., 2008) , (4) Forest Fire Sampler (FFS) (Leskovec et al., 2005) , (5) Snowball Sampler (Goodman, 1961) , (6) Random Walk Sampler (RW) (Gjoka et al., 2010) , (7) Multi-dimensional Random Walk Sampler (MDRW) (Ribeiro & Towsley, 2010) . As for other methods, we set the hyper-parameters as the default in package Little Ball of Fur (Rozemberczki et al., 2020b) . We replicate the experiments 30 times under each setting and compare the performance in terms of MAE for all methods. All the experiments are conducted on a machine with a 40-core NVIDIA Tesla V100 GPU (3.00 GHz).

Synthetic Dataset

We generate synthetic datasets by the stochastic block models (SBM) and degreecorrected block models (DCBM), which can assign the community distribution to each node. We set the community proportion as (3/4, 1/10, 1/12, 1/15) with 900 nodes in total. The out-in-ratio (the ratio of between-communities edges over within-community edges) controls the ratio of the number of edges between the communities and within a community. A higher ratio represents a noisier graph. The degree-corrected model corrects the node degree by a power-law distribution. Given the probability of an edge within a community (p in = 0.8), we vary the probability of an edge between communities (p out ) ({0.06, 0.08, 0.10, 0.12}) and subsampling proportions ({0.1, 0.12, 0.14, 0.16}) from low to high for both block models to observe how our method performs as the setting becomes more challenging compared with others. More details about the generation of the DCBM datasets are presented in section B.1.1 in Appendix.

Real-world Dataset

We use five widely-used read-world graph datasets with labeled community structures to validate the performance of our method, including Polbooks, Facebook, Cora, Polblogs, and PubMed (Rossi & Ahmed, 2015; Rozemberczki et al., 2020a; Sen et al., 2008) . All these graphs are considered unweighted and undirected. Besides, all the self-loop edges and isolated nodes are removed. Table A .4 in the Appendix summarizes the network statistics of these datasets. As we can see, the number of nodes ranges from 105 to 19,717, and the network density ranges from 0.0001 to 0.04. Thus the networks we considered span a wide range. The subsampling time r for estimating M is set as 3, which is enough to get a stable estimation result (Li et al., 2020) .

5.2. RESULTS OF SYNTHETIC DATASET

The results of SBM and DCBM datasets are reported in Figure 5 . The error bar is the standard deviation of 30 replications. We set the subsampling proportion (prop) as 0.1 when varying the probability of an edge between communities (pout) and set pout=0.06 when varying the prop. Our method can get a more accurate estimation of M than other methods. More details about the table of results, including each combination of pout and prop values, are presented in Table A .1 and A.2 in Appendix. We also compare the computation time for estimating M by using the whole graph and the subgraph. It turns out that the time for estimating the subgraph together with the time for subsampling is still much shorter than the time for estimation on the whole graph. It is consistent with our conclusion about the complexity of the algorithm. Details about computational time are presented in Section B.1.2 in Appendix.

5.3. RESULTS OF REAL-WORLD DATASET

We compare the performance of the proposed ORG-sub algorithm with subsamples generated by different algorithms by MAE defined above. 1 , we observe that the ORG-sub performances of PubMed and Polbooks are better than that of Facebook. This shows that ORG-sub performs better when the community of the graph is more imbalanced. The true M of each dataset is recorded in the first column of Table 1 . We tested the performance of different algorithms with a broader range of sampling proportions ranging from 0.5% to 30%. In particular, for small-sized networks Polbooks and Facebook, we consider sampling proportions 10%, 20%, and 30%. For medium-size networks Polblogs and Cora, we consider 5%, 10%, and 20%. For large-size network PubMed, we consider 0.5%, 2%, and 5%. In addition, the computation time on the subgraph (together with the subsampling time) is still much shorter than on the whole graph. Details about computational time are presented in Section B.2.2 in Appendix. 

6. CONCLUSION

In this paper, we propose a novel Ollivier-Ricci Curvature Gradient-based graph subsampling (ORGsub) method that samples a subgraph by maximizing OR curvature gradient. The contribution of the ORG-sub method to the graph subsampling research line is three-fold. First, to the best of our knowledge, we are the first to utilize the graph's internal topological information to subsample a large graph that preserves the number of communities. Second, to the best of our knowledge, we are the first to bridge the gap in the consistency theory regarding subsampling algorithms for the number of community estimations in SBMs. In particular, we theoretically show that ORG-sub effectively traverses different communities and avoids trapping in one community. In addition, we theoretically show the advantage of ORG-sub over a popular sampling method, random walk. Third, we empirically show that our method has superior performance over existing subsampling algorithms in terms of estimating the number of communities. An interesting future direction we plan to investigate is to extend the theory of our method to SBM variants, including degree-correlated SBM, overlapping SBM, and multi-layer SBM. Another future direction is to investigate other community-related statistics that our method has preserved. In fact, we have empirically observed some promising results in preserving the clustering coefficient (CC). Empirical results can be found in Section E in Appendix. In the future, we will investigate the consistency theory of CC. Appendix for "Subsampling for Large Graphs Using Ricci Curvature" The appendix shows the details of the proof, details of experiments, including the parameter settings of generating the synthetic datasets, a description of real-world datasets, and more experiment results on synthetic and real-world datasets.

A PROOF DETAILS

Lemma A.1 (Ollivier-Ricci curvature of random graphs generated by stochastic block models). Consider a graph with M communities generated from a stochastic block model (SBM) B p (i) in M i=1 , p out , {n i } M i=1 , where p (i) in represents the probability of an edge connecting nodes in the same community B i , p out represents the probability of an edge connecting nodes from different communities, and {n i } M i=1 are the sizes of each community in the SBM. We assume that, for any i = 1, ..., M , p . We denote the node x from the community B i by x Bi , the edge (x Bi , y Bi ) where x is from B i and y is from B j , and the edge's OR curvature by κ(x Bi , y Bi ). The following statements hold for the OR curvature. in n i + p out (n -n i ) > 8 3 ln n i , for the lower bound of within-community OR curvature, we almost surely have κ(x Bi , y Bi ) ≥ (n i -2)p (i) in 2 + (n -n i )p out 2 (n i -1)p (i) in + (n -n i )p out + O ln n p 2 out n .

2.. If p (i) in

2 n i + p out 2 (n -n i ) > 4 ln n i , for the upper bound of within-community OR curvature, we almost surely have κ(x Bi , y Bi ) ≤ p (i) in + O ln n i n . ( ) 3. If p out > (3 ln n ∨ ) p (i) in pout (n i + n j ) + n -n i -n j , for the upper bound of betweencommunities OR curvature, we almost surely have κ(x Bi , y Bj ) ≤ n -n i -n j n -1 p out + n i -n j -2 n -1 p ∨ + O p ∨ ln n p out n . ( ) We will use Chernoff's inequality to prove the lemmas. Lemma A.2. (Chernoff's inequality) Let X 1 , ..., X n be the independent random variables with P (X i = 1) = p i , P (X i = 0) = 1 -p i (14) We consider the sum X = n i=1 X i , with expectation E(X) = n i=1 p i . Then we have (Lowertail)P (X ≤ E(X) -λ) ≤ e -λ 2 /2E(X) (15) (U ppertail)P (X ≥ E(X) + λ) ≤ e -λ 2 /(2E(X)+2λ/3) (16) WLOG, the proof of the theorem considers the situation when M = 2. The conclusion can be easily generated to any arbitrary M by setting one block as block B 1 , and the other blocks as another block B 2 . Before we prove the theorem, we need a few lemmas. Lemma A.3. If p (i) in n i + p out (n -n i ) > 8 3 ln n i , then the degree of all nodes in block B i fall in the range p (i) in (n i -1) + p out (n -n i )) -4(p (i) in n i + p out (n -n i )) ln n i , p (i) in (n i -1) + p out (n - n i ) + 6(p (i) in n i + p out (n -n i )) ln n i with probability at least 1 -2/n i . Proof. Without loss of generality, we just prove the first statement and the second statement can be proved in the same way. For each vertex v in block B i , it is easy to show that the expected degree of vertex v is E(d Bi v ) = p (i) in (n i -1) + p out (n -n i ). Applying Chernoff's inequality with the lower tail λ = 4(p (i) in n i + p out (n -n i )) ln n i , we have P d Bi v -(p (i) in (n i -1) + p out (n -n i )) ≤ -4(p (i) in n i + p out (n -n i )) ln n i (17) ≤ e -4(p (i) in n i +p out (n-n i )) ln n i 2(p (i) in (n i -1)+p out (n-n i )) ≤ 1 n 2 i ( ) Applying Chernoff's inequality with the upper tail λ = 6(p (i) in n i + p out (n -n i )) ln n i , we have P d Bi v -(p (i) in (n i -1) + p out (n -n i )) ≥ 6(p (i) in n i + p out (n -n i )) ln n i (19) ≤ e -6(p (i) in n i +p out (n-n i )) ln n i 2(p (i) in (n i -1)+p out (n-n i ))+2/3 √ 6(p (i) in n i +p out (n-n i )) ln n i < 1 n 2 i ( ) In the last step, we used the assumption p (i) in n i + p out (n -n i ) > 8 3 ln n i . The probabil- ity that there is a vertex v in block B i so that d Bi v / ∈ p (i) in (n i -1) + p out (n -n i )) - 4(p (i) in n i + p out (n -n i )) ln n i , p (i) in (n i -1) + p out (n -n i ) + 6(p (i) in n i + p out (n -n i )) ln n i is at most n i 1 n 2 i + 1 n 2 i = 2 ni . The co-degree d xy of a pair of vertices (x, y) is the cardinality of the common neighborhood of x and y. Roughly speaking, when p (i) in and p out is large, the co-degree follows the binomial distribution; when p (i) in and p out is small, the co-degree follows the Poisson distribution. We can expect that all co-degrees within a block and between blocks are concentrated around a small interval. Lemma A.4. If p (i) in 2 n i + p out 2 (n -n i ) > 4 ln n i , then with probability at least 1 -1/n i , all co-degrees of a pair of vertices in block B i fall in the range p (i) in 2 (n i - 2) + p out 2 (n -n i ) -6(p (i) in 2 n i + p out 2 (n -n i )) ln n i , p (i) in 2 (n i -2) + p out 2 (n -n i ) + 9(p (i) in 2 n i + p out 2 (n -n i )) ln n i . If p (i) in p out n i +p out p (i) in n j +p out 2 (n-n i -n j ) > 3 ln n ∨ , then with probability at least 1-1/n ∨ , all co-degrees of a pair of vertices in blocks B i and B j fall in the range p (i) in p out (n i -1)+p out 2 (n-n j - n i )+p (i) in p out (n j -1)-6(p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j )) ln n ∨ , p (i) in p out (n i -1)+ p out 2 (n-n j -n i )+p (i) in p out (n j -1)+ 9(p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j )) ln n ∨ . Proof. For a pair of vertices x ∈ B i and y ∈ B i , the co-degree d x B i y B i is the sum of n -2 indepen- dent random variables X 1 , ..., X n-2 with expectation E(d x B i y B i ) = p (i) in 2 (n i -2) + p out 2 (n -n i ). Applying Chernoff's inequality with the lower tail λ = 6(p (i) in 2 n i + p out 2 (n -n i )) ln n i , we have P d x B i y B i -(p (i) in 2 (n i -2) + p out 2 (n -n i )) < -6(p (i) in 2 n i + p out 2 (n -n i )) ln n i (21) ≤ e - 6(p (i) in 2 n i +p out 2 (n-n i )) ln n i 2(p (i) in 2 (n i -2)+p out 2 (n-n i )) ≤ 1 n 3 i . ( ) If p (i) in 2 n i + p out 2 (n -n i ) > 4 ln n i , we apply Chernoff's inequality with the upper tail λ = 9(p (i) in 2 n i + p out 2 (n -n i )) ln n i , P d x B i y B i -(p (i) in 2 (n i -2) + p out 2 (n -n i )) > 9(p (i) in 2 n i + p out 2 (n -n i )) ln n i (23) ≤ e - 9(p (i) in 2 n i +p out 2 (n-n i )) ln n i 2(p (i) in 2 (n i -2)+p out 2 (n-n i ))+2/3 9(p (i) in 2 n i +p out 2 (n-n i )) ln n i < 1 n 3 i . ( ) Now the number of pairs is at most n 2 i /2. the sum of the probabilities of small events is at most n 2 i 2 1 n 3 i + 1 n 3 i = 1 ni . For a pair of vertices x ∈ B i and y ∈ B j , the co-degree d x B i y B j is the sum of n -2 in- dependent random variables X 1 , ..., X n-2 with expectation E(d x B i y B j ) = p (i) in p out (n i -1) + p out 2 (n -n j -n i ) + p (i) in p out (n j -1). Applying Chernoff's inequality with the lower tail λ = 6(p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j )) ln n ∨ , we have P d x B 1 y B 2 -(p (i) in p out (n i -1) + p out 2 (n -n j -n i ) + p (i) in p out (n j -1)) < -6(p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j )) ln n ∨ ) ≤e - (6 ln n ∨ )(p (i) in p out (n i -1)+p out 2 (n-n j -n i )+p (i) in p out (n j -1)) 2(p (i) in p out (n i -1)+p out 2 (n-n j -n i )+p (i) in p out (n j -1)) ≤ 1 n ∨ 3 . ( ) If p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j ) > 4 ln n ∨ , we apply Chernoff's inequality with the upper tail λ = 9(p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j )) ln n ∨ , P d x B i y B j -(p (i) in p out (n i -1) + p out 2 (n -n j -n i ) + p (i) in p out (n j -1)) (28) > 9(p (i) in p out n i + p out p (i) in n j + p out 2 (n -n i -n j )) ln n ∨ (29) ≤ e - 9(p (i) in p out n i +p out p (i) in n j +p out 2 (n-n i -n j )) ln n ∨ 2(p (i) in p out n i +p out p (i) in n j +p out 2 (n-n i -n j ))+2/3 √ 9(p (i) in p out n i +p out p (i) in n j +p out 2 (n-n i -n j )) ln n ∨ (30) < 1 n ∨ 3 . ( ) The number of pairs is at most n ∨ 2 . the sum of the probabilities of small events is at most (n ∨ 2 ) 1 n∨ 3 = 2 n∨ . Referring to the lemma mentioned in Lin et al. (2011) , we can construct the lower and upper bound of the Ricci curvature of graphs generated by the stochastic block model. Lemma A.5. Suppose that ϕ : Γ(x)\N (y) → Γ(y)\N (x) is an injective mapping. Then we have κ(x, y) ≥ 1 - 1 d y u∈Γ(x)\N (y) d(u, ϕ(u)) + 1 d x - 3(d y -d x ) d y . κ(x, y) = lim α→1 1 -W (m α x , m α y ) 1 -α ≤ d xy + 1 d x + 1 d y . Lemma A.6. For the lower bound, we will construct a matching M from Γ(x)\N (y) to Γ(y)\N (x) as follows. Let U 0 = Γ(x)\N (y) and V 0 = Γ(y)\N (x). Pick up a vertex u 1 ∈ U 0 . Reveal the neighborhood of u 1 in V 0 . Pick a vertex in the neighborhood, and denote it by v 1 . Let U 1 = U 0 \{u 1 } and V 1 = V 0 \{v 1 } and continue this process. The process ends when Γ(u i+1 ) ∩ V i = ∅. The probability that the maximum matching between U 0 and V 0 is at most k is less than -k) . k i=1 (1 -p out ) |V0|-i < 1 p (1 -p out ) |V0|-k ≤ n ln n e -pout(|V0|-k) ≤ ne -pout(|V0| Choose k = ⌊|V 0 | -(3 ln n)/p⌋. With probability at least 1 -1/n 2 , there is a matching M of size k between Γ(x)\N (y) and Γ(y)\N (x). Now we extend the matching M to an injective mapping ϕ : Γ(x)\N (y) → Γ(y)\N (x) arbitrarily. Applying Lemmas A.6, with probability at least 1 -4/n, we have κ(x, y) ≥ 1 - 1 d y u∈Γ(x)\N (y) d(u, ϕ(u)) + 1 d x - 3(d y -d x ) d y ≥ 1 - 1 d y (k + 3(|V 0 | -k)) + 1 d x - 3(d y -d x ) d y ≥ d xy d y - 2(3 ln n/p out ) d y - 3(d y -d x ) d y . Proof. Given the range of the degrees and co-degrees, lower bound and upper bound of Ricci curvatures with respect to d x and d xy , we can prove the remark 3 of Lemma A.1 as follows. For vertex x ∈ B i , y ∈ B i , given p (i) in n i + p out (n -n i ) > 8 3 ln n i , we have κ(x Bi , y Bi ) ≤ (n i -2)p (i) in 2 + (n -n i )p 2 out + 9 ln n 1 (n 1 p (i) in 2 + n 2 p 2 out ) + 2 (n i -1)p (i) in + (n -n i )p out -4(n i p (i) in + (n -n i )p out ) ln n i = p (i) in + O( ln n i n ). For vertex x ∈ B i , y ∈ B i , given p (i) in 2 n i + p out 2 (n -n i ) > 4 ln n i , we have κ(x Bi , y Bi ) ≥ (n i -2)p (i) in 2 + (n -n i )p out 2 -6 ln n i (n i p (i) in 2 + (n -n i )p 2 out ) (n i -1)p (i) in + (n -n i )p out + 6 ln n i (n i p (i) in + (n -n i )p out ) - 6 ln n/p out (n i -1)p (i) in + n 2 p out + 4 ln n i (n i p (i) in + (n -n i )p out ) - 6 6 ln n i (n i p (i) in + (n -n i )p out ) (n i -1)p (i) in + (n -n i )p out + 4 ln n i (n i p (i) in + (n -n i )p out ) ≥ (n i -2)p (i) in 2 + (n -n i )p out 2 (n i -1)p (i) in + (n -n i )p out -O( ln n p 2 out n ) -O( p (i) in ln n i p 2 out n ) = (n i -2)p (i) in 2 + (n -n i )p out 2 (n i -1)p (i) in + (n -n i )p out -O( ln n p 2 out n ) -O( p (i) in ln n i p 2 out n ). If p out 2 p (i) in ≤ (lnn) 2 n ln ni , we have κ(x Bi , y Bi ) = (ni-2)p (i) in 2 +(n-ni)pout 2 (ni-1)p (i) in +(n-ni)pout -O( ln n p 2 out n ). For vertex x ∈ B i , y ∈ B j , let n ∨ = max {n i } M i=1 , and p ∨ = max p (i) in M i=1 . If p out > (3 ln n ∨ ) p (i) in pout (n i + n j ) + n -n i -n j , we have κ(x Bi , y Bj ) ≤ 3 ln n ∨ [(n i -1)p (i) in p out + (n -n i -n j )p 2 out + (n j -1)p (j) in p out ] (n i -1)p (i) in + n j p out -2 ln n ∨ (n i p (i) in + n j p out ) + 1 + (n i -1)p (i) in p out + (n -n i -n j )p 2 out + (n j -1)p (j) in p out (n i -1)p (i) in + n j p out -2 ln n ∨ (n i p (i) in + n j p out ) + 1 (n j -1)p (j) in + n i p out -2 ln n ∨ (n j p (j) in + n i p out ) ≤ 3 ln n ∨ [(n i -1)p (i) in p out + (n -n i -n j )p 2 out + (n j -1)p (j) in p out ] + 2 (n -1)p out -2 √ np ∨ ln n + (n i -1)p (i) in p out + (n -n i -n j )p 2 out + (n j -1)p (j) in p out (n -1)p out -2 √ np ∨ ln n ≤O( n ∨ (p (i) in + p (j) in + p out ) ln n ∨ (n -1) 2 p out ) + n i + n j -2 n -1 p ∨ + n -n i -n j n -1 p out Lemma A.7. Combining the conclusion of statement 1 and statement 2 in Lemma A.1, almost surely, for any (x ′ Bi , y ′ Bj ) ∈ ∆((x Bi , y Bi )), we have |κ(x Bi , y Bi ) -κ(x ′ Bi , y ′ Bi )| ≤ p out (p (i) in -p out ) ni-1 n-ni p (i) in + p out + O 1 p out n . ( ) Proof. We can prove the Lemma A.7 by the lower bound and upper bound of the within-community edges' OR curvature from Lemma A.1. methods as the default in the package Little Ball of Fur. The rejection constraint of the MHRW method is set as 1; the burning probability of FFS is 0.4; the bound on the degree of Snowball Sampler is set as 50. The hyper-parameter used to calculate the Ricci curvature of graphs is set as α = 0.5. The times of subsampling r in Algorithm 2 for estimation of the number of communities is set as 3, which is enough to get a stable estimation result. All the experiments are conducted on a workstation with a 40-core NVIDIA Tesla V100 GPU (3.00 GHz). We use stochastic block models (SBM), and degree corrected block models (DCBM), which can assign the community distribution to each node. To create graphs with unbalanced communities, we set the community proportion as (3/4, 1/10, 1/12, 1/15) with 900 nodes. The out-in-ratio controls the ratio of an edge's probability of between-communities and within-community. A higher out-inratio represents a noisier graph, i.e., harder to distinguish the community from the graph. We set the probability of an edge within a block as 0.8 and vary the probability of an edge between blocks (pout) for both SBM and DCBM ({0.06, 0.08, 0.10, 0.12}). The higher the pout value (higher the out-in-ratio) is, the noisier the subsampled graph is. We also change the subsampling proportion ({0.1, 0.12, 0.14, 0.16}) from low to high for both block models to observe how our method performs compared with others as the settings change. The degree corrected model corrects the node degree using a power-law distribution. More parameters need to be set before generating the graphs following DCBM. We set the average node degree as 40, with is consistent with the setting in Li et al. (2020) , in which the assumptions in our algorithm hold. The node degree follows a power-law distribution with the lower bound as 1 and the scaling parameter as 5. We use synthetic datasets with hidden community structures to evaluate the performance of our method. We replicate the experiments 30 times under each setting and compare the performance of the error of estimation of the number of communities and the computation time. The average of 30 replications of each setting is recorded in Table A .1 and Table A .2. In these tables, Column Prop denotes the subsampling proportion, and column Pout denotes the probability of an edge between communities. The rest of the columns are the results of different subsampling methods. The performance of our algorithm is better than that of other methods. As for the results for SBM in Table A .2, the error of estimating the number of communities decreases as the subsampling proportion increases, and the error increases as the observed graphs are noisier (pout increases). As for the computational time recorded in Table A .3 for SBM, we observe that the estimation time of the full sample is two orders of magnitude larger than the time of subsampling and estimation. As the The comparisons of errors in estimating the number of communities are presented in the main context. The computation time is also obtained in A.5. Similar to the synthetic datasets, the estimation time of the full sample is two orders of magnitude larger than the time of subsampling (including the time for computing OR curvature for each edge in the graph) and estimation.

B.1.2 ADDITIONAL RESULTS OF SYNTHETIC DATASETS

The result also shows that the computation time for estimating the number of communities using the subsampling method is much shorter than that using the full dataset. The complexity of all methods is influenced by the size of the graph, especially by the number of nodes. It is no wonder that the computation time of all methods in estimating the number of communities for dataset Cora is much longer than other datasets.

B.3 ADDITIONAL EMPIRICAL RESULTS

In this section, we provide additional empirical results. In real-world networks, the ground truth is not available. We used some data sets that are widely used for community detection tasks, where the communities are labeled with domain knowledge. The manually labeled community structures correspond to different underlying aspects of the nodes. For example, in the Cora citation network (where a node represents a paper), the community label corresponds to the paper's subject (e.g., neural network). We agree that the manually labeled number of communities might not be the ground truth, which is a limitation for real networks. To overcome this limitation, we also employ another evaluation metric to assess the performance of our methods. In particular, we do not compare our estimation with the manually labeled number of communities. We compare our estimation with the estimation obtained from the full data. These comparison results can be used to evaluate whether the subsampled subgraph can be a good surrogate to carry out computations of interest for the full data. Table A .6 shows the estimation difference between the subsampled and full graphs for four datasets, i.e., Polbooks, Facebook, Cora, and Polblogs. Here we do not show the results for PubMed, since the estimation results of PubMed full graph are not available due to prohibitive computation. As we can see, using this new metric, our method ORG-sub still outperforms other subsampling methods. Indeed, the performance of our method depends on the edge density from both theory and empirical results. On one hand, Corollary 4.1.2 basically assumes that ρ > log(ñ) ñ , where ñ is the number of sampled nodes, and ρ is the edge density. When ñ = 100, we require ρ > 0.046, and when ñ = 1000, we only require ρ > 0.007. As the subsample size becomes larger, the constraint imposed on the edge density becomes weaker. On the other hand, we include additional simulation studies to show the empirical performance of our method with different network densities. Particularly, we use the same SBM setting in Section 5.1 to generate synthetic data, except that we let p in = 0.8λ and p out = 0.1λ, where λ ∈ {0.2, 0.4, . . . , 1} controls the edge density level. Here, we fix the sampling proportion as Empirically, we calculated the CC value of the full graph and that of the subsampled graph for two real datasets: Facebook and Cora. We set the subsampling proportion to 10%. We replicated the sampling procedure 100 times and thus obtained 100 subsamples. Using these subsamples, we obtained the 95% confidence interval (CI) by calculating mean ± 1.96*sd (standard deviation). For Facebook data, the CI of the CC value is [0.420, 0.537], and the CC value obtained from the full graph is 0.476. For Cora data, the CI is [0.234, 0.290], and the CC value of the full graph is 0.238. As we can see, the 95% CI covers the CC value of the full graph for both real data sets. These observations suggest that our subsampling algorithm could preserve CC to some extent. The consistency theory on CC using our method is under investigation, and the results will be reported in future publications.



Denote a graph as G = ⟨V, E⟩, where V is the node set and E is the edge set. Operator | • | calculates the cardinality of the set. We then denote the cardinality of the node set in a graph by |V | (|V | = n), the number of edges in a graph by |E|, and the neighborhood set of a node v by δ(v) = {w|(v, w) ∈ E}. The degree of a node v is defined as the cardinality of its neighborhood: d v = |δ(v)|. For a subset of V , S ⊆ V , the subgraph induced by the subset S is denoted by G[S] = (S, E S ), where the edge set E S = {(v S , w S ) ∈ E|v S ∈ S, w S ∈ S}. The neighborhood of the node set S is defined by N (S) = v∈S δ(v), and the neighboring edge set of edge e = (u, v) is ∆(e) = {(x, y)|x ∈ {u, v} , y ∈ δ(x)\ {u, v}}. Let M (G[S]) denote the estimator of M , obtained by using the subgraph G[S] for estimation.

Figure 2: The cold-colored edges have large OR curvatures, and the warm-colored edges have small OR curvatures. Nodes with different colors belong to different communities.

(i) in > p out , and ni n converges to r i (1 > r i > 0), as n goes to infinity. Let n ∨ = max {n i } M i=1 , and p ∨ = max p (i) in M i=1

Figure 3: (A): A graph generated by SBM with community size (650, 50, 50, 50), pin = 0.8 and pout = 0.1. (B): The subsampled graph by the proposed ORG-sub algorithm. The proportion of subsampling is 10%. (C) The subsampled graph by degree-based subsampling algorithm. The proportion of subsampling is 10%.

Figure 4: The function of the estimation of M with respect to the subsampling proportion for SBM and DCBM, respectively.Computational complexity of Algorithm 1. We analyze the time complexity of Algorithm 1 step by step. The time complexity of querying the neighboring edges and their corresponding curvatures is of order O(n d) by Nys-sink algorithm(Altschuler et al., 2019), given d is the average degree. The time complexity of the sorting step to attaining the most different curvature from the neighbors is O( d log d). Thus, the time complexity of taking one step is of order O(n d). To sample ñ nodes, we need to run about ñ steps and get the induced subgraph from the sampled node set S. Since getting an induced subgraph of size ñ takes time complexity of order O(ñ d), the total complexity of the sampling procedure is O(nñ d).

Figure 5: The function of the performance of MAE with respect to pout and prop for DCBM dataset.

out , and ni n converges to r i (1 > r i > 0), as n goes to infinity. Let n ∨ = max {n i } M i=1 , and p ∨ = max p (i) in M i=1



Figure A.1: The degree distribution of the five datasets.

Table1records the average error of the estimation of M over 30 replications. Still, our algorithm outperforms other algorithms in most cases. Let n 1 , . . . , n M denote the number of nodes in M communities. Here, we calculate the normalized Shannon Entropy of n 1 , . . . , n M as a metric to measure the imbalance level, i.e., we calculate IM = and a higher IM value means that the communities are more imbalanced. TableA.4 in the appendix summarizes the IM. As we can see, both PubMed and Polbooks are more imbalanced than Facebook. From Table

Comparison of the performance on the error of the estimation of M for each dataset and subsampling method.

1: DCBM: Error of the estimation of the number of communities for different subsampling methods under different settings.

2: SBM: Error of the estimation of the number of communities for different subsampling methods under different settings.

3: SBM: Comparison of the computation time (s) of the estimation of the number of communities between using the full dataset and the sampled dataset for different subsampling methods under different settings (seconds). This is a network of books about US politics published around the 2004 presidential election and sold by online bookseller Amazon.com. All the books are divided into four communities by NI-LPA. Edges between books represent frequent co-purchasing of books by the same buyers.Facebook: This is an ego-network dataset of the "friend circles" of one anonymous user on Facebook. The network forms friend circles such as family members, high school friends or other friends that are "hand-labeled" by the user. The PubMed dataset describes the citation relationship among scientific publications classified into seven classes. After preprocessing, there remain 19,717 nodes and 44,338 links. Table A.4 presents the summary statistics for those real networks we used in this paper. As we can see, these datasets cover different levels of network size, density, number fo communities, clustering coefficient (CC), and imbalance. The degree distributions of all the datasets are presented in Figure A.1.

4: Key features of the real-world datasets. Node and Edge represent the number of nodes, edges, and communities of the graph, respectively. Density is the edge density, calculated by the ratio of the number of edges in the actual and complete graphs. CC is the clustering coefficient of the whole graph. IM is the imbalance level.



6: The estimation difference between sampling and full.

ACKNOWLEDGMENTS

Thanks to the partial support by NSF awards DMS-1903226, DMS-1925066, DMS-2124493, and NIH grant R01GM1222080. We would like to thank the referees for their valuable comments and suggestions that helped us to improve the quality of this paper. We appreciate the time and effort the referees put into the review process, and we are grateful for their contributions.

annex

Published as a conference paper at ICLR 2023 Theorem A.8. Combining the conclusion of statement 1 and statement 3 in Lemma A.1, if p ∨ <n 2 -(ni+nj )(n-2ni) and conditions given in Lemma A.1, almost surely, for any (x * Bi , y * Bj ) ∈ ∆((x Bi , y Bi )), we have max)∈∆((x B i ,y B i ))|κ(x Bi , y Bi ) -κ(x ′ Bi , y ′ Bi )| < κ(x Bi , y Bi ) -κ(x * Bi , y * Bj ).(33)Proof. We first get the lower bound of the OR curvature difference of within community edge κ(x Bi , y Bi ) and between community edge κ(x * Bi , y * Bj ), which can be obtained by the lower bound of κ(x Bi , y Bi ) and upper bound of κ(x * Bi , y * Bj ) from the Lemma A.1. Given the upper bound of the OR curvature difference of edges in the same community, we can prove under certain conditions, the upper bound is smaller than the lower bound we obtained in the last step. Denote r i = n i /n and r j = n j /n.Theorem A.9 (Probability of the ORG-sub's Expansion to Communities). Given the graph G generated by Theorem A.8, we apply the ORG-sub Algorithm 1 to the graph G and take graph subsamples with size ñ denoted by G[S]. Under the conditions in Theorem A.8, and if ñ/n → 1, we can prove that, the probability of the final subgraph G[S] taken by ORG-sub containing nodes in the community B i is:Proof. Without loss of generality, we show the theorem holds for a two-block stochastic block model (SBM).We assume the graph is generated from a two-block stochastic block model (SBM), where the probability of an edge within a block is p in , and the probability of an edge between the blocks is p out . We denote the subsampled graph from the original graph G by G[S], where S is a subset of the full vertices set V of graph G, and G[S] is the subgraph induced by the subset S. Denote the probability matrix of generating the adjacency matrix A by P:where the size of the first community is n 1 , the size of the second community is n 2 , and the number of nodes of the graph is n = n 1 + n 2 . The node set belonging to the community 1 is B 1 = {v 1 , ..., v n1 }, and the node set belonging to the community 2 is B 2 = {v n1+1,...,n }. Denote each entry of the probability matrix P by p ij , then the observed A is generated from:Define the block of the matrix A asWe denote the edge subsampled at the l-th step by e (l) ij , and the neighboring edge set of edge e ij by ∆(e ij ).For each step, the probability of subsampling an edge in a block is a constant. After subsampling one edge in each step, the nodes connected by the subsampled edges are subsampled into node set S. We denote the probability of adding node ṽ at the step l belonging to community B o , where o ∈ {1, 2}, by P (ṽ (l) ∈ B o ). After ñ steps, the probability of the obtained subsample G [S] containing node ṽ from the node set B 1 is:Analogously, the probability of subsampling node ṽ belonging to the node set B 2 after n steps is:)) can be small when n 1 (n 2 ) and ñ is large, the proposed ORG-sub subsampler can traverse to new communities with high probability. When p out is small, that is, the edges density between communities is relatively small, the proposed ORG-sub needs more steps to expand to another community. As the traveling steps of the proposed subsampler ORG-sub increase, we can prove that the probability of the proposed ORG-sub algorithm taking all communities into the final subgraph converges to one under the two-stochastic-block model.In general cases, the graph is generated by SBM with multiple communities, we can always think of the community that the subsampler has never been to as one block, and other communities as the other block. Given the conclusion under the two-stochastic-block model, the ORG-sub can gradually expand the subsampled graph to the block we have never been to. Thus, we can prove the ORG-sub can expand to any community under a multiple-stochastic-block model.Corollary A.9.1 (Theoretical advantage over random walk). Under the assumptions in Theorem 4.1, for any community B i in graph G, we haveProof. For the proposed method, Theorem A.9 has proved that after walking ñ steps, the probability of the obtained subsample G [S] by the proposed ORG-sub containing vertexes v from community B i is:After ñ steps, the probability of the obtained subsample G RW [S] by the random walk subsampler containing vertexes in community B i isCorollary A.9.2. If the subsample with size ñ satisfies ñρ log ñ → ∞, where ρ is the edge density, as the subsampling times r → ∞, the probability the subsampled graph underestimates the true M is:Proof. From Theorem A.9, we know that, if we take enough steps for subsampling, the true number of communities M (G[S]) of subgraph G[S] is the same as the true number of communities M of the original graph G. Theorem 3 of Li et al. (2020) proves the one-sided consistency of estimating the number of communities of the stochastic block model under the following assumptions. First, the expected node degree λ n = nρ has the order λn log n , where n is the number of nodes of the graph G. Second, there exists a constant γ such that min k n k ≥ γN , where n k is the size of the k-th community. Since our subsampled graph satisfies ñρ log ñ → ∞, we can easily verify that the subsampled subgraph satisfies the assumptions in Theorem 3 of Li et al. (2020) . The estimate of the number of communities M (G[S i ]) corresponding to S i , the i-th subsample, satisfies:As the replications of the subsamples increase, it's trivial to show:

B DETAILS OF EXPERIMENTS

We evaluate the performance of our algorithm on both synthetic datasets and real-world datasets using the metrics presented in the main context. We also compare the proposed method with seven benchmark exploration-based graph subsampling methods, including Metropolis Hasting Random Walk Sampler (MHRW), Forest Fire Sampler (FFS), Snowball Sampler, Community Structure Expansion Sampler (CSE), degree-based node sampler (DBN), Random Walk Sampler (RW) and Multi-dimensional Random Walk Sampler (MDRW). We set the hyper-parameters of the seven 10%, that is, we sample 90 nodes. Table A .7 shows the edge density and average estimation error under different λ. As we can see, an increasing edge density comes with a lower error. Node-based sampling. Random Node (RN) sampling (Leskovec & Faloutsos, 2006) and Degreebased Node sample (DBN) (Leskovec & Faloutsos, 2006) sampling are two most common node-base sampling methods. RN selects a set of nodes uniformly at random from the graph, while DBN selects a node with a probability that is proportional to its degree. DBN has been shown to favor high-degree nodes.Edge-based sampling. Random Edge (RE) sampling (Leskovec & Faloutsos, 2006) generates an induced subgraph by selecting edges uniformly at random. Some variants of RE have been proposed, such as Random Node-Edge (RNE) sampling (Rafiei, 2005) that randomly selects a node and then randomly chooses an adjacent edge. Previous studies (Leskovec & Faloutsos, 2006) have demonstrated that neither RE nor RNE preserves community structures because the resulting sampled graphs are often sparsely connected. Meanwhile, both RE and REN slightly favor high-degree nodes because the probability of selecting a node increases with its degree.Exploration-based sampling. SnowBall sampling (Goodman, 1961) and Forest Fire sampling (FFS) (Leskovec et al., 2005) are two basic exploration-based sampling methods, selecting a fixed fraction of neighbors visited at each iteration. Random walk (RW) (Lovász, 1996) is also a popular explorationbased graph sampling method, which selects the next node at random from the neighbors of the currently selected node. One of the major limitations of RW is that RW is inherently biased towards visiting high-degree nodes. To overcome the drawback of RW, researchers proposed Metropolis Hasting Random Walk Sampler (MHRW) (Hübler et al., 2008) and Multi-dimensional Random Walk Sampler (MDRW) (Ribeiro & Towsley, 2010) . Most of the aforementioned literature aims at getting a subgraph that preserves some network summary statistics, such as degree distribution. Only a few works aim at obtaining a subgraph that preserves the community information. (Maiya & Berger-Wolf, 2010) proposed a local greedy search-based community structure expansion sampling (CSE) method to optimize the preservation of community structures.Despite many successful applications of existing graph sampling methods. They all have various limitations. In particular, node-based and edge-based sampling methods sample nodes or edges independently, ignoring the neighborhoods of seed nodes. They might obtain a disconnected subgraph from a connected graph (Wu et al., 2016) . For RW and FF, it has been shown that they could get trapped inside the communities and leave other communities out of the sample entirely (Wu et al., 2016) . While the MHRW algorithm ensures that the subgraph preserves degree distribution, the performance of this algorithm is dependent on its sample acceptance ratio. It has been shown that the acceptance ratio of MHRW is typically very low in real-world networks . Therefore, MHRW generally suffers from the sample rejection problem, which clearly degrades the performance of MHRW. For CSE, the selection procedure of the next node is not based on a mathematical framework and one cannot compute the probability of visiting sampled nodes in CSE (Salehi et al., 2012) .

E EMPIRICAL INVESTIGATION OF FUTURE WORK

Theoretically, the OR curvature has been proven to be related to some network summaries concerning communities, such as the eigenvalues of the graph Laplacian Bauer et al. (2011) and the clustering coefficient (CC) Jost & Liu (2014) .

