LEARN LOW-DIMENSIONAL SHORTEST-PATH REPRE-SENTATION OF LARGE-SCALE AND COMPLEX GRAPHS Anonymous authors Paper under double-blind review

Abstract

Estimation of shortest-path (SP) distance lies at the heart of network analysis tasks. Along with the rapid emergence of large-scale and complex graphs, approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are critical for fast and scalable online analysis. Among different approaches, learning-based representation methods have made a breakthrough both in response time and accuracy. Several competitive works in learning-based methods heuristically leverage truncated random walk and optimization on the arbitrary linkage for SP representation learning. However, they have limitations on both exploration range and distance preservation. We propose in this paper an efficient and interpretable SP representation method called Betweenness Centrality-based Distance Resampling (BCDR). First, we prove that betweenness centrality-based random walk can occupy a wider exploration range of distance due to its awareness of high-order path structures. Second, we leverage distance resampling to simulate random shortest paths from original paths and prove that the optimization on such shortest paths preserves distance relations via implicitly decomposing SP distance-based similarity matrix. BCDR yields an average improvement of 25% accuracy and 25-30% query speed, compared to all existing approximate methods when evaluated on a broad class of real-world and synthetic graphs with diverse sizes and structures.

1. INTRODUCTION

Estimation of shortest-path (SP) distance lies at the heart of many network analysis tasks, such as centrality computation (Schönfeld & Pfeffer, 2021) , node separation (Houidi et al., 2020) , community detection (Zhang et al., 2020; Asif et al., 2022) , which also directly contributes to enormous downstream applications, including point of interest (POI) search (Qi et al., 2020; Chen et al., 2021a) social relationship analysis (Carlton, 2020; Melkonian et al., 2021) , biomedical structure prediction (Yue et al., 2019; Sokolowski & Wasserman, 2021) , learning theory (Yang et al., 2021; Yuan et al., 2021) , optimization (Rahmad Syah et al., 2021; Jiang et al., 2021b) , etc. Nowadays, a key challenge of computing SP distance is the prohibitive complexity in very large and complex graphs. e.g., for a sparse undirected graph with N nodes and k queries, the time complexity of A* (Hart et al., 1968) and Dijkstra algorithm (Thorup & Zwick, 2004 ) are up to O(kN ) and O(kN log N ) for unweighted and weighted graph, respectively. Regarding this issue, various methods (Cohen et al., 2003; Fu et al., 2013; Akiba et al., 2013; Delling et al., 2014; Farhan et al., 2019; Liu et al., 2021) attempt answering exact distance in microseconds online via indexing or compressing techniques, which suffer huge storage costs on all pair SP distance representations and fail to reflect latent sub-structures in graphs for scalable queries (see Figure 1 ). Highly concise SP representation for large-scale and complex graphs remains to be studied yet. Regarding this, a surging number of approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are thus critical for fast and scalable online analysis. They can be categorized into oracle-based (Thorup & Zwick, 2004; Baswana & Kavitha, 2006 ), landmark-based (Potamias et al., 2009; Sarma et al., 2010; Gubichev et al., 2010) and learning-based (Rizi et al., 2018; Schlötterer et al., 2019; Qi et al., 2020; Jiang et al., 2021a ) SP representation methods. Among these categories, learning-based methods are of high accuracy and short response time (see Table 1 ), owing much to flexible node embeddings in a metric space. Table 1 : Overall comparison of approaches to SP representation on DBLP dataset (A.8.4). PTC: preprocessing time complexity, PSC: preprocessing space complexity, RTC: response time complexity, TSC: the total storage cost for answering online distance queries, RT: real response time, AL: accuracy loss which is measured by mRE (see Equation 1). N : the number of nodes in the graph, L(N ): the average label size of each node which increases along with N , D: the amortized degree on each node. α 0 , L, n, d, w, l, c and β are hyperparameters in corresponded models. Several competitive works in learning-based methods (Rizi et al., 2018; Schlötterer et al., 2019) heuristically leverage truncated random walk and optimization of node-cooccurrence likelihood on the arbitrary linkage to learn SP representations, which once achieved the state-of-the-art performance on approximation quality. However, they are not without limitations on efficiency and interpretability. On one side, a random walk is an unstrained node sequence from the root, possessing a limited exploration range of distance, thus resulting in uncaught distance relations with remote nodes. This is because each transition on nodes is not implied for a specific direction to move towards or beyond the root, especially after several walk steps, which restricts it from visiting remote nodes under limited walk steps (see Figure 2a ). On the other side, the optimization on arbitrary linkage reflects excessively versatile local similarity among nodes, which preserves inaccurate distance relations from original graphs to the embedding space. In fact, it exerts a too-general metric over nodes' correlation, wherein the more edges or paths exist between two nodes, the stronger correlation they share. That means there are many ways to simulate a strong correlation for two nodes (e.g., add mutual edges, delete an edge to other nodes) even if some of the operations do not influence their actual SP distance (see Figures 2c and 2d ). A detailed statement of related works on SP representation and motivation for estimating accurate SP distance can be found in Appendix A.1. In this paper, we address the above shortcomings by proposing an efficient and interpretable SP representation method called Betweenness Centrality-based Distance Resampling (BCDR). It improves the approximation quality of SP representations with two components. The first is betweenness centrality (BC)-based random walk which explores a wider range of distance correlation on the graph due to its awareness of high-order path structures. To our best knowledge, there is no existing method that combines betweenness centrality and random walk to learn SP representations. We prove that BC-based transition is prone to jump out of local neighborhoods compared to random



Figure 1: Differences between approximate (ours.) and exact (PLL (Akiba et al., 2013), efficient implementation of the hub-labeling method) SP representation methods regarding storage cost (megabytes, MB) and response time (nanoseconds, ns). We simulate a group of Bernoulli random graphs with |V | nodes, and each edge is filled independently with probability p. (a) and (c) show the storage cost of exact representations increases dramatically relative to the graph size. (b) and (d) reflect longer response time of exact methods, induced by random access to massive information.

