LEARN LOW-DIMENSIONAL SHORTEST-PATH REPRE-SENTATION OF LARGE-SCALE AND COMPLEX GRAPHS Anonymous authors Paper under double-blind review

Abstract

Estimation of shortest-path (SP) distance lies at the heart of network analysis tasks. Along with the rapid emergence of large-scale and complex graphs, approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are critical for fast and scalable online analysis. Among different approaches, learning-based representation methods have made a breakthrough both in response time and accuracy. Several competitive works in learning-based methods heuristically leverage truncated random walk and optimization on the arbitrary linkage for SP representation learning. However, they have limitations on both exploration range and distance preservation. We propose in this paper an efficient and interpretable SP representation method called Betweenness Centrality-based Distance Resampling (BCDR). First, we prove that betweenness centrality-based random walk can occupy a wider exploration range of distance due to its awareness of high-order path structures. Second, we leverage distance resampling to simulate random shortest paths from original paths and prove that the optimization on such shortest paths preserves distance relations via implicitly decomposing SP distance-based similarity matrix. BCDR yields an average improvement of 25% accuracy and 25-30% query speed, compared to all existing approximate methods when evaluated on a broad class of real-world and synthetic graphs with diverse sizes and structures.

1. INTRODUCTION

Estimation of shortest-path (SP) distance lies at the heart of many network analysis tasks, such as centrality computation (Schönfeld & Pfeffer, 2021) , node separation (Houidi et al., 2020 ), community detection (Zhang et al., 2020; Asif et al., 2022) , which also directly contributes to enormous downstream applications, including point of interest (POI) search (Qi et al., 2020; Chen et al., 2021a) social relationship analysis (Carlton, 2020; Melkonian et al., 2021) , biomedical structure prediction (Yue et al., 2019; Sokolowski & Wasserman, 2021) , learning theory (Yang et al., 2021; Yuan et al., 2021 ), optimization (Rahmad Syah et al., 2021; Jiang et al., 2021b) , etc. Nowadays, a key challenge of computing SP distance is the prohibitive complexity in very large and complex graphs. e.g., for a sparse undirected graph with N nodes and k queries, the time complexity of A* (Hart et al., 1968) and Dijkstra algorithm (Thorup & Zwick, 2004 ) are up to O(kN ) and O(kN log N ) for unweighted and weighted graph, respectively. Regarding this issue, various methods (Cohen et al., 2003; Fu et al., 2013; Akiba et al., 2013; Delling et al., 2014; Farhan et al., 2019; Liu et al., 2021) attempt answering exact distance in microseconds online via indexing or compressing techniques, which suffer huge storage costs on all pair SP distance representations and fail to reflect latent sub-structures in graphs for scalable queries (see Figure 1 ). Highly concise SP representation for large-scale and complex graphs remains to be studied yet. Regarding this, a surging number of approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are thus critical for fast and scalable online analysis. They can be categorized into oracle-based (Thorup & Zwick, 2004; Baswana & Kavitha, 2006 ), landmark-based (Potamias et al., 2009; Sarma et al., 2010; Gubichev et al., 2010) and learning-based (Rizi et al., 2018; Schlötterer et al., 2019; Qi et al., 2020; Jiang et al., 2021a ) SP representation methods. Among these categories, learning-based methods are of high accuracy and short response time (see Table 1 ), owing much to flexible node embeddings in a metric space.

