LEARN LOW-DIMENSIONAL SHORTEST-PATH REPRE-SENTATION OF LARGE-SCALE AND COMPLEX GRAPHS Anonymous authors Paper under double-blind review

Abstract

Estimation of shortest-path (SP) distance lies at the heart of network analysis tasks. Along with the rapid emergence of large-scale and complex graphs, approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are critical for fast and scalable online analysis. Among different approaches, learning-based representation methods have made a breakthrough both in response time and accuracy. Several competitive works in learning-based methods heuristically leverage truncated random walk and optimization on the arbitrary linkage for SP representation learning. However, they have limitations on both exploration range and distance preservation. We propose in this paper an efficient and interpretable SP representation method called Betweenness Centrality-based Distance Resampling (BCDR). First, we prove that betweenness centrality-based random walk can occupy a wider exploration range of distance due to its awareness of high-order path structures. Second, we leverage distance resampling to simulate random shortest paths from original paths and prove that the optimization on such shortest paths preserves distance relations via implicitly decomposing SP distance-based similarity matrix. BCDR yields an average improvement of 25% accuracy and 25-30% query speed, compared to all existing approximate methods when evaluated on a broad class of real-world and synthetic graphs with diverse sizes and structures.

1. INTRODUCTION

Estimation of shortest-path (SP) distance lies at the heart of many network analysis tasks, such as centrality computation (Schönfeld & Pfeffer, 2021) , node separation (Houidi et al., 2020) , community detection (Zhang et al., 2020; Asif et al., 2022) , which also directly contributes to enormous downstream applications, including point of interest (POI) search (Qi et al., 2020; Chen et al., 2021a) social relationship analysis (Carlton, 2020; Melkonian et al., 2021) , biomedical structure prediction (Yue et al., 2019; Sokolowski & Wasserman, 2021) , learning theory (Yang et al., 2021; Yuan et al., 2021) , optimization (Rahmad Syah et al., 2021; Jiang et al., 2021b) , etc. Nowadays, a key challenge of computing SP distance is the prohibitive complexity in very large and complex graphs. e.g., for a sparse undirected graph with N nodes and k queries, the time complexity of A* (Hart et al., 1968) and Dijkstra algorithm (Thorup & Zwick, 2004 ) are up to O(kN ) and O(kN log N ) for unweighted and weighted graph, respectively. Regarding this issue, various methods (Cohen et al., 2003; Fu et al., 2013; Akiba et al., 2013; Delling et al., 2014; Farhan et al., 2019; Liu et al., 2021) attempt answering exact distance in microseconds online via indexing or compressing techniques, which suffer huge storage costs on all pair SP distance representations and fail to reflect latent sub-structures in graphs for scalable queries (see Figure 1 ). Highly concise SP representation for large-scale and complex graphs remains to be studied yet. Regarding this, a surging number of approximate SP-representing algorithms that transform a graph into compact and low-dimensional representations are thus critical for fast and scalable online analysis. They can be categorized into oracle-based (Thorup & Zwick, 2004; Baswana & Kavitha, 2006) , landmark-based (Potamias et al., 2009; Sarma et al., 2010; Gubichev et al., 2010) and learning-based (Rizi et al., 2018; Schlötterer et al., 2019; Qi et al., 2020; Jiang et al., 2021a ) SP representation methods. Among these categories, learning-based methods are of high accuracy and short response time (see Table 1 ), owing much to flexible node embeddings in a metric space. Figure 1 : Differences between approximate (ours.) and exact (PLL (Akiba et al., 2013) , efficient implementation of the hub-labeling method) SP representation methods regarding storage cost (megabytes, MB) and response time (nanoseconds, ns) . We simulate a group of Bernoulli random graphs with |V | nodes, and each edge is filled independently with probability p. (a) and (c) show the storage cost of exact representations increases dramatically relative to the graph size. (b) and (d) reflect longer response time of exact methods, induced by random access to massive information. Table 1 : Overall comparison of approaches to SP representation on DBLP dataset (A.8.4) . PTC: preprocessing time complexity, PSC: preprocessing space complexity, RTC: response time complexity, TSC: the total storage cost for answering online distance queries, RT: real response time, AL: accuracy loss which is measured by mRE (see Equation 1). N : the number of nodes in the graph, L(N ): the average label size of each node which increases along with N , D: the amortized degree on each node. α 0 , L, n, d, w, l, c and β are hyperparameters in corresponded models. Several competitive works in learning-based methods (Rizi et al., 2018; Schlötterer et al., 2019) heuristically leverage truncated random walk and optimization of node-cooccurrence likelihood on the arbitrary linkage to learn SP representations, which once achieved the state-of-the-art performance on approximation quality. However, they are not without limitations on efficiency and interpretability. On one side, a random walk is an unstrained node sequence from the root, possessing a limited exploration range of distance, thus resulting in uncaught distance relations with remote nodes. This is because each transition on nodes is not implied for a specific direction to move towards or beyond the root, especially after several walk steps, which restricts it from visiting remote nodes under limited walk steps (see Figure 2a ). On the other side, the optimization on arbitrary linkage reflects excessively versatile local similarity among nodes, which preserves inaccurate distance relations from original graphs to the embedding space. In fact, it exerts a too-general metric over nodes' correlation, wherein the more edges or paths exist between two nodes, the stronger correlation they share. That means there are many ways to simulate a strong correlation for two nodes (e.g., add mutual edges, delete an edge to other nodes) even if some of the operations do not influence their actual SP distance (see Figures 2c and 2d ). A detailed statement of related works on SP representation and motivation for estimating accurate SP distance can be found in Appendix A.1. In this paper, we address the above shortcomings by proposing an efficient and interpretable SP representation method called Betweenness Centrality-based Distance Resampling (BCDR). It improves the approximation quality of SP representations with two components. The first is betweenness centrality (BC)-based random walk which explores a wider range of distance correlation on the graph due to its awareness of high-order path structures. To our best knowledge, there is no existing method that combines betweenness centrality and random walk to learn SP representations. We prove that BC-based transition is prone to jump out of local neighborhoods compared to random transition. The second is distance resampling which preserves accurate SP distance relations via implicitly decomposing an SP distance-based similarity matrix. In essence, it simulates the observation of random SPs from original walk paths and exerts desirable constraints on node representations to preserve distance relations over the graph. We summarize the major contributions as follows: i) We propose BC-based random walk as an efficient strategy for exploring a wider range of SP distance within limited walk steps (see Section 3.1). ii) We propose distance resampling to preserve accurate distance relations among nodes to learn an interpretable SP representation (see Section 3.2). iii) We evaluate BCDR with a broad class of real-world and synthetic graphs, and it yields an average improvement of 25% accuracy and 25-30% query speed compared to all existing methods (see Section 4).

2. PRELIMINARY

Notation: G = (V, E) denotes an undirected graph, with V = {v i } being the set of nodes and E = {(v i , v j )} being the set of undirected edges, and N = |V |, M = |E|. We use Z N ×d to represent a matrix comprising embedded vectors of nodes, where d is the embedding size, and the i-th row of Z is corresponded with v i . A path p ij of length l ∈ N + on graph G is an ordered sequence of nodes (v i , v a1 , • • • , v a l-1 , v j ) , where each node except the last one has an edge with the subsequent node. The shortest path pij is one of the paths with the minimum length D ij between v i and v j . Also, the SP distance matrix D comprises {D ij }. A node v i 's neighborhood N i is a set of nodes with an edge with v i , i.e., N i = {v j |(v i , v j ) ∈ E}. For high-order neighborhoods of v i , N (h) i is defined as a set of nodes h-hop away from v i , i.e., {v j |D ij = h}. To avoid confusion with the symbol of paths, we use P (•) to represent a probability distribution in this paper. A truncated random walk W i rooted at node v i of length l is a random vector of W 1 i , W 2 i , • • • , W l i , where W k i is a node chosen from the neighborhood of node W k-1 i for k = 1, ..., l, with the initial probability P (W 0 i = v i ) ≡ 1. W i is a categorical distribution of nodes on W i , and the probability of each node in W i represents the frequency of occurrence on the sampled paths.

Problem Definition & Metrics:

The evaluation of approximate SP representation methods is divided into two stages. For the offline stage, the processing time and memory usage when constructing SP representations are evaluated, and the storage size of such representations is considered. For the online stage, the query speed, memory usage, and approximation quality are evaluated. Thereinto, to evaluate query speed and memory usage, a million times of query requests for arbitrary node pairs are performed, then the memory and average response time for each node pair are recorded. For approximation quality, the commonly used metrics are mean of relative error (mRE) and mean of absolute error (mAE). For a group of SP distance queries Q = {(v i , v j )}, mRE is defined as the relative loss of the estimated value Dij with respect to the real value D ij , while mAE measures the absolute gap between them: mRE := 1 |Q| (vi,vj )∈Q | Dij -D ij | D ij mAE := 1 |Q| (vi,vj )∈Q | Dij -D ij | (1)

3. METHOD

Although random walk (RW) is universally accepted as an efficient serialization strategy of similarity measurement on graphs (Grover & Leskovec, 2016; Zhuang & Ma, 2018) , we argue that the intuitive practice of RW in representing SP structures has several limitations. Consider a walk path p = (v a , v a1 , v a2 , • • • , v a l ) ∈ P a sampled by stochastic selection on neighborhoods from root node v a . Distance measured along p (i.e., the order on the walk) is not consistent with that on the graph (see Figure 2b ) since the node sequence is unstrained, i.e., for v ai , v aj ∈ p, i ≤ j D aai ≤ D aaj , where i and j are indices of node v ai and v aj on p, and 1 ≤ i, j ≤ l. Therefore, optimizing node co-occurrence likelihood on such walk paths incurs two problems. 1. Problem 1: Limited exploration range of distance. The exploration range of rooted random walk is not in proportion to its length since each transition on the walk has an agnostic tendency to move towards or beyond the root after a few steps (see Figure 2a ). 2. Problem 2: Intractability of distance relations on paths. The distance measured on walk paths may not actually reflect the SP distance on the graph because of the unbalanced number of edges between different nodes (see Figure 2c and 2d ). In this section, we describe in detail our method as a decent way of representing SP structures. We discuss two techniques named BC-based random walk and distance resampling to address the above problems, respectively, and present the corresponding theoretical analysis for their interpretability. A time and space-efficient implementation of BCDR to integrate these techniques is available in Appendix A.2.

3.1. BC-BASED RANDOM WALK FOR WIDER EXPLORATION RANGE OF DISTANCE

Definition 1. (Betweenness Centrality) Define G = (V, E) as undirected graph. v i , v s , v t are arbitrary nodes in V . σ st (v i ) represents the number of shortest paths between v s and v t that pass v i , and σ st is the total number of shortest paths between v s and v t . Then we say that BC of v i is BC(v i ) = s =i =t σ st (v i ) σ st To address Problem 1, we propose BC-based random walk. As defined in Definition 1, BC(v i ) determines the probability of v i located on SPs of arbitrary node pairs. Thus, we consider a node with a large BC value vitally significant to drive the walk to move away from the root node, since it reveals an easy way of traveling to some other nodes with minimal steps. And to leverage this property, in BC-based random walk W a = W 1 a , W 2 a , ...W j a , ... on node v a , we prefer choosing nodes with the largest BC values among their neighborhoods when simulating walk paths, i.e.,

P (W

j a = v m |W j-1 a = v n ) = BC(v m ) v k ∈Nn BC(v k ) , v m ∈ N n (3) Theorem 2, proved in Appendix A.3, indicates that BC-based random walk is prone to transit from N (h) a to N (h+1) a , leading to a wider exploration range measured by the intrinsic graph's SP distance. Specifically, for each node v a , N Finally, we conclude that BC-based random walk is a competitive walking pattern regarding exploration range of SP distance, since it possesses a strong tendency to jump out of local neighborhoods. We further verify our conclusion by comparing it with existing RW techniques in Section 4.2.

3.2. DISTANCE RESAMPLING FOR SP DISTANCE PRESERVATION

To address Problem 2, we propose distance resampling. We first illustrate a general RW-based graph learning paradigm in Figure 3b and clarify the differences between ours and other approaches. The basic idea of RW-based methods is to learn node-level embeddings Z from pieces of observation (i.e., walk paths), and Z thus reflects the structural bias on graphs. Specifically, for the naive RW strategy and its variants utilized in other approaches, the observation is a set of stochastic paths reflecting the property of arbitrary linkage between nodes, which asks Z to preserve point-wise mutual information (PMI) similarity (proved in Levy & Goldberg (2014) ; Shaosheng et al. (2015) ). Unfortunately, the PMI similarity shares no direct connection with SP distance and causes the problems depicted in Figure 2c and 2d. To fit Z with correct information about SP structures, we intend to observe random shortest paths instead. This practice is feasible since the SP problem always has optimal substructures, i.e., the subpath between two nodes on any SP could also be extracted as an SP between these nodes. However, the prohibitive complexity of computing all pairs of SPs forbids us from performing such sufficient observation (see both technical and empirical comparisons between utilizing BCDR and directly sampling SPs for optimization in Appendix A.7). By way of an alternative, we propose a resampling strategy to transform BC random paths into approximate random SPs with efficient linear processing time and better performance. Initially, we formulate the SP representation problem from the RW-based learning perspective. we refer to random SP walk Wi as an ideal walking pattern whose transition reflects the probability of each shortest path passing through v i . It means paths sampled from Wi are prone to be an SP rooted at v i . For sufficient observation on SPs, we thus have an optimization objective on Wi , i.e., L(Z) = E vi∈ P (V ) log P Wi|Zi ( Wi |Z i ) . To reduce optimization complexity, we replace the intrinsic probability normalization by negative sampling, according to Mikolov et al. (2013a; b) , i.e., L n (Z) = vi∈V P (v i ) E vj ∼ P Wi (V ) [log σ(Z i Z T j )] + λE v k ∼ Pn(V ) [log σ(-Z i Z T k )] , since we prefer an informative Z instead of the accurate probability. Thereinto, Pn is the distribution of negative sampling over the graph, λ denotes the number of negative samples, and σ(x) = (1 + e -x ) -1 . It is notable that Wi on each node v i is backbreaking to extract, since it requires a traversal on all SPs passing through v i . To address this, we revisit the node distribution PWi on BC-based random walk W i and construct a distribution QWi resampled from PWi , as an efficient approximation to P Wi , i.e., QWi (v j ) = α Dij BC(v j ) v k ∈Wi α D ik BC(v k ) (5) where α is a hyper-parameter controlling the weight decay by the distance, 0 < α < 1. Finally, we maximize the following approximate objective Ln on W i (instead of L n on Wi ), i.e., Ln (Z) = vi∈V P (v i ) E vj ∼ QW i (V ) [log σ(Z i Z T j )] + λE v k ∼ Pn(V ) [log σ(-Z i Z T k )] We 6with embeddings Z is equivalent to decomposing an SP distance-based similarity matrix D = ZZ T , where for any v a and v b , the distance between them in the embedding space varies linearly with respect to distance D ab , namely, Dab = Z a Z T b = -log + D ab log α (7) where is a small constant related to the negative samples, which is independent of v b , and α is the hyper-parameter defined in Equation 5, where 0 < α < 1. Proposition 1 is proved in Appendix A.4 by deriving the extreme point of Ln regarding Z. Then, we consider the preservation of SP distance relations. Some studies on metric learning (Hermans et al., 2017; Zeng et al., 2020) have revealed that a triplet of samples (v a , v b , v c ) being easy to learn means if v b shares strong correlation with v a , the distance between v b and v a in the embedding space should be shorter than that of v c and v a . With this property, we have the following theorem (proved in A.6 by directly applying Proprosition 1), which indicates that our method is consistent with distance relations under the intrinsic SP metric. Theorem 1. Each symbol here follows the definition in Proposition 1. Let D be a global distance matrix defined on graph G and D ab be graph's SP distance between node v a and v b . Then for any nodes v a , v b , v c ∈ G, (D ab -D ac )( Dab -Dac ) ≤ 0 In conclusion, we discuss here the significance of distance resampling for preserving accurate distance relations. It exerts two implicit constraints on Z to learn an interpretable SP representation. First, as stated in Proposition 1, the distance measured in the embedding space shares a strong negative correlation with that measured on the graph. Second, for any node triplet, the distance relation between any two of them is preserved according to Theorem 1. The two constraints are further verified in Section 4.3 against existing techniques.

3.3. EFFICIENT IMPLEMENTATION OF BCDR ALGORITHM

We also provide a time and space-efficient implementation of BCDR to integrate the above techniques in Algorithm 1. Like previous learning-based SP representation methods (Rizi et al., 2018) , we first transform the graph into low-dimensional embeddings (i.e., Z N ×d ) and learn a distance predictor g φ : (R d , R d ) → R by observed distance triplets {(Z a , Z b , D ab )}. Then, the predictor g φ will be involved in answering online distance queries. In addition, similar to Jiang et al. (2021a) , we also improve the prediction results via gradient boosting techniques. The detailed designs of these procedures are described in Appendix A.2, and an ablation study to evaluate their impact on performance is provided in Appendix A.11.

4. EXPERIMENTAL EVALUATION

In this section, we show the comprehensive performance of BCDR with 5 real-world graphs of different sizes and 6 synthetic graphs of different structures. Specifically, we evaluate BCDR on 3 small graphs (i.e., Cora, Facebook, and GrQc) and 2 large graphs (i.e., DBLP and Youtube) for its scalability (see Section 4.1), and evaluate it on 6 synthetic graphs for its representational capacity of complex structures (see Section 4.2 and 4.3). Our method is compared with strong baselines from both approximate SP representation and general graph representation learning (GRL). In the experiments, we also provide two variants of BCDR, i.e., BCDR-FC and BCDR-FQ, for accelerating the construction and querying process, respectively. A detailed description of the datasets, including statistics and visualization, is thoroughly provided in Appendix A.8.

4.1. PERFORMANCE OF APPROXIMATE SP DISTANCE QUERY

We compare BCDR with other learning-based SP representation methods (i.e., Orion (Xiaohan et al., 2010) , Rigel (Xiaohan et al., 2011 ), DADL (Rizi et al., 2018) , Path2Vec (Kutuzov et al., 2019) , HALK (Schlötterer et al., 2019) ) and CatBoost-SDP (Jiang et al., 2021a) as well as other approximate methods, including landmark-based (i.e., LS (Potamias et al., 2009) and oracle-based (i.e., ADO (Thorup & Zwick, 2004 )) techniques. All of the above models are run with six 3.50GHz Intel Xeon(R) CPUs and 128GB memory, and the precomputed representations of each model are serialized by Pickle. Each baseline generally follows the default parameters discussed in its paper with some trivial changes, so that its performance can be evaluated in a unified way. The detailed parameter setups of each model are provided in Appendix A.9. Like previous works, we initially compute all pairs of SP distance on each graph by BFS and take a uniform sampling to select 1, 000, 000 distance triplets {(v a , v b , D ab )} as test samples. All of the baselines, including ours, are purely implemented in Python 3.9 and evaluated under the same environment. Since only unweighted graphs are considered, the outputs of each model are quantized to integer when evaluating accuracy loss. Some of the experimental results are shown in Table 2 (see Appendix A.10 for extended comparisons with GRL models). We can see from the table that our model not only outperforms previous models regarding accuracy loss for all graphs but also shares competitive results on other metrics. In detail, for accuracy loss (mAE and mRE), BCDR answers arbitrary queries with the minimum error due to a wider exploration range of distance and distance-preserved optimization. Notably, the variants of BCDR without boosting module (i.e., BCDR-FQ and BCDR-FC) also achieve the highest accuracy against other RW-based learning approaches (i.e., DADL and HALK) within almost the least storage cost. For offline processing time (PT), memory usage (PMU), and storage cost (SC), the results show BCDR possesses powerful scalability against the growth of graph scale. Even for a graph with millions of nodes, the offline processing could be completed within several hours, and the memory usage is close to the graph size. This is because we perform BC-based random walks with a fixed length on each graph, and the size of walk data is further reduced by distance resampling. In addition, although CatBoost-SDP seems to achieve strong scalability on these metrics, we need to point out that this method does not learn any representation of nodes and completely optimizes all pairs of distance in a boosting way, which subsequently suffers higher time and space cost for online queries. For response time (RT) and memory usage (QMU) in querying, we see BCDR-FC and most other learning-based models share similar low memory overhead since each distance query could be answered by checking the node embeddings and graph adjacency matrices. Furthermore, BCDR-FQ and Orion could answer such a query within tens of nanoseconds due to the absence of double-checking on adjacency matrices. Besides, to evaluate the impact of critical components and hyper-parameters in BCDR, we further conduct an ablation study in Appendix A.11 where we discuss 6 different modifications to BCDR as well as an investigation on 9 critical parameters to show their impacts on different metrics.

4.2. EXPLORATION RANGE OF DISTANCE

As stated in Section 3.1, the exploration range of distance could be widened by BC-based random walk (BC-RW), since the latter helps to jump out of local neighborhoods. Here, we compare BC-RW with existing renowned walk strategies, including naive random walk (NRW) (Perozzi et al., 2014; Zhuang & Ma, 2018) , second-order random walk (SORW) (Grover & Leskovec, 2016) , and random surfing (RS) (Cao et al., 2016) . We also consider a DFS-like random walk (DFS-RW) as a strong baseline by setting a very small q in Node2Vec for deep exploration. The methods are tested on 6 synthetic graphs with divergent structures. We randomly sample 20 root nodes and, for each root, simulate 10 walks to show how many nodes with different distance are explored at each step of the walk. The ideal situation for rooted walk paths with a fixed length l is to cover up to nodes l-hop away from the current root. The results on circle graphs are shown in Figure 4 . We can see from the results that our proposed BC-RW is much more competitive in exploring a wider range of SP distance. Further results and analysis on different graph structures are presented in Appendix A.12.

4.3. PRESERVATION OF DISTANCE RELATIONS

As discussed in Section 3.2, distance resampling is proposed to preserve accurate distance relations via implicitly decomposing an SP distance-based similarity matrix. Here, we show its interpretability for SP representations by visualizing the properties of embedded vectors Z when compared with the maximum likelihood optimization on other biased random walks. Environment configuration follows the previous section. First, we evaluate the relation of distance measured on graphs and embedding spaces for each node pair. Specifically, distance on the embedding space is measured by inner product Z i Z T j for given nodes v i and v j , and that on the graph is measured by SP distance. Initially, we learn embedded vectors from walk paths simulated by each walk strategy and randomly sample 100 source nodes with 100 destinations for each source. The results on circle graphs are shown in Figure 5 (refer to Appendix A.13 for extended results on other graphs), which indicates that embeddings enhanced by distance resampling have a better tendency to maintain a linear relationship on the distance metric between the original graph and embedding space. The results also verified our analysis in Remark 2 regarding relations between the SP distance-based similarity matrix D and the distance matrix D. Second, we try to find out how much the probability distance relation is violated in the embedding space, i.e., whether a pair of nodes with larger SP distance is corresponded with less embedded similarity as described in Theorem 1. We randomly take 10, 000 node triplets {(v a , v b , v c )}, and record if Equation 8is satisfied. The results on circle graphs are shown in Figure 5 (refer to Appendix A.13 for extended results on other graphs). The figure confirms that our model is much more satisfactory in preserving distance relation than existing methods. This is because BC-RW provides sufficient observation on each node by locating many remote nodes with a sequence of centralized nodes on a graph, and thus distance resampling based on such observation could preserve distance relations of each node within its exploration range.

5. CONCLUSION

In this paper, we propose a novel graph SP representation method called Betweenness Centralitybased Distance Resampling (BCDR) and discuss two significant techniques for an efficient and interpretable SP representation. The experimental evaluation indicates that BCDR improves the approximation quality with a shorter response time for SP distance queries and possesses strong scalability to large-scale and complex graphs. Notably, the produced node representations by our method also reflect the highly-efficient paths for high-order message passing in GNNs, which appears to be helpful for structural graph pooling and inference. We leave it for our future work. As an important global measurement on graphs, SP distance reflects the minimum travelling cost from node to node, similar to the geodesic distance on manifolds. Along the rapid emergence of large-scale graphs in many areas, space-and time-efficient estimation of accurate SP distance is urgently required in many downstream applications. In this part, we investigate the direct impact of SP distance estimation in different fields by discussing several real-world scenerios. Case 1: find nearest points of interest in road and social networks Points of interest (POIs) (Chen et al., 2021a) are specific point locations that someone may find useful or interesting, e.g., hotels, campsites, fuel stations, etc. A real road network may contain millions of nodes, while thousands of users may issue SP distance queries simultaneously for searching the nearest POI from their location, like 'finding restaurants within 5 km distance' or 'ranking restaurant search results by distance'. To achieve such demands, learning to accurately and fast answer SP distance with limited computing resources is of high significance. Specifically, utilizing limited computing resources means the algorithm should be space-and time-efficient. Thereinto, less storage overhead enables the representations to be stored in users' mobile devices instead of centrally computing SP on the server. And less query time ensures that the computation of SP distance can be processed in real-time (since some POIs may change their positions frequently over time). Case 2: construct skeleton graph from mesh for 3D animation In the literature of 3D animation, animating an articulated character requires constructing a skeleton graph to control the movement of the surface, i.e., place the skeleton joints inside the character and specify which parts of the surface are attached to which bone. A critical technique (Aujay et al., 2007; Poirier & Paquette, 2009) to automatically embed a skeleton into a character relies on computing a harmonic function under the SP metric on mesh graphs. This requires finding a group of nodes that locally maximize SP distance with the user-defined node. Since the mesh of a delicate-described character may have tens or hundreds of vertices, estimating and finding such nodes with the longest SP distance accurately and fast are also well-motivated. Case 3: estimate latencies in communication networks In large-scale communication networks, the latencies between Internet hosts are defined as a round-trip measurement from one to another (i.e., SP distance), which is utilized for performance optimization in many network applications such as content distribution networks (Ratnasamy et al., 2002) , multicast systems (Nogueira, 2014) , distributed file system (Rhea et al., 2003) , etc.

A.1.2 APPROXIMATE SP REPRESENTATION

Hard-coding Perspective Compared with exact SP representations that improve query speed at the expense of huge storage costs, approximate methods are designed to find a compact and scalable representation of high performance both in time and space. The basic idea of these methods is to reduce the complexity of SP distance matrices. Thorup and Zwick (Thorup & Zwick, 2004) initially observe that a hierarchical sparse sampling of nodes could significantly reduce the number of elements in the distance matrix, and all pairs of SP distance are thus approximately represented by the distance relations on those nodes with a bounded error. They also provide a time-efficient algorithm to compute the pruned distance relations. Several later extensions (Baswana & Kavitha, 2006; Enachescu et al., 2008; Chen et al., 2009) are proposed to improve the processing time and space on specific graphs. However, these methods still have limitations on space complexity and accuracy. First, the sampled distance relations take O(α 0 N 1+α0 ) space which is not linear to the number of nodes N , thus inducing scalable problems on large graphs. Second, the bounded error is often unacceptable on graphs with smaller diameters since even the most accurate model (with α 0 = 2) allows three times the error of real distance. Addressing these issues, landmark-based distance estimation methods (Potamias et al., 2009; Gubichev et al., 2010; Sarma et al., 2010) are proposed. Instead of sampling hierarchical sets of nodes, landmark-based methods only preserve distance relations between a fixed number of nodes (called landmarks) to others on the graph, and all pairs of SP distance could be bounded by their distance related to landmarks according to triangle inequality (Zheng et al., 2005; Lee et al., 2006; Mao et al., 2006) , i.e., for any nodes v a and v b , max vi∈L |D ai -D ib | ≤ D ab ≤ min vj ∈L |D ai + D ib | (9) where D ab denotes the SP distance between v a and v b , L denotes the set of landmarks. The average accuracy could be optimized by selecting proper landmarks that covers as many SPs as possible. Unfortunately, finding the optimal finite set of landmarks with the minimal size has been proved to be NP-hard, which is mapping to a set-cover problem (Balas, 1982) . Therefore, several heuristic selection strategies are discussed to tight Equation 9 by leading D ab almost near to its upper bound (Potamias et al., 2009) . Other efforts (Gubichev et al., 2010; Tretyakov et al., 2011) are made to store SP trees for each landmark instead of distances at the cost of extra storage and response time. Nevertheless, the approximation performance in these models relies highly on graph structures, since less-centralized graphs (e.g., a grid-like graph) and graphs of large diameters (e.g., a large planar graph) require a large number of distributed landmarks to cover remote pairs of nodes. Learning Perspective Instead of the hard-coding techniques mentioned above, our work steps forward from a learning perspective of SP distance estimation, which constructs general and scalable representations for arbitrary graphs. Under the low-rank assumption of SP distance matrices, the basic idea of learning-based methods is to transform the graph into a metric space while preserving the distance between pairs of nodes. As the embedding space is low-dimensional and continuous, extracting distance from learning-based SP representations is fast and scalable. However, directly optimizing the distance between all pairs is time-consuming, which takes at least O(N 2 ) time for computing distance and subsequent optimization. Towards this, many graph coordinate systems (Ng & Zhang, 2002; Tang & Crovella, 2003; Costa et al., 2004; Dabek et al., 2004; Ng & Zhang, 2004) have been studied in the past years. To reduce processing complexity, a feasible learning procedure for very large graphs later proposed in Orion (Xiaohan et al., 2010) contains three steps. First, perform breadth-first search (BFS) from a small landmark set L and record node pairs as well as their distance as training triplets { v l , v a , D la } where v l ∈ L, v a ∈ V . Second, create a graph coordinate system M by preserving distance relations among nodes in L, i.e., arg min L M ={v M i |vi∈L} vi,vj ∈L |D M ij -D ij | (10) where v M i denotes the embedded vector corresponding to the node v i , and D M ij denotes the geodesic distance between v M i and v M j measured on M . Third, fix L M and calibrate distance between other nodes and landmarks iteratively. Among these steps, the metric tensor defined on M significantly affects the accuracy of distance estimation, and models regarding embedding in euclidean space (Rao, 1999; Lee, 2009; Xiaohan et al., 2010) and hyperbolic space (Shavitt & Tankel, 2008; Xiaohan et al., 2011) are well studied respectively. Inspired by the great success in graph representation learning (GRL), further work (Rizi et al., 2018; Schlötterer et al., 2019) including ours treats M as an agnostic but definite manifold learned by GRL techniques and estimates distance based on learnable metric criteria (usually a multi-layer perceptron). Therefore, the learning task here is converted from "calibrate the position of each node" to "learn powerful metric criteria to extract distance everywhere." This novel paradigm achieves higher accuracy with reduced training time despite the fact that we are unsure about whether general GRL models could embed sufficient information to infer all pairs of SPs. In this paper, we thus discuss an interpretable SP representation learning method and improve the comprehensive performance of SP distance estimation.

A.1.3 GRAPH REPRESENTATION LEARNING

Graph representation learning (GRL) organizes symbolic objects (such as nodes, edges, and clusters) in a way such that their similarities on the graph are well-preserved in the low-dimensional embedding space. Currently, most of these methods focus on preserving arbitrary linkage on graphs by considering high-order adjacency matrices, and are categorized into matrix factorization (MF) and random walk (RW) approaches. Thereinto, our work shares strong connections with general RW approaches, which embed remote nodes' correlation within linear complexity compared with MF methods. The basic idea of RW-based learning methods proposed in Deepwalk (Perozzi et al., 2014) is to dump complicated linkage structure on graphs into a few fixed-length node sequences in a statistical view and learn node embeddings to reflect their co-occurrence on walk paths using a skip-gram algorithm (Mikolov et al., 2013a; b) . The learning process is to solve a maximum likelihood optimization problem based on observed sequences, i.e., for any nodes v a and v b , arg max Za,Z b log σ(Z a Z T b ) + λE v k ∼ Pn(V ) [log σ(-Z a Z T k )])] where Z a denotes the embedded vector of v a , λ denotes the number of negative samples, σ(•) denotes the sigmoid function where σ(x) = (1 + e -x ) -1 , and Pn (•) denotes a probability distribution of the negative sampling. Several practical strategies are proposed in the past few years to simulate structure-aware traversal on graphs in RW-based methods (Grover & Leskovec, 2016; Cao et al., 2016; Perozzi et al., 2017; Chen et al., 2018) . In detail, to enhance sensibility on divergent structures, Node2Vec (Grover & Leskovec, 2016 ) exploits a biased random walk strategy to perform combinatorial traversal on graphs, including breadth-first search (BFS) and depth-first search (DFS), which explores both local-neighborhood linkage and correlation with remote nodes simultaneously. To reflect the locality around each node, a random walk with restarting mechanism (called random surfing) is applied in learning point-wise mutual information (PMI) representations (Cao et al., 2016) . For capturing multi-scale representations of different-order neighborhoods, hierarchical random walks by skipping some of the nodes on paths are also proposed (Perozzi et al., 2017; Chen et al., 2018) . Recently, Schlotterer et al. (Schlötterer et al., 2019) observed that RW-based methods perform better than others in exploring a wide range of distance and evaluated these methods as being helpful for SP distance estimation. However, a specific and insightful investigation of RW-based SP representation remains to be studied. In this paper, we discuss a novel biased random walk strategy toward high-order SP exploration and provide an explicit optimization algorithm for distance-preserved representation.

A.2 EFFICIENT IMPLEMENTATION OF BCDR ALGORITHM

We discuss here an efficient implementation to integrate the techniques mentioned in Section 3.1 and 3.2. Our algorithm is presented in Algorithm 1, including constructing SP representations and answering online distance queries. The description and analysis of these procedures are provided as follows.  3 approximate BCs γ := dict{v i : 0, ∀v i ∈ V } 4 for each landmark v i ∈ L do 5 γ[v i ] ← 1 6 for each node v j ∈ V reached by BFS from v i do 7 append (v i , v j , D ij ) to T 8 γ[v j ] ← γ[v j ] + 1 P k c := dict{} 21 for v j ∈ N c ∧ v j / ∈ S i do 22 P k c [v j ] ← γ[v i ] × (2 -tanh(ζ -B[v j ])) 23 if v j ∈ D i then 24 D i [v j ] = min{D i [v j ], D i [v c ] + 1} 25 else 26 D i [v n ] = D i [v c ] + 1 27 end 28 end 29 sample next v n from normalized P k c 30 v c ← v n , S i ← v n 31 c i ← c i + 1 32 B[v n ] ← B[v n ] + 1

A.2.2 LEARNING EMBEDDINGS & DISTANCE PREDICTOR

To learn node embeddings Z on SP structures (line 37 to 49), we initially sample a group of BC walk paths as distance maps {D i } (line 39), and then resample from them to feed the skip-gram algorithm by considering distance decay and BC (line 42 and line 45), which preserves the property of distance relations discussed in Theorem 1. Besides, the resampling process also provides a shape transform of observed node sequences from w in × l in to w out × l out . Let β = (w in l in )/(w out l out ) > 1, and this property leads the actual time cost of learning embeddings to be reduced to its 1/β, compared with learning directly on the original paths. For the learning of the distance predictor (line 37, and line 50 to 54), we utilize a two-layer fully connected neural network as the predictor, which takes the concatenation of two nodes' embeddings as input, and outputs a scalar to indicate the distance (line 50). The predictor model is learned from distance triplets T by minimizing a mean squared error (MSE) between the predicted value and the real distance using the stochastic gradient descent (SGD) technique. Finally, the parameter of neural network φ and node embeddings Z are stored as graph SP representations. To ) is an edge in the graph. Since this practice prolongs the total response time, we also preserve a fast version (BCDR-FQ) without neighbors searching for some potential applications.

A.2.4 PARALLELISM

The implementation of BCDR is easy to be highly parallelized. In detail, the construction of SP representations could be divided into three parts, including simulating distance triplets, performing BC-based random walk, training embeddings, as well as the distance predictor. First, the BFS from each landmark could be parallelized at a thread level up to the size of the landmark set. Second, for the simulation of BC walk paths, the paths from different roots could also be simulated simultaneously. Third, the training process in the skip-gram algorithm and neural network could be locally parallelized by matrix computations.  h > 1, PB (N (h) a → N (h+1) a ) PR (N (h) a → N (h+1) a ) = 1 + B(k) + C (12) lim k→E(va)-1-h B(k) + C = A 2 -1 A 1 + C > 0 A 1 = |e h (v a )| |f h (v a )| , A 2 = 1 PR (f h (v a ) → e h (v a )) , C ≥ 0. (13) Proof. We simplify symbols N (h) a , N a , e h (v a ), f h (v a ) as N h , N h , e h , f h for short. → N h = #( min{E(va),h+k-1} i=h {v j |v j ∈ N (i) a }) ← N h = #( h i=max{0,h-k+1} {v j |v j ∈ N (i) a }) where #(•) is a counting function indicating the number of occurrence times of specified nodes, i.e., the cardinality of a sampled set. According to the definition in Theorem 2, we have For general random walks, the choice of destination is based on uniform sampling, thus causing N h = |e h | + |f h |. Since PR (e h → e h+1 |e h → N h+1 ) = |e h+1 | N h+1 For BC-based random walk, we need calculate BC value of nodes of e h+1 and f h+1 for starters. Let BC(e h+1 ) and BC(f h+1 ) represent the BC value of nodes in e h+1 and f h+1 respectively, and the correspond legal SPs counts come from 4 sources as { ← N h → → N h+2 }, { ← N h → N h+1 }, {N h+1 → → N h+2 } and {N h+1 → N h+1 }. And we use BC({• → •}) as the BC gain from the specified source, then BC({ ← N h → → N h+2 }) = ← N h → N h+2 BC({ ← N h → N h+1 }) = ← N h (|f h+1 | • 0 + |e h+1 |β (1) e ) BC({N h+1 → → N h+2 }) = → N h+2 (|f h+1 | • 1 + |e h+1 |β (1) e ) BC({N h+1 → N h+1 }) = N 2 h+1 β (2) e ( ) where β (1) e means average BC gain between nodes in e h+1 and → N h+2 , and β (2) e means average BC gain between nodes both in e h+1 , which are constantly related with G. Therefore, we have BC(e h+1 ) = ← N h → N h+2 + ( ← N h + → N h+2 )|e h+1 |β (1) e + → N h+2 |f h+1 | + N 2 h+1 β (2) e (17) Likewise, we calculate BC(f h+1 ) = ← N h → N h+2 • 0 + ← N h (|f h+1 |β (1) f + |e h+1 | • 0) + → N h+2 (|f h+1 |β (1) f + e h+1 • 0) + N 2 h+1 β (2) e =( ← N h + → N h+2 )β (1) f + N 2 h+1 β (2) f (18) To compare the above BC(e h+1 ) and BC(f h+1 ), BC(f h+1 ) BC(e h+1 ) = ( ← N h + → N h+2 )β (1) f + N 2 h+1 β (2) f ← N h → N h+2 + ( ← N h + → N h+2 )|e h+1 |β (1) e + → N h+2 |f h+1 | + N 2 h+1 β (2) e note that for any N j and N (x,y) = → N 0 - ← N x - → N y where j, x, y ∈ {0, E(v a )}, lim k→E(va)-1-h N j N (x,y) = lim k→E(va)-1-h (k) = 0 (20) Equation 18 is reduced to BC(f h+1 ) BC(e h+1 ) = 2 (k)β (1) f + (k 2 )β (2) f 1 + 2 (k)n|e h+1 |β (1) e + (k)|f h+1 | + (k 2 )β (2) e = 2β (1) f (k) Then, we perform weighted random sampling based on BC and get PB (e h → e h+1 |e h → N h+1 ) = |e h+1 | |e h+1 | + 2|f h+1 |β (1) f (k) Now, we consider the relation between PR (N h → e h+1 |N h → N h+1 ) and PB (N h → e h+1 |N h → N h+1 ). PB (N h → N h+1 ) PR (N h → N h+1 ) = PB (e h ) PB (e h → N h+1 ) PR (e h ) PR (e h → N h+1 ) = PB (N h-1 → e h ) + PB (N h-1 → f h ) PB (f h → e h ) PR (N h-1 → e h ) + PR (N h-1 → f h ) PR (f h → e h ) = 1 + PB (N h-1 → e h ) -PR (N h-1 → e h ) [1 -PR (f h → e h )] PR (N h-1 → e h ) + PR (N h-1 → f h ) PR (f h → e h ) + PR (f h → e h ) PR (N h-1 → e h ) + PR (N h-1 → f h ) PR (f h → e h ) = 1 + |f h |[1 -2|f h |β (1) f (k)][1 -PR (f h → e h )] (1 + |f h | |e h | )[|e h | + 2|f h |β (1) f (k)] PR (f h → e h ) + PR (f h → e h ) PR (N h-1 → e h ) + PR (N h-1 → f h ) PR (f h → e h ) (23) Let C = PR (f h →e h ) PR (N h-1 →e h )+ PR (N h-1 →f h ) PR (f h →e h ) , B(k) = |f h |[1-2|f h |β (1) f (k)][1-PR (f h →e h )] (1+ |f h | |e h | )[|e h |+2|f h |β (1) f (k)] PR (f h →e h ) . Since C is a non-negative value independent of k, finally we get PB (N h → N h+1 ) PR (N h → N h+1 ) = 1 + B(k) + C where lim k→E(va)-1-h B(k) + C = A 2 -1 A 1 + C. ( ) Remark 1. Equations 12 and 13 in Theorem 2 give definite conditions under which BC-based random walk travels further than random walk. Since nodes in v a 's h-order neighborhood N . On the other side, even if nodes in f h (v a ) have fewer links to those in e h (v a ) and thus A 2 increases, BC-based random walk tends to transit between f h (v a ) and e h (v a ) instead of looping within f h (v a ) as well as N (h-1) a .

A.4 PROOF OF PROPOSITION 1

Proof. In Proposition 1, we optimize an approximate objective Ln instead of L n , i.e., Ln (Z) = vi∈V P (v i ) E vj ∼ QW i (V ) [log σ(Z i Z T j )] + λE v k ∼ Pn(V ) [log σ(-Z i Z T k )] Here, we rewrite the negative sampling item of Equation 26as E v k ∼ Pn(Wi) [log σ(-Z i Z T k )]) = v k ∈Wi Pn (v k |v i )[log σ(-Z i Z T k )] = Pn (v j |v i )[log σ(-Z i Z T j )] + v k ∈Wi\{vj } Pn (v k |v i )[log σ(-Z i Z T k )] Then, for each pair of v i ∈ V and v j ∈ W i , we get independent objective by combing similar items from the total likelihood Ln , and reach Ln (Z) = vi∈V vj ∈Wi L (Z i , Z j ) L (Z i , Z j ) = P (v i , v j ) log σ(-Z i Z T j ) + λ P (v i ) Pn (v j |v i ) log σ(-Z i Z T j ) Let D = ZZ T , for each node pair (v a , v b ) with SP distance Dab , consider Dab = Z a Z T b = arg max Za,Z b L (Z a , Z b ) Remember that QWi is a distribution resampled from PWi , as an efficient approximation to P Wi , i.e.,  QWi (v j ) = α Dij BC(v j ) v k ∈Wi α D ik BC(v k ) ∂ Dab = P (v a )α n γ b σ(1 -Dab ) - λ κ(v a ) p(v a )γ b σ(1 + Dab ) = 0 After some simplification, we get It also reveals that the significance of distance resampling is to preserve the SP distance relations between nodes which are well-observed on given arbitrary walk paths. Besides, although the similarity matrix D could not directly tell the absolute distance, it also shares similar properties with D, i.e., if we fix v a as the source node in a path, the comparison between the similarities of (v a , v b ) and (v a , v c ) just reflects the SP distance relations between them. This practical property is discussed in Theorem 1. Dab = D ab log α -log λ κ(v a )

A.6 PROOF OF THEOREM 1

Proof. In terms of node pair (v a , v b ), as proved in Proposition 1, their similarity Dab in the embedding space varies linear with respect to the SP distance D ab on the graph, i.e., Dab = -log + D ab log α (37) Likewise, for (v a , v c ), there holds | Dac | = -log + D ac log α ( ) where 0 < α < 1 and relies on W a which is independent of v b and v c . Then, consider the distance relation of v a , v b and v c , there holds (D ab -D ac )( Dab -Dbc ) = (D ab -D ac )((D ab -D ac )) log α = log α • (D ab -D ac ) 2 ≤ 0.

A.7 MOTIVATION OF BCDR PROCEDURE AGAINST DIRECTLY SAMPLING SPS

We further clarify in this section the motivation for leveraging BCDR instead of directly sampling SPs. As a prerequisite, it should be acknowledged that we need sampled SPs as observation to optimize node embeddings Z. However, to perform sufficient observation on all pairs of shortest paths is time-consuming, which takes at least O(N 2 ) time on sparse unweighted graphs. Towards this, an intuitive idea is to sample a limited number of paths that starts only at a few nodes (landmarks). But it will introduce strong bias on the landmarks and ignore many shortest paths far away from them. To alleviate this bias, in BCDR, we hope to observe shortest paths rooted at all nodes on the graph (instead of the landmarks only). Therefore, we need some strategies to overcome the huge complexity of directly sampling these paths (since it requires performing BFS on all nodes). The proposed strategy is BC-based random walk where we intend to equip 'random walk' with the awareness of high-order SP structure and make the sampled walk paths much more likely to be certain shortest paths. This strategy is comparatively efficient since the sampling complexity is proportional to its path length l. Then the subsequent module DR further resampled from these paths for implicitly preserving SP distance relations on Z. According to the above discussion, a brief procedure of BCDR with its motivation could be summarized as follows. • estimate BC just by BFS from only a few nodes (landmarks). motivation: determine which node is prone to trigger high-order explorations of SP distances. • perform BC-based random walk. motivation: observe the potential shortest paths rooted at each node sufficiently. • leverage DR for resampling approximate random shortest paths. motivation: implicitly preserving distance relations on observed paths. • optimize Z from the observation of the resampled paths. motivation: reflect the SP structure on the graph instead of arbitrary linkage. Each step above possesses linear complexity with respect to N (number of nodes in the graph). Besides, we are convinced of the necessity of BCDR procedure and would like to explain it carefully from both technical and empirical perspectives. From a technical perspective, directly leveraging shortest paths as observation to optimize Z has a few shortcomings. • Prohibitive Complexity of Sufficient Observation. Observing all pairs of SP distance requires at least O(N 2 ) time for sparse unweighted graphs. Alternatively, an insufficient observation with linear complexity will cause a loss in accuracy (see experimental results below). • Inflexible Path Length for Optimization. Since we leverage the skip-gram algorithm for optimizing Z, it should be clear how long the sliding window size is, serving to reconstruct the distance relations between nodes. But shortest paths rooted at a certain node factually possess significantly divergent path lengths, which causes difficulty in determining proper sliding window size on different paths, i.e., a longer window helps to capture long-distance correlation but causes indistinguishable in shorter paths and vice versa. Alternatively, if we only select shortest paths of a certain fixed length, paths shorter than this length will be ignored, thus impairing the performance. Correspondingly, the BCDR procedure overcomes the above shortcomings as follows. • Linear Complexity of Such Observation. Instead of directly simulating shortest paths, we sample paths by BC-based random walk and transform the paths into approximate random shortest paths by DR. Both of these operation share linear time complexity. Also, the optimization on such resampled paths is proved to share similar properties with that on real shortest paths by Proposition 1 and Theorem 2. • Flexible Path Length for Optimization. Since the paths are resampled from random paths, the number and length of them (i.e., w out , l out ) could be customized. We are thus able to fix them at a certain proper length for subsequent optimization. From an empirical perspective, we further construct and evaluate 6 competitive baselines which have the same architecture and hyper-parameters as BCDR, but adopt different intuitive strategies to directly optimize on shortest paths. The basic description of these baselines is stated as follows. • Shortest Paths on Landmarks only (SPoL) Since we need anyway perform BFS from landmarks to acquire distance triplet for learning distance predictor, we intuitively retrieve the shortest paths starting from the landmarks. This operation introduces little extra time cost. The size of landmark set is the same as BCDR (i.e., |L| = 80) • Shortest Paths on Landmarks only with Fixed Length (SPoL-F) This is similar to SPoL but restricts the output walk length at a certain level (the same as BCDR, i.e., l out = 10). • Shortest Paths on All Nodes (SPoN) In BCDR, we perform BC random walk on each node v a to locate its position on the graph. Here, we directly sample shortest paths from v a to any other nodes instead. Specifically, for each source node v a , we take a uniform sampling over V to acquire the destination nodes, and retrieve the shortest paths between them. The number and max length of shortest paths on each node is the same as BCDR (i.e., w out = 40, l out = 10). • Shortest Paths on All Nodes with Fixed Length (SPoN-F) This is similar to SPoN but restricts the output walk length (the same as BCDR, i.e., l out = 10). • Shortest Paths on Arbitrary Node Pair (SPoANP) We randomly select a group of node pairs (v s , v t ) and retrieve one of the shortest paths between them by BFS. The number of paths is the same as the total number of walk paths on all nodes in BCDR (i.e., N × w out ). • Shortest Paths on Arbitrary Node Pair (SPoANP-F) This is similar to SPoANP but restricts the output walk length (the same as BCDR, i.e., l out = 10). All of the above baselines are evaluated on GrQc dataset, and the experimental results are presented in Table 3 . We see from the table that BCDR outperforms all the baselines in approximation quality (i.e., mAE and mRE) within proper time. Specifically, SPoL possesses desirable pre-processing time since only the shortest paths rooted at landmarks are considered. But they are plagued with insufficient observation of other shortest paths that do not pass through landmarks. SPoN and SPoANP suffer huge complexity when retrieving shortest paths on the whole graph, and perform even worse due to the uncertainty of reasonable sliding window size. From SPoL-F, SPoN-F, and SPoANP, we see that even if the path length is fixed, some uncaptured shorter paths will also cause a loss in accuracy.

A.8 DATASETS

To thoroughly evaluate our proposed method, we conduct experiments on real-world graphs and synthetic graphs with divergent properties on sizes, structures, diameters, etc. Thereinto, real-world graphs are extracted from Stanford Large Network Dataset Collection (Leskovec & Krevl, 2014) , and synthetic graphs are simulated according to specific rules described in A.8.6. The visualization results of each graph are illustrated in Figure 6 , and the corresponding statistics are presented in Table 4 . In the experiments, we show the efficiency and scalability of BCDR on real-world graphs of different sizes and test on smaller synthetic graphs with different structures for further analysis of exploration range and distance preservation. Here are brief descriptions of these graphs: A.8.1 CORA This is a graph that describes the citation relationship of papers, which contains 2708 nodes and 10556 directed edges among them. Each node also has a predefined feature with 1433 dimensions.

A.8.2 FACEBOOK

This is a graph that describes the relationship among Facebook users by their social circles (or friend lists), which is collected from a group of test users. Facebook has also encoded each user with a reindexed user ID to protect their privacy.

A.8.3 GRQC

This is a graph recorded from the e-print arXiv in the period from January 1993 to April 2003, which represents co-author relationships based on their submission. Each undirected edge (v i , v j ) represents that an author v i is co-authored a paper with another author v j . If one paper is owned by k authors, a complete subgraph of k nodes is generated correspondingly.

A.8.4 DBLP

This is a graph collected as a computer science bibliography that provides a comprehensive list of research papers in computer science. As an undirected collaboration network, each edge reflects the corresponding two authors who have at least one paper together.  (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k)

A.8.6 SYNTHETIC GRAPHS

We also construct some smaller graphs reflecting one or some of the typical sub-structures which are frequently occurred in complex graphs. The simulation rules of each graph are listed as follows. • Circle Graph (CG): this is a graph that contains several circles of different sizes. The simulation of circle graphs takes an iterative process where for each newly introduced circle, there are a limited number of nodes (called exit nodes) connected to the previous circles. • Tri-circle Graph (TCG): this is a graph that combines the properties of circle graphs and triangle graphs. Here, each circle is simulated by connecting triangle sub-graphs end to end. • Tree Graph (TRG): this is a graph that is generated from one root to several leaves recursively. There is no cycle in tree graphs. To control the tree structure, we define a splitting probability that is decayed exponentially with current depth. • Spiral Graph (SG): this is a graph shaped like a spiral line. We first simulate a line graph and add edges between nodes with exponentially increased distances by their indices on the line. • Net Graph (NG): this is a graph containing grid-like connections between nodes. We define a small probability of dropping those edges stochastically.

A.9 BASELINE & PARAMETER SETUP

The parameter setups of each baseline are listed as follows. For the oracle-based method, α 0 is set to 2 for the best accuracy, as discussed in the previous work. For the landmark-based method, we choose a sufficient size of the landmark set as |L| = 128 and take the constrained strategy, i.e., for each landmark selected, nodes within two hops are discarded from consideration. For learningbased methods, the embedding size d is fixed at 16. In addition, the number of selected landmarks in learning-based methods is up to 80 for small graphs (i.e., Cora, Facebook, and GrQc) and 24 for large graphs (i.e., DBLP and Youtube). Other hyper-parameters of each model follow the default configurations discussed in their works. For the baselines proposed in road networks, the coordinaterelated features are omitted in their models, since there's no coordinate assumption in our graph datasets. For general GRL methods, all of the baselines follow the default configurations and are further trained by linear regression to extract the distance between any two nodes. For our proposed method, we simulate w in = 20 walks on each node, and each walk is truncated at a length of l in = 40. Each landmark is selected randomly up to the size of a landmark set |L| = 80. The number of negative samples n is set to 1. The process of distance resampling outputs w out = 40 walks with each walk at a length of l out = 10. The decay coefficients of BC and distance are fixed as ζ = 10, α = 0.35. We train the distance predictor using a two-layer perceptron with a learning rate r = 0.01 for 15 epochs and train the CatBoost regressors with a grid search for their best parameters at the offline stage. For large graphs (i.e., DBLP and YouTube), we adjust the above parameters by |L| = 5, w in = 2, ζ = 1.0. For BCDR-FC, the number of walks is reduced by a half for every graph. For BCDR-FQ, we take the raw outputs of the distance predictor without searching first-order neighborhoods on the graph (i.e., set τ = True). The boosting module is only utilized in BCDR and disabled in BCDR-FQ and BCDR-FC (i.e. set χ = False). We summarized the critical parameter setting to reproduce results in Table 2 as follows. A.10 EXTENDED COMPARISONS WITH GRL MODELS ON APPROXIMATION QUALITY We present in Table 6 the experimental results of comparisons to general GRL models. Here, only the approximation quality (i.e., mAE and mRE) is evaluated, and the metrics are exerted directly on the representations without quantizing the outputs to integers or checking adjacency matrices. We see from Table 6 that although general embeddings by GRL methods could preserve some local SP structures, our proposed method with explicit SP constraint possesses better approximation quality for the SP distance queries. (Roweis & Saul, 2000) 5.6265 0.8445 1.9921 0.6841 4.8849 0.7105 LE (Roweis & Saul, 2000) 5.6393 0.8455 2.0312 0.6998 5.0046 0.7366 GF (Ahmed et al., 2013) 5.6249 0.8440 1.8743 0.6383 4.8562 0.7125 DeepWalk (Perozzi et al., 2014) 1.5183 0.2425 0.9323 0.3289 2.8002 0.4169 GraRep (Shaosheng et al., 2015) 2.6206 0.3830 2.8702 1.0479 4.2445 0.6292 Node2Vec (Grover & Leskovec, 2016) 1.3072 0.2115 0.8541 0.2993 1.5156 0.2278 NetMF (Qiu et al., 2018) 4.1736 0.6025 1.6982 0.6163 3.8799 0.5779 VERSE (Tsitsulin et al., 2018) 2.8895 0.4049 1.1092 0.3729 3.3436 0.4689 LPCA (Chanpuriya et al., 2020) We discuss here the impact on approximation quality of different components in BCDR framework. In addition to those plausible post-processing operations described in Algorithm 1 (i.e., enable τ, χ or not), we also explore other operations that influnces approximation quality when pre-processing graphs. The modifications to BCDR are stated as follows, and the corresponding results evaluated on Facebook and DBLP are shown in Table 7 . • no checks on adjacency. The outputs of BCDR are accepted as predictions of SP distance without checking if there is any immediate edge between each node pair (i.e., set τ = True). • no global features. SP distances are solely predicted by the two-layer neural network, and the boosting module based on global distances to landmarks is omitted (i.e., set χ = False). • no local features. SP distances are solely predicted by the boosting module, and the learning process of the two-layer neural network on local features is omitted. • no BC. Node representations are learned without BC-based random walk. For any nodes, each transition in simulating walks considers its first-order neighbors equally, ignoring their BC values. • no DR. Node representations are learned without distance resampling. The resampling rule (i.e. Equation 5) is replaced by a uniform sampling. • degree selection. In simulation of distance triplets and BC values, the landmarks are selected in descending order of degree, instead of random selection. We see from the table that the full approach of BCDR achieves the best performance on approximation quality. It also reveals that all of the components significantly improve the prediction results. Specifically, for the checks on adjacency, it is notable that learning-based methods on SP representation show much difficulty in catching distance to first-order neighbors. Checking adjacency of input node pairs is necessary for accurate SP prediction. For global and local features, SP distance predicted from global features performs better than that from local features. It means leveraging global distances to each landmark helps a lot in locating a node on the graph. Furthermore, BCDR combines both global and local features for prediction and shows superior performance compared to either of them. For representing local features (i.e., BC and DR), we see that both BC-based random walk and distance resampling help to enhance the node representations with high-order SP structures, making it easier to extract distances to remote nodes. For landmark selection, we find that random selection of landmarks is more necessary for BCDR than other existing strategies. This is because we need to estimate BC values by performing BFS from these landmarks, and any assumption on landmark distribution will lead to unfair numerical estimation. If only landmarks with large degrees are selected, the BC value of nodes located in dense regions will be over-estimated, which impairs the efficiency of BC-based random walk. Then, we further investigate the parameter settings of BCDR and discuss 9 critical parameters for their impacts on performance. Notably, although we describe rather detailed settings of parameters in Appendix A.9, the proposed method BCDR is factually robust and effective, and its performance does not sensitively rely on any one of them. Here, we show the impacts of these parameters on related metrics and how to easily tune them in any unweighted graphs, both conceptually and practically. The next discussion and evaluation of each parameter follow its order in Table 5 . d: the dimension of node-level embeddings (i.e., Z). In our experiment, d is not a finetuned parameter but fixed at a certain value (i.e., d = 16) among different models to fairly evaluate their performance. This parameter could improve the performance on accuracy since a large size of embeddings could dump more valuable information about SP structures at the expense of higher storage cost and deficiency in query speed. To verify this, We test BCDR with different d = {2, 4, 16, 64, 128, 256} on Facebook and GrQc to evaluate their performance under these metrics. From the Table 8 and 9 , we see that the accuracy loss could be cut down by increasing d, but it will lead to significant deterioration in storage cost and query speed. As we discuss a low-dimensional and accurate SP representation in this paper, the results also reveal that even at a rather lower dimension of embeddings (like d = 4), the distance relations on the graph could be well-preserved. |L|: the number of landmarks for constructing distance triplet and estimating BC. This parameter mainly affects accuracy and pre-processing time since involving more landmarks helps to alleviate harmful inductive bias on a certain part of the graph but suffers higher computing overhead. It is also observed in the previous works (Rizi et al., 2018) that for large graphs with strong centrality on a few nodes, the number of landmarks could be reduced without much loss of accuracy. We evaluate BCDR with a group of landmarks (|L| = {10, 20, 40, 80, 160}) on Facebook and GrQc, to see their impacts on the two metrics. The results in Table 10 and 11 show that the pre-processing time on graphs increases linearly with |L| since performing BFS from the added landmarks needs extra traversal on the whole graph for O(N + M ) time. It is also interesting to see that the number of landmarks large enough for the best performance diverges for dense and sparse graphs, i.e., it generally takes more than 40 landmarks for Facebook but only 20 landmarks necessary for GrQc. Specifically, for relatively dense graphs (i.e., Facebook), each node shares weaker centrality due to the enriched links, which means we need to observe more landmarks to cover more SPs on the graphs (according to the hub-labeling theory in (Cohen et al., 2003) ). But for sparse graphs (i.e., GrQc), as long as several nodes with strong centrality are well-observed, SP distance between most node pairs could be preserved, resulting in tolerance of reduced landmarks. w in , l in : the number and length of sampled BC walks on each node. These parameters affect the accuracy and pre-processing time. When we simluate BC walks rooted at a certain node, a large w in makes it sufficient to observe the local structure of each node (like BFS), while a large l in allows wider exploration on the graph to let the distance with remote nodes be seen (like DFS). Like the previous evaluation, we test w in = {5, 10, 20, 30, 40} and l in = {5, 10, 20, 40, 60, 80} to investigate their impacts, respectively. The experimental results from Table 12 to 15 show the accuracy of BCDR is not sensitive to these parameters, owing much to the efficiency of BC walk and well-preserved distance relations by DR. Intuitively, we recommend setting l in proportional to the diameter of the graph, which makes the whole graph observed from any nodes. Also, w in could be reduced when the connectivity on the graph is relatively weak since the local structures are quite simple to explore. w out , l out : the number and length of resampled paths (by DR) on each node. These parameters control the shape of output node sequences to subsequently optimize Z under a skip-gram procedure. To avoid much loss of information and preserve the correlation in BC walks, we intend to keep the scale of outputs similar to that of inputs, i.e., w out l out = Ω(w in l in ). To accelerate the optimization process, we could further shorten l out and keep this scale (by correspondingly expanding w out ). Note that this reshaping operation does not apparently change the locality nor impair the performance since DR resamples nodes from high-order neighborhoods with respect to their distance from the root, thus resulting in well-defined convergence, as shown in Prop. 1. In the experiment, we fix the scale of output node sequences as half of the scale of BC walks (i.e., w out l out = w in l in /2 = 400), and test different combinations of their settings as (w out , l in ) = {(200, 2), (100, 4), (50, 8), (40, 10), (25, 16), (16, 25), (10, 40), (8, 50) , (4, 100), (2, 200)}. The results in Table 16 and 17 reveal that the pre-processing time dramatically increases along with l out . This is because we utilize the whole sequence to optimize co-occurrence likelihood between the root and nodes in this sequence, which requires joint training with a large number of node embeddings proportional to l out . It is also shown that the accuracy does not significantly fluctuate as pre-processing time, indicating a relatively small l out will help to reduce the off-line time cost. ζ, α: the decay coefficient of BC values and distance weights. These parameters mainly affect the performance on accuracy by dominating the intrinsic behaviors of BC walk and DR, respectively. Thereinto, ζ determines how frequently a node could be enrolled in the current BC walk, which helps to diverge the direction of different walks from one root. Likewise, α determines how frequently a node with more hops from the root could be selected into resampled paths, which helps to distinguish neighbors of different orders. Like the previous evaluation, we test BCDR with ζ = {-1, 0, 1, 2, 4, 10, 20} and α = {0.1, 0.2, 0.3, 0.4, 0.5, 0.9, 0.98} to show their impacts. From the Table 18 and 19, we see the accuracy of BCDR is not sensitive to these parameters, but a fine-tuning process could improve the performance on specific graphs. For choices of ζ, it depends on the fluctuation of centrality on neighbor nodes. Specifically, for relatively dense graphs (like Facebook) with flattened centrality on neighbors, a larger ζ resists the frequency decaying of most preferred walk paths, leading to efficient exploration for high-order distance relations. On the contrary, a quick BC decaying (smaller ζ) makes the priority of neighbor nodes indistinguishable, dragging down the performance like a naive random walk, since many neighbors possess similar centrality on such graphs. For choices of α, as discussed in Remark 2, it reflects a trade-off between quality (i.e., preserves accurate distance relations) and quantity (i.e., embeds more relations with a widened range of nodes). In detail, a smaller α slows down the process Dab → 0, allowing relations between node pairs with larger distance D ab to converge, i.e., Z a Z T b → Dab > 0, but it causes nodes possessing similar distance from the root indistinguishable due to the noise in the embedding space, and vice versa.

Number of epochs.

The number of epochs determines if it is sufficient to learn a NN distance predictor. To produce the results of Table 2 , we just leverage the empirical value as discussed in (Rizi et al., 2018) . Here, we evaluate its impact on accuracy loss and pre-processing time. The extended results on all synthetic graphs are shown in Figure 7 . We analyze the significance of utilizing BC-RW for a wider range of exploration on different structures as follows. • For CGs and TCGs, BC-RW tends to choose the exit nodes of each circle since they provide a large BC gain by splitting all SPs between inner nodes and outer nodes regarding the current circle. • For TGs, transitions on every triangle clique tend to move forward along the trunk road since the number of nodes beyond the current clique is often larger than that of inner nodes, contributing to more SPs. • For TRGs, each transition from the root to leaves appears to be biased since subtrees with more descendants contribute to more SPs and possess larger BC values. • For SGs, there are many shortcuts that link some nodes on the trunk road, and the BC values of shortcut nodes and other nodes are usually on par. BC-RW possesses a slight advantage by keeping a relatively good balance on these nodes. • For NGs, most of the nodes are passed through by SPs with similar probabilities, and BC-RW is hard to tell the proper direction for deeper exploration like other walk strategies.

A.13 EXTENDED RESULTS ON PRESERVATION OF DISTANCE RELATIONS

The extended results of distance preservation are shown in Figure 8 and 9. These figures confirm that our model is much more satisfactory in preserving distance relation than existing methods except for TGs. For most graphs, BC walk paths provide sufficient observation on each node by locating many remote nodes with a sequence of center nodes on a graph, and thus the resampling process based on such observation could preserve distance relations in the exploration range. For TGs, however, there are many final nodes (i.e., leaves) possessing trivial significance on BC walks which are insufficiently observed for calibrating their distance relations well. 



Figure 2: Distance confusion in previous SP representation learning. (a) random walk from v a has much difficulty in exploring beyond current community to v c . (b): node similarity on random paths misleads the measurement of SP distance since the walk from v a is prone to steer clear of v b for starters and back to v b as the end, causing an extremely weak correlation between v a and v b even though they have an immediate edge. (c): a sufficient number of 2-hop links between v c and v a induce a shorter distance in embedding space than that of v b and v a . (d): v b and v c sharing substantial connection are mapped closed to each other even if they have a large SP distance gap, while the divergence of distance between v b , v c and v a is also plagued with extraction.

(h) a comprises two components, i.e., final nodes f h (v a ) and connective nodes e h (v a ). Thereinto, nodes in f h (v a ) have no edge with N (h+1) a , while nodes in e h (v a ) have several edges with N (h+1) a , as illustrated in Figure3a. Our method significantly improves the performance when the number of final nodes |f h (v a )| is larger or there are more edges from f h (v a ) to e h (v a ) (see analysis in Remark 1).

Figure 3: (a): high-order neighborhoods' structure of v a . BC-based random walk enhances the transition from f h to e h and e h to e h+1 , prone to jump out of local neighborhoods. (b): comparison between general RW-based graph learning and BCDR. Distance resampling transforms the observation into random shortest paths, which exerts desirable constraints on learning SP representations.

Figure 4: Exploration range of distance of different walk strategies on circle graphs. Column from left to right: different walk strategies, i.e, NRW, SORW, RS, DFS-RW, BC-RW (ours.).

Figure5: Row 1: measured distance from the embedding space and the original graph. Row 2: whether distance relations are violated in the embedding space. Columns from left to right: embeddings learned by different walk strategies,i.e., NRW, SORW, RS, DFS-RW, and BC-RW (ours.), respectively. For ours, walk paths are further simulated by distance resampling.

construct SP representations & answer distance queries Input: input graph G = (V, E), set of landmarks L, distance queries of node pairs Q, dimension of embeddings d, number of walk paths on each node w in , length of walk paths l in , BC decay coefficient ζ, number of resampled walk paths on each node w out , length of resampled walk paths l out , distance decay coefficient α, training epochs m, learning rate η r , usage of fast query τ , usage of boosting χ. Output: SP representation Z, predictor g φ , estimated distance D. (optional: regressors b 1 , b 2 , global representation Z ) 1 Def sim DT with BC(G, L): 2 distance triplets T := list[]

γ 12 Def sim BC Walk(G, v i , w in , l in , γ): 13 distance map D i := dict{v i : 0} 14 visit counter B := dict{v j : 0, ∀v j ∈ V } 15 for walk k from 0 to w in do 16 visit sign set S i := {v i } 17 current node v c := v i 18 length c i := 0 19 while c i < l in do 20 probabilities of the next candidate nodes

cons BCDR(G, d, L, w in , l in , ζ, w out , l out , α, m, η):37 T, γ ← sim DT with BC(G, L), P = list[] 38 for v i ∈ V do 39 D i ← sim BC Walk(G, v i , w in , l in , γ) 40 probabilities of candidates nodes Pi = dict{} 41 for v j ∈ D i .keys do 42 Pi [v j ] := α Di[vj ] • γ[v j ]43 end 44 for walk k from 0 to w out do 45 sample walk path p i of length l out by normalized Pi

THEOREM FOR CLARIFYING THE SIGNIFICANCE OF BC-BASED RANDOM WALK Theorem 2. Define N (h) a = {v j |D aj = h} as a set of nodes that are h-hops away from v a and N the number of nodes in the set. Nodes in N (h) a could be divided into two sets, i.e., e h (v a ) and f h (v a ). Thereinto, e h (v a ) comprises nodes that have an edge with the nodes in N (h+1) a (called connective nodes), and f h (v a ) comprises the other nodes (called final nodes). The BC value of each node is approximated by considering only the shortest path of nodes within a range of k-hops locally. Let PR (N (h) a → N (h+1) a ) represent the probability to transit from N (h) a to N (h+1) a by a naive random walk, and PB (N (h) a → N (h+1) a ) represent that by a BC-based random walk. Let PR (f h (v a ) → e h (v a )) be the probability of transition from f h (v a ) to e h (v a ). E(v a ) is the eccentricity of v a , E(v a ) = max v b ∈G D ab . Then, for sufficient large E(v a ), any node v a ∈ G and any

divided into e h (v a ) and f h (v a ) like Figure3a, BC-based random walk improves exploration distance beyond local loops and dead ends in contrast with naive random walk by two aspects. On one side, even if N (h) a comprises a larger number of nodes in f h (v a ) than e h (v a ) and thus A 1 decreases, BC-based random walk tends to transit in e h (v a ) to get near with the next desired set N (h+1) a

BC(v b ) by γ b , according to Equation 30, the joint distribution of each pair (v a , v b ) could be formulated as P (v a , v b ) = P (v a ) • P (v b |v a ) = P (v a ) • α D ab γ b (31) Then, consider the formualtion of Pn (v b |v a ). Recall that the negative sampling is a uniform distribution on W a which are simulated by BC-based random walk, and the probability of v b 's occurrence relies on γ b and W a . Thus there holds Pn (v b |v a ) = γ b κ(v a ) (32) where κ(v a ) is in proportion with the number of nodes covered by W a on the graph. Furthermore, Equation 29 could be rewritten as Dab = arg max Dab P (v a )α D ab γ b log σ( Dab ) + λ κ(v a ) P (v a )γ b log σ(-Dab ) (33) Solve the above problem by just let ∂L(va,v b ) ∂ Dab be equal to zero, i.e., ∂L(v a , v b )

) Let = λκ -1 (v a ) and there holds Dab = -log + D ab log α (36) A.5 REMARK FOR THE RELATIONSHIP BETWEEN DISTANCE MATRIX & DISTANCE-BASED SIMILARITY MATRIX Remark 2. Equation 7 in Proposition 1 reveals the linear projections between elements in D and D. Thereinto, -log is a big positive constant with respect to |log|W a || -| log λ|, and D ab log α is a negative value that decreases linearly with SP distance D ab . It indicates that there exists a finite distance range n ∈ N + , for each node v b ∈ {v x |D ax ≤ n}, the distance relation between v a and v b could be well-optimized by converging Z a Z T b → Dab > 0.

Figure 6: Visualization results of the graphs used for evaluation. (a): Cora. (b): Facebook. (c): GrQc. (d): DBLP. (e): YouTube. (f): CG. (g): TG. (h): TCG. (i): TRG. (j): SG. (k): NG.

Figure 7: Exploration range of distance when taking different walk strategies tested on six synthetic graphs. Row from top to bottom: different synthetic graphs including CG, TG, TCG, TRG, SG, and NG, respectively. Column from left to right: different walk strategies including NRW, SORW, RS, DFS-RW, BC-RW (ours.), respectively.

Figure 8: Measured distance from the embedding space and the original graph. Row from top to bottom: different graphs including CG, TG, TCG, TRG, SG, and NG, respectively. Column from left to right: embeddings learned by different walk strategies,i.e., NRW, SORW, RS, DFS-RW, and BC-RW (ours.), respectively. For ours, walk paths are further simulated by distance resampling.

show in Proposition 1 that optimization of Equation 6 conforms to implicitly decompose an SP distance-based similarity matrix where for any v a and v b located far away from each other under the SP metric (i.e., a small D ab ) should be mapped with low similarity in the embedding space (i.e., a large | Dab |). Also, further discussion in Remark 2 shows that such similarity matrix D shares strong connections with the real SP distance matrix D on graphs. Proposition 1. Let G be an undirected graph, W i be the categorical distribution of nodes on the paths sampled by BC-based random walk. Negative sampling on each node v i takes a uniform distribution on W i . Then, for sufficient observation of W 1 , • • • , W N , maximizing Ln defined by Equation

Performance comparison of approximate methods on SP distance queries. PT: processing time when constructing SP representations, PMU: processing memory usage, SC: space cost on storing SP representations, RT: average response time of answering a distance query, QMU: querying memory usage. mAE and mRE are the accuracy metrics. DNF means it did not finish in one day. We bold the top three performances, and highlight the top one with an underline.

Xiang Yue, Zhen Wang, Jingong Huang, Srinivasan Parthasarathy, Soheil Moosavinasab, Yungui Huang, Simon M Lin, Wen Zhang, Ping Zhang, and Huan Sun. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241-1251, 10 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz718. URL https://doi.org/10.1093/ bioinformatics/btz718.

improve the prediction results, we further integrate the distance predictor with CatBoost techniques(Dorogush et al., 2018). Initially, we treat the feature of nodes as a combination of global features and local features. Thereinto, local features are already represented by Z, since we have constructed SP representations on each node locally. For the global feature of any node v i , we directly leverage the distance to each landmark as its global embedding Z i . Then, we train two Cat-Boost regressors (i.e., b 1 , b 2 ) in turn. The first regressor b 1 takes global features of two nodes as input and predicts their distance as output (line 58), while the second regressor b 2 takes as input not only such global features, but the distance predicted from both global (i.e., b 1 (Z

Comparisons between BCDR and directly sampling SPs by different intuitive strategies. PT: pre-processing time, ST: time of sampling paths, mAE: mean of Absolute Error, mRE: mean of Relative Error.

Statistics of the graphs used for evaluation. N denotes the number of nodes, M denotes the number of edges, RoBC denotes the range of BC, mBC denotes the average BC on nodes, D denotes the diameter of a graph, BFS denotes the average processing time of breadth-first search on all nodes. Each graph is considered an undirected and unweighted graph in the SP representation problem.

Parameter settings of evaluation on 5 real-world graphs. Smaller graphs include Cora, FaceBook, and GrQc. Larger Graphs include DBLP and YouTube.

Extended comparison to general GRL models on approximation quality

Ablation study of BCDR framework on approximation quality

Impacts of d on the performance of BCDR evaluated on FaceBook

Impacts of d on the performance of BCDR evaluated on GrQc

Impacts of |L| on the performance of BCDR evaluated on FaceBook

Impacts of |L| on the performance of BCDR evaluated on GrQc

Impacts of w in on the performance of BCDR evaluated on FaceBook

Impacts of l in on the performance of BCDR evaluated on FaceBook

Impacts of w in on the performance of BCDR evaluated on GrQc

Impacts of l in on the performance of BCDR evaluated on GrQc

Impacts of (w out , l out ) on the performance of BCDR evaluated on FaceBook

Impacts of (w out , l out ) on the performance of BCDR evaluated on GrQc

Impacts of ζ and α on the performance of BCDR evaluated on FaceBook

Impacts of ζ and α on the performance of BCDR evaluated on GrQc

Impacts of the number of epochs on the performance of BCDR evaluated on FaceBook

Impacts of the number of epochs on the performance of BCDR evaluated on GrQcThe results in Table20and 21 show that learning with 15 epochs is generally appropriate for many real-world graphs. It also reflects that training the distance predictor with more iterations may cause an over-fitting problem since the training data (distance triplets) are extracted from a few landmarks, which induces harmful inductive bias on a certain part of the graph.

A.2.1 SIMULATION OF DISTANCE TRIPLETS & BC WALK

To simulate distance triplets (line 1 to 11), we perform BFS from a fixed number of landmarks L and record their distance to each node on the graph (line 6 and 7). L comprises nodes that are selected by heuristic strategies (e.g., by their degrees in descending order or randomly), and the simulated triplets with linear complexity reflect a sufficient group of distance relations among V ×V for metric learning. Before simulating BC walk paths, pre-computed BC of each node is also required at first, which takes at least O(N M ) time on unweighted graphs for an exact solution. To reduce the time complexity, we consider a time-efficient approximation by integrating this process into the above simulation of distance triplets (line 8). Intuitively, BC(v a ) measures some kind of relationship between v a and the centers of the graph, and nodes with larger BC values possess a shorter average SP distance to any node on the graph. Since BFS visits each node just in ascending order of distance, we thus estimate BC on each node by the average distance to all landmarks in L without introducing extra time complexity.Then, in the simulation of BC walk paths (line 12 to 35), we are interested in "nodes on these walk paths" instead of the full paths themselves, and the former with their distance to the root v i (i.e., D i ) will be passed to feed subsequent construction (line 39). Therefore, we enlarge the node coverage by a decay mechanism on BC to diverge different walks. Note that BC-based random walk possessing the ability to explore remote nodes tends to choose the paths that are prone to travel further, which causes ignorance of nodes on some dead ends. Addressing this, as stated in line 22, the probability of transiting to a node with large BC will be saturated after a sufficient number of walks passing through it, which means some rare paths are getting much easier to be visited later.

