

Abstract

In this paper, we introduce InstantEmbedding, an efficient method for generating single-node representations using local PageRank computations. We theoretically prove that our approach produces globally consistent representations in sublinear time. We demonstrate this empirically by conducting extensive experiments on real-world datasets with over a billion edges. Our experiments confirm that Instant-Embedding requires drastically less computation time (over 9,000 times faster) and less memory (by over 8,000 times) to produce a single node's embedding than traditional methods including DeepWalk, node2vec, VERSE, and FastRP. We also show that our method produces high quality representations, demonstrating results that meet or exceed the state of the art for unsupervised representation learning on tasks like node classification and link prediction.

1. I N T R O D U C T I O N

Graphs are widely used to represent data when are objects connected to each other, such as social networks, chemical molecules, and knowledge graphs. A widely used approach in dealing with graphs is learning compact representations of graphs (Perozzi et al., 2014; Grover & Leskovec, 2016; Abu-El-Haija et al., 2018) , which learns a d-dimensional embedding vector for each node in a given graph. Unsupervised embeddings in particular have shown improvements in many downstream machine learning tasks, such as visualization (Maaten & Hinton, 2008) , node classification (Perozzi et al., 2014) and link prediction (Abu-El-Haija et al., 2018) . Importantly, since such embeddings are learned solely from the structure of the graph, they can be used across multiple tasks and applications. Typically, graph embedding models often assume that graph data fits in memory (Perozzi et al., 2014) and require representations for all nodes to be generated. However, in many real-world applications, it is often the case that graph data is large but also scarcely annotated. For example, the Friendster social graph (Yang & Leskovec, 2015) has only 30% nodes assigned to a community, from its total 65M entries. At the same time, many applications of graph embeddings such as classifying a data item only require one current representation for the item itself, and eventually representations of labeled nodes. Therefore, computing a full graph embedding is at worst infeasible and at best inefficient. These observations motivate the problem which we study in this paper -the Local Node Embedding problem. In this setting, the embedding for a node is restricted to using only local structural information, and can not access the representations of other nodes in the graph or rely on trained global model state. In addition, we require that a local method needs to produce embeddings which are consistent with all other node's representations, so that the final representations can be used in the same downstream tasks that graph embeddings have proved adapt at in the past. In this work, we introduce InstantEmbedding, an efficient method to generate local node embeddings on the fly in sublinear time which are globally consistent. Considering previous work that links embedding learning methods to matrix factorization (Tsitsulin et al., 2018; Qiu et al., 2018) , our method leverages a high-order similarity matrix based on Personalized PageRank (PPR) as foundations on which local node embeddings are computed via hashing. We offer theoretical guarantees on the locality of the computation, as well as the proof of the global consistency of the generated embeddings. We show empirically that our method is able to produce high-quality representations on par with state of the art methods, with efficiency several orders of magnitude better in clock time and memory consumption: running 9,000 times faster and using 8,000 times less memory on the largest graphs that contenders can process. Table 1 : Related work in terms of desirable properties and the computational complexity necessary to generate a single node embedding. Note that all existing methods must generate a full graph embedding, and thus are directly dependent on the total graph size, while our method can directly solve this task in sublinear time. Analysis in Section 3.2.1.  InstantEmbedding 1 α(1-α) + d 1 α(1-α) + d 2 P R E L I M I N A R I E S & R E L AT E D W O R K 2 . 1 G R A P H E M B E D D I N G Let G = (V, E) represent an unweighted graph, which contains a set of nodes V , |V | = n, and edges E ⊆ (V × V ), |E| = m. A graph can also be represented as an adjacency matrix A ∈ {0, 1} n×n where A u,v = 1 iff (u, v) ∈ E. The task of graph embedding then, is to learn a d-dimensional node embedding matrix X ∈ R n×d where X v serves as the embedding for any node v ∈ V . We note that d n, i.e. the learned representations are low-dimensional, and the challenge of graph embedding is to best preserve graph properties (such as node similarities) in this space. Following the formalization in Abu- El-Haija et al. (2018) , many graph embedding can be thought of minimizing an objective in the general form: min X L(f (X), g(A)), where f : R n×d → R n×n is a pairwise distance function on the embedding space, g : R n×n → R n×n is a distance function on the (possibly transformed) adjacency matrix, and L is a loss function over all (u, v) ∈ (V × V ) pairs. A number of graph embedding methods have been proposed. One family of these methods simply learn X as a lookup dictionary of embeddings and calculate the loss via distance (Kruskal, 1964) , or matrix factorization (either implicit (Perozzi et al., 2014; Grover & Leskovec, 2016) or explicit (Ou et al., 2016) ). Another line of work focuses on leveraging the graph structure using neighborhood aggregation (Battaglia et al., 2016; Scarselli et al., 2008) , or the Laplacian matrix of the graph (Kipf & Welling, 2016) . On attributed structured data, Graph Convolutional Networks (Kipf & Welling, 2016) have been successfully applied to both supervised and unsupervised tasks (Veličković et al., 2018) . However, in the absence of node-level features, Duong et al. (2019) demonstrated that these methods do not produce meaningful representations. Graph Embedding via Random Projection The computational efficiency brought by advances in random projection (Achlioptas, 2003; Dasgupta et al., 2010) paved the way for its adaptation in graph embedding to allow direct construction of the embedding matrix X. Two recent works, RandNE (Zhang et al., 2018) and FastRP (Chen et al., 2019) iteratively project the adjacency matrix to simulate the higher-order interactions between nodes. As we show in the experiments, these methods suffer from high memory requirements and are not always competitive with other methods.

2. . L O C

A L A L G O R I T H M S O N G R A P H S Local algorithms on graphs (Suomela, 2013) solve graph without using the full graph. A wellstudied problem in this space is personalized recommendation (Jeh & Widom, 2003) where users are represented as nodes in a graph and the goal is to recommend items to specific users leveraging the graph structure. Classic solutions to this problem are Personalized PageRank (Gupta et al., 2013) and Collaborative Filtering (Schafer et al., 2007; He et al., 2017) . Interestingly, these methods have been recently applied to graph neural networks (Klicpera et al., 2019; He et al., 2020) . We now recall the definition of Personalized PageRank that is one of the main ingredients in our embedding algorithm. Definition (Personalized PageRank (PPR, variation)). Given s ∈ R n (s i ≥ 0, i s i = 1), a distribution of the starting node of random walks, and α ∈ (0, 1), a decay factor, the Personalized PageRank vector π(s) ∈ R n is defined recursively as: π(s) = αs + (1 -α)π(s) 1 2 (I + D -1 A), where 1 2 (I + D -1 A) is the lazy random-walk matrix. PPR takes as input a distribution of starting nodes s, which is typically a n dimensional one-hot vector e i with 1 in the i-th coordinate, enforcing a local random walks starting from node i. Following this practice, we denote π i ∈ R n , the PPR vector starting from a single node i, and PPR ∈ R n×n , the full PPR matrix for all nodes in the graph, where PPR i,: = π(e i ). VERSE (Tsitsulin et al., 2018) proposes to learn node embeddings by implicitly factorizing PPR. Its stochastic approach can perform well, but lacks guarantees of stability and convergence. The idea of learning embeddings based on local random walks has also been used in the property testing framework, a direction in graph algorithm aiming at analyzing the clustering structure of a graph (Kale & Seshadhri, 2008; Czumaj & Sohler, 2010; Czumaj et al., 2015; Chiplunkar et al., 2018) .

2. . 3 P R O B L E M S TAT E M E N T

In this work, we consider the problem of embedding a single node in a graph quickly. More formally, we consider what we term the Local Node Embedding problem: given a graph G and any node v, return a globally consistent structural representation for v using only local information around v, in time sublinear to the size of the graph. A solution to the local node embedding problem should possess the following properties: 1. Locality. The embeddings for a node are computed locally, i.e. the embedding for a node can be produced using only local information and in time independent of the total graph size. 2. Global Consistency. A local method must produce embeddings that are globally consistent (i.e. able to relate each embedding to each other, s.t. distances in the space preserve proximity). While many node embedding approaches have been proposed (Chen et al., 2018) , to the best of our knowledge we are the first to examine the local embedding problem. Furthermore, no existing methods for positional representations of nodes meet these requirements. We briefly discuss these requirements in detail below, and put the related work in terms of these properties in Table 1 . Locality. While classic node embedding methods, such as DeepWalk (Perozzi et al., 2014) , node2vec (Grover & Leskovec, 2016) , or VERSE (Tsitsulin et al., 2018) rely on information aggregated from local subgraphs (e.g. sampled by a random walk), they do not meet our locality requirement. Specifically, they also require the representations of all the nodes around them, resulting in a dependency on information from all nodes in the graph (in addition to space complexity O(nd) where d is the embedding dimension) to compute a single representation. Classical random-projection based methods also require access to the full adjacency matrix in order to compute the higher-order ranking matrix. We briefly remark that even methods capable of local attributed subgraph embedding (such as GCN or DGI) also do not meet this definition of locality, as they require a global training phase to calibrate their graph pooling functions. Global Consistency. This property allows embeddings produced by local node embedding to be used together, perhaps as features in a model. While existing methods for node embeddings are global ones which implicitly have global consistency, this property is not trivial for a local method to achieve. Specifically, a local method must produce a node representation that resides in a space, preserving proximities to all other node embeddings that may be generated, without relying on a global state. One exciting implication of a local method which is globally consistent is that it can wait to compute a representation until it is actually required for a task. For example, in a production system, one might only produce representations for immediate classification when they are requested. We propose our approach satisfying these properties in Section 3, and experimentally validate it in Section 4, followed by conclusions in Section 5.

3. M E T H O D

Here we outline our proposed approach for local node embedding. We begin by discussing the connection between a recent embedding approach and matrix factorization. Then using this analysis, we propose an embedding method based on randomly hashing the PPR matrix. We note that this approach has a tantalizing property -it can be decomposed into entirely local operations per node. With this observation in hand, we present our solution, InstantEmbedding. Finally, we analyze the algorithmic complexity of our approach, showing that it is both a local algorithm (which runs in time sublinear to the size of G) and that the local representations are globally consistent.

3. . 1 G L O B A L E M B E D D I N G U S I N G P P R

A recently proposed method for node embedding, VERSE (Tsitsulin et al., 2018) , learns node embeddings using a neural network which encodes Personalized PageRank similarities. Their objective function, in the form of Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010) , is: L = n i=1 n j=1 PPR ij log σ x i x j + bE j ∼U log σ -x i x j , ( ) where PPR is the Personalized PageRank matrix, σ is the sigmoid function, b is the number of negative samples, and U is a uniform noise distribution from which negative samples are drawn. Like many SkipGram-style methods (Mikolov et al., 2013) , this learning process can be linked to matrix factorization by the following lemma: Lemma 3.1 (Tsitsulin et al. (2020) ). VERSE implicitly factorizes the matrix log(PPR)+log n-log b into XX , where n is the number of nodes in the graph and b is the number of negative samples.

3. . 1 . 1 H

A S H I N G F O R G R A P H E M B E D D I N G Lemma 3.1 provides an incentive to find an efficient alternative to factorizing the dense similarity matrix M = log(PPR)+log n-log b. Our choice of the algorithm requires two important properties: a) providing an unbiased estimator for the inner product, and b) requiring less than O(n) memory. The first property is essential to ensure we have a good sketch of M for the embedding, while the second one keeps our complexity per node sublinear. In order to meet both requirements we propose to use hashing (Weinberger et al., 2009) to preserve the essential similarities of PPR in expectation. We leverage two global hash functions h d : N → {0, ..., d -1} and h sgn : N → {-1, 1} sampled from universal hash families U d and U -1,1 respectively, to define the hashing kernel H h d ,hsgn : R n → R d . Applied to an input vector x, it yields h = H h d ,hsgn (x), where h i = k∈h -1 d (i) x k h sgn (k). We note that although H h d ,hsgn is proposed for vectors, it can be trivially extended to matrix M when applied to each row vector of that matrix, e.g. by defining H h d ,hsgn (M) i,: ≡ H h d ,hsgn (M i,: ). In the appendix we prove the next lemma that follows from (Weinberger et al., 2009) and highlights both the aforementioned properties: Lemma 3.2. The space complexity of H h d ,hsgn is O(1) and: E h d ∼U d ,hsgn∼U-1,1 H h d ,hsgn (M)H h d ,hsgn (M) = MM This matrix sketching technique is strongly related to the factorization proposed in Lemma 3.1. To better understand this, we consider the approximation M ≈ U ΣU T . If the (asymptotic) solution of VERSE is U √ Σ, then our method aims to directly approximate U Σ. We show that this rescaled solution is more computationally tractable, while still preserving critical information for high-quality node representations. Our algorithm for global node embedding is presented in Algorithm 1. First, we compute the PPR matrix PPR (Line 2) with a generic approach (CreateP P RM atrix), which takes a graph and , the desired precision of the approximation. We remark that any of the many proposed approaches for computing such a matrix (e.g. from Jeh & Widom (2003) ; Andersen et al. (2007); Lofgren et al. (2014) ) can be used for this calculation. As the PPR could be dense, the same could be said about the implicit matrix M. Thus, we filter the signal from non-significant PPR values by applying the max operator. We remove the constant log b from the implicit target matrix. In lines (4-6), the provided hash function accumulates each value in the corresponding embedding dimension. Algorithm 1 Global Node Embedding using Personalized PageRank Input: graph G, embedding dimension d, PPR precision , hash functions h d , hsgn Output: embedding matrix W 1: function G R A P H E M B E D D I N G(G, d, , h d , hsgn) 2: PPR ← CreateP P RM atrix(G, ) 3: W = 0 n×d 4: for πi in PPR do 5: for rj in πi do 6: W i,h d (j) += hsgn(j) × max(log(rj * n), 0) 7: return W Interestingly, the projection operation only uses information from each node's individual PPR vector π i to compute its representation. In the following section, we will show that local calculation of the PPR can be utilized to develop an entirely local algorithm for node embedding.

3. . 2 L

O C A L N O D E E M B E D D I N G V I A I N S TA N T E M B E D D I N G Having a local projection method, all that we require is a procedure that can calculate the PPR vector for a node in time sublinear to size of the graph. Specifically, for InstantEmbedding we propose that the CreateP P RM atrix operation consists of invoking the SparseP P R routine from Andersen et al. (Andersen et al., 2007) once for each node i. This routine is an entirely local algorithm for efficiently constructing π i , the PPR vector for node i, which offers strong guarantees. The next lemma follows from (Andersen et al., 2007) and formalizes the result (proof in Appendix A.4). We present InstantEmbedding, our algorithm for local node embedding, in Algorithm 2. As we will show, it is a self-contained solution for the local node embedding problem that can generate embeddings for individual nodes extremely efficiently. Notably, per Lemma 3.3, the local area around v explored by InstantEmbedding is independent of n. Therefore the algorithm is strictly local.

Algorithm 2 InstantEmbedding

Input: node v, graph G, embedding dimension d, PPR precision , hash functions h d , hsgn Output: embedding vector w 1: function I N S TA N T E M B E D D I N G(v, G, d, , h d , hsgn) 2: πv ← SparseP P R(v, G, ) 3: w ← 0 d 4: for rj in πv do 5: w h d (j) += hsgn(j) × max(log(rj * n), 0) 6: return w

3. . 2 . 1 A N A LY S I S

We now prove some basic properties of our proposed approach. First, we show that the runtime of our algorithm is local and independent of n, the number of nodes in the graph. Then, we show that our local computations are globally consistent, i.e., the embedding of a node v is the same independently if we compute it locally or if we recompute the embeddings for all nodes in the graph at the same time. Note that we focus on bounding the running time to compute the embedding for a single node in the graph. Nonetheless, the global complexity to compute all the embeddings can be obtained by multiplying our bound by n, although it is not the focus of this work. We state the following theorem and prove it in Appendix A.5. Theorem 3.4. The InstantEmbedding(v, G, d, ) algorithm has running time O (d + 1 /α(1-α) ). Besides the embedding size d, both the time and and space complexity of our algorithm depend only on the approximation factor and the decay factor α. Both are independent of n, the size of the graph, and m, the size of the edge set. Notably, if O (foot_0 /α(1-α) ) ∈ o(n), as commonly happen in real world applications, our algorithm has sublinear time w.r.t. the graph size. Lastly, we note that the space complexity is also sublinear (due to Lemma 3.3), which we show in the appendix. Now we turn our attention to the consistency of our algorithm, by showing that for a node v the embeddings computed by InstantEmbedding and GraphEmbedding are identical. In the following we denote the graph embedding computed by GraphEmbedding(G, d, ) for node v by GraphEmbedding(G, d, ) v , and we prove the following theorem (Appendix A.6). Theorem 3.5 (Global Consistency). InstantEmbedding(v, G, d, ) output equals one of GraphEmbedding(G, d, ) at position v. Complexity Comparison. Table 1 compares the complexity of InstantEmbedding with that of previous works: d, n, m stands for embedding dimension, size of graph and number of edges respectively. Specifically, b ≥ 1 stands for the number of samples used in node2vec and VERSE. It is noteworthy that all the previous works have time complexity depending on n, and perform at least linear w.r.t. size of the graph. In contrast, our algorithm depends only on and α, and has sublinear time w.r.t. n, the graph size. In Section 4, we experimentally verify the advantages of our principled method.

4. E X P E R I M E N T S

In the light of the theoretical guarantees about the proposed method, we perform extended experiments in order to verify our two main hypotheses: 1. H1. Computing local node-embedding is more efficient than generating a global embedding. 2. H2. The local representations are consistent and of high-quality, being competitive with and even surpassing state-of-the-art methods on several tasks. We assess H1 in Section 4.2, in which we measure the efficiency of generating a single node embedding for each method. Then in Section 4.3 we validate H2 by comparing our method against the baselines on multiple datasets using tasks of node classification, link prediction and visualization. To ensure a relevant and fair evaluation, we compare our method against multiple strong baselines, including Deep-Walk (Perozzi et al., 2014) , node2vec (Grover & Leskovec, 2016) , VERSE (Tsitsulin et al., 2018) , and FastRP (Chen et al., 2019) . Each method was run on a virtual machine hosted on the Google Cloud Platform, with a 2.3GHz 16core CPU and 128GB of RAM. All reported results use dimensionality d = 512 for every method. We provide extended results for 4 additional baselines: RandNE (Zhang et al., 2018) , NodeSketch (Yang et al., 2019 ), LouvainNE (Bhowmick et al., 2020) and FREDE (Tsitsulin et al., 2020) on a subset of tasks, along with full details regarding each method and its parameterization in the Appendix B.1. For reproducibility, we release an implementation of our method. 1

4. . 1 D ATA S E T S A N D E X P E R I M E N TA L S E T T I N G S

InstantEmbedding Instantiation. As presented in Section 3, our implementation of the presented method relies on the choice of PPR approximation used. For instant single-node embeddings, we use the highly efficient PushFlow (Andersen et al., 2007) approximation that enables us to dynamically load into memory at most 2 /(1-α) nodes from the full graph to compute a single PPR vector π. This is achieved by storing graphs in binarized compressed sparse row format that allows selective reads for nodes of interest. In the special case when a full graph embedding is requested, we have the freedom to approximate the PPR in a distributed manner (we omit this from runtime analysis, as Our method is over 9,000 times faster than FastRP and uses over 8,000 times less memory than VERSE, the next most efficient baselines respectively, in the largest graph that these baseline methods can process. we had no distributed implementations for the baselines, but we note our local method is trivially parallelizable). We refer to Appendix B.5 for the study of the influence of on runtime and quality. Datasets. We perform our evaluations on 10 datasets as presented in Table 2 . Detailed descriptions, scale-free and small-world measurements for these datasets are available in the supplementary material. Note that on YouTube and Orkut the number of labeled nodes is much smaller than the total. We observe this behavior in several real-world application scenarios, where our method shines the most.

4. . 2 P

E R F O R M A N C E C H A R A C T E R I S T I C S We report the mean wall time and total memory consumption (Wolff) required to generate an embedding (d=512) for a single node in the given dataset. Note that due to the nature of all considered baselines, they implicitly have to generate a full graph embedding in order to get a node representation. We repeat the experiment 1,000 times for InstantEmbedding due to its locality property; for the baselines, we average the running time from 5 experiments, and measure the memory usage once. For reference, we also provide the performance of InstantEmbedding producing a full graph embedding in Appendix B.3. Running Time. As Figure 1 (a) shows, InstantEmbedding is the most scalable method, drastically outperforming all the other methods, at the task of producing a single node embedding. We are over 9,000 times faster than the next fastest baseline in the largest graph both methods can process, and can scale to graphs of any size. Memory Consumption.As Figure 1 (b) shows, InstantEmbedding is the most efficient method having been able to run in all datasets using negligible memory compared to the other methods. Compared to the next most memory-efficient baseline (VERSE) we are over 8,000 times more efficient in the largest graph both methods can process. The results of running time and memory analysis confirm hypothesis H1 and show that Instant-Embedding has a significant speed and space advantage versus the baselines. The relative speedup continues to grow as the size of the datasets increase. On a dataset with over 1 billion edges (Friendster), we can compute a node embedding in 80ms -fast enough for a real-time application! 4 . 3 E M B E D D I N G Q U A L I T Y Node Classification. This task measures the semantic information preserved by the embeddings by training a simple classifier on a small fraction of labeled representations. For each method, we perform three different random splits of the data. More details are available in the Appendix B.4.1. In Table 3 we report the mean Micro F1 scores with their respective confidence intervals (corresponding Macro-F1 scores in the supplementary material). For each dataset, we perform Welch's t-test between our method and the best performing contender. We observe that InstantEmbedding is remarkably good on these node classification, despite its several approximations and locality restriction. Specifically, on four out of five datasets, no other method is statistically significant above ours, and three of these (PPI, CoCit and YouTube) we achieve the best classification results. In Figure 2 , we study how our hyperparameter, the PPR approximation error , influences both the classification performance, running time, and memory consumption. There is a general sweet spot (around = 10 -5 ) across datasets where InstantEmbedding outperforms competing methods while being orders of magnitude faster. Data on the other datasets is available in Section B.5. Link prediction. We conduct link prediction experiments to assess the capability of the produced representations to model hidden connections in the graph. For the dataset which has temporal information (CoAuthor), we select data until 2014 as training data, and split co-authorship links between 2015-2016 in two balanced partitions that we use as validation and test. For the other datasets, we uniformly sample 80% of the available edges as training (to learn embeddings on), and use the rest for validation (10%) and testing (10%). Over repeated runs, we vary the splits. More details about the experimental design are available in the supplementary material. We report results for each method in in Table 4 , which shows average ROC-AUC and confidence intervals for each method. Across the datasets, our proposed method beats all baselines except VERSE, however we do achieve the best performance on YouTube by a statistically significant margin. Visualization. Figure 3 presents UMAP (McInnes et al., 2018) projections on the CoCit dataset, where we grouped together similar conferences. We note that our sublinear approach is especially well suited to visualizing graph data, as visualization algorithms only require a small subset of points (typically downsampling to only thousands) to generate a visualization for datasets. The experimental analysis of node classification, link prediction, and visualization show that despite relying on two different approximations (PPR & random projection), InstantEmbedding is able to very quickly produce representations which meet or exceed the state of the art in unsupervised representation learning for graph structure, confirming hypothesis H2. We remark that interestingly InstantEmbedding seems slightly better at node classifications than link prediction. We suspect that the randomization may effectively act as a regularization which is more useful on classification. 

5. C O N C L U S I O N

The present work has two main contribution: a) introducing and formally defining the Local Node Embedding problem and b) presenting InstantEmbedding, a highly efficient method that selectively embeds nodes using only local information, effectively solving the aforementioned problem. As existing graph embedding methods require accessing the global graph structure at least once during the representation generating process, the novelty brought by InstantEmbedding is especially impactful in real-world scenarios where graphs outgrow the capabilities of a single machine, and annotated data is scarce or expensive to produce. Embedding selectively only the critical subset of nodes for a task makes many more applications feasible in practice, while reducing the costs for others. Furthermore, we show theoretically that our method embeds a single node in space and time sublinear to the size of the graph. We also empirically prove that InstantEmbedding is capable of surpassing state-of-the-art methods, while being many orders of magnitude faster than them -our experiments show that we are over 9,000 times faster on large datasets and on a graph over 1 billion edges we can compute a representation in 80ms. 2018)) Since PPR is right-stochastic and the noise distribution does not depend on j we can decompose the positive and negative terms from the objective of VERSE: A P P E N D I X A P R O O F S A . 1 V E R S E A S M L = n i=1 n j=1 PPR ij log σ x i x j + b n n i=1 n j =1 log σ -x i x j . Isolating the loss for a pair of vertices i, j: L ij = PPR ij log σ x i x j + b n log σ -x i x j . We substitute z ij = x i x j , use our independence assumption, and solve for ∂Lij ∂zij = PPR ij σ(-z ij )- b n σ(z ij ) = 0 to get z ij = log n•PPRij b , hence log(PPR)+log n-log b = XX . A . 2 H A S H K E R N E L Lemma A.2. (restated from Weinberger et al. ( 2009)) The hash kernel is unbiased: E h d ∼U d ,hsgn∼U-1,1 H h d ,hsgn (x) H h d ,hsgn (x) = x x Corollary A.2.1. (ref. Lemma 3. 2) The space complexity of H h d ,hsgn is O(1) and: E h d ∼U d ,hsgn∼U-1,1 H h d ,hsgn (M)H h d ,hsgn (M) = MM Proof. We note that the space complexity required to store a hash function from an universal family is O(1). Indeed, one can choose an universal hash family such that its elements are uniquely determined by a fixed choice of keys. As an example, the multiplication hash function (Cormen et al. ( 2009)) h A (x) = n(xA mod 1)) requires constant memory to store the key A ∈ (0, 1). In order to prove the projection provides unbiased dot-products, considering the expectation per each entry, we have: E h d ∼U d ,hsgn∼U-1,1 H h d ,hsgn (M)H h d ,hsgn (M) i,j =E h d ∼U d ,hsgn∼U-1,1 H h d ,hsgn (M i )H h d ,hsgn (M j ) =M i M j From Lemma A.2 = MM i,j which holds for all i, j pairs.

A . 3 S PA R S E P E R S O N A L I Z E D PA G E R A N K P R O P E R T I E S

Theorem A.3. (restated from Andersen et al. (2007) ) Properties of the S PA R S E P E R S O N A L I Z E D PA G E R A N K(v, G, ) ) algorithm are as follows. For any starting vector v, and any constant ∈ (0, 1], the algorithm computes an -approximate Personal-izedPageRank vector p. Furthermore the support of p satisfies vol(Supp(p)) ≤ O ( 1 /(1-α) ), and the running time of the algorithm is O( 1 /α ). We note here that Andersen et al. (2007) prove their results for the lazy transition matrix and not the standard transition matrix that we consider here. Nevertheless as discussed in Appendix A.7 switching between the two definitions does not change the asymptotic of their results. Andersen et al. (2007) Input: node v, graph G, precision , return probability α Output: PPR vector π We now focus our attention to the second part of our algorithm, projecting the sparse PPR vector into the embedding space. For each non-zero entry r j of the PPR vector π, we compute hash functions h d (j), h sgn (j) and max(log(r j * n), 0) in O(1) time. The total number of iterations is equal to the support size of π, i.e. O ( 1 /(1-α) ). Algorithm 3 S PA R S E P E R S O N A L I Z E D PA G E R A N K cf. 1: function S PA R S E P E R S O N A L I Z E D PA G E R A N K(v, G, , α) 2: r ← 0n (sparse) 3: π ← 0n (sparse) 4: r[v] = 1 5: while ∃ w ∈ G, r[w] > × deg(w) do 6: r ← r[w] 7: π[w] ← π[w] + αr 8: r[w] ← (1-α)r 2 9: r[u] ← r[u] + (1-α)r 2 deg(w) , ∀(w, Finally, we note that our algorithm always generates a dense embedding, handling this variable in O(d) time complexity. However, in practice this term is negligible as 1 /e >> d. Summing up the aforementioned bounds we get the total running time of our algorithm: O (d + 1 /α + 1 /(1-α) ) = O (d + 1 /α(1-α) ) A . 6 G L O B A L C O N S I S T E N C Y Theorem A.6. (ref. Theorem 3.5) I N S TA N T E M B E D D I N G(v, G, d, ) output equals G R A P H E M B E D D I N G(G, d, ) v . Proof. We begin by noting that for a fixed parameterization, the S PA R S E P E R S O N A L I Z E D PA G E R A N K routine will compute an unique vector for a given node. Analyzing now the W v,j entry of the embedding matrix generated by G R A P H E M B E D D I N G(G, d, ), we have: W v,j = r k ∈πv h sgn (k) × max(log(r k * n), 0)I[h d (k) = j] The entire computation is deterministic and directly dependent only on the hash functions of choice and the indexing of the graph. By fixing the two hash functions h d and h sgn , we also have that W v,j = w v j where w v = I N S TA N T E M B E D D I N G(v, G, d, ), ∀v ∈ [0..n -1], j ∈ [0..d -1]. A . 7 R E PA R A M E T E R I Z AT I O N We note that Andersen et al. (2007) in their paper use a lazy random walk transition matrix. Furthermore in their analysis they also consider a lazy random walk. Nevertheless, this does not affect the asymptotic of their results, in fact in Proposition 8.1 in Andersen et al. (2007) they show that the two definition are equivalent up to a small change in α. More precisely, a standard personalized PageRank with decay factor α is equivalent to a lazy random walk with decay factor α 2-α . So all the asymptotic of the bounds in Andersen et al. (2007) apply also to the classic random walk setting that we study in this paper.

B E X P E R I M E N T S B . 1 M E T H O D S D E S C R I P T I O N S

We ran all baselines on 128 and 512 embedding dimensions. As we expect our method to perform better as we increase the projection size, we performed an auxiliary test with embedding size 2048 for InstantEmbedding. We also make the observation that learning-based methods generally do not scale well with an increase of the embedding space. The following are the description and individual parameterization for each method. • DeepWalk (Perozzi et al., 2014) : Constructs node-contexts from random-walks and learns representations by increasing the nodes co-occurrence likelihood by modeling the posterior distribution with hierarchical softmax. We set the number of walks per node and their length to 80, and context windows size to 10. • Node2Vec (Grover & Leskovec, 2016) : Samples random paths in the graph similar to DeepWalk, while adding two parameters, p and q, controlling the behaviour of the walk. Estimates the likelihood through negative sampling. We set again the number of walks per node and their length to 80 and windows size 10, number of negative samples to 5 per node and p = q = 1. • Verse (Tsitsulin et al., 2018) : Minimizes objective through gradient descent, by sampling nodes from PPR random walks and negatives samples from a noise distribution. We train it over 10 5 epochs and set the stopping probability to 0.15. • FastRP (Chen et al., 2019) : Computes a high-order similarity matrix as a linear combination of multiple-steps transitions matrices and projects it into an embedding space through a sparse random matrix. We fix the linear coefficients to [0, 0, 1, 6] and the normalization parameter -0.65. • InstantEmbedding (this work): Approximate per-node PPR vectors with return probability α and precision , which are projected into the embedding space using two fixed hash functions. In all our experiments, we set α = 0.15 and > 1 n , where n is the number of nodes in the graph. Four additional baselines were considered for extending a subset of our experiments, as follows: • RandNE (Zhang et al., 2018) : Linearnly combine transition matrices of multiple orders, randomly projected through an orthogonal Gaussian matrix. We used transitions up to order 3, with coefficients [1, 100, 10 4 , 10 5 ]. • NodeSketch (Yang et al., 2019) : Employs a recursive sketching process using a hash function that preserves hamming distances. Where provided, we use the recommended default parameters, and on all other datasets we choose the best performing ('order', α) parameters from {(2, 0.0001), (5, 0.001)}. • LouvainNE (Bhowmick et al., 2020) : Aggregates node representations from a successive sub-graph hierarchy. We use the recommended defaults across all datasets, with Louvain partition strategy and a damping parameter a = 0.01. Table 5 : Analysis of employed networks in terms of scale-free and small-world measures. The scalefree degree is reported as a Kolmogorov-Smirnov test between power-law and exponential/log-normal distributions candidates (R = mean log-likelihood ratio, p = degree of confidence). • FREDE (Tsitsulin et al., 2020) : Sketches matrix 2 by iteratively computing per-node PPR vectors and using frequent directions. We use the provided default parameters for all datasets in order to measure running times.

B . 2 D ATA S E T D E S C R I P T I O N S

The graph datasets we used in our experiments are as follows: • PPI (Stark et al., 2006) : Subgraph of the protein-protein interaction for Homo Sapiens species and ground-truth labels represented by biological states. Data originally processed by Grover & Leskovec (2016) . • Blogcatalog (Tang & Liu, 2010) : Network of social interactions between bloggers. Authors specify categories for their blog, which we use as labels. • Microsoft Academic Graph (MAG) (mag, 2016): Collection of scientific papers, authors, journals and conferences. Two distinct subgraphs were originally processed by Tsitsulin et al. (2018) , based on co-authorship (CoAuthor) and co-citations (CoCit) relations. For the latter one, labels are represented by the unique conference where the paper was published. • Flickr (Tang & Liu, 2010) : Contact network of users within 195 randomly sampled interest groups. • YouTube (Tang & Liu, 2010 ) Social network of users on the video-sharing platform. Labels are represented by group of interests with at least 500 subscribers. • Amazon2M (Chiang et al., 2019) : Network of products where edges are represented by co-purchased relations. • Orkut (Yang & Leskovec, 2015) : Social network where users can create and join groups, used at ground-truth labels. We followed the approach of Tsitsulin et al. (2018) and selected only the top 50 largest groups. • Friendster (Yang & Leskovec, 2015) : Social network where users can form friendship edge each other. It also allows users form a group which other members can then join. To better understand the variety of our chosen datasets, we report the scale-free and small-world characteristics of the networks in Table 5 . fitting a power-law distribution to node degrees, and comparing it to other 2 distributions (exponential and log-normal) through the Kolmogorov-Smirnov test (R = mean log-likelihood ratio, p = degree of confidence). We note that on the CoAuthorship, Flickr, YouTube, Amazon2M and Orkut networks, the h0 hypothesis can be rejected (p < 0.05) and thus conclude that the log-normal distribution is a better fit (graphs are not scale-free). Additionally, we report two small-world related measures: the pseudo-diameter of the graphs and their global clustering coefficient (transitivity). We observe that that graphs we use in the study are diverse, covering the spectrum of small-world and large-diameter networks.

B . 3 R U N T I M E

A N A LY S I S B . 3 . 1 G E N E R A L S E T U P For the runtime analysis we use the same parameterization as described in B.1 for all methods. In the special case of InstantEmbedding, we dynamically load into memory just the required subgraph in order to approximate the PPR vector for a single node. We individually ran each method on a virtual machine hosted on the Google Cloud Platform, with a 2.3GHz 16-core CPU and 128GB of RAM. B . 3 . 2 R U N T I M E : S P E E D All methods, except the single-threaded FastRP, leveraged the 16 cores of our machines. Some methods did not complete all the tasks: none ran on Friendster; node2vec was unable to run on Amazon2M and Orkut; FastRP did not run on Orkut specifically for a 512-dimension embedding. We note that all reported numbers are real execution times, taking into account also loading the data in memory. The detailed results are shown in Table 7 . For reference, we also provide the total running time for producing a full graph embedding. We note that when computing the PPR matrix (of an entire graph) a local measure may be suboptimal, and we leave optimizing global run time as future work. Nevertheless, here we report the total running time of our local method successively applied to all nodes in a graph in Table 6 , with an additional 4 recent baselines. All methods ran on the same machine and produced a 512-dimensional embedding for the node classification task. From the additional baselines, only RandNE and LouvainNE could scale to Orkut, while FREDE could only produce an embedding on half of the datasets. B . 3 . 3 TA S K : M E M O R Y U S A G E The methods that failed to complete in the Running Times section are also marked here accordingly. We note that due to the local nature of our method, we can consistently keep the average memory usage under 1MB for all datasets. This observation reinforces the sublinear guarantees of our algorithm when being within a good -regime, as stated in Lemma A.4. The detailed results are shown in Table 8 . These tasks aim to measure the semantic information preserved by the embeddings, through the means of the generalization capacity of a simple classifier, trained on a small fraction of labeled representations. All methods use 512 embedding dimensions. For each methods, we perform three different splits of the data, and for our method we generate five embeddings, each time sampling a different projection matrix. We use a logistic regression (LR) classifier from using Scikit-Learn (Pedregosa et al., 2011) to train the classifiers. In the case of multi-class classification (we follow Perozzi et al. (2014) and use a one-vs-rest LR ensemble) -we assume the number of correct labels K is known and select the top K probabilities from the ensemble. To simulate the sparsity of labels in the real-wold, we train on 10% of the available labels for PPI and Blogcatalog and only 1% for the rest of them, while testing on the rest. We treat CoCit as a multi-class problem as each paper is associated an unique conference were it was published. Also, for Orkut we follow the approach from Tsitsulin et al. (2018) and select only the top 50 largest communities, while further filtering nodes belonging to more than one community. In these two cases, are fitting a simply logistic regression model on the available labeled nodes. The other datasets have multiple labels per node, and we are using a One-vs-The-Rest ensemble. When evaluating, we assume the number of correct labels, K, is known and select the top K probabilities from the ensemble. For each methods, we are performing three different splits of the data, and for our method we generate five embeddings, sampling different projection matrices. We report the average and 90% confidence interval for micro and macro F1-scores at different fractions of known labels. The following datasets are detailed for node classification: PPI (Table 9 ), BlogCatalog (Table 10 ), CoCit (Table 11 ), Flickr (Table 12 ), and YouTube (Table 13 ). We also report experiments with 4 additional baselines in Table 14 . The classification task is the same, however for NodeSketch we used the recommended SVC classifier with hamming kernel, as the Logistic Regression could not infer a separation boundary in this particular case. Additionally, for FREDE we are referencing the scores from the original paper, having a comparable evaluation framework. We note that although FREDE produces better scores, it does not scale past a mediumsized graph, and its extremely high running times (Table 6 ) takes this approach out of the original scope of our paper. For this task we create edge embeddings by combining node representations, and treat the problem as a binary classification. We observed that different strategies for aggregating embeddings could maximize the performance of different methods under evaluation, so we conducted an in-depth investigation in order for the fairest possible evaluation. Specifically, for two node embeddings w and ŵ we adopt the following strategies for creating edge representations: 1. dot-product: w ŵ 2. cosine distance: w ŵ w ŵ 3. hadamard product: w ŵ 4. element-wise average: 1 2 (w + ŵ) 5. L1 element-wise distance: |w -ŵ| 6. L2 element-wise distance (w -ŵ) (w -ŵ) While the first two strategies directly create a ranking from two embeddings, for the other ones we train a logistic regression on examples from the validation set. In all cases, a likelihood scalar value will be attributed to all edges, and we report their ROC-AUC score on the test set. Taking into account that different embedding methods may determine a specific topology of the embedding space, that may in turn favour a specific edge aggregation method, for each method we consider only the strategy that consistently provides good results on all datasets. This ensures that all methods can be objectively compared to one another, independent of the particularities of induced embedding space geometry. The following tables show detailed analysis of link prediction results for BlogCatalog (Table 15 ) and CoAuthor (Table 16 ). In order to gain insight into the effect of on the behaviour of our method, we test 6 values in the range of [10 -1 , ..., 10 -6 ]. We note that the decrease of is strongly correlated with a better classification performance, but also to a larger computational overhead. The only apparent exception seems to be the Micro-F1 score on the Blogcatalog dataset, which drops suddenly when = 10 -6 . We argue that this is due to the fact that more probability mass is dispersed further away from the central node, but the max operator cuts that information away (as the number of nodes is small), and thus the resulting embedding is actually less accurate. et al., 2018) projections on the CoCit dataset, where we grouped together similar conferences. We note that our sublinear approach is especially well suited to visualizing graph data, as visualization algorithms (such as t-SNE or UMAP) only require a small subset of points (typically downsampling to only thousands) to generate a visualization for datasets. 



Software available for reviewers and ACs in the OpenReview forum.



The I N S TA N T E M B E D D I N G(v, G, d, ) algorithm computes the local embedding of a node v by exploring at most the O ( 1 /(1-α) ) nodes in the neighborhood of v.

Figure1: Required (a) running time and (b) memory consumption to generate a node embedding (d=512) based on the edge count of each graph (|E|), with the best line fit drawn. Our method is over 9,000 times faster than FastRP and uses over 8,000 times less memory than VERSE, the next most efficient baselines respectively, in the largest graph that these baseline methods can process.

Figure 2: The impact of the choice of on the quality of the resulting embedding (through the Micro-F1 score), average running time and peak memory increase for the YouTube dataset.

Figure 3: UMAP visualization of CoCit (d=512). Research areas ( ML, DM, DB, IR).

AT R I X FA C T O R I Z AT I O N Lemma A.1. (restated from Tsitsulin et al. (2020), (ref. Lemma 3.1)) VERSE implicitly factorizes the matrix log(PPR) + log n -log b into XX , where n is the number of nodes in the graph and b is the number of negative samples. Proof. (from Tsitsulin et al. (2020), following Levy & Goldberg (2014); Qiu et al. (

N S TA N T E M B E D D I N G L O C A L I T Y Lemma A.4. (ref. Lemma 3.3) The I N S TA N T E M B E D D I N G(v, G, d, ) algorithm computes the local embedding of a node v by exploring at most the O ( 1 /(1-α) ) nodes in the neighborhood of v.Proof. First recall that the only operation that explores the graph inI N S TA N T E M B E D D I N G is S PA R S E P E R S O N A L I Z E D PA G E R A N K,which explores a node w in the graph if and only if a neighbor of w has a positive score (i.e. r[w ] > × deg(w ) was true and thus π[w ] > 0), and so it is part of the support of π. Furthermore at the beginning of the algorithm only v is active. So it follows that every node explored by the algorithm is connected to v via a path composed only by the nodes with π score strictly larger than 0. So its distance from v is bounded by the support of the π vector that is O ( 1 /(1-α) ) cf. Theorem A.3. A . 5 I N S TA N T E M B E D D I N G C O M P L E X I T Y Theorem A.5. (ref. Theorem 3.4) The I N S TA N T E M B E D D I N G(v, G, d, ) algorithm has running time O (d + 1 /α(1-α) ). Proof. The first step of the I N S TA N T E M B E D D I N G is computing the approximate Personalized PageRank vector. As noted in Theorem A.3, this can be done in time O ( 1 /α ).

E M B E D D I N G Q U A L I T Y B . 4 . 1 Q U A L I T Y: N O D E C L A S S I F I C AT I O N

Figure 4: Effect of on the PPI dataset.

Figure 5: Effect of on the BlogCatalog dataset.

Figure3presents multiple UMAP(McInnes et al., 2018) projections on the CoCit dataset, where we grouped together similar conferences. We note that our sublinear approach is especially well suited to

Figure 6: Effect of on the CoCit dataset.

Figure 7: Effect of on the Flickr dataset.

Figure 8: Effect of on the YouTube dataset.

Figure 9: UMAP visualization of CoCit (d=512). Research areas ( ML, DM, DB, IR).



Average Micro-F1 classification scores and confidence intervals. Our method is marked as follows: * -above baselines; bold -no other method is statistically significant better. ± 0.64 32.48 ± 0.35 37.44 ± 0.67 31.22 ± 0.38 38.69 ± 1.17 87.67 ± 0.23 node2vec 15.03 ± 3.18 33.67 ± 0.93 38.35 ± 1.75 29.80 ± 0.67 36.02 ± 2.01 DNC VERSE 12.59 ± 2.54 24.64 ± 0.85 38.22 ± 1.34 25.22 ± 0.20 36.74 ± 1.05 81.52 ± 1.11 FastRP 15.74 ± 2.19 33.54 ± 0.96 26.03 ± 2.10 29.85 ± 0.26 22.83 ± 0.41 DNC InstantEmbedding 17.67* ± 1.22 33.36 ± 0.67 39.95* ± 0.67 30.43 ± 0.79 40.04* ± 0.97 76.83 ± 1.16

Average ROC-AUC scores and confidence intervals for the link prediction task. Our method is marked as follows: * -above baselines; bold -no other method is statistically significant better.

Total approximate running time for producing a 512-dimensional full graph embedding, with 4 additional recent baselines. In this scenario, InstantEmbedding produced a full graph embedding, as opposed to the originally proposed single node representation task.

Average run time (in seconds) to generate a 128-size and a 512-size node embedding for each method and each dataset with the respective standard deviation. Each experiment was run 5 times for all the methods (given their global property) except for InstantEmbedding for which we ran the experiment 1000 times (given the method's locality property). bold -improvement over the baselines; DNC -Did Not Complete.

Peak memory used (in MB) to generate a 128-size and 512-size node embedding for each method and each dataset. Each experiment was run once for all the methods (given their global property) except for InstantEmbedding for which we ran the experiment 1000 times (given the method's locality property) and report the mean peak memory consumption with the respective standard deviation. bold -improvement over the baselines; DNC -Did Not Complete.

Classification micro and macro F1-scores for PPI. ± 1.75 12.56 ± 1.84 21.34 ± 1.20 18.59 ± 1.40 24.44 ± 0.32 20.36 ± 2.74 512 16.08 ± 0.64 12.89 ± 1.66 19.90 ± 1.02 18.08 ± 1.11 21.51 ± 5.75 20.36 ± 5.05 node2vec 128 15.65 ± 1.46 12.07 ± 1.23 20.97 ± 1.26 17.86 ± 0.85 23.99 ± 5.84 19.05 ± 2.25 512 15.03 ± 3.18 12.19 ± 2.34 21.04 ± 1.90 18.11 ± 2.13 22.02 ± 1.14 18.18 ± 3.47 VERSE 128 14.41 ± 1.40 11.56 ± 1.37 19.63 ± 1.08 16.95 ± 1.61 22.01 ± 2.66 18.71 ± 0.61 512 12.59 ± 2.54 9.54 ± 2.22 13.62 ± 0.88 11.67 ± 0.85 16.00 ± 0.26 13.66 ± 0.53 ± 1.36 11.67 ± 1.09 20.51 ± 0.70 16.89 ± 0.93 21.82 ± 2.47 17.49 ± 2.36 512 17.67 ± 1.22 13.04 ± 1.06 23.50 ± 0.97 19.84 ± 1.34 25.36 ± 2.32 21.21 ± 2.92 2048 18.77 ± 1.22 13.76 ± 1.41 24.30 ± 0.67 20.44 ± 0.85 25.85 ± 2.91 22.03 ± 3.84 B . 4 . 2 TA S K : L I N K P R E D I C T I O N

Classification micro and macro F1-scores for Blogcatalog.

Classification micro and macro F1-scores for YouTube. ± 1.40 29.04 ± 3.77 41.64 ± 0.15 34.45 ± 0.70 42.97 ± 0.29 35.62 ± 0.93 512 38.69 ± 1.17 31.11 ± 1.08 40.26 ± 0.38 35.09 ± 0.26 40.74 ± 0.06 36.14 ± 0.23 VERSE 128 37.13 ± 0.41 28.54 ± 2.39 39.74 ± 0.32 33.87 ± 0.67 41.70 ± 0.38 35.04 ± 0.41 512 36.74 ± 1.05 27.16 ± 0.15 37.47 ± 1.37 32.40 ± 0.91 37.64 ± 0.67 33.00 ± 0.35 node2vec 128 34.64 ± 2.63 25.35 ± 3.83 40.62 ± 1.02 33.26 ± 0.20 42.65 ± 0.70 35.73 ± 0.32 512 36.02 ± 2.01 25.03 ± 2.89 39.64 ± 0.44 33.78 ± 0.38 40.47 ± 0.85 35.01 ± 1.08 FastRP 128 23.61 ± 1.61 6.24 ± 0.61 24.16 ± 0.96 6.64 ± 1.64 24.50 ± 0.29 7.09 ± 0.35 512 22.83 ± 0.41 7.21 ± 0.20 23.43 ± 0.55 8.77 ± 0.82 23.76 ± 0.64 9.56 ± 0.91 ± 1.02 26.27 ± 1.36 40.90 ± 0.53 31.57 ± 0.86 41.78 ± 0.37 32.73 ± 0.51 512 40.04 ± 0.97 27.52 ± 1.60 43.31 ± 0.41 33.98 ± 0.81 44.00 ± 0.42 35.56 ± 0.69 2048 40.91 ± 0.86 28.34 ± 1.43 44.82 ± 0.49 35.16 ± 1.02 45.67 ± 0.32 36.90 ± 0.69

Approximate Micro-F1 scores with an additional 4 baselines. All methods produced 512dimensional embeddings, with the exception of FREDE for which we refer the scores from the original paper.

Link-prediction ROC-AUC scores for Blogcatalog. For each method, we highlight the aggregation function that consistently performs good on all datasets. ± 2.45 63.01 ± 2.83 75.73 ± 1.49 91.51 ± 0.61 91.84 ± 0.88 82.07 ± 0.09 512 67.70 ± 1.58 62.80 ± 2.07 72.83 ± 0.82 90.94 ± 0.29 91.41 ± 0.67 83.71 ± 1.46 node2vec 128 93.12 ± 0.20 91.85 ± 1.37 22.52 ± 0.41 89.90 ± 0.70 90.28 ± 1.28 94.41 ± 0.53 512 92.18 ± 0.12 90.96 ± 0.12 12.49 ± 1.20 93.89 ± 0.38 93.50 ± 0.76 93.72 ± 0.26 VERSE 128 94.96 ± 0.38 95.10 ± 0.67 85.21 ± 0.88 75.74 ± 0.85 75.92 ± 0.73 94.07 ± 0.47 512 93.42 ± 0.35 93.40 ± 0.67 61.48 ± 0.88 91.52 ± 0.26 92.17 ± 0.61 93.14 ± 0.58 FastRP 128 73.54 ± 0.23 68.16 ± 0.55 76.32 ± 1.90 85.78 ± 2.31 82.46 ± 2.01 89.25 ± 0.85 512 78.34 ± 2.80 70.67 ± 0.79 79.25 ± 1.02 88.68 ± 0.70 84.56 ± 0.76 90.99 ± 0.55 ± 1.48 84.95 ± 4.19 51.57 ± 1.14 72.52 ± 1.71 64.39 ± 1.37 87.65 ± 0.70 512 92.74 ± 0.60 90.77 ± 1.51 51.75 ± 1.16 83.07 ± 1.00 70.39 ± 1.11 90.63 ± 0.56 2048 93.84 ± 0.33 93.44 ± 0.53 51.35 ± 1.18 88.95 ± 0.85 77.39 ± 1.02 92.40 ± 0.42

Temporal link-prediction ROC-AUC scores for CoAuthor. For each method, we highlight the aggregation function that consistently performs good on all datasets. ± 0.88 74.05 ± 1.58 83.5 ± 0.12 86.99 ± 0.09 87.21 ± 0.73 73.64 ± 1.72 512 78.42 ± 0.53 76.40 ± 1.87 82.05 ± 1.20 87.85 ± 0.29 88.43 ± 1.08 79.56 ± 0.70 node2vec 128 80.18 ± 0.67 45.00 ± 1.34 54.59 ± 0.88 70.14 ± 1.31 70.32 ± 0.58 79.07 ± 0.53 512 86.09 ± 0.85 45.19 ± 0.20 42.99 ± 1.66 72.41 ± 1.84 72.70 ± 1.43 84.00 ± 0.38 VERSE 128 93.16 ± 0.44 92.74 ± 0.15 90.85 ± 0.20 79.24 ± 1.49 80.27 ± 0.41 86.50 ± 0.47 512 92.75 ± 0.73 92.36 ± 1.08 90.33 ± 0.20 72.58 ± 1.17 73.82 ± 1.49 86.69 ± 1.02 FastRP 128 60.23 ± 1.78 59.97 ± 1.61 65.08 ± 0.93 78.51 ± 0.64 77.66 ± 0.23 57.69 ± 1.90 512 61.16 ± 1.75 61.92 ± 0.85 70.12 ± 0.38 82.19 ± 2.22 78.51 ± 1.99 63.87 ± 1.49 Instant Embedding 128 89.41 ± 0.67 88.88 ± 0.79 89.15 ± 0.63 66.19 ± 1.92 66.78 ± 1.90 83.22 ± 0.86 512 90.44 ± 0.48 90.10 ± 0.69 90.60 ± 0.55 76.50 ± 1.44 75.76 ± 1.41 85.64 ± 0.67 2048 89.45 ± 0.62 90.38 ± 0.60 90.84 ± 0.44 88.42 ± 0.48 84.83 ± 0.67 87.67 ± 1.07

