TIME-VARYING GRAPH REPRESENTATION LEARNING VIA HIGHER-ORDER SKIP-GRAM WITH NEGATIVE SAMPLING Anonymous

Abstract

Representation learning models for graphs are a successful family of techniques that project nodes into feature spaces that can be exploited by other machine learning algorithms. Since many real-world networks are inherently dynamic, with interactions among nodes changing over time, these techniques can be defined both for static and for time-varying graphs. Here, we show how the skip-gram embedding approach can be used to perform implicit tensor factorization on different tensor representations of time-varying graphs. We show that higher-order skip-gram with negative sampling (HOSGNS) is able to disentangle the role of nodes and time, with a small fraction of the number of parameters needed by other approaches. We empirically evaluate our approach using time-resolved face-to-face proximity data, showing that the learned representations outperform state-of-the-art methods when used to solve downstream tasks such as network reconstruction. Good performance on predicting the outcome of dynamical processes such as disease spreading shows the potential of this new method to estimate contagion risk, providing early risk awareness based on contact tracing data.

1. INTRODUCTION

A great variety of natural and artificial systems can be represented as networks of elementary structural entities coupled by relations between them. The abstraction of such systems as networks helps us understand, predict and optimize their behaviour (Newman, 2003; Albert & Barabási, 2002) . In this sense, node and graph embeddings have been established as standard feature representations in many learning tasks (Cai et al., 2018; Goyal & Ferrara, 2018) . Node embedding methods map nodes into low-dimensional vectors that can be used to solve downstream tasks such as edge prediction, network reconstruction and node classification. Node embeddings have proven successful in achieving low-dimensional encoding of static network structures, but many real-world networks are inherently dynamic (Holme & Saramäki, 2012) . Timeresolved networks are also the support of important dynamical processes, such as epidemic or rumor spreading, cascading failures, consensus formation, etc. (Barrat et al., 2008) . Time-resolved node embeddings have been shown to yield improved performance for predicting the outcome of dynamical processes over networks, such as information diffusion and disease spreading (Sato et al., 2019) , providing estimation of infection and contagion risk when used with contact tracing data. Since we expect having more data on proximity networks being used for contact tracing and as proxies for epidemic risk (Alsdurf et al., 2020) , learning meaningful representations of time-resolved proximity networks can be of extreme importance when facing events such as epidemic outbreaks (Kapoor et al., 2020; Gao et al., 2020) . The manual and automatic collection of time-resolved proximity graphs for contact tracing purposes presents an opportunity for quick identification of possible infection clusters and infection chains. Even before the COVID-19 pandemic, the use of wearable proximity sensors for collecting time-resolved proximity networks has been largely discussed in the literature and many approaches have been used to describe patterns of activity and community structure, and to study spreading patterns of infectious diseases (Sapienza et al., 2015; Gauvin et al., 2014; Génois et al., 2015) . Here we propose a representation learning model that performs implicit tensor factorization on different higher-order representations of time-varying graphs. The main contributions are as follows: Given that the skip-gram embedding approach implicitly performs a factorization of the shifted pointwise mutual information matrix (PMI) (Levy & Goldberg, 2014) , we generalize it to perform implicit factorization of a shifted PMI tensor. We then define the steps to achieve this factorization using higher-order skip-gram with negative sampling (HOSGNS) optimization. We show how to apply 3rd-order and 4th-order SGNS on different higher-order representations of time-varying graphs. Finally, we show that time-varying graph representations learned via HOSGNS outperform stateof-the-art methods when used to solve downstream tasks, even using a fraction of the number of embedding parameters. We report the results of learning embeddings on empirical time-resolved face-to-face proximity data and using such representations as predictors for solving two different tasks: network reconstruction and predicting the outcomes of a SIR spreading process over the time-varying graph. We compare these results with state-of-the art methods for time-varying graph representation learning.

2. PRELIMINARIES AND RELATED WORK

Skip-gram representation learning. The skip-gram model was designed to compute word embeddings in WORD2VEC (Mikolov et al., 2013) , and afterwards extended to graph node embeddings (Perozzi et al., 2014; Tang et al., 2015; Grover & Leskovec, 2016) . Levy & Goldberg (2014) established the relation between skip-gram trained with negative sampling (SGNS) and traditional low-rank approximation methods (Kolda & Bader, 2009; Anandkumar et al., 2014) , showing the equivalence of SGNS optimization to factorizing a shifted PMI matrix (Church & Hanks, 1990) . This equivalence was later retrieved from diverse assumptions (Assylbekov & Takhanov, 2019; Allen et al., 2019; Melamud & Goldberger, 2017; Arora et al., 2016; Li et al., 2015) , and exploited to compute closed form expressions approximated in different graph embedding models (Qiu et al., 2018) . In this work, we refer to the shifted PMI matrix also as SPMI κ = PMIlog κ, where κ is the number of negative samples. Random walk based graph embeddings. Given an undirected, weighted and connected graph G = (V, E) with nodes i, j ∈ V, edges (i, j) ∈ E and adjacency matrix A, graph embedding methods are unsupervised models designed to map nodes into dense d-dimensional representations (d |V|) (Hamilton et al., 2017) . A well known family of approaches based on the skip-gram model consists in sampling random walks from the graph and processing node sequences as textual sentences. In DEEPWALK (Perozzi et al., 2014) and NODE2VEC (Grover & Leskovec, 2016) , the skip-gram model is used to obtain node embeddings from co-occurrences in random walk realizations. Although the original implementation of DEEPWALK uses hierarchical softmax to compute embeddings, we will refer to the SGNS formulation given by Qiu et al. (2018) . Since SGNS can be interpreted as a factorization of the word-context PMI matrix (Levy & Goldberg, 2014) , the asymptotic form of the PMI matrix implicitly decomposed in DEEPWALK can be derived (Qiu et al., 2018) . Given the 1-step transition matrix P = D -1 A, where D = diag(d 1 , . . . , d |V| ) and d i = j∈V A ij is the (weighted) node degree, the expected PMI for a node-context pair (i, j) occurring in a T -sized window is: E[ PMI DEEPWALK (i, j) | T ] = log 1 2T T r=1 [p * (i)(P r ) ij + p * (j)(P r ) ji ] p * (i) p * (j) (2.1) where p * (i) = di vol(G) is the unique stationary distribution for random walks (Masuda et al., 2017 ) and vol(G) = i,j∈V A ij . We will use this expression in Section 3.2 to build PMI tensors from higher-order graph representations. Time-varying graphs and their algebraic representations. Time-varying graphs (Holme & Saramäki, 2012) are defined as triples H = (V, E, T ) , i.e. collections of events (i, j, k) ∈ E, representing undirected pairwise relations among nodes at discrete times (i, j ∈ V, k ∈ T ). H can be seen as a temporal sequence of static graphs {G (k) } k∈T , each of those with adjacency matrix A (k) such that A (k) ij = ω(i, j, k) ∈ R is the weight of the event (i, j, k) ∈ E. We can concatenate the list of time-stamped snapshots [A (1) , . . . , A (|T |) ] to obtain a single 3rd-order tensor A stat (H) ∈ R |V|×|V|×|T | which characterize the evolution of the graph over time. This representation has been used to discover latent community structures of temporal graphs (Gauvin et al., 2014) and to perform temporal link prediction (Dunlavy et al., 2011) . Indeed, beyond the above stacked graph representation, more exhaustive representations are possible. In particular, the multi-layer approach (De Domenico et al., 2013) allows to map the topology of a time-varying graph H into a static network G H = (V H , E H ) (the supra-adjacency graph) such that vertices of G H correspond to pairs (i, k) ≡ i (k) ∈ V × T of the original time-dependent network. This representation can be stored in a 4th-order tensor A dyn (H) ∈ R |V|×|V|×|T |×|T | equivalent, up to an opportune reshaping, to the adjacency matrix A(G H ) ∈ R |V||T |×|V||T | associated to G H . Multi-layer representations for time-varying networks have been used to study time-dependent centrality measures (Taylor et al., 2019) and properties of spreading processes (Valdano et al., 2015) . Time-varying graph representation learning. Given a time-varying graph H = (V, E, T ), we define as temporal network embedding a model that learns from data, implicitly or explicitly, a mapping function: f : (v, t) ∈ V × T → v (t) ∈ R d (2. 2) which project time-stamped nodes into a latent low-rank vector space that encodes structural and temporal properties of the original evolving graph. Many existing methods learn node representations from sequences of static snapshots through incremental updates in a streaming scenario: deep autoencoders (Goyal et al., 2017) , SVD (Zhang et al., 2018) , skip-gram (Du et al., 2018) and random walk sampling (Béres et al., 2019; Mahdavi et al., 2018; Yu et al., 2018) . Another class of models learn dynamic node representations by recurrent/attention mechanisms (Goyal et al., 2020; Li et al., 2018; Sankar et al., 2020; Xu et al., 2020) or by imposing temporal stability among adjacent time intervals (Zhou et al., 2018; Zhu et al., 2016) . DYANE (Sato et al., 2019) and WEG2VEC (Torricelli et al., 2020) project the dynamic graph structure into a static graph, in order to compute embeddings with WORD2VEC. Closely related to these ones are Nguyen et al. (2018) and Zhan et al. (2020) , which learn node vectors according to time-respecting random walks or spreading trajectory paths. Moreover, Kumar et al. (2019) proposed an embedding framework for user-item temporal interactions, and Malik et al. ( 2020) suggested a tensor-based convolutional architecture for dynamic graphs. Methods that perform well for predicting outcomes of spreading processes make use of timerespecting supra-adjacency representations such as the one proposed by Valdano et al. (2015) . In this representation, random itineraries correspond to temporal paths of the original time-varying graph. The supra-adjacency representation G H that we refer in Section 3.2, also used in DYANE, with adjacency matrix A(G H ), is defined by two rules: 1. For each event (i, j, t 0 ), if i is also active at time t 1 > t 0 and in no other time-stamp between the two, we add a cross-coupling edge between supra-adjacency nodes j (t0) and i (t1) . In addition, if the next interaction of j with other nodes happens at t 2 > t 0 , we add an edge between i (t0) and j (t2) . The weights of such edges are set to ω(i, j, t 0 ). 2. For every case as described above, we also add self-coupling edges (i (t0) , i (t1) ) and (j (t0) , j (t2) ), with weights set to 1. Figure 1 shows the differences between a time-varying graph and its time-aware supra-adjacency representation, according to the formulation described above. DYANE computes, given a node i ∈ V, one vector representation for each time-stamped node i  t 0 t 1 t 2 i j k j t 0 t 1 t 2 i k self-coupling cross-coupling (t) ∈ V (T ) = {(i, t) ∈ V ×T : ∃ (i, j,

3. PROPOSED METHOD

Given a time-varying graph H = (V, E, T ), we propose a representation learning method that learns disentangled representations for nodes and time slices. More formally, we learn a function: f * : (v, t) ∈ V × T → v, t ∈ R d through a number of parameters proportional to O(|V| + |T |). This embedding representation can then be reconciled with the definition in Eq. ( 2.2) by combining v and t in a single v (t) representation using any combination function c : (v, t) ∈ R d × R d → v (t) ∈ R d . Starting from the existing skip-gram framework for node embeddings, we propose a higher-order generalization of skip-gram with negative sampling (HOSGNS) applied to time-varying graphs. We show that it allows to implicitly factorize higher-order relations that characterize tensor representations of time-varying graphs, in the same way that the classical SGNS decomposes dyadic relations associated to a static graph. Similar approaches have been applied in NLP for dynamic word embeddings (Rudolph & Blei, 2018) , and higher-order extensions of the skip-gram model have been proposed to learn context-dependent (Liu et al., 2015) and syntactic-aware (Cotterell et al., 2017) word representations. Moreover tensor factorization techniques have been applied to include the temporal dimension in recommender systems (Xiong et al., 2010; Wu et al., 2019) , knowledge graphs (Lacroix et al., 2020; Ma et al., 2019) and face-to-face contact networks (Sapienza et al., 2015; Gauvin et al., 2014) . But this work is the first to merge SGNS with tensor factorization, and then apply it to learn time-varying graph embeddings.

FACTORIZATION

Here we address the problem of generalizing SGNS to learn embedding representations from higherorder co-occurrences. We analyze here the 3rd-order case, giving the description of the general N -order case in the Supplementary Information. Later in this work we will focus 3rd and 4th order representations since these are the most interesting for time-varying graphs. We consider a set of training samples D = {(i, j, k), i ∈ W, j ∈ C, k ∈ T } obtained by collecting co-occurrences among elements from three sets W, C and T . While SGNS is limited to pairs of nodecontext (i, j), here D is constructed with three (or more) variables, e.g. sampling random walks over a higher-order data structure. We denote as #(i, j, k) the number of times the triple (i, j, k) appears in D. Similarly we use #i = j,k #(i, j, k), #j = i,k #(i, j, k) and #k = i,j #(i, j, k) as the number of times each distinct element occurs in D, with relative frequencies P D (i, j, k) = #(i,j,k) |D| , P D (i) = #i |D| , P D (j) = #j |D| and P D (k) = #k |D| . Optimization is performed as a binary classification task, where the objective is to discern occurrences actually coming from D from random occurrences. We define the likelihood for a single observation (i, j, k) by applying a sigmoid (σ(x) = (1 + e -x ) -1 ) to the higher-order inner product [[•]] of corresponding d-dimensional representations: P [ (i, j, k) ∈ D | w i , c j , t k ] = σ [[w i , c j , t k ]] ≡ σ d r=1 W ir C jr T kr (3.1) where embedding vectors w i , c j , t k ∈ R d are respectively rows of W ∈ R |W|×d , C ∈ R |C|×d and T ∈ R |T |×d . In the 4th-order case we will also have a fourth embedding matrix S ∈ R |S|×d related to a fourth set S. For negative sampling we fix an observed (i, j, k) ∈ D and independently sample j N and k N to generate κ negative examples (i, j N , k N ). In this way, for a single occurrence (i, j, k) ∈ D, the expected contribution to the loss is: (i, j, k) = log σ [[w i , c j , t k ]] + κ • E j N ,k N ∼P N log σ -[[w i , c j N , t k N ]] (3.2) where the noise distribution is the product of independent marginal probabilities P N (j, k) = P D (j) • P D (k). Thus the global objective is the sum of all the quantities of Eq. (3.2) weighted with the corresponding relative frequency P D (i, j, k). The full loss function can be expressed as: L = - |W| i=1 |C| j=1 |T | k=1 P D (i, j, k) log σ [[w i , c j , t k ]] + κ P N (i, j, k) log σ -[[w i , c j , t k ]] (3.3) In Supplementary Information we show the formal steps to obtain Eq. (3.3) for the N -order case and that it can be optimized with respect to the embedding parameters, satisfying the low-rank tensor approximation of the multivariate shifted PMI tensor into factor matrices W, C, T: d r=1 W ir C jr T kr ≈ log P D (i, j, k) P N (i, j, k) -log κ ≡ SPMI κ (i, j, k) (3.4) 3.2 TIME-VARYING GRAPH EMBEDDING VIA HOSGNS While a static graph G = (V, E) is uniquely represented by an adjacency matrix A(G) ∈ R |V|×|V| , a time-varying graph H = (V, E, T ) admits diverse possible higher-order adjacency relations (Section 2). Starting from these higher-order relations, we can either use them directly or use random walk realizations to build a dataset of higher-order co-occurrences. In the same spirit that random walk realizations give place to dyadic co-occurrences used to learn embeddings in SGNS, we use higher-order co-occurrences to learn embeddings via HOSGNS. As discussed in Section 3.1, the statistics of higher-order relations can be summarized in multivariate PMI tensors, which derive from proper co-occurrence probabilities among elements. Once such PMI tensors are constructed, we can again factorize them via HOSGNS. To show the versatility of this approach, we choose PMI tensors derived from two different types of higher-order relations: occurrences in temporal edges: (P (stat) ) ijk = ω(i, j, k) vol(H) (3.5) where vol(H) = i,j,k ω(i, j, k) is the total weight of interactions occurring in H. These probabilities are associated to the snapshot sequence representation A stat (H) = [A (foot_0) , . . . , A (|T |) ] and contain information about the topological structure of H. 2. A 4th-order tensor P (dyn) (H) ∈ R |V|×|V|×|T |×|T | , which gather occurrence probabilities of time-stamped nodes over random walks of the supra-adjacency graph G H (as used in DYANE). Using the numerator of Eq. (2.1) tensor entries are given by: (P (dyn) ) ijkl = 1 2T T r=1 d (ik) vol(G H ) (P r ) (ik)(jl) + d (jl) vol(G H ) (P r ) (jl)(ik) (3.6) where (ik) and (jl) are lexicographic indices of the supra-adjacency matrix A(G H ) corresponding to nodes i (k) and node j (l) . These probabilities encode causal dependencies among temporal nodes and are correlated with dynamical properties of spreading processes. We also combined the two representations in a single tensor that is the average of P (stat) and P (dyn) (P (stat|dyn) ) ijkl = 1 2 (P (stat) ) ijk δ kl + (P (dyn) ) ijkl (3.7) where  δ kl = 1[k = l] is the Kronecker delta. c t 3 g M n V w c = " > A A A B + n i c b V D L S g M x F L 1 T X 7 W + p r p 0 E y x C B S k z U l A X Q l F E l x X s A 9 q h Z N J M G 5 p 5 k G S U M v Z T 3 L h Q x K 1 f 4 s 6 / M d P O Q l s P B A 7 n 3 M s 9 O W 7 E m V S W 9 W 3 k l p Z X V t f y 6 4 W N z a 3 t H b O 4 2 5 R h L A h t k J C H o u 1 i S T k L a E M x x W k 7 E h T 7 L q c t d 3 S V + q 0 H K i Q L g 3 s 1 j q j j 4 0 H A P E a w 0 l L P L H Z 9 r I Y E 8 + T m o t w 8 v j 6 a 9 M y S V b G m Q I v E z k g J M t R 7 5 l e 3 H 5 L Y p 4 E i H E v Z s a 1 I O Q k W i h F O J 4 V u L G m E y Q g P a E f T A P t U O s k 0 + g Q d a q W P v F D o F y g 0 V X 9 v J N i X c u y 7 e j I N K u e 9 V P z P 6 8 T K O 3 M S F k S x o g G Z H f J i j l S I 0 h 5 Q n w l K F B 9 r g o l g O i s i Q y w w U b q t g i 7 B n v / y I m m e V O x q 5 f y u W q p d Z n X k Y R 8 O o A w 2 n E I N b q E O D S D w C M / w C m / G k / F i v B s f s 9 G c k e 3 s w R 8 Y n z / 0 w 5 M r < / l a t e x i t > Random Walk on a higher-order representation of H = (V, E, T ) < l a t e x i t s h a 1 _ b a s e 6 4 = " L i r S w M R 5 Y x 1 + 7 K q O q 1 U j w m e B 8 F g = " > A A A B / H i c b V D L S g M x F L 3 j s 9 b X a J d u g k W o U M q M F N S F U B S h y w p 9 Q T u U T J q 2 o Z k H S U Y Y h v o r b l w o 4 t Y P c e f f m G l n o a 0 H A o d z 7 u W e H D f k T C r L + j b W 1 j c 2 t 7 Z z O / n d v f 2 D Q / P o u C 2 D S B D a I g E P R N f F k n L m 0 5 Z i i t N u K C j 2 X E 4 7 7 v Q u 9 T u P V E g W + E 0 V h 9 T x 8 N h n I 0 a w 0 t L A L P Q 9 r C Y E 8 6 R + U 2 q X 7 8 v N 8 9 n A L F o V a w 6 0 S u y M F C F D Y 2 B + 9 Y c B i T z q K 8 K x l D 3 b C p W T Y K E Y 4 X S W 7 0 e S h p h M 8 Z j 2 N P W x R 6 W T z M P P 0 J l W h m g U C P 1 8 h e b q 7 4 0 E e 1 L G n q s n 0 6 h y 2 U v F / 7 x e p E Z X T s L 8 M F L U J 4 t D o 4 g j F a C 0 C T R k g h L F Y 0 0 w E U x n R W S C B S Z K 9 5 X X J d j L X 1 4 l 7 Y u K X a 1 c P 1 S L t d u s j h y c w C m U w I Z L q E E d G t A C A j E 8 w y u 8 G U / G i / F u f C x G 1 4 x s p w B / Y H z + A A 4 q k 8 A = < / l a t e x i t > Figure 2 : Representation of SGNS and HOSGNS with embedding matrices and operations on embedding vectors. Starting from a random walk realization on a static graph G = (V, E), SGNS takes as input nodes i and j within a context window of size T , and maximizes σ(w i • c j ). HOSGNS starts from a random walk realization on a higher-order representation of time-varying graph H = (V, E, T ), takes as input nodes i (k) (node i at time k) and j (l) (node j at time l) within a context window of size T and maximizes σ([[w i , c j , t k , s l ]]). In both cases, for each input sample, we fix i and draw κ combinations of j or j, k, l from a noise distribution, and we maximize σ(- w i • c j ) (SGNS) or σ(-[[w i , c j , t k , s l ]] ) (HOSGNS) with their corresponding embedding vectors (negative sampling). obtained as the product of marginal distributions P D (i), P D (j), P D (k) . . . Computing exactly the objective function in Eq. (3.3) (or the 4th-order analogous) is computationally expensive, but it can be approximated by a sampling strategy: picking positive tuples according to the data distribution P D and negative ones according to independent sampling P N , HOSGNS objective can be asymptotically approximated through the optimization of the following weighted cross entropy loss: L (bce) = - 1 B B (ijk... )∼P D log σ [[w i , c j , t k . . . ]] +κ• B (ijk... )∼P N log σ -[[w i , c j , t k . . . ]] (3.8) where B is the number of the samples drawn in a training step and κ is the negative sampling constant. We additionally apply the warm-up steps explained in Supplementary Information to speed-up the training convergence.

4. EXPERIMENTS

For the experiments we use time-varying graphs collected by the SocioPatterns collaboration (http://www.sociopatterns.org) using wearable proximity sensors that sense the face-to-face proximity relations of individuals wearing them. After training the proposed models (HOSGNS applied to P (stat) , P (dyn) or P (stat|dyn) ) on each dataset, embedding matrices W, C, T (and S in case of P (stat) ) are mapped to embedding vectors w i , c j , t k (and s l ) where i, j ∈ V and k, l ∈ T , and we use them to solve different downstream tasks: node classification and temporal event reconstruction.

4.1. EXPERIMENTAL SETUP

Datasets. We performed experiments with both empirical and synthetic datasets describing face-toface proximity of individuals. We used publicly available empirical contact data collected by the SocioPatterns collaboration (Cattuto et al., 2010) , with a temporal resolution of 20 seconds, in a variety of contexts: in a school ("LYONSCHOOL"), a conference ("SFHH"), a hospital ("LH10"), a highschool ("THIERS13"), and in offices ("INVS15") (Génois & Barrat, 2018) . This is currently the largest collection of open datasets sensing proximity in the same range and temporal resolution used by modern contact tracing systems. In addition, we used social interactions data generated by the agent-based-model OpenABM-Covid19 (Hinch et al., 2020) to simulate an outbreak of COVID-19 in a urban setting. We built a time-varying graph from each dataset, and for the empirical data we performed aggregation on 600 seconds time windows, neglecting those snapshots without registered interactions at that time scale. The weight of the link (i, j, k) is the number of events recorded between nodes (i, j) in a certain aggregated window k. For synthetic data we maintained the original temporal resolution and we set links weights to 1. Table 1 shows statistics for each dataset. Baselines. We compare our approach with several baseline methods from the literature of timevarying graph embeddings, which learn time-stamped node representations: (1) DYANE (Sato et al., 2019) , which learns temporal node embeddings with DEEPWALK, mapping a time-varying graph into a supra-adjacency representation; (2) DYNGEM (Goyal et al., 2017) , a deep autoencoder architecture which dynamically reconstructs each graph snapshot initializing model weights with parameters learned in previous time frames; (3) DYNAMICTRIAD (Zhou et al., 2018) , which captures structural information and temporal patterns of nodes, modeling the triadic closure process. Details about hyper-parameters used in each method can be found in the Supplementary Information.

4.2. DOWNSTREAM TASKS

Node Classification. The aim of this task is to classify nodes in epidemic states according to a SIR epidemic process with infection rate β and recovery rate µ. We simulated 30 realizations of the SIR process on top of each empirical graph with different combinations of parameters (β, µ). We used similar combinations of epidemic parameters and the same dynamical process to produce SIR states as described in Sato et al. (2019) . Then we set a logistic regression to classify epidemic states S-I-R assigned to each active node i (k) during the unfolding of the spreading process. We combine the embedding vectors of HOSGNS using the Hadamard (element-wise) product w i • t k . We compared with dynamic node embeddings learned from baselines. For fair comparison, all models produce time-stamped node representations with dimension d = 128 as input to the logistic regression. Temporal Event Reconstruction. In this task, we aim to determine if an event (i, j, k) is in H = (V, E, T ), i.e., if there is an edge between nodes i and j at time k. We create a random timevarying graph H * = (V, E * , T ) with same active nodes V (T ) and a number of |E| events that are not part of E. Embedding representations learned from H are used as features to train a logistic regression to predict if a given event (i, j, k) is in E or in E * . We combine the embedding vectors of HOSGNS as follows: for HOSGNS (stat) , we use the Hadamard product w i • c j • t k ; for HOSGNS (dyn) and HOSGNS (stat|dyn) , we use w i •c j •t k •s k . For baseline methods, we aggregate vector embeddings to obtain link-level representations with binary operators (Average, Hadamard, Weighted-L1, Weighted-L2 and Concat) as already used in previous works (Grover & Leskovec, 2016; Tsitsulin et al., 2018) . For fair comparison, all models are required produce event representations with dimension d = 192. Tasks were evaluated using train-test split. To avoid information leakage from training to test, we randomly split V and T in train and test sets (V tr , V ts ) and (T tr , T ts ), with proportion 70% -30%. For node classification, only nodes in V tr at times in T tr were included in the train set, and only nodes in V ts at times in T ts were included in the test set. For temporal event reconstruction, only events with i, j ∈ V tr and k ∈ T tr were included in the train set, and only events with i, j ∈ V ts and k ∈ T ts were included in the test set. Under review as a conference paper at ICLR 2021

4.3. RESULTS

In this section we first show downstream task performance results for the empirical datasets, leaving results for synthetic datasets to Supplementary Information. Synthetic datasets are used here to compare the performance of the different approaches in terms of training complexity, by measuring the number of trainable parameters and the training time with fixed number of training steps. All approaches were evaluated for both downstream tasks in terms of Macro-F1 scores in all datasets. 5 different runs of the embedding model are evaluated on 30 different train-test splits for both downstream tasks. We report the average score with standard error over all splits. In node classification, every SIR realization is assigned to a single embedding run to compute prediction scores. In event reconstruction, a different random realization H * is assigned to each train-test subset. Results for the classification of nodes in epidemic states are shown in Table 2 . We report here a subset of (β, µ) but other combinations are available on the Supplementary Information. DYNGEM and DYNAMICTRIAD have low scores, since they are not devised to learn from graph dynamics. HOSGNS (stat) is not able to capture the graph dynamics due to the static nature of P (stat) . DYANE, HOSGNS (stat|dyn) and HOSGNS (dyn) show good performance, with these two HOSGNS variants outperforming DYANE in most of the combinations of datasets and SIR parameters. (stat) 55.5 ± 0.8 57.3 ± 1.1 45.9 ± 0.9 46.9 ± 0.7 44.5 ± 0.7 HOSGNS (dyn) 79.2 ± 0.5 69.1 ± 1.1 59.6 ± 1.5 71.8 ± 1.2 64.6 ± 0.7 HOSGNS (stat|dyn) 77. (stat) 56.8 ± 0.9 61.8 ± 2.4 49.1 ± 1.9 47.3 ± 0.6 45.9 ± 0.7 HOSGNS (dyn) 76.0 ± 0.4 71.5 ± 2.0 59.6 ± 2.0 74.2 ± 0.4 65.9 ± 0.6 HOSGNS (stat|dyn) 74.6 ± 0.4 70.2 ± 1.9 59.9 ± 2. (stat) 55.5 ± 0.7 57.6 ± 2.2 49.4 ± 0.8 45.5 ± 0.4 43.6 ± 0.5 HOSGNS (dyn) 73.5 ± 0.5 65.7 ± 1.6 61.1 ± 1.2 69.5 ± 0.3 59.6 ± 0.5 HOSGNS (stat|dyn) 72.9 ± 0.6 66.3 ± 1.9 58.2 ± 1.1 68.5 ± 0.4 59.0 ± 0.7 Results for the temporal event reconstruction task are reported in Table 3 . Temporal event reconstruction is not performed well by DYNGEM. DYNAMICTRIAD has better performance with Weighted-L1 and Weighted-L2 operators, while DYANE has better performance using Hadamard or Weighted-L2. Since Hadamard product is explicitly used in Eq. (3.1) to optimize HOSGNS, all HOSGNS variants show best scores with this operator. HOSGNS (stat) outperforms all approaches, setting new state-of-the-art results in this task. The P (dyn) representation used as input to HOSGNS (dyn) does not focus on events but on dynamics, so the performance for event reconstruction is slightly below DYANE, while HOSGNS (stat|dyn) is comparable to DYANE. Results for HOSGNS models using other operators are available in the Supplementary Information. We observe an overall good performance of HOSGNS (stat|dyn) in both downstream tasks, being in almost all cases the second highest score, compared to the other two variants which excel in one task but have lower performance in the other. Training Complexity. We report in sampling was implemented by picking positive and negative examples from a corpus of random walks sampled from a given graph. For HOSGNS (stat) random walks are sampled from the set of temporal snapshots {G (k) } k∈T with window size T = 1, and for HOSGNS (dyn) random walks are sampled from the supra-adjacency graph G H with window size T = 10. With these sampling strategies, positive examples are drawn from the same probability distributions as defined in Eq. (3.5) and Eq. (3.6). We recall that HOSGNS, by learning disentangled representations of nodes and time intervals, uses a number of parameters in the order of O(|V| + |T |), while models that learn node-time representations (such as DYANE) need a number of parameters that is at least O(|V| • |T |). In the Supplementary Information we include plots with two dimensional projections of these embeddings, showing that the embedding matrices of HOSGNS approaches successfully capture both the structure and the dynamics of the time-varying graph.

5. CONCLUSIONS

In this paper, we introduce higher-order skip-gram with negative sampling (HOSGNS) for timevarying graph representation learning. We show that this method is able to disentangle the role of nodes and time, with a small fraction of the number of parameters needed by other methods. The embedding representations learned by HOSGNS outperform other methods in the literature and set new state-of-the-art results for predicting the outcome of dynamical processes and for temporal event reconstruction. We show that HOSGNS can be intuitively applied to time-varying graphs, but this methodology can be easily adapted to solve other representation learning problems that involve multi-modal data and multi-layered graph representations.



A 3rd-order tensor P (stat) (H) ∈ R |V|×|V|×|T | which gather relative frequencies of nodes



Figure 1: A time-varying graph H with three intervals (left) and its corresponding time-respecting supra-adjacency graph G H (right).

Figure2summarizes the differences between graph embedding via classical SGNS and time-varying graph embedding via HOSGNS. Here, indices (i, j, k, l) correspond to (node, context, time, contexttime) in a 4th-order tensor representation of H.The above tensors gather empirical probabilities P D (i, j, k . . . ) corresponding to positive examples of observable higher-order relations. The probabilities of negative examples P N (i, j, k . . . ) can be

Summary statistics about empirical and synthetic time-varying graph data. In order: number of single nodes |V|, number of steps |T |, number of events |E|, number of active nodes |V (T ) |,

Macro-F1 scores for classification of nodes in epidemic states according to different SIR epidemic processes over empirical datasets. For each (β, µ) we highlight the two highest scores and underline the best one.

Macro-F1 scores for temporal event reconstruction in empirical datasets. We highlight in bold the two best scores for each dataset. For baseline models we underline their highest score.

Number of trainable parameters and training time of each time-varying graph representation learning model compared between LYONSCHOOL and synthetic datasets. The embedding dimension is fixed to 128, technical specifications of the computing system and hyper-parameters configuration are reported in Supplementary Information.

availability

The source code and data are publicly available at [link to anonymized

