DYG2VEC: REPRESENTATION LEARNING FOR DY-NAMIC GRAPHS WITH SELF-SUPERVISION Anonymous

Abstract

The challenge in learning from dynamic graphs for predictive tasks lies in extracting fine-grained temporal motifs from an ever-evolving graph. Moreover, task labels are often scarce, costly to obtain, and highly imbalanced for large dynamic graphs. Recent advances in self-supervised learning on graphs demonstrate great potential, but focus on static graphs. State-of-the-art (SoTA) models for dynamic graphs are not only incompatible with the self-supervised learning (SSL) paradigm but also fail to forecast interactions beyond the very near future. To address these limitations, we present DyG2Vec, an SSL-compatible, efficient model for representation learning on dynamic graphs. DyG2Vec uses a window-based mechanism to generate task-agnostic node embeddings that can be used to forecast future interactions. DyG2Vec significantly outperforms SoTA baselines on benchmark datasets for downstream tasks while only requiring a fraction of the training/inference time. We adapt two SSL evaluation mechanisms to make them applicable to dynamic graphs and thus show that SSL pre-training helps learn more robust temporal node representations, especially for scenarios with few labels.

1. INTRODUCTION

Graph Neural Networks (GNNs) have recently found great success in representation learning for complex networks of interactions, as present in recommendation systems, transaction networks, and social media (Wu et al., 2020; Zhang et al., 2019; Qiu et al., 2018) . However, most approaches ignore the dynamic nature of graphs encountered in many real-world domains. Dynamic graphs model complex, time-evolving interactions between entities (Kazemi et al., 2020; Skarding et al., 2021; Xue et al., 2022) . Multiple works have revealed that real-world dynamic graphs possess finegrained temporal patterns known as temporal motifs (Toivonen et al., 2007; Paranjape et al., 2017) . For example, a simple pattern in social networks specifies that two users who share many friends are likely to interact in the future. A robust representation learning approach must be able to extract such temporal patterns from an ever-evolving dynamic graph in order to make accurate predictions. Self-Supervised Representation Learning (SSL) has shown promise in achieving competitive performance for different data modalities on multiple predictive tasks (Liu et al., 2021) . Given a large corpus of unlabelled data, SSL postulates that unsupervised pre-training is sufficient to learn robust representations that are predictive for downstream tasks with minimal fine-tuning. However, it is important to specify a pre-training objective function that induces good performance for the downstream tasks. Contrastive SSL methods, despite their early success, rely heavily on negative samples, extensive data augmentation, and large batch sizes (Jing et al., 2022; Garrido et al., 2022) . Non-contrastive methods address these shortcomings, incorporating information theoretic principles through architectural innovations or regularization methods. These closely resemble strategies employed in manifold learning and spectral embedding methods (Balestriero & LeCun, 2022) . The success of such SSL methods on sequential data (Tong et al., 2022; Eldele et al., 2021; Patrick et al., 2021) suggests that one can learn rich temporal node embeddings from dynamic graphs without direct supervision. SSL methods are attractive for dynamic graphs because it is often costly to generate ground truth labels. Contrastive approaches are very sensitive to the quality of the negative samples, and these are challenging to identify in dynamic graphs due to the temporal evolution of interactions and the lack of semantic labels at the contextual level. As a result, it is desirable to explore non-contrastive techniques, but state-of-the-art models for dynamic graphs suffer from shortcomings that make them hard to adapt to SSL paradigms. First, they heavily rely on chronological training or a full history of interactions to construct predictions (Kumar et al., 2019; Xu et al., 2020; Rossi et al., 2020; Wang et al., 2021b) . Second, the encoding modules either use inefficient message-passing procedures (Xu et al., 2020) , memory blocks (Kumar et al., 2019; Rossi et al., 2020) , or expensive random walkbased algorithms (Wang et al., 2021b) that are designed for edge-level tasks only. As a result, while SSL pre-training has been applied successfully for static graphs (Thakoor et al., 2022; Hassani & Khasahmadi, 2020; You et al., 2022) , there has been limited success in adapting SSL pre-training to dynamic graphs. In this work, we propose DyG2Vecfoot_0 , a novel encoder-decoder model for continuous-time dynamic graphs that benefits from a window-based architecture that acts a regularizer to avoid over-fitting. DyG2Vec is an efficient attention-based graph neural network that performs message-passing across structure and time to output task-agnostic node embeddings. Experimental results for 7 benchmark datasets indicate that DyG2Vec outperforms SoTA baselines on future link prediction and dynamic node classification in terms of performance and speed, particularly in medium-and long-range forecasting. The novelty of our model lies in its compatibility with SoTA SSL approaches. That is, we propose a joint-embedding architecture for DyG2Vec that can benefit from non-contrastive SSL. We adapt two evaluation protocols (linear and semi-supervised probing) to the dynamic graph setting and demonstrate that the proposed SSL pre-training is effective in the low-label regime.

2. RELATED WORK

Self-supervised representation learning: Multiple works explore learning visual representations without labels (see (Liu et al., 2021) for a survey). The more recent contrastive methods generate random views of images through data augmentations, and then force representations of positive pairs to be similar while pushing apart representations of negative pairs (Chen et al., 2020a; He et al., 2019) . With the goal of attaining hard negative samples, such methods typically use large batch sizes Chen et al. (2020a) or memory banks (He et al., 2019; Chen et al., 2020b) . Non-contrastive methods such as BYOL (Grill et al., 2020) and VICReg (Bardes et al., 2022) eliminate the need for negative samples through various techniques that avoid representation collapse (Jing et al., 2022) . Recently, several SSL methods have been adapted to pre-train GNNs (Xie et al., 2022) . Deep Graph Infomax (DGI) (Velickovic et al., 2019) and InfoGCL (Xu et al., 2021 )) rely on mutual information maximization or information bottle-necking between patch-level and graph-level summaries. MV-GRL (Hassani & Khasahmadi, 2020) incorporates multiple views, and BGRL (Thakoor et al., 2022) adapts BYOL to graphs to eliminate the need for negative samples, which are often memory-heavy in the graph setting. The experiments demonstrate the high degree of scalability of non-contrastive methods and their effectiveness in leveraging both labeled and unlabeled data. Representation learning for dynamic graphs: Early works on representation learning for continuous-time dynamic graphs typically divide the graph into snapshots that are encoded by a static GNN and then processed by an RNN module (Sankar et al., 2020; Pareja et al., 2020; Kazemi et al., 2020) . Such methods fail to learn fine-grained temporal patterns at smaller timescales within each snapshot. Therefore, several RNN-based methods were introduced that sequentially update node embeddings as new edges arrive. JODIE (Kumar et al., 2019) employs two RNN modules to update the source and destination embeddings respectively of a arriving edge. DyRep (Trivedi et al., 2019) adds a temporal attention layer to take into account multi-hop interactions when updating node embeddings. TGAT (Xu et al., 2020) includes an attention-based message passing (AMP) architecture to aggregate messages from a historical neighborhood. TGN (Rossi et al., 2020) alleviates the expensive neighborhood aggregation of TGAT by using an RNN memory module to encode the history of each node. CaW (Wang et al., 2021b) extracts temporal patterns through an expensive procedure that samples temporal random walks and encodes them with an LSTM. This procedure must be performed for every prediction. In contrast to prior works, our method operates on a fixed window of history to generate node embeddings. Additionally, we do not recompute embeddings for every prediction, which allows for efficient computation and memory usage.

3. PROBLEM FORMULATION

A Continuous-Time Dynamic Graph (CTDG) G = (V, E, X ) is a sequence of E = |E| interactions, where X = (X V , X E ) is the set of input features containing the node features X V ∈ R N ×D V and the edge features X E ∈ R E×D E . E = {e 1 , e 2 , . . . , e E } is the set of interactions. There are N = |V| nodes, and D V and D E are the dimensions of the node and edge feature vectors, respectively. An edge e i = (u i , v i , t i , m i ) is an interaction between any two nodes u i , v i ∈ V, with t i ∈ R being a continuous timestamp, and m i ∈ X E an edge feature vector. For simplicity, we assume that the edges are undirected and ordered by time (i.e., t i ≤ t i+1 ). A temporal sub-graph G i,j is defined as a set of all the edges in the interval [t i , t j ], such that E ij = {e k | t i ≤ t k ≤ t j }. Any two nodes can interact multiple times throughout the time horizon; therefore, G is a multi-graph. Our goal is to learn a model f that maps the input graph to a representation space. The model is a pre-trainable encoder-decoder architecture, f = (g θ , d γ ). The encoder g θ maps a dynamic graph to node embeddings H ∈ R N ×D H ; the decoder d γ performs a task-specific prediction given the embeddings. The model is parameterized by the encoder/decoder parameters (θ, γ). More concretely, H = g θ (G) , Z = d γ (H; Ē) , where Z ∈ R N ×D Y is the prediction of task-specific labels (e.g., edge prediction or source node classification labels) of all edges in Ē. The node embeddings H must capture the temporal and structural dynamics of each node such that the future can be accurately predicted from the past, e.g., future edge prediction given past edges. The main distinction of this design is that, unlike previous dynamic graph models (Rossi et al., 2020; Xu et al., 2020; Wang et al., 2021b) , the encoder must produce embeddings independent of the downstream task specifications. This special trait can allow the model to be compatible with the SSL paradigm where an encoder is pre-trained separately and then fine-tuned together with a task-specific decoder to predict the labels. To this end, we present a novel DyG2Vec framework, that can learn rich node embeddings at any timestamp t independent of the downstream task. DyG2Vec is formulated as a two-stage framework. In the first stage, we use a non-contrastive SSL method to learn the model f SSL = (g θ , d ψ ) over various sampled dynamic sub-graphs with self-supervision. d ψ is an SSL decoder that is only used in the SSL pre-training stage. In the second stage, a task-specific decoder d γ is trained on top of the pre-trained encoder g θ to compute the outputs for the downstream tasks, e.g., future edge prediction or dynamic node classification (Xu et al., 2020; Wang et al., 2021b ). We consider two example downstream tasks: future link prediction (FLP), and dynamic node classification (DNC). In each case, there is a prediction horizon of the next K interactions. The test window for FLP starting at time t i is Ē = {(u j , v j , t j , m j )|j ∈ [i, i + K]}. This is augmented by a set of K negative edges. Each negative edge (u j , v ′ j , t j , m j ) differs from its corresponding positive edge only in the destination node, v ′ j ̸ = v j , which is selected at random from all nodes. The FLP task is then binary classification for the test set of 2K edges. In the DNC task, a dynamic label is associated with each node that participates in an interaction. We are provided with {(u j , t j )|j ∈ [i, i + K]}, i.e., the source node and interaction time. The goal is to predict the source node labels for the next K interactions. The performance metrics are detailed in Appendix A.4.

4. METHODOLOGY

We now introduce our novel dynamic graph learning framework DyG2Vec, which can achieve downstream task-agnostic representation. We first present the SSL pre-training approach with a non-contrastive loss function for dynamic graphs. We then introduce the novel window-based downstream training approach. Finally, we outline the encoder architecture. Given the full dynamic graph G 0,E , a set of intervals I is generated by dividing the entire time-span {t 0 , t E } into M = ⌈E/S⌉ -1 intervals with stride S and interval length W (See Appendix A.4 for details). Let B ⊂ I be the mini-batch of intervals. Given B, the sub-graph sampler m (G, B; W ) constructs the mini-batch of input graphs: Ĝ = {G i,j | [i, j) ∈ B}. The corresponding mini-batch of target graphs is denoted by Ḡ = {G j,j+K | [i, j) ∈ B}. In principle, G i,j ∈ Ĝ is an input (history) graph used to predict the target labels of the corresponding target (future) graph G j,j+K ∈ Ḡ. The parameter W controls the size of the history while K controls how far the model is predicting into the future. S controls the stride between intervals. In practice, we set S = K so that each edge is only predicted once in each epoch. Since Ḡ is only provided for training in the downstream task, SSL pre-training only operates on Ĝ, as seen in Figure 2 .

4.1. STAGE 1: PRE-TRAINING ON DYNAMIC GRAPHS WITH SELF-SUPERVISION

We formulate a joint-embedding architecture (Bromley et al., 1993) for DyG2Vec in which two views of a mini-batch of sub-graphs are generated through random transformations. The transformations are randomly sampled from a distribution defined by a distortion pipeline. The encoder maps the views to node embeddings which are processed by the predictor to generate node representations. We minimize an SSL objective (Eq. 2, described below) to optimize the model parameters end-to-end in the pre-training stage. See Figure 1 for an overall design of the SSL framework. Views: The temporal distortion module generates two views of the input graphs Ĝ′ = t ′ ( Ĝ) and Ĝ′′ = t ′′ ( Ĝ) where the transformations t ′ and t ′′ are sampled from a distribution T over a predefined set of candidate graph transformations. In this work, we use edge dropout and edge feature masking (Thakoor et al., 2022) in the transformation pipeline. See Appendix A.4 for more details. Embedding: The encoding model g θ is an attention-based message-passing (AMP) neural network that produces node embeddings H SSL Objective: In order to learn useful representations, we minimize the VICReg regularizationbased SSL loss function from (Bardes et al., 2022) : L SSL = l(Z ′ , Z ′′ ) = λs(Z ′ , Z ′′ ) + µ[v(Z ′ ) + v(Z ′′ )] + ν[c(Z ′ ) + c(Z ′′ )] . (2) In this loss function, the weights λ, µ, and ν control the emphasis placed on each of three regularization terms. The invariance term s encourages representations of the two views to be similar. The variance term v is included to prevent the well-known collapse problem (Jing et al., 2022) . The covariance term c promotes maximization of the information content of the representations. More details and complete expressions for s, v and c are provided in Appendix A.3. Unlike previous regularization-based SSL approaches (Chen et al., 2020a; Bardes et al., 2022) in computer vision, we do not use a projector network because the embedding dimensions are relatively small in the graph domain. The full pre-training procedure is illustrated in Figure 1 . Following the pre-training stage, we replace the SSL decoder with a task-specific downstream decoder d ψ that is trained on top of the frozen pre-trained encoder. The window-based training strategy has several major advantages. First, the window acts as a regularizer by providing a natural inductive bias towards recent edges, which are often more predictive of the future. It also avoids costly time-based neighborhood sampling techniques (Wang et al., 2021b) . Second, relying on a fixed window-size for message-passing allows for constant memory and computational complexity, which is well-suited to the practical online streaming data scenario. Third, unlike previous works (Xu et al., 2020; Wang et al., 2021b) which generate separate node embeddings for each target edge, a generic encoder allows us to use the same set of embeddings for any prediction. This dramatically reduces the training/inference overhead. Another advantage of this design is that it allows the model to forecast unseen edges relatively far into the future, in contrast to existing works (Xu et al., 2020; Rossi et al., 2020) that focus on predicting the next occurring edge.

4.3. DYG2VEC ENCODER ARCHITECTURE

Our encoder combines a self-attention mechanism for message-passing with the Time2Vec module (Kazemi et al., 2019 ) that provides relative time encoding. We also introduce a novel temporal edge encoding that efficiently captures the temporal structural relationship between nodes. Temporal Attention Embedding: Given a dynamic graph G, the encoder g θ computes the embedding h L i of node i through a series of L multi-head attention (MHA) layers (Vaswani et al., 2017 ) that aggregate messages from its L-hop neighborhood (Xu et al., 2020; Velickovic et al., 2018) . Given a node embedding h l-1 i at layer l-1, we uniformly sample N 1-hop neighborhood interactions of node i, N (i) = {e p , . . . , e k } ⊆ E. The embedding h l i at layer l is calculated by: h l i = W 1 h l-1 i + MHA l (q l , K l , V l ), q l = h l-1 i , K l = V l = [Φ p (t p ), . . . , Φ k (t k )] . Here, W 1 is a learnable mapping matrix, MHA l (•) is a multi-head dot-product attention layer, and Φ p (t p ) represents the edge feature vector of edge e p = (u p , v p , t p , m p ) ∈ N (i) at time t p : Φ p (t p ) = [h l-1 up || f p (t p ) || m p ], f p (t p ) = ϕ( ti -t p ) + Θ p (t p ) , (7) ti = max {t l | e l ∈ N (v p ) .} , where || denotes concatenation and ϕ(.) is a learnable Time2Vec module that helps the model be aware of the relative timespan between a sampled interaction and the most recent interaction of node v p in the input graph. Θ p (.) is a temporal edge encoding function, described in more detail below. In contrast to TGAT's recursive message passing procedure (Xu et al., 2020) , the message passing in our encoder is 'flat': at every iteration, the same set of node embeddings is used to propagate messages to neighbors. Our encoder performs message passing once to generate a set of node embeddings H used for all target predictions on Ḡ. Temporal Edge Encoding: Dynamic graphs often follow evolutionary patterns that reflect how nodes interact over time (Kovanen et al., 2011) . For example, in social networks, two people who share many friends are likely to interact in the future. Therefore, we incorporate two simple yet effective temporal encoding methods that provide inductive biases to capture common structural and temporal evolutionary behaviour of dynamic graphs. The temporal edge encoding function is then: Θ p (t p ) = W 2 [z p (t p )||c p (t p )] , where we incorporate (i) Temporal Degree Centrality z p (t p ) ∈ R 2 : the concatenated current degrees of nodes u p and v p at time t p ; and (ii) Common Neighbors c p (t p ) ∈ R: the number of common 1-hop neighbors between nodes u p and v p at time t p . By using the degree centrality as an edge feature, the model is able to learn any bias towards more frequent interactions with high-degree nodes. The number of common neighbors helps capture temporal motifs, and it is known to often have a strong positive correlation with the likelihood of a future interaction (Yao et al., 2016) .

5.1. EXPERIMENTAL SETUP

Baselines: We compare DyG2Vec to five state-of-the-art baseline models: DyRep (Trivedi et al., 2019) , JODIE (Kumar et al., 2019) , TGAT (Xu et al., 2020) , TGN (Rossi et al., 2020) , and CaW (Wang et al., 2021b) . DyRep, JODIE, and TGN sequentially update node embeddings using an RNN. TGAT applies message passing via attention on a sampled temporal subgraph. CaW samples temporal random walks and learns temporal motifs by counting node occurrences in each walk.

Downstream Tasks:

We evaluate all models on two temporal tasks: future link prediction (FLP), and dynamic node classification (DNC). In FLP, the goal is to predict the probability of future edges occurring given the source, destination, and timestamp. For each positive edge, we sample a negative edge that the model is trained to predict as negative. The DNC task involves predicting the label of the source node of a future interaction. Both tasks are trained using binary cross entropy loss. Contrary to prior works which only evaluate under the K = 1 setting, we evaluate all models under K ∈ {1, 200, 2000}. This is a more challenging evaluation scheme as it tests the forecasting capabilities of the models across multiple time horizons. See Appendix A.4 for details. For the FLP task, we report both classification and recommendation metrics: Average Precision (AP), Mean Reciprocal-Rank (MRR), and Rec@10. For the DNC task, we report the area under the curve (AUC) metric due to the prevailing issue of class imbalance in dynamic graphs. Datasets: We use 7 real-world datasets: Wikipedia, Reddit, MOOC, and LastFM (Kumar et al., 2019) ; SocialEvolution, Enron, and UCI (Wang et al., 2021b) . These datasets span a wide range in terms of number of nodes and interactions, time range, and repetition ratio. We perform the same 70%-15%-15% chronological split for all datasets as in (Wang et al., 2021b) . The datasets are split differently under two settings: Transductive and Inductive. More details can be found in Appendix A.1. The code and datasets will be publicly available upon publication. Training Protocols and Hyper-parameters: We train and evaluate the models under three different settings commonly used in vision SSL works (Grill et al., 2020; Bardes et al., 2022) . In the supervised setting, DyG2Vec is initialized with random parameters and trained directly on the downstream tasks and compared to all supervised baselines. In the self-supervised setting, the encoder is pre-trained using our SSL framework, and the performance is measured under two evaluation protocols: Linear and Semi-supervised Probing. In the linear evaluation setting, the decoder is trained on top of the frozen encoder and compared to the supervised counterpart. In the semi-supervised evaluation setting, the decoder is trained on top of the frozen pre-trained encoder on a random portion of the dataset (i.e., a fraction of the intervals I). The DyG2Vec encoder performs L = 3 layers of message passing. We sample N = 20 temporal neighbors at each hop as in Xu et al. (2020) . Other hyperparameters are discussed in Appendix A.5. For the DNC task, following prior work Rossi et al. (2020) , the decoder is trained on top of the frozen encoder that is pre-trained on the future link prediction task unless otherwise explicitly stated.

5.2. EXPERIMENTAL RESULTS

Future Link Prediction: We report the tranductive test AP scores for future link prediction in Table 1 . Unlike previous work which focuses on K = 1, we evaluate all models under K ∈ {1, 200, 2000} to test their medium-and longer-term forecasting capabilities. Unsurprisingly, all methods degrade as K increases. Our model significantly outperforms all sequential and message-passing baselines on 5/7 of the datasets for K > 1 and is on par with SoTA (CaW) for K = 1. The gap is particularly large on the UCI and SocialEvol. datasets for K = 2000, where DyG2Vec outperforms the second- best method (CaW) by over 10% and 6% respectively. Interestingly, while SocialEvol. is the largest dataset with ∼ 2M edges, our model is able to achieve this performance while only using the last 1000 edges (See Table 9 ) to predict any future edge. This further cements the findings by Xu et al. (2020) that capturing recent interactions may be more important for certain tasks. Our window-based framework offers a good trade-off between capturing recent interactions and recurrent patterns which both have a major influence on future interactions. Appendix A.2 contains results in the Inductive setting which show that DyG2Vec is competitive with CaW for 4/7 of the datasets while using a small fraction of the computation (See Figure 3 ).

Dynamic Node classification:

We evaluate DyG2Vec on 3 datasets for node classification where the labels indicate whether a user will be banned from editing/posting after an interaction. This task is challenging both due to its dynamic nature (i.e., nodes can change labels) and the high class imbalance (only 217 of 157K interactions result in a ban). We measure performance using the AUC metric to deal with the class imbalance. Table 2 shows that DyG2Vec outperforms all baselines on 7/9 of the tasks for different K. Interestingly, the performance of all methods does not always drop as K increases. This could be explained by the fact that depending on slightly out-of-date history can help make less noisy and more consistent predictions; thus, increasing performance. Training/Inference Speed: Relying on a fixed window of history to produce task-agnostic node embeddings gives DyG2Vec a significant advantage in speed and memory. Figure 3 shows the performance and runtime per epoch of all methods on the three large datasets: LastFM, SocialEvolution and MOOC, with K = 200. DyG2Vec is many orders of magnitude faster than CaW due to the latter's expensive random walk sampling procedure. RNN-based methods such as TGN have a good runtime on LastFM and MOOC; however, they are significantly slower on SocialEvol. which has a small number of nodes (74) but a large number of interactions (∼ 2M ). This suggests that memorybased methods are slower for settings where a node's memory is updated frequently. Furthermore, while TGAT has a similar AMP encoder, DyG2Vec improves the efficiency and performance significantly. This reveals the significance of the window-based mechanism and the encoder architecture. Overall, DyG2Vec presents the best trade-off between speed and performance. This supports the capability of the non-contrastive methods to learn generic representations across unlabelled large-scale dynamic graphs, which is in line with the findings for other data modalities (Bardes et al., 2022) . The Random-init baseline is surprisingly good, as observed by recent works (Thakoor et al., 2022) , but is outperformed by the SSL pre-trained encoder. Semi-supervised Learning on Dynamic Node Classification: The DNC task is challenging due to its highly imbalanced labels. Previous works alleviate this issue by pre-training the encoder on future link prediction. In Figure 5 , we show that SSL is a more effective pre-training strategy for dynamic graphs than FLP, particularly in the low-label data regime where each model is trained on a portion of the target intervals I. This further cements the findings that reconstruction-based tasks such as link prediction overemphasize proximity which can be limiting for some downstream tasks (Velickovic et al., 2019; You et al., 2020) . We perform a detailed study on different instances of our framework with 3 datasets. All ablation results are reported in Figure 6 .

5.3. ABLATION STUDIES

Window Size: We observe that each dataset has its own optimal window size W due to the inherently different recurring temporal patterns. As observed by Xu et al. (2020) , recent and/or recurrent interactions are often the most predictive of future interactions. Therefore, datasets with long range dependencies favor larger window sizes to capture the recurrent patterns while some datasets benefit from an increased bias towards recent interactions. Our window-based framework coupled with uniform neighbor sampling strikes a balance between the two. Moreover, increasing the window size to 64K for UCI, which is effectively full history as it has 60K edges in total, results in a 4% drop in performance. This shows that the fixed window size also contributes to the performance as it helps limit irrelevant information that is not highly predictive of future interactions.

Number of Layers:

Most datasets benefit from more embedding layers, and some (e.g. MOOC) more than others. This suggests that these datasets contain higher order temporal correlations among nodes that must be learned using long-range message passing. Overall, the results show that one can choose to sacrifice some performance to further improve the speed of DyG2Vec by decreasing the window size and the number of layers. Temporal Edge Features: The results show a significant decrease in performance for MOOC when temporal edge features are removed (i.e. 1-5% drop). This indicates that such temporal edge features provide useful multi-hop information about the evolution of the dynamic graph (Yao et al., 2016) .

6. CONCLUSION

In this paper, we introduce DyG2Vec, a novel window-based encoder-decoder model for dynamic graphs. We present an efficient attention-based message-passing model that utilizes hierarchical multi-head attention modules to encode node embeddings across time. Furthermore, we present a joint-embedding architecture for dynamic graphs in which two views of temporal sub-graphs are encoded to minimize a non-contrastive loss function. We evaluate the SSL pre-training of DyG2Vec under both linear and semi-supervised protocols and demonstrate the effectiveness of such pre-training on benchmark datasets. Our window-based architecture allows for efficient messagepassing and robust forecasting abilities. We aim to further explore ways to improve the capacity of the dynamic graph models to learn long-range dependencies. Additionally, other SSL paradigms such as masked auto-encoders are worthwhile future explorations for dynamic graphs given their success on sequential tasks.

7. ETHICS STATEMENT

Dynamic graph neural network techniques have been commonly used for prediction tasks in social networks and recommender systems. Our techniques, as an efficient and effective variant of dynamic graph representation technique, can be used in those scenarios to further improve the model performance. However, having such an ability is a double-edged sword. On one hand, it can be beneficial to greatly improve user experience. On the other hand, there may be some concerns about the potential use of the model to exploit data relating to user behaviour and thus invade privacy. Overall, our paper does not include content which has immediate ethical concerns.

8. REPRODUCIBILITY STATEMENT

We describe the details of the benchmarks we used in Appendix A.1. We also include the full implementation details in Appendix A.4, including our detailed model design for decoder, negative sampler, distortion pipeline for view generations and etc. Besides, we provide a justification for our choice of hyperparameters in Appendix A.5. Additionally, we elaborate on the baselines we consider in our paper as well as their hyper-parameter details in Appendix A.6. We believe with the code open-sourced upon acceptance and the detailed description of our model and the baselines provided in the paper, our work is fully reproducible. We describe several real-world dynamic graph datasets which we train on: Reddit (Kumar et al., 2019) : A dataset tracking active users posting in subreddits. Data is represented as a bipartite graph with nodes being the users or subreddit communities. An edge represents a user posting on a subreddit. Each user's post is mapped to an embedding vector which is used as an edge feature. A Dynamic label indicates whether a user u is banned from posting after an interaction (post) at time t. Wikipedia (Kumar et al., 2019) : A dataset tracking user edits on Wikipedia pages. The data is also represented as a bipartite graph involving interactions (edits) between users and Wikipedia pages. Each user edit is mapped to an embedding vector which is treated as an edge feature. A dynamic label indicates whether a user u is banned from posting after an interaction (edit) at time t. MOOC (Kumar et al., 2019) : This dataset tracks actions performed by students on the MOOC online course platform. Nodes represent students or items (i.e., videos, questions, etc.) . A dynamic label indicates whether a student u drops out after performing an action at time t. LastFM (Kumar et al., 2019) : This dataset tracks songs that users listen to throughout one month. Nodes represent users or songs. Dynamic labels are not present. UCI (Wang et al., 2021b) : A dataset that records online posts made by university students on a forum. Enron (Wang et al., 2021b) : A dataset that records email communication between employees in a company over several years. SocialEvolve (Wang et al., 2021b) : A dataset tracking the evolving physical proximity between students in a dormitory over a year. Dataset Splitting: As mentioned earlier, we follow the setup of (Wang et al., 2021b) to perform 70%-15%-15% chronological split for each dataset. The datasets are split differently under two settings: Transductive and Inductive. Under the transductive setting, a dataset is split normally by time, i.e., the model is trained on the first 70% of links and tested on the rest. In the inductive setting, we strive to test the model's prediction performance on edges with unseen nodes. Therefore, following (Wang et al., 2021b) , we randomly assign 10% of the nodes to the validation and test sets and remove any interactions involving them in the training set. Additionally, to ensure an inductive setting, we remove any interactions not involving these nodes from the test set.

A.2.1 INDUCTIVE SETTING AND RANKING METRICS

Table 5 reports the future link prediction results in the inductive setting. DyG2Vec is generally competitive with CaW while using 50-100× less inference and training runtime (See 3 and 7 respectively). Tables 6 and 7 report the future link prediction ranking performance for K = 1 on the transductive and inductive tasks, respectively. While ranking metrics should provide more finegrained analysis, the results are consistent with the AP results (See Tables 1 and 5 ). One of the advantages of the DyG2Vec framework is that, unlike prior work, it is trained to forecast with K > 1. In fact, the parameter K can be set during training based on how much long-range range forecasting is favored. In our experiments, we found that training with K = 200 achieves a good balance between long and short-range forecasting capabilities. Moreover, the limited history of W edges forces the model to be more inductive as it is predicting based on limited long-range historical information. To better understand this, we trained both DyG2Vec and CaW under K = {1, 200} and evaluate on K = {1, 200, 2000}. Table 8 shows that training DyG2Vec with short-range prediction (K = 1) improves performance to be on par with CaW on K = 1 and outperform it for K > 1. However, as expected, this comes at a cost of ∼ 2% drop for long-range forecasting (K > 1) when compared to DyG2Vec trained with K = 200. On the other hand, CaW's performance drops significantly when trained with K = 200 (i.e. over 10% drop for UCI and MOOC). We believe this is due to the sampling bias α which may be incorrectly favoring recent edges over edges that occurred further in the past but can help for long-range forecasting. Unfortunately, we were unable to address this by re-tuning α. An interesting direction for future research would be to study training settings under which all models can have improved forecasting abilities. larized standard deviation defined by S(z, ϵ) = Var(z) + ϵ, γ is a constant value set to 1 in our experiments, and ϵ is a small scalar that helps to prevent numerical instability. This term avoids dimensional collapse by maximizing the volume of the distribution of the mapped views in all dimensions. In other words, it prevents the well-known trivial solution where the representations of the two views of a sample collapse to the same representation (Jing et al., 2022) . Covariance term: The covariance regularization terms C decorrelates different dimensions of the representations and prevents them from encoding similar information. The covariance matrix of Z is C(Z) = 1 n n i=1 (z i -z)(z i -z) T where z = 1 N i=1 n z i . The covariance regularization term c is then defined as the sum of the squared off-diagonal coefficients of the covariance matrix as follows c(Z) = 1 d i̸ =j [C(Z)] 2 i,j where [C(Z)]foot_1 i,j is the element at row i and column j of the matrix C(Z). Both the variance and covariance terms helps to maximize the information encoded by the model in the representation space. Invariance criterion: The invariance criterion s between Z ′ and Z ′′ is defined as the mean squared Euclidean distance between the representation vectors in the two views s(Z ′ , Z ′′ ) = 1 n i ∥z ′ i -z ′′ i ∥ 2 2 . The invariance term encourages the parametric mapping to ensure that the views of an object remain close in the latent space. Finally, the SSL loss function L SSL over a batch of representations is a weighted average of the invariance, variance, and covariance terms: L SSL = l(Z ′ , Z ′′ ) = λs(Z ′ , Z ′′ ) + µ[v(Z ′ ) + v(Z ′′ )] + ν[c(Z ′ ) + c(Z ′′ )] In our experiments, we set λ = µ = 25 and ν = 1, following Bardes et al. (2022) .

A.4 IMPLEMENTATION DETAILS

We train our model using the Pytorch framework (Paszke et al., 2019) . The dynamic graph data and GNN encoder architecture are implemented using Pytorch Geometric (Fey & Lenssen, 2019) . The ReLU activation function is used for all models. The code and datasets are publicly available 2 . Window-based framework: As mentioned in Section 4, the full dynamic graph G 0,E is divided into a set of intervals I that is generated by dividing the entire time-span into M = ⌈E/S⌉ -1 intervals with stride S and interval length W : I = max(0, jS -W ), min(jS, E) | j ∈ {1, 2, . . . , M } . Here, W defines the number of edges in an interval and S defines the stride. Note that we include all intervals up to but not including [E -W, E) so that the target interval contains at least one edge. Negative Sampling: For future link prediction, we sample an equal number of negative interactions and positive (target) interactions. Negative interactions are sampled by sampling a random negative destination node from the set of all possible nodes. For the Recall@10 and MRR metrics, we sample 500 random negative destinations for each positive target edge. Decoder Architecture: Denote by t max the timestamp of the latest interaction, within the provided history, incident to node u. For future link prediction, to predict a target interaction (u, v, t), our decoder maps the sum of the two node embeddings of u and v and a time embedding of t -t max to an edge probability. Following Xu et al. (2020) , the FLP decoder is a 2-layer MLP. For dynamic node classification, to predict the label of node u for interaction (u, v, t), the decoder maps the source node embedding and time embedding of t -t max to class probabilities. Following Xu et al. (2020) , the DNC decoder is a 3-layer MLP with a dropout layer with p = 0.1. The time embedding is calculated using a trainable Time2Vec module (Kazemi et al., 2019) . The time embedding allows the decoder to be time-aware; hence, possibly output different predictions for the same nodes/edges at different timestamps. For downstream training, the window size W is tuned on the validation set from the range {1K, 4K, 8K, 12K, 32K, 65K}. The best window sizes per dataset can be found in Table 9 . The target window size K is fixed to 200 during training. The stride is always fixed to be equal to K, so that each edge is only predicted once. One could augment the dataset by changing S but we leave this for future work. The batch size is always set to 1; hence, we only predict one target interval of size K at a time. However, the model could be sped up by increasing batch size at the cost of higher memory. During SSL pre-training, we use a constant window size of 32K with stride 200. The DyG2Vec encoder hyperparameters can be found in Table 10 . Following previous work (Rossi et al., 2020; Xu et al., 2020) , all dynamic node classification training experiments are performed with L2-decay parameter λ = 0.00001 to alleviate over-fitting.

A.6 BASELINES

Baselines: Following prior work (Rossi et al., 2020; Xu et al., 2020) , all baselines are trained with a constant learning rate of 0.0001 using the Adam optimizer (Kingma & Ba, 2014) on batch-size 200 for a total of 50 epochs. The early stopping strategy is used to stop training if validation AP does not improve for 5 epochs. For JODIE (Kumar et al., 2019) , DyRep (Trivedi et al., 2019) , and TGN (Rossi et al., 2020) , we use the general frameworkfoot_2 implemented by Rossi et al. (2020) . The node memory dimension is set to 172. Other encoder hyperparameters are specified in Table 10 . For TGATfoot_3 , we use the default hyperparameters of 2 layer neighbor sampling with 20 neighbors sampled at each hop. Other hyperparameters are specified in Table 10 . For the CaWfoot_4 method, we tune the time decay parameter α ∈ {0.01, 0.1, 0.3, 0.5, 1, 2, 4, 10} × 10 -6 , and length of the walks m ∈ {2, 3, 4, 5} on the validation set. The optimal hyperparameters for each dataset are specified in Table 11 . The number of heads for the walking-based attention is fixed to 8. A.8 ADDITIONAL RELATED WORK Self-supervised learning for dynamic graphs: Most adaptations of SSL for dynamic graphs have focused on improving downstream task performance via auxiliary losses rather than learning general pre-trained models. Jiang et al. (2021) adapt a sub-graph contrastive learning method (Jiao et al., 2020) where a node representation is contrasted in both structure and time. That is, for each node in the graph, a GNN encoder is trained to contrast its real temporal subgraph to its fake temporal subgraph . This is done by constructing a positive sample, a structural negative sample and a temporal negative sample. The positive sample is a time-weighted subgraph representation. The margin triplet loss is proposed to maximize the mutual information with the positive sample while maximizing distance with the structural and temporal negative samples. Experiments on downstream link prediction task under the freeze setting show improvement over baselines. However, their approach comes with several shortcomings. First, initial node features are computed as one-hot encodings which makes the method not suitable for the inductive scenario (i.e. predicting on new nodes). Second, the use of contrastive learning method is known to result in high memory and computation due to negative sampling (Thakoor et al., 2022) . This makes the method less desirable for large-scale graphs. Third, they do not include results on other downstream tasks (e.g. dynamic node classification). Lastly, they do not compare to the SoTA CaW method (Wang et al., 2021b) . Cong et al. (2022) propose the dynamic graph transformer (DGT) which is a transformer-based graph encoder for discrete-time dynamic graphs. DGT is composed of two-tower networks that embed the temporal evolution and topological information of the input graph. Moreover, a temporalunion graph structure is proposed to efficiently summarize the temporal evolution into one graph. DGT is trained to encode the temporal-union graph using two complementary self-supervised pretext tasks. Namely, temporal reconstruction and multi-view contrasting. The first aims to reconstruct a snapshot given the past and present similar to how language models are trained. On the other hand, the latter is trained via non-contrastive learning on two views with randomly masked nodes. All together, DGT outperforms SOTA discrete-time baselines on several datasets for link prediction tasks. While they operate in a different domain, an interesting direction for future work would be to adapt their pre-training strategy for continuous-time dynamic graphs. 2021) adapt the TGAT encoder with a self-supervised contrastive framework across time. That is, they propose an extension to the classic contrastive learning paradigm by contrasting two nearby temporal views of the same node using a time-dependent similarity metric. Moreover, a de-basied contrastive loss is utilized to correct the typical negative sampling bias in contrastive learning. Experiments on the fine-tune and mutli-task learning settings show that the simple TGAT encoder can be significantly improved on both future link prediction and dynamic node classification. Nonetheless, their approach comes with several shortcomings. First, it is built on the TGAT encoder which, as seen in Tables 1 and 5 , is a weak encoder; particularly, for large datasets. Second, experiments for the FLP task are limited to the Reddit and Wikipedia datasets which are relatively easy. Lastly, the authors do not experiment under the standard settings in graph SSL literature such as the freeze and semi-supervised settings. Table 12 shows the results for downstream future link prediction under the freeze setting. The results show up to 10% gap compared to DyG2Vec, particularly for datasets where the TGAT encoder under-performs (e.g. Enron, UCI). More encoders for temporal graphs: Souza et al. ( 2022) is a concurrent work that establishes a series of theoretical results on temporal graph encoders. Their analysis exposes several weakness of both memory-based methods (e.g. TGN) and walk-based methods (e.g. CaW). Given these insights, they propose PINT, a memory-based method that leverages injective message-passing and novel relative positional encodings. The relative positional encodings count how many temporal walks of a given length exist between two nodes. Experiments show significant improvement over SoTA baselines on the link prediction task. An interesting direction for future research would be to evaluate the expressive power of DyG2Vec compared to baselines using their theoretical framework (e.g. temporal WL test). Wang et al. (2021a) adapt the vanilla transformer architecture to dynamic graphs by designing a twostream encoder that extracts temporal and structural information from the temporal neighborhoods associated with any two interaction nodes. Rather than treating link prediction as a binary classification task, the authors leverage a contrastive learning strategy that maximizes the mutual information between the representations of future interaction nodes. Experiments show improved performance on future link prediction due to the more robust contrastive training strategy. Nonetheless, the paper does not compare to the SoTA CaW method (Wang et al., 2021b) . Moreover, experiments are limited to the future link prediction task.



We are going to open-source the code upon acceptance. https://github.com/anon/anon.git https://github.com/twitter-research/tgn https://github.com/StatsDLMathsRecomSys/Inductive-representation-learning-on-temporal-graphs https://github.com/snap-stanford/CAW



Figure 1: The joint embedding architecture for the non-contrastive SSL Framework. Each slice of the input dynamic graph contains edges arriving at the same continuous timestamp. B is a batch of intervals of size W . Ĝ is a batch of the corresponding input graphs of each interval.

graphs Ĝi,j . We elaborate on the details of the encoder in Sec. 4.3.Prediction:The decoding head d γ for our self-supervised learning design consists of a node-level predictor p ϕ that outputs the final representations Z ′ and Z′′, where Z = p ϕ (H).

Figure2: DyG2Vec Window Framework. Every slice of the dynamic graph G contains edges that arrived at the same continuous timestamp. The blue interval represents the history graph G i-W,i that is encoded to make a prediction on the next K edges (yellow interval). B is a batch of intervals of size W edges. Ĝ is batch of input graphs. Ḡ is batch of target graphs that is only used in the downstream stage.

Figure 3: Transductive FLP Performance (Test AP for K = 200) vs Inference runtime (s) on 3 datasets. Inference time represents the time it takes to predict the whole test set.

Figure 5: Semi-Supervised Learning on Dynamic Node Classification. For each setting, DyG2Vec was trained on a varying random portion of the training data.

Figure 6: Ablation studies on 3 datasets for the FLP transductive task.

Transductive Future link Prediction Performance in AP (Mean ± Std). Avg. Rank reports the mean rank of DyG2Vec across all datasets. Bold font and underline font represent first-and second-best performance respectively.

Transductive Dynamic Node Classification Performance in AUC (Mean ± Std) for K ∈ {1, 200, 2000}.

Linear probing AP results (Mean ± Std) on Transductive Future Link Prediction for K ∈ {1, 200, 2000}

Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. Bringing your own view: Graph contrastive learning without prefabricated data augmentations. In Proc. Int. Conf. on Web Search and Data Mining, 2022. Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys, 2019. Dynamic Graph Datasets. % Repetitive Edges: percentage of edges which appear more than once in the dynamic graph. The labels column specifies whether the dataset contains dynamic node labels or not.

Inductive Future link Prediction Performance in AP (Mean ± Std) for K ∈ {1, 200, 2000}. Avg. Rank reports the mean rank of DyG2Vec across all datasets. Bold font and underline font represent first-and second-best performance respectively.

Tranductive Setting. Ranking metrics, Recall@10 and Mean Reciprocal Rank (MRR), for future link prediction K = 1. Avg. Rank reports the mean rank of DyG2Vec across all datasets

Inductive Setting. Ranking metrics, Recall@10 and Mean Reciprocal Rank (MRR), for future link prediction K = 1. Avg. Rank reports the mean rank of DyG2Vec across all datasets

For SSL pre-training, the predictor p ϕ is a simple 2-layer MLP that maps node embeddings H to node representations Z.Distortion Pipeline: Following static graph SSL methods such asThakoor et al. (2022), we use edge dropout and edge feature dropout as distortions. Both distortions are applied with dropout probability p d = 0.3 which we have found to work best in a validation experiment exploring the values p d ∈ {0.1, 0.15, 0.2, 0.3}. The edge feature dropout is applied on the temporal edge encodings introduced in Section 4.3, i.e., z p (t p ) and c p (t p ). More advanced temporal distortions could be explored such as time distortions or edge shuffling, but we leave those for future work. We use a constant learning rate of 0.0001 for all datasets and tasks. DyG2Vec is trained for 100 epochs for both downstream and SSL pre-training. The model from the last epoch of pre-training is used for downstream training. For downstream evaluation, we pick the model with the best validation AP performance. Overall, we found that DyG2Vec converges within ∼ 50 epochs.

Optimal Window size W for downstream training.

Hyperparameters for DyG2Vec, RNN-based methods (JODIE, DyRep, and TGN), and TGAT.

Downstream Freeze test AP Results (after pre-training) for K=1. DDGCL pre-training and downstream training were run with default parameters described in the work. DyG2Vec was run with parameters described in A.5.

A.2.3 RUNTIME AND COMPUTATIONAL COMPLEXITY

Figure 7 shows the training time per method for 3 datasets. DyG2Vec is orders-of-magnitude faster than CaW and is on par with memory-based methods in terms of speed.The main runtime overhead lies in how each of the baselines processes the input graph to predict a batch of K target edges. CaW samples M L-hop random walks for each target edge. This is followed by an expensive set-based anonymization scheme. To achieve good performance, CaW can require relatively long walks (e.g., for Enron, L = 5). On the other hand, memory-based methods and TGAT sample a different L-hop subgraph for each target edge. DyG2Vec samples a single L-hop subgraph within a constant window size W for all target edges.Thus, assuming we use sparse operations in Pytorch Geometric (Fey & Lenssen, 2019) for messagepassing, the encoding computational complexities are: DyG2Vec = O(LW ); CaW = O(LM N s K) and TGN and variants = O(KLN s ). Here, N s represents the maximum number of sampled nodes in an L-hop subgraph and K is the number of target edges to predict. We can see that the main difference is the factor M and the fact that this sampling is done for each of the K target edges. The factor N s comes from the complexity of message passing at each hop (assuming sparse operations). Note that DyG2Vec is limited to O(W ) nodes so it does not have this factor.

A.3 PRELIMINARY: VICREG

We outline the details of the VICReg (Bardes et al., 2022) method used in our SSL pre-training stage. Given z ′ ∈ R d and z ′′ ∈ R d , the representations of two random views of an object (e.g., image) generated through random distortions, the objective of non-contrastive SSL is two-fold. First, the output representation of one view should be maximally informative of the input representation of the view. Second, the representation of one view should be maximally predictable from the representation of the other view. These two aspects are formulated by VICReg (Bardes et al., 2022) where a combination of 3 loss terms (i.e. Variance, Covariance, Invariance) is minimized to learn useful representations while also avoiding the well-known problem of collapse (Jing et al., 2022) in the mapping space. More concretely, let Zbe the batches composed of n representations of dimension d.Variance term: The variance regularization term v is the mean over the representation dimension of the hinge function on the standard deviation of the representations along the batch dimension: v(Z) = 1 D D j=1 max(0, γ -S(Z :,j , ϵ)). Here Z :,j is the column j of matrix Z. S is the regu-

