DYG2VEC: REPRESENTATION LEARNING FOR DY-NAMIC GRAPHS WITH SELF-SUPERVISION Anonymous

Abstract

The challenge in learning from dynamic graphs for predictive tasks lies in extracting fine-grained temporal motifs from an ever-evolving graph. Moreover, task labels are often scarce, costly to obtain, and highly imbalanced for large dynamic graphs. Recent advances in self-supervised learning on graphs demonstrate great potential, but focus on static graphs. State-of-the-art (SoTA) models for dynamic graphs are not only incompatible with the self-supervised learning (SSL) paradigm but also fail to forecast interactions beyond the very near future. To address these limitations, we present DyG2Vec, an SSL-compatible, efficient model for representation learning on dynamic graphs. DyG2Vec uses a window-based mechanism to generate task-agnostic node embeddings that can be used to forecast future interactions. DyG2Vec significantly outperforms SoTA baselines on benchmark datasets for downstream tasks while only requiring a fraction of the training/inference time. We adapt two SSL evaluation mechanisms to make them applicable to dynamic graphs and thus show that SSL pre-training helps learn more robust temporal node representations, especially for scenarios with few labels.

1. INTRODUCTION

Graph Neural Networks (GNNs) have recently found great success in representation learning for complex networks of interactions, as present in recommendation systems, transaction networks, and social media (Wu et al., 2020; Zhang et al., 2019; Qiu et al., 2018) . However, most approaches ignore the dynamic nature of graphs encountered in many real-world domains. Dynamic graphs model complex, time-evolving interactions between entities (Kazemi et al., 2020; Skarding et al., 2021; Xue et al., 2022) . Multiple works have revealed that real-world dynamic graphs possess finegrained temporal patterns known as temporal motifs (Toivonen et al., 2007; Paranjape et al., 2017) . For example, a simple pattern in social networks specifies that two users who share many friends are likely to interact in the future. A robust representation learning approach must be able to extract such temporal patterns from an ever-evolving dynamic graph in order to make accurate predictions. Self-Supervised Representation Learning (SSL) has shown promise in achieving competitive performance for different data modalities on multiple predictive tasks (Liu et al., 2021) . Given a large corpus of unlabelled data, SSL postulates that unsupervised pre-training is sufficient to learn robust representations that are predictive for downstream tasks with minimal fine-tuning. However, it is important to specify a pre-training objective function that induces good performance for the downstream tasks. Contrastive SSL methods, despite their early success, rely heavily on negative samples, extensive data augmentation, and large batch sizes (Jing et al., 2022; Garrido et al., 2022) . Non-contrastive methods address these shortcomings, incorporating information theoretic principles through architectural innovations or regularization methods. These closely resemble strategies employed in manifold learning and spectral embedding methods (Balestriero & LeCun, 2022) . The success of such SSL methods on sequential data (Tong et al., 2022; Eldele et al., 2021; Patrick et al., 2021) suggests that one can learn rich temporal node embeddings from dynamic graphs without direct supervision. SSL methods are attractive for dynamic graphs because it is often costly to generate ground truth labels. Contrastive approaches are very sensitive to the quality of the negative samples, and these are challenging to identify in dynamic graphs due to the temporal evolution of interactions and the lack of semantic labels at the contextual level. As a result, it is desirable to explore non-contrastive techniques, but state-of-the-art models for dynamic graphs suffer from shortcomings that make them hard to adapt to SSL paradigms. First, they heavily rely on chronological training or a full history of interactions to construct predictions (Kumar et al., 2019; Xu et al., 2020; Rossi et al., 2020; Wang et al., 2021b) . Second, the encoding modules either use inefficient message-passing procedures (Xu et al., 2020) , memory blocks (Kumar et al., 2019; Rossi et al., 2020) , or expensive random walkbased algorithms (Wang et al., 2021b) that are designed for edge-level tasks only. As a result, while SSL pre-training has been applied successfully for static graphs (Thakoor et al., 2022; Hassani & Khasahmadi, 2020; You et al., 2022) , there has been limited success in adapting SSL pre-training to dynamic graphs. In this work, we propose DyG2Vecfoot_0 , a novel encoder-decoder model for continuous-time dynamic graphs that benefits from a window-based architecture that acts a regularizer to avoid over-fitting. DyG2Vec is an efficient attention-based graph neural network that performs message-passing across structure and time to output task-agnostic node embeddings. Experimental results for 7 benchmark datasets indicate that DyG2Vec outperforms SoTA baselines on future link prediction and dynamic node classification in terms of performance and speed, particularly in medium-and long-range forecasting. The novelty of our model lies in its compatibility with SoTA SSL approaches. That is, we propose a joint-embedding architecture for DyG2Vec that can benefit from non-contrastive SSL. We adapt two evaluation protocols (linear and semi-supervised probing) to the dynamic graph setting and demonstrate that the proposed SSL pre-training is effective in the low-label regime.

2. RELATED WORK

Self-supervised representation learning: Multiple works explore learning visual representations without labels (see (Liu et al., 2021) for a survey). The more recent contrastive methods generate random views of images through data augmentations, and then force representations of positive pairs to be similar while pushing apart representations of negative pairs (Chen et al., 2020a; He et al., 2019) . With the goal of attaining hard negative samples, such methods typically use large batch sizes Chen et al. (2020a) or memory banks (He et al., 2019; Chen et al., 2020b) . Non-contrastive methods such as BYOL (Grill et al., 2020) and VICReg (Bardes et al., 2022) eliminate the need for negative samples through various techniques that avoid representation collapse (Jing et al., 2022) . Recently, several SSL methods have been adapted to pre-train GNNs (Xie et al., 2022) . Deep Graph Infomax (DGI) (Velickovic et al., 2019) and InfoGCL (Xu et al., 2021 )) rely on mutual information maximization or information bottle-necking between patch-level and graph-level summaries. MV-GRL (Hassani & Khasahmadi, 2020) incorporates multiple views, and BGRL (Thakoor et al., 2022) adapts BYOL to graphs to eliminate the need for negative samples, which are often memory-heavy in the graph setting. The experiments demonstrate the high degree of scalability of non-contrastive methods and their effectiveness in leveraging both labeled and unlabeled data. Representation learning for dynamic graphs: Early works on representation learning for continuous-time dynamic graphs typically divide the graph into snapshots that are encoded by a static GNN and then processed by an RNN module (Sankar et al., 2020; Pareja et al., 2020; Kazemi et al., 2020) . Such methods fail to learn fine-grained temporal patterns at smaller timescales within each snapshot. Therefore, several RNN-based methods were introduced that sequentially update node embeddings as new edges arrive. JODIE (Kumar et al., 2019) employs two RNN modules to update the source and destination embeddings respectively of a arriving edge. DyRep (Trivedi et al., 2019) adds a temporal attention layer to take into account multi-hop interactions when updating node embeddings. TGAT (Xu et al., 2020) includes an attention-based message passing (AMP) architecture to aggregate messages from a historical neighborhood. TGN (Rossi et al., 2020) alleviates the expensive neighborhood aggregation of TGAT by using an RNN memory module to encode the history of each node. CaW (Wang et al., 2021b) extracts temporal patterns through an expensive procedure that samples temporal random walks and encodes them with an LSTM. This procedure must be performed for every prediction. In contrast to prior works, our method operates on a fixed window of history to generate node embeddings. Additionally, we do not recompute embeddings for every prediction, which allows for efficient computation and memory usage.



We are going to open-source the code upon acceptance.

