ONLINE LEARNING OF GRAPH NEURAL NETWORKS: WHEN CAN DATA BE PERMANENTLY DELETED?

Abstract

Online learning of graph neural networks (GNNs) faces the challenges of distribution shift and ever gbv rowing and changing training data, when temporal graphs evolve over time. This makes it inefficient to train over the complete graph whenever new data arrives. Deleting old data at some point in time may be preferable to maintain a good performance and to account for distribution shift. We systematically analyze these issues by incrementally training and evaluating GNNs in a sliding window over temporal graphs. We experiment with three representative GNN architectures and two scalable GNN techniques, on three new datasets. In our experiments, the GNNs face the challenge that new vertices, edges, and even classes appear and disappear over time. Our results show that no more than 50% of the GNN's receptive field is necessary to retain at least 95% accuracy compared to training over a full graph. In most cases, i. e., 14 out 18 experiments, we even observe that a temporal window of size 1 is sufficient to retain at least 90%.

1. INTRODUCTION

Training of Graph Neural Networks (GNNs) on temporal graphs has become a hot topic. Recent works include combining GNNs with recurrent modules (Seo et al., 2018; Manessi et al., 2020; Sankar et al., 2020; Pareja et al., 2020) and vertex embeddings as a function of time to cope with continuous-time temporal graphs (da Xu et al., 2020; Rossi et al., 2020a) . Concurrently, other approaches have been proposed to improve the scalability of GNNs. Those include sampling-based techniques (Chiang et al., 2019; Zeng et al., 2020) and shifting expensive neighborhood aggregation into pre-processing (Wu et al., 2019; Rossi et al., 2020b) or post-processing (Bojchevski et al., 2020) . However, there are further fundamental issues with temporal graphs that are not properly answered yet. First, as new vertices and edges appear (and disappear) over time, so can new classes. This results in a distribution shift, which is particularly challenging in an online setting, as there is no finite, a-priori known set of classes that can be used for training and it is not known when a new class appears. Second, scalable techniques for GNNs address the increased size of the graph, but always operate on the entire graph and thus on the entire temporal duration the graph spans. However, training on the entire history of a temporal graph (even in the context of scaling techniques like sampling (Chiang et al., 2019; Zeng et al., 2020) ) may actually not be needed to perform tasks like vertex classification. Thus, it is important to investigate if, at some point in time, one can actually "intentionally forget" old data and still retain the same predictive power for the given task. In fact, is has been observed in other tasks such as stock-market prediction that too much history can even be counterproductive (Ersan et al., 2020) . Proposed Solution and Research Questions While we do not suggest to use an entirely new GNN architecture, we propose to adapt existing GNN architectures and scalable GNN techniques to the problem of distribution shift in temporal graphs. In essence, we propose a new evaluation procedure for online learning on the basis of the distribution of temporal differences, which assesses the nature of how vertices are connected in a temporal graph by enumerating the temporal differences of connected vertices along k-hop paths. This information is crucial for balancing between capturing the distribution shift while having sufficient vertices within the GNN's receptive field. In summary, the central question we aim to answer is, whether we can intentionally forget old data without losing predictive power in an online learning scenario under presence of distribution shift. We simulate this scenario by applying temporal windows of different sizes over the temporal graph, as illustrated in Figure 1 . The window size c resembles how much history of the temporal graph is used for training, or with other words: which information we forget. In this example, data older than t -2 is ignored. We evaluate the accuracy of representative GNN architectures and scalable GNN techniques trained on the temporal window, against training on the entire timeline of the graph (full history). We evaluate the models by classifying the vertices at time step t, before we advance to the next time step. Figure 1 : A temporal graph G t where new vertices with potentially new classes appear over time. For example, class "c" emerged at t -2 and was subsequently added to the class set C. Training is constrained on a temporal window to simulate intentional deletion of older data. The task is to label the new vertices marked with "?" at time step t, before advancing to the next time step. To answer the research question, we break it down into four specific questions Q1 to Q4, each answered in a separate experiment. New Datasets To enable an analysis with a controlled extent of distribution shift, we contribute three newly compiled temporal graph datasets based on scientific publications: two citation graphs based on DBLP and one co-authorship graph based on Web of Science. To determine candidate window sizes, we contribute a new measure to compute the distribution of temporal differences within the k-hop neighborhood of each vertex, where k corresponds to the number of GNN layers. We select the 25th, 50th, and 75th percentiles of this distribution as candidate window sizes. This results in window sizes of 1, 3, and 6 time steps for the two DBLP datasets, and 1, 4, 8 for the Web of Science dataset.

Results

We select three representative GNN architectures: GraphSAGE-Mean (Hamilton et al., 2017) , graph attention networks (Veličković et al., 2018) and jumping knowledge networks (Xu et al., 2018) along with graph-agnostic multi-layer perceptrons. As scalable GNN techniques, we consider GraphSAINT (Zeng et al., 2020) as well as simplified GCNs (Wu et al., 2019) . The results of our experiments show that already with a small window size of 3 or 4 time steps, GNNs achieve at least 95% accuracy compared to using the full graph. With window sizes of 6 or 8, 99% accuracy can be retained. With a window size of 1, for almost all experiments, a relative accuracy of no less than 90% could be retained, compared to models trained on the full graph. Furthermore, our experiments confirm that incremental training is necessary to account for distribution shift in temporal graphs and we show that both reinitialization strategies are viable and differ only marginally, when the learning rates are tuned accordingly. Surprisingly, simplified GCNs perform notably well on the most challenging dataset DBLP-hard and are only outperformed by GraphSAGE-Mean.



For Q1: Distribution Shift under Static vs Incremental Training, we verify that incremental training is necessary to account for distribution shift, compared to using a once-trained, static model. Extending from Q1, we investigate in Q2: Training with Warm vs Cold Restarts whether it is preferable to reuse model parameters from the previous time step (warm start) or restart with newly initialized parameters at each time step (cold start). In Q3: Incremental Training on Different Window Sizes, we answer the question what influence different choices for the window sizes have, i. e., how far do we need to look into the past such that a GNN trained on the window is still competitive to a model trained on the full graph. Question Q4 extends Q3 by considering Q4: Incremental Training with Scalable GNN Methods, i. e., how scalable GNN approaches compare to using the full history of the temporal graph and to which extent scaling techniques can be applied on top of the temporal window.

