SUFFICIENT SUBGRAPH EMBEDDING MEMORY FOR CONTINUAL GRAPH REPRESENTATION LEARNING Anonymous

Abstract

Memory replay, which constructs a buffer to store representative samples and retrain the model over the buffer to maintain its performance over existing tasks, has shown great success for continual learning with Euclidean data. Directly applying it to graph data, however, can lead to the memory explosion problem due to the necessity to consider explicit topological connections of representative nodes. To this end, we present Parameter Decoupled Graph Neural Networks (PDGNNs) with Sufficient Subgraph Embedding Memory (SSEM) to fully utilize the explicit topological information for memory replay and reduce the memory space complexity from O(nd L ) to O(n), where n is the memory buffer size, d is the average node degree, and L is the range of neighborhood aggregation. Specifically, PDGNNs decouple trainable parameters from the computation subgraphs via Sufficient Subgraph Embeddings (SSEs), which compress subgraphs into vectors (i.e., SSEs) to reduce the memory consumption. Besides, we discover a pseudo-training effect in memory based continual graph learning, which does not exist in continual learning on Euclidean data without topological connection (e.g., individual images). Based on the discovery, we develop a novel coverage maximization sampling strategy to enhance the performance when the memory budget is tight. Thorough empirical studies demonstrate that PDGNNs with SSEM outperform state-of-the-art techniques for both class-incremental and task-incremental settings.

1. INTRODUCTION

Continual graph representation learning (Liu et al., 2021; Zhou & Cao, 2021; Zhang et al., 2021) , which aims to accommodate new types of emerging nodes in a graph and their associated edges without interfering with the model performance over existing nodes, is an emerging area that attracts increasingly more attention recently. It exhibits enormous value in various practical applications, especially in the case where graphs are relatively large and retraining a new model over the entire graph is computationally infeasible. For instance, in a social network, a community detection model has to keep adapting its parameters based on nodes from newly emerged communities; in a citation network, a document classifier needs to continuously update its parameters to distinguish the documents of newly emerged research fields. Memory replay (Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019; Shin et al., 2017) , which stores representative samples in a buffer for retraining the model to maintain its performance over existing tasks, exhibits great success in preventing catastrophic forgetting for various continual learning tasks, e.g., computer vision and reinforcement learning (Kirkpatrick et al., 2017; Li & Hoiem, 2017; Aljundi et al., 2018; Rusu et al., 2016) . Directly applying memory replay to graph data with message passing based graph neural networks (GNNs) (Gilmer et al., 2017; Kipf & Welling, 2016; Veličković et al., 2017) , however, could give rise to the memory explosion problem. Specifically, due to the message passing over the topological connections in graphs, retraining an L-layer GNN (Figure 1 a) with n buffered nodes would require storing O(nd L ) nodes (Chiang et al., 2019; Chen et al., 2017) (the number of edges is not counted yet) in the buffer, where d is the average node degree. Take the Reddit dataset (Hamilton et al., 2017) for an example, its average node degree is 492, the buffer size will easily be intractable even with a 2 layer GNN. To overcome this issue, Experience Replay based GNN (ER-GNN) (Zhou & Cao, 2021) stores representative nodes in the buffer but completely ignores the topological information (Figure 1 b ). Feature graph network (FGN) (Wang et al., 2020a) implicitly encodes node proximity with the inner products between the features of the target node and its neighbors. However, the explicit topological connections are completely ignored and message passing is no longer feasible on the graph. (Zhou & Cao, 2021) . (c) Our PDGNNs with SSEM. The incoming computation subgraphs are first embedded as SSEs and then fed into the trainable function. The SSEs are sampled and stored with the probability computed based on their coverage ratio, i.e., the ratio of nodes covered by their computation subgraphs (Section 3.6). To this end, we propose Parameter Decoupled Graph Neural Networks (PDGNNs) with Sufficient Subgraph Embedding Memory (SSEM) for continual graph learning. Since the key challenge lies in the unbounded sizes of the computation subgraphs, we introduce the concept of Sufficient Subgraph Embedding (SSE) with a fixed size but contains all necessary information of a computation subgraph for model optimization. Such SSEs can be surrogates of computation subgraphs in memory replay. Next, we found that it is infeasible to derive SSEs from MPNNs since their trainable parameters and individual nodes/edges are entangled. To this end, we formulate the PDGNNs framework to decouple them and enable memory replay only based on buffered SSEs (without the computation subgraphs). Since the size of an SSE is fixed, the memory space complexity of a buffer with size n can be dramatically reduced from O(nd L ) to O(n). Moreover, different from traditional continual learning on data without topology (e.g., images), we discover that replaying an SSE incurs a pseudotraining effect on the neighbor nodes, which strengthens the prediction of the other nodes in the same computation subgraph. This effect is unique in continual graph learning and takes place due to the neighborhood aggregation in GNNs. We further analyze that in homophilous graphs (prevalent in real-world data), the pseudo-training effect makes the SSEs corresponding to larger computation subgraphs (quantitatively measured by coverage ratio) more beneficial to continual learning. Inspired by this, we develop a novel coverage maximization sampling, which enlarges the coverage ratio of the selected SSEs and empirically enhances the performance without consuming additional memory. In experiments, we adopt both the class-incremental (class-IL) continual learning scenario (Rebuffi et al., 2017) (rarely studied for node classification under the continual learning setting) and the task-incremental (task-IL) scenario (Liu et al., 2021; Zhou & Cao, 2021) . Thorough empirical studies demonstrate that PDGNNs with SSEM outperform state-of-the-art continual graph representation learning techniques for both class-IL and task-IL settings. Our contributions are summarized below: • We formulate the framework of PDGNNs-SSEM, which successfully enable memory replay with topological information for continual graph representation learning, and reduce the memory space complexity from O(nd L ) to O(n). • PDGNNs-SSEM obtain superior performance especially in the challenging class-IL scenario. • We theoretically reveal a unique phenomenon in continual graph learning (i.e. the pseudotraining effect) when applying memory replay, and accordingly develop the coverage maximization sampling strategy to leverage this effect for improving the performance.

2. RELATED WORKS

Our proposed PDGNNs-SSEM is closely related to continual learning, continual graph learning, and decoupled graph neural networks.

2.1. CONTINUAL LEARNING & CONTINUAL GRAPH LEARNING

To alleviate the catastrophic forgetting problem encountered by machine learning models, i.e., drastic performance decrease on previous tasks after learning new tasks, existing approaches can be categorized into three types. Regularization based methods apply different constraints to prevent drastic modification of model parameters that are important for previous tasks (Farajtabar et al., 2020; Kirkpatrick et al., 2017; Li & Hoiem, 2017; Aljundi et al., 2018; Hayes & Kanan, 2020) . Parametric isolation methods adaptively allocate new parameters for the new tasks to protect the parameters for the previous tasks (Wortsman et al., 2020; Wu et al., 2019b; Yoon et al., 2020; 2017; Rusu et al., 2016) . Memory replay based methods alleviate forgetting by storing and replaying representative data examples from previous tasks when learning new tasks (Caccia et al., 2020; Chrysakis & Moens, 2020; Rebuffi et al., 2017; Lopez-Paz & Ranzato, 2017; Aljundi et al., 2019; Shin et al., 2017) . Recently, continual learning on graphs attracts increasingly more attention due to its practical importance (Zhou & Cao, 2021; Zhang et al., 2021; Liu et al., 2021; Wang et al., 2020b; Xu et al., 2020; Daruna et al., 2021) . Existing works include regularization methods like topology-aware weight preserving (TWP) (Liu et al., 2021) to preserve crucial parameters and topologies, parametric isolation approaches like HPNs (Zhang et al., 2021) that adaptively select different parameters for different tasks, and memory replay methods like ER-GNN (Zhou & Cao, 2021) that stores representative nodes. Our work is also based on memory replay and its key advantage lies in being capable of preserving complete topological information with reduced space complexity, which shows significant superiority in class-IL setting (Section 4.4). Note that we study the class-IL for node classification, which is essentially different from the class-IL for graph-level prediction (Carta et al., 2021) . Memory replay for graph-level tasks stores individual graphs and will not trigger the memory explosion problem (same as traditional continual learning on Euclidean data). In this work, we focus on the class-IL for node classification and aim to resolve the memory explosion problem.Finally, it is worth highlighting the difference between continual graph learning and some relevant but different research areas. First, dynamic graph learning (Galke et al., 2020; Wang et al., 2020c; Han et al., 2020; Yu et al., 2018; Nguyen et al., 2018; Zhou et al., 2018; Ma et al., 2020; Feng et al., 2020) focuses on the temporal node dynamics with all previous data being accessible. In contrast, continual graph learning aims to alleviate forgetting, therefore the previous data is inaccessible. Second, few-shot graph learning (Zhou et al., 2019; Guo et al., 2021; Yao et al., 2020; Tan et al., 2022) targets fast adaptation to new tasks. In training, few-shot learning models can access all previous tasks simultaneously (unavailable in continual learning). For evaluation, few-shot learning models need to be fine-tuned on the new test classes, while the continual learning models are evaluated over existing tasks without fine-tuning.

2.2. DECOUPLED GRAPH NEURAL NETWORKS & RESERVOIR COMPUTING

Unlike the early works with interleaved neighborhood aggregation and node feature transformation (Kipf & Welling, 2016; Gilmer et al., 2017; Veličković et al., 2017; Xu et al., 2018; Chen et al., 2018; Hamilton et al., 2017) , recent works reveal that decoupling these two operations can reduce complexity and increase scalability, while maintaining equivalent or even achieving superior performance for GNNs (Zeng et al., 2021; Chen et al., 2020; 2019; Nt & Maehara, 2019; Frasca et al., 2020) . For instance, Simple Graph Convolution (SGC) (Wu et al., 2019a) removes the non-linear activations from GCN and only keeps one neighborhood aggregation and one node transformation layer. Approximate Personalized Propagation of Neural Predictions (APPNP) (Klicpera et al., 2018) first performs node transformation and then conducts multiple neighborhood aggregations in one layer. Following these works, Dong et al. (2021) prove that the decoupling strategy to predict then propagate is equivalent to training on the unlabelled nodes with pseudo labels aggregated from the labeled neighbors. To further explore decoupled GNNs, Chen et al. (2020) formulate the Graph-Augmented Multi-Layer Perceptrons (GA-MLPs), and theoretically analyzed their expressive power. Instead of decoupling the structures, Zeng et al. (2021) propose SHADOW-GNN to decouple the depth and scope of GNNs by fixing the depth of the computation subgraph. Among these works, some can be viewed as instantiations of PDGNNs (Wu et al., 2019a; Zhu & Koniusz, 2020; Gallicchio & Micheli, 2020) , while the others may not focus on decoupling the trainable parameters and the space complexity is still O(nd L ) when applying memory replay, e.g., APPNP (Klicpera et al., 2018) , Propagation then Training Adaptively (PTA) (Dong et al., 2021) , etc.. Besides works on decoupling GNNs, PDGNNs are also related to reservoir computing based GNNs (Gallicchio & Micheli, 2020; 2010) , which embed the graphs via a fixed, non-linear system followed by a trainable linear readout module. The reservoir computing modules can be adopted in PDGNNs as the SSE generation function (Equation 4), and the corresponding experimental results are in Appendix C.5.

3. PARAMETER DECOUPLED GNNS WITH SUFFICIENT SUBGRAPH EMBEDDING MEMORY

In this section, we first introduce the notations and then explain the technical challenge of applying memory replay techniques to GNNs. Targeting the challenge, we introduce PDGNNs with Sufficient Subgraph Embedding Memory (SSEM). Finally, inspired by theoretical findings of the pseduotraining effect, we develop the coverage maximization sampling. It can empirically improve the continual learning performance, especially when the memory budget is tight. All detailed proofs are provided in the Appendix B.

3.1. PRELIMINARIES

In this paper, continual graph learning is formulated as learning node representations on a sequence of subgraphs (tasks): S = {G 1 , G 2 , ..., G T }. Each subgraph G τ contains several new emerging categories of nodes in the overall graph and is associated with a node set V τ and an edge set E τ , which is represented as the adjacency matrix A τ ∈ R |Vτ |×|Vτ | . Each entry of A τ denotes an edge between a pair of nodes. The degree of a node d refers to the number of edges connected to it. In practice, A τ is often normalized as Âτ = D -1 2 τ A τ D -1 2 τ , where D τ ∈ R |Vτ |×|Vτ | is the degree matrix. Each node v ∈ V τ has a feature vector x v ∈ R b . In classification tasks, each node v has a label y v ∈ {0, 1} C , where C is the total number of classes. When generating the representation for a target node v, GNNs typically take a subgraph within G τ as the input, which is denoted as the computation subgraph G sub τ,v . For simplicity, G sub v may be used in the following, without the graph index. We define the L-hop neighbors of a node v as N L (v) which contains all nodes within a distance of L from v.

3.2. MEMORY REPLAY MEETS GNNS

In traditional continual learning, a model f(•; θ) parameterized by θ is trained on a sequence of T tasks. Each task τ (τ ∈ {1, ..., T }) corresponds to a dataset D τ = {(x i , y i ) nτ i=1 }. To avoid forgetting, memory replay based methods store representative data from the old tasks in a buffer B, which are replayed when learning new tasks. A common approach to utilize B is through an auxiliary loss: L = xi∈Dτ l(f(x i ; θ), y i ) Lτ : loss of the current task +λ xj ∈B l(f(x j ; θ), y j ) Laux: auxiliary loss , where l(•, •) denotes the loss function, and λ ≥ 0 balances the contribution of the old data. The buffer B may also be used in other ways to prevent forgetting instead of directly minimizing L aux Lopez-Paz & Ranzato (2017); Rebuffi et al. (2017) . In these applications, the space complexity of a buffer containing n examples is O(n). However, to capture the topological information, GNNs obtain the representation of a node v based on a computation subgraph surrounding v. We exemplify it with the popular MPNN framework (Gilmer et al., 2017) , which updates the hidden node representations at the l + 1-th layer as: m l+1 v = w∈N 1 (v) M l (h l v , h l w , x e v,w ; θ M l ), h l+1 v = U l (h l v , m l+1 v ; θ U l ), where h l v , h l w are hidden representations of nodes at layer l, x e v,w is the edge feature, M l (•, •, •; θ M l ) is the message function to integrate neighborhood information, and U l (•, •; θ U l ) updates m l+1 v into h l v . When l = 0, h 0 v denotes the input node features. In a L-layer MPNN, the representation of a node v can be simplified as, h L v = MPNN(x v , G sub v ; Θ), (3) where G sub v is the computation subgraph containing the L-hop neighbors (i.e., N L (v)), MPNN(•, •; Θ) is the composition of all M l (•, •, •; θ M l ) and U l (•, •; θ U l ) at different layers. Since N L (v) typically contains O(d L ) nodes, replaying n sampled nodes would require storing O(nd L ) nodes (the edges of G sub v are not counted yet), where d is the average node degree. Take the Reddit dataset (Hamilton et al., 2017) as a concrete example, its average degree is 492, even with a 2 layer MPNN, the buffer size will be easily intractable. Therefore, directly storing the computation subgraphs for memory replay is infeasible for GNNs. Besides, the unsupervised learning models Adhikari et al. (2018) ; Narayanan et al. (2016) also suffer from this problem. Because the trainable parameters in the unsupervised learning part will also be updated after learning each task, the original computation subgraphs are required for retraining the model.

3.3. PARAMETER DECOUPLED GNNS WITH SSEM

As we discussed earlier, the key difficulty of applying memory replay to graph data is to store the computation subgraphs with potentially unbounded sizes. Therefore, we would naturally expect to preserve the necessary information (e.g., the topological information) of a computation subgraph with a vector of fixed length such that the memory consumption can be manageable. Formally, the desired subgraph representation can be defined as Sufficient Subgraph Embedding (SSE). Definition 1 (Sufficient subgraph embedding). Given a model parameterized with θ and an input G sub v , an embedding vector e v is a sufficient subgraph embedding for G sub v if optimizing θ with G sub v or e v are equivalent. Given the definition, we aim to derive SSEs from the computation subgraphs. As we have shown in Section 3.2, SSEs cannot be derived from the MPNNs due to their interleaved neighborhood aggregation and feature transformations, i.e., whenever the trainable parameters get updated, recalculating the representation of v requires all nodes and edges of G sub v . To resolve this issue, we formulate the Parameter Decoupled Graph Neural Networks (PDGNNs) framework, which decouples the trainable parameters from the individual nodes/edges. PDGNNs may not be the only feasible framework to derive SSEs, but is the first attempt in this direction and is empirically verified to be effective. Given a computation subgraph G sub v , the prediction of node v with PDGNNs consists of two steps. First, the topological information of G sub v is encoded into an embedding e v via the function f topo (•) without trainable parameters (instantiations of f topo (•) are detailed in Section 3.4). e v = f topo (G sub v ). (4) Next, e v is further passed into a trainable function f out (•; θ) parametrized by θ (instantiations of f out (•; θ) are detailed in Section 3.4) to get the output prediction ŷv , ŷv = f out (e v ; θ). (5) With the formulations above, e v derived in Eq. ( 4) clearly satisfies the requirements of SSE (Definition 1). Specifically, since the trainable parameters acts on e v instead of directly on any individual node/edge, optimizing the model parameters θ with either e v or G sub v are equivalent. Since SSEs are equivalent to the computation subgraphs for optimizing PDGNNs, the memory buffer only needs to store SSEs to reduce the space complexity from O(nd L ) to O(n). For convenience, we refer to G sub v as the computation subgraph of both v and e v . We name the buffer to store the SSEs as Sufficient Subgraph Embedding Memory (SSEM). Given a new task τ , the update of SSEM is: SSEM = SSEM sampler({e v | v ∈ V τ }, n), where sampler(•, •) denotes the adopted sampling strategy to populate the buffer, denotes the set union, and n is the budget size. As long as a memory buffer SSEM is maintained, our PDGNNs-SSEM perform well with different sampling strategies including random sampling. But in Section 3.6, based on the theoretical insights in Section 3.5, we propose a special sampling strategy to better populate SSEM, which is empirically verified to be beneficial when the memory budget is tight. Equation ( 6) assumes a scenario where all data of the current task are presented concurrently. In practice, if the data of a task are presented in multiple batches (e.g., nodes come in batches on large graphs), the buffer update can be modified by adopting mechanisms to replace the existing data, which is detailed in Appendix A. For task τ with graph G τ , the loss with SSEM then becomes: L = v∈Vτ l(f out (e v ; θ), y v ) Lτ : loss of the current task τ + λ ew∈SSEM l(f out (e w ; θ), y w ) Laux: auxiliary loss , where the e v on the current task is calculated according to Equation (8). Different from traditional continual learning works which choose λ manually, on graph data, we re-scale the losses according to the class sizes to counter the bias from the severe class imbalance, which cannot be handled on graphs by directly balancing the datasets (details are provided in Appendix C.2).

3.4. INSTANTIATIONS OF PDGNNS

Although without trainable parameters, the function f topo (•) for generating SSEs can be highly expressive with various formulations including linear and non-linear ones, both of which are studied in this work. We will mainly focus on the linear formulations, which are empirically comparable to the non-linear choices (Appendix C.3) but is much more efficient and convenient for theoretical analysis (Section 3.5 and 3.6). The linear instantiations of f topo (•) can be generally formulated as, e v = f topo (G sub v ) = w∈V x w • π(v, w; Â), where π(•, •; Â) denotes the adopted strategy for computation subgraph construction based on the structure Â (the normalized adjacency matrix defined in Section 3.1). Next, to instantiate π(•, •; Â), we first formulate the SSE generation for all nodes in V as a matrix multiplication: E V = ΠX V , where each entry Π v,w = π(v, w; Â). E V ∈ R |V|×b is the concate- nation of all SSEs (e v ∈ R b ), and X V ∈ R |V|×b is the concatenation of all node feature vectors x v ∈ R b . The following three options are adopted as instantiations of Π in our experiments: 8) is re-scaled by π(v,w; Â) π(w,w; Â) . We term this property as the pseudo-training effect on neighboring nodes, because it is equivalent to that the training is conducted on each neighboring node (in G sub v ) through the pseudo labels and the pseudo computation subgraphs. 1. SGC Wu et al. (2019a): Π = ÂL 2. S 2 GC Zhu & Koniusz (2020): Π = 1 L L l=1 (1 -α) Âl + αI 2. When f out (•; θ) is linear, training PDGNNs on e v is also equivalent to training f out (•; θ) on pseudolabeled nodes (x w , y v ) for each w in G sub v , where the contribution of w in the loss is adaptively re-scaled with a weight fout(xw;θ) k •π(v,w; Â) w∈V sub v fout xw•π(v,w; Â);θ k . The pseudo-training effect essentially arises from the neighborhood aggregation operation. Due to the prevalence of homophily (defined in Appendix B.2) in real-world graphs, neighborhood aggregation (i.e., message passing) is widely adopted in mainstream GNNs to enhance the performance by encouraging similar representations and predictions for neighbored nodes. Similarly, pseudo-training effect implies that replaying the SSE of a buffered node is encouraging a similar prediction for its neighbors (not buffered), which is also beneficial on homophilous graphs. In other words, the homophilous neighbors of a buffered node v do not need to be stored, but the forgetting problem on them can also be alleviated by replaying the SSE of v. Besides, when f out (•; θ) is linear, the re-scaling weight in Theorem 1.2 can adjust the pseudo-training on neighboring nodes according to their homophily to some extent. Specifically, larger f out (x w ; θ) k denotes a higher confidence to classify w into class k, and a higher π(v, w; Â) typically denotes a higher similarity between w and v. Therefore, the pseudo-training is stronger on the homophilous neighbors (with same labels) and weaker on the heterophilous neighbors (with different labels), according to the prediction confidence of f out (•; θ). Note that although homophily brings this extra benefit, it is not a prerequisite for our model to work. Despite the homophily, replaying e v still reinforces the prediction of node v itself, just like memory replay in traditional continual learning on independent data (e.g., images). However, real-world graphs often exhibit strong homophily, and pseudo-training effect is generally beneficial, which is also empirically justified (Section 4.3). The above analysis suggests that SSEs with larger computation graphs covering more nodes may be more effective. In the next subsection, we design the coverage maximization sampling strategy to leverage the benefit of the pseudo-training effect. n ← n -1 13: end while Following the above subsection, to quantify the number of nodes covered by the selected SSEs versus the total number of nodes in the graph, we define the coverage ratio of the SSEs. In the following, since each SSE uniquely corresponds to a node, we may use 'node' and 'SSE' interchangeably Definition 2. Given a graph G, node set V, and function π(•, •; Â), the coverage ratio of a set of nodes V s is:

3.6. COVERAGE MAXIMIZATION SAMPLING

R c (V s ) = | ∪ v∈Vs {w|w ∈ G sub v }| |V| , i.e., the ratio of nodes of the entire (training) graph covered by the computation subgraphs of the selected nodes (SSEs). To maximize R c (SSEM), a naive approach is to start from the SSE with the largest coverage ratio and iteratively incorporate SSE that increases R c (SSEM) the most. However, this requires computing R c (SSEM) for all possible SSEs in each iteration, which is time consuming especially on large graphs. Besides, certain randomness is also desired for the diversity of SSEM. Therefore, we propose to sample SSEs from a multinomial distribution based on the coverage ratio of each individual SSE. Specifically, in task τ with node set V τ , the probability of sampling node v ∈ V τ is p v = Rc({v}) w∈Vτ Rc({w}) . Then the procedure is to sample from V τ according to {p v | v ∈ V τ } without replacement, as shown in Algorithm 1. In experiments, we compare different sampling strategies to demonstrate the strong correlation between the coverage ratio and the performance, which also verifies the benefits revealed in Section 3.5

4. EXPERIMENTS

In this section, we aim to answer the following questions: Q1: Whether PDGNNs-SSEM work well with a reasonable buffer size? Q2: Does coverage maximization sampling ensure a higher coverage ratio? Q3: Whether our theoretical results can be empirically justified? Q4: Does a higher coverage ratio lead to better performance? Q5: Whether PDGNNs-SSEM can outperform the state-of-the-art methods in both class-IL and task-IL scenarios? Due to the space limitations, only the most prominent results are presented in the main content, and more details are available in Appendix. For simplicity, PDGNNs-SSEM will be denoted as PDGNNs in this section. 4.1 DATASETS We adopted four public datasets, CoraFull, OGB-Arxiv, Reddit, and OGB-Products, with up to millions of nodes and 70 classes. Dataset statistics and task splittings (i.e., how we partition the node classes into different tasks) are summarized in Table 5 . In the paper, we show the results under the splittings with the largest number of tasks. More details of the datasets, dataset splittings, and the results with other task splittings are provided in the Appendix C.1,C.2,C.4.

4.2. EXPERIMENTAL SETUP AND MODEL EVALUATION

Continual learning setting and model evaluation. During training, a model is trained on a task sequence with access only to the subgraphs of the current task. After that, the model is tested on all learned tasks. In the class-IL scenario, a model has to classify a given node by picking a class from all learned classes (which is more challenging), while the task-IL scenario only requires the model to distinguish the classes within each task. For model evaluation, the most thorough metric is the accuracy matrix M acc ∈ R T ×T , where M acc i,j denotes the accuracy on task j after learning task i. The learning dynamics are shown with the curves of average accuracy (AA): i j=1 M acc i,j i |i = 1, ..., T and the average forgetting (AF): i-1 j=1 M acc i,j -M acc j,j i-1 |i = 2, ..., T when the number of learned tasks varies. To use a single numeric value for evaluation, the AA and AF after learning all T tasks will be used. We repeat all experiments 5 times on one Nvidia Titan Xp GPU. All results are reported with average performance and standard deviations. Baselines and model settings. Our baselines include the methods designed for continual graph learning including Experience Replay based GNN (ERGNN) (Zhou & Cao, 2021) and Topologyaware Weight Preserving (TWP) (Liu et al., 2021) , and milestone works designed for Euclidean data but also applicable to GNNs including Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) , Learning without Forgetting (LwF) (Li & Hoiem, 2017) , Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) , and Memory Aware Synapses (MAS) (Aljundi et al., 2018) ). These baselines are implemented based on three popular backbone GNNs, i.e., Graph Convolutional Networks (GCNs) (Kipf & Welling, 2016), Graph Attentional Networks (GATs) (Veličković et al., 2017) , and Graph Isomorphism Network (GIN) (Xu et al., 2018) . Besides, joint training (without forgetting problem) and fine-tune (without continual learning technique) are adopted as the upper and lower bound for performance comparison. We instantiate f out (•; θ) as a multi-layer perceptron (MLP). To make a fair comparison, all methods including f out (•; θ) of PDGNNs are set as 2-layer with 256 hidden dimensions, and the neighborhood aggregation range of PDGNNs (L in Section 3.3) is also set as 2 for consistency. As detailed in Section 4.3, f topo (•) is chosen as the SGC strategy (Section 3.4), while the comparison among different choices is introduced in Appendix C.4. 4.3 STUDIES ON THE BUFFER SIZE & PERFORMANCE VS. COVERAGE RATIO (Q1,2,3,4) In Table 2 , based on PDGNNs, we compare the proposed coverage maximization sampling with uniform sampling and mean of feature (MoF) in terms of coverage ratios and performance when the buffer size (ratio of dataset) varies from 0.0002 to 0.4 on OGB-Arxiv. More complicated sampling methods Hübler et al. (2008) ; Yang et al. (2016) can also be used. However, the sampling methods adopted in this work can already obtain performance comparable to Joint, and are highly efficient. The proposed coverage maximization sampling achieves a superior coverage ratio, especially when buffer sizes are relatively small. We also notice that the average accuracy for coverage maximization sampling is positively related to the coverage ratio in general, which verifies Theorem 1. Table 2 also shows the positive correlation between the buffer size and the performance. Besides, our SSEM appears to be highly efficient in terms of memory usage. No matter which sampling strategy is used, the performance can reach ≈50 average accuracy (AA) with only 5% data buffered. In Appendix C.5, we further evaluate how the performance changes when the buffer size varies with different variants of PDGNNs (i.e., the SSE generation strategies adopted from SGC, S 2 GC, APPNP, and reservoir computing described in Section 3.3). The SGC strategy is more efficient than the other variants with comparable performance, therefore is chosen in our following experiments.

4.4. RESULTS FOR CLASS-IL SCENARIO AND TASK-IL SCENARIO (Q5)

Class-IL Scenario. We compare PDGNNs with the baselines on 4 public datasets under the class-IL scenario. As shown in Table 3 , PDGNNs significantly outperform the baselines and is even comparable to joint training (the performance upper bound) on 4 different datasets. The learning dynamics are also shown in Figure 3 . Among the baselines, those techniques relying on regularization or Fine-tune exhibit severe forgetting problems. LwF performs slightly better than them since knowledge distillation is employed. ER-GNN outperforms LwF since it leverages memory replay to maintain performance over old tasks. For clarity, we omit the error bars on the CoraFull dataset. Full results with error bars are available in Appendix C.4. To further understand the dynamics of different methods under the class-IL scenario, we visualize the accuracy matrices of PDGNNs, ER-GNN, LwF, and Fine-tune in Figure 4 . Each row of the matrix denotes the performance on all tasks after learning a new task, and each column denotes the performance change of a specific task when learning all tasks sequentially. Compared to baselines exhibiting severer forgetting when learning new tasks, PDGNNs can maintain relatively stable performance on each task even though new tasks are continuously learned. Besides, we also visualized the learnt node representations of after learning all tasks, which is shown in Appendix C.4. Task-IL Scenario. In Table 4 , we can observe that PDGNNs still outperform baselines on all 4 different datasets under the task-IL scenario even though it is less challenging than the class-IL scenario as we discussed in Section 4.2. Due to space limitations, more detailed discussions about the results and the learning dynamics with the task-IL scenario are provided in Appendix C.4.

5. CONCLUSION

In this work, we propose the PDGNNs with SSEM for continual graph representation learning. Based on SSEs, we reduce the memory space complexity from O(nd L ) to O(n), which enables PDGNNs to fully utilize the explicit topological information sampled from previous tasks. We also discover and theoretically analyze the pseudo-training effect of SSEs. This inspires us to develop coverage maximization sampling which has been demonstrated to be highly efficient especially when the memory budget is tight. Finally, thorough empirical studies on both class-IL and task-IL continual learning scenarios demonstrate the effectiveness of PDGNNs-SSEM.

A ADDITIONAL DETAILS ON PARAMETER DECOUPLED GNNS WITH SSEM

As mentioned in Section 3.3 of the paper, in real-world applications, the data may come in batches instead of being presented simultaneously. Therefore, the updating of SSEM may need modification. The key issue is to determine how to update SSEM such that the newly sampled SSEs can be accommodated accordingly. We present two different approaches to handle this. 1. The most straightforward approach is to store the computation subgraph size s sub v of each e v and recalculate the multinomial distribution. Given the incoming new node set V τ , the probability of sampling each node is recalculated as  p v = s sub v

B THEORETICAL ANALYSIS

In this section, we give proofs and detailed analysis of the theoretical results in the paper.

B.1 PARAMETER DECOUPLED GNNS WITH SSEM

In Section 3.3, we mentioned that the embedding e v derived in PDGNNs is a sufficient subgraph embedding of G sub v with respect to the optimization of θ. Although this is straightforward, we still provide a proof for completeness. Proof. According to Definition 1, a sufficient condition for a vector e v to be a sufficient subgraph embedding of G sub v is that e v provides same information as G sub v for optimizing the parameter θ of a model f out (•; θ). Therefore, the proof can be done by showing ∇ θ L(e v , θ) = ∇ θ L(G sub v , θ) , where L is the adopted loss function. This becomes straightforward under the PDGNNs framework since G sub v is first embedded in e v and then participate in the computation with the trainable parameter θ. Specifically, given an input computation subgraph G sub v with the label y v , the corresponding prediction of PDGNNs is: ŷv = f out w∈V sub v x w • π(v, w; Â); θ , ( ) and the loss is: L v = l f out w∈V sub v x w • π(v, w; Â); θ , y v , the gradient of loss L v is: ∇ θ L v = ∇ θ l f out w∈V sub v x w • π(v, w; Â); θ , y v . ( ) When the input G sub v is replaced with e v , the prediction becomes: ŷv = f out (e v ; θ), and the corresponding loss becomes: L ′ v = l f out (e v ; θ), y v , the gradient of loss L v becomes: ∇ θ L ′ v = ∇ θ l f out (e v ; θ), y v . ( ) Since in the PDGNNs, e v is calculated as: e v = w∈V sub v x w • π(v, w; Â), then we have: ∇ θ L v = ∇ θ L ′ v , i.e., optimizing the trainable parameters with e v is equal to optimizing the trainable parameters with G sub v .

B.2 PSEUDO-TRAINING EFFECTS OF SSES

Theorem 1 (Pseudo-training). Given a node v, its computation subgraph G sub v , the SSE e v , and label y v (suppose v belongs to class k, i.e. y v,k = 1), then training PDGNNs with e v has the following two properties: 1. It is equivalent to training PDGNNs with each node w in G sub v with G sub v being a pseudo computation subgraph and y v being a pseudo label, where the contribution of x w (via Equation 4 in the paper) is re-scaled by π(v,w; Â) π(w,w; Â) . We term this property as the pseudo-training effect on neighboring nodes. 2. When f out (•; θ) is linear, training PDGNNs on e v is also equivalent to training f out (•; θ) on pseudo- labeled nodes (x w , y v ) for each w in G sub v , where the contribution of w in the loss is adaptively re-scaled with a weight fout(xw;θ) k •π(v,w; Â) w∈V sub v fout xw•π(v,w; Â);θ k . Proof of Theorem 1.1. Theorem 1.1 is rather intuitive and easy to understand, we still provide a detailed proof for rigorousness. Given a node v, the prediction is: ŷv = f out (e v ; θ) ∵ e v = w∈V sub v x w • π(v, w; Â), where V sub v denotes the node set of the computation subgraph G sub v , and Â is the adjacency matrix of G sub v . ∴ ŷv = f out w∈V sub v x w • π(v, w; Â); θ Given the target (ground truth label) of node v as y v , the objective function of training the model with node v is formulated as: L v = l f out w∈V sub v x w • π(v, w; Â); θ , y v , where l could be any loss function to measure the distance between the prediction and the target. Since V sub v contains both the features of node v and its neighbors, Equation 20 can be further expanded to separate the contribution of node v and its neighbors: L v = l f out x v • π(v, v; Â) information from node v + w∈V sub v \{v} x w • π(v, w; Â) neighborhood information ; θ , y v , Given an arbitrary node q ∈ V sub v but q ̸ = v ∈ V sub v (the adjacency matrix Â stays the same), we can similarly obtain the loss of training the model with node q: L q = l f out x q • π(q, q; Â) information from node q + w∈V sub q \{q} x w • π(q, w; Â) neighborhood information ; θ , y q . ( ) Since q ∈ V sub v \{v}, we rewrite Equation 21 as: L v = l f out x q • π(v, q; Â) information from node q + w∈V sub v \{q} x w • π(v, w; Â) neighborhood information ; θ , y v , By comparing Equation 23and 22, we could observe the similarity in the loss of node v and q, and the difference lies in the contribution (weight π(•, •; Â)) of each node and the neighboring nodes (V sub q and V sub v ). To clearly explain the analysis in the paper that stronger homophily leads to more benefits from pseudo training effect, we give the formal definition of the graph homophily ratio. Given a graph G, the homophily ratio is defined as the ratio of the number of edges connecting nodes with a same label and the total number of edges, i.e. h(G) = 1 |E| (j,k)∈E 1(y j = y k ), where E is the edge set containing all edges, y j is the label of node j, and 1(•) is the indicator function Ma et al. (2021) . For any graph, the homophily ratio is between 0 and 1. For each computation subgraph, when the homophily ratio is high, the neighboring nodes tend to share labels with the center node, and the pseudo training would be beneficial for the performance. Many real-world graphs like the social network and citation networks tend to have high homophily ratios, and pseudo training will bring much benefit, which is shown in Section 4.3 of the paper. Proof of Theorem 1.2. In this part, we choose the loss function l as cross entropy CE(•, •), which is the common choice for classification problems. In the following, we will first derive the gradient of training the PDGNNs with (e v , y v ). For cross entropy, we denote the one-hot vector form label as y v , of which the y v -th element is one and other entries are zero. Given the loss of a node v as shown in the Equation 20, the gradient is derived as: ∇ θ L v = ∇ θ CE w∈V sub v f out x w • π(v, w; Â); θ , y v (25) = ∇ θ y v,k • log w∈V sub v f out x w • π(v, w; Â); θ k (26) = y v,k • ∇ θ w∈V sub v f out x w • π(v, w; Â); θ k w∈V sub v f out x w • π(v, w; Â); θ k (27) = y v,k • w∈V sub v ∇ θ f out x w • π(v, w; Â); θ k w∈V sub v f out x w • π(v, w; Â); θ k (28) = y v,k • w∈V sub v ∇ θ f out (x w ; θ) k • π(v, w; Â) w∈V sub v f out x w • π(v, w; Â); θ k (29) = w∈V sub v y v,k • ∇ θ fout(xw;θ) k fout(xw;θ) k • f out (x w ; θ) k • π(v, w; Â) w∈V sub v f out x w • π(v, w; Â); θ k (30) = w∈V sub v ∇ θ CE f out (x w ; θ), y v,k • f out (x w ; θ) • π(v, w; Â) w∈V sub v f out x w • π(v, w; Â); θ = w∈V sub v f out (x w ; θ) • π(v, w; Â) w∈V sub v f out x w • π(v, w; Â); θ • ∇ θ CE f out (x w ; θ), y v . The loss of training f out (x w ; θ) with pairs of feature and pseudo-label (x w , y v ) of all nodes of G sub v is: L G sub v = w∈V sub v CE f out (x w ; θ), y v (33) Then, the corresponding gradient of L G sub v is : In this subsection, we give further analysis on the pseudo training effect when the SSE generation follows the following formulation: ∇ θ L G sub v = w∈V sub v ∇ θ CE f out (x w ; θ), y v . e v = g({x w | w ∈ V}, Â). In this scenario, the pseudo training effect will depend on the specific form of g(•, •). Despite this, we can still analyze the strength of pseudo training effect with respect to the smoothness of the function and the dataset properties. First of all, the pseudo training effect exists because the GNN models generate the prediction based on a local neighborhood. Therefore, the nodes with overlapping In experiments, we also instantiated g(•, •) with the reservoir computing module (Gallicchio & Micheli, 2020) , which yields comparable performance with other instantiations (Section C.5).

C ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide additional information on the datasets, experimental settings, and experimental results.

C.1 DATASET DESCRIPTIONS

The statistics of the datasets are summarized in Table 5 . Among these datasets, CoraFull and OGB-Arxiv are two citation graphs, Reddit is a graph constructed from Reddit posts, and OGB-Products is an Amazon product co-purchasing network. The usage of the datasets is granted for academic purposes, and full details on the licenses can be obtained from the official websites. The datasets contain no personally identifiable information or offensive content.

C.1.1 CITATION NETWORKS

CoraFull (McCallum et al., 2000) is a citation network labeled based on the paper topics. In total, it contains 19,793 nodes and 126,842 edges. The original dataset has 65,311 edges. We directly adopted the version in DGL with reverse edges added and duplicates removed. It contains 70 classes, and each node has a 8,710 dimensional feature vector. The OGB-Arxiv dataset is collected in the Open Graph Benchmark OGB. It is a directed citation network between all Computer Science (CS) arXiv papers indexed by MAG (Wang et al., 2020d) . Totally it contains 169,343 nodes and 1,166,243 edges. The dataset contains 40 classes. For each dataset, the splitting of different tasks is conducted by dividing the classes into groups in the default order. Different group sizes are shown in Table 1 of the paper. For each task, the ratio for training, validation, and testing is 60%, 20%, 20%. The validation set was only used in baseline model selection, since the hyperparameters of our method are simply set to be consistent with baselines (Section 4.2 in the paper). Baselines and model settings. In this part, we give more details on the model configurations. The following setting applies to all datasets. All the backbone GNNs of baselines are configured as 2-layer with 256 hidden dimensions, which exhibit better performance than other configurations. To ensure a fair comparison, we also set the MLP part of PDGNNs as 2-layer with 256 hidden dimensions (the SSE generation part does not contain trainable parameters) as shown in Table 6 . The memory budget (number of nodes per class selected to store) is set as 400 for PDGNNs-SSEM for all datasets. For the memory based baselines, the budget was chosen with two criteria: 1. The buffer size should be large than the size for PDGNNs-SSEM to ensure PDGNNs-SSEM does not succeed by storing more examples. 2. The budget should be large enough for the baseline methods to gain a reasonable performance. The specific budgets on different datasets are listed in Table 7 , which demonstrates that PDGNNs-SEEM is actually highly efficient in using the buffered data and outperforms the memory based baselines with less memory usage. A brief introduction of the baseline continual learning techniques are given below: 1. Fine-tune directly trains a given backbone GNN on the task sequence without any technique to avoid forgetting, therefore can be viewed as a lower bound on the continual learning performance. 2. Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) adds a quadratic penalty to prevent the model weights, which are important to prevent model parameters related to previous tasks from shifting too much. 3. Memory Aware Synapses (MAS) (Aljundi et al., 2018) measures the importance of the parameters according to the sensitivity of the predictions on the parameters and slows down the update of the important parameters. 4. Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, stores representative data in episodic memory and adds a constraint to prevent the loss of the episodic memory from increasing and only allow it to decrease. 5. Topology-aware Weight Preserving (TWP) (Liu et al., 2021) adds a penalty on the model weights to preserve the topological information of previous graphs. 6. Learning without Forgetting (LwF) (Li & Hoiem, 2017) uses knowledge distillation to constrain the shift of parameters for old tasks. 7. Experience Replay GNN (ER-GNN) (Zhou & Cao, 2021) integrates memory-replay to GNNs by storing experience nodes from previous tasks. 8. Joint Training does not follow the continual learning setting and trains the model on all tasks simultaneously. Therefore, Joint Training does not suffer from forgetting problems and its performance can be viewed as the upper bound for continual learning. A widely adopted performance upper bound on the continual learning models is joint training. Different from being trained sequentially on a task sequence, a jointly trained model does follow the continual learning setting but is simultaneously trained on all tasks. Therefore, jointly trained models do not suffer from the forgetting problem and could be viewed as an upper bound on the continual learning performance. Note that under the class-IL setting, the average accuracy of the jointly trained model will still decrease as the number of classes increases. The reason is that the classification difficulty increases when the number of classes vary from small to large. Class imbalance in continual graph learning. According to Equation 37, the performance on different tasks contributes equally to the average accuracy. However, unlike the traditional continual learning with balanced datasets, the class imbalance problem is usually severe in graphs, of which the effect will be entangled with the effect of forgetting. Directly balancing the data by choosing equal number of nodes from each class may not be practical. For example, in the OGB-Products dataset, the largest class has 668,950 nodes, while the smallest contains only 1 node. Therefore, sampling equal amount of nodes from each class would result in either deleting many classes without enough nodes or sampling a very small number of nodes from each class so that all classes can provide the same amount of nodes. Moreover, deleting nodes in a graph would also change the original topological structures of the remaining nodes, which is undesired. To this end, we propose to re-scale the loss of nodes in each class according to the class sizes. (37) Since the evaluation treats all classes equally and the loss on each class is balanced, λ is omitted in our implementation, as it will influence the balance of each class. 

C.3 ADDITIONAL RESULTS OF STUDIES ON THE BUFFER SIZE

In this subsection, we show the performance of PDGNNs-SSEM with different buffer sizes on the other 3 datasets in Figure 5 and 6. We observe similar patterns in these results, i.e., the performance (both average accuracy and average forgetting) increases when the buffer size (in terms of the ratio of data) increases. Specifically, on OGB-Products dataset, which is the largest dataset with millions of nodes, the PDGNNs-SSEM can achieve reasonably well performance with a buffer size of only 0.01 to the size of the dataset, which further demonstrates the effectiveness and efficiency of PDGNNs-SSEM. In Table 2 of the paper, we have the following findings: (1) our coverage maximization sampling does guarantee a superior coverage ratio compared to the other sampling strategies, especially when the buffer size is relatively small; (2) the performance does exhibit strong correlation with the coverage ratio, especially when the buffer size is small. For different buffer sizes, a higher coverage ratio can yield better performance. The performance gap between different sampling strategies is larger with smaller buffer sizes, which is also the situation when the coverage ratio gap is larger. In this case (buffer size smaller than 1.0%), the number of stored SSEs is relatively small compared to the size of the dataset, therefore the effectiveness of pseudo training on more nodes is more prominent. With larger buffer sizes, all sampling strategies can cover a large ratio of nodes and the performance gaps close up. In real world applications, a smaller buffer size is typically adopted, therefore the high memory efficiency of coverage maximization sampling would be preferred. The above analysis verifies our Theorem 1 and indicates higher coverage ratio would be beneficial to the performance. 



https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv https://ogb.stanford.edu/docs/nodeprop/#ogbn-products https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv https://ogb.stanford.edu/docs/nodeprop/#ogbn-products https://ogb.stanford.edu/docs/nodeprop/ogbn-products http://manikvarma.org/downloads/XC/XMLRepository.html



Figure 1: (a) Directly storing computation subgraphs for replay in a multi-layer MPNN. (b) The strategy to store single nodes proposed in ER-GNN (Zhou & Cao, 2021). (c) Our PDGNNs with SSEM. The incoming computation subgraphs are first embedded as SSEs and then fed into the trainable function. The SSEs are sampled and stored with the probability computed based on their coverage ratio, i.e., the ratio of nodes covered by their computation subgraphs (Section 3.6).

Figure 2: Illustration of the coverage ratio. Supposing the graph has N nodes in total, R c ({u}) = 13 N , R c ({v}) = 15 N , R c ({u}) = 14 N , and R c ({u, v, w}) = 42 N In traditional continual learning on Euclidean data without topological connections, replaying an example x i (e.g., an image) only reinforces the prediction of x i itself. In this subsection, we introduce the pseudo-training effect of SSEs, which implies that training PDGNNs with e v of node v also influences the predictions of the other nodes in G sub v , based on which we develop a novel sampling strategy to further boost the continual learning performance on graphs. Theorem 1 (Pseudo-training). Given a node v, its computation subgraph G sub v , the SSE e v , and label y v (suppose v belongs to class k, i.e. y v,k = 1), then training PDGNNs with e v has the following two properties: 1. It is equivalent to training PDGNNs with each node w in G sub v with G sub v being a pseudo computation subgraph and y v being a pseudo label, where the contribution of x w (via Equation8) is re-scaled byπ(v,w; Â)  π(w,w; Â) . We term this property as the pseudo-training effect on neighboring nodes, because it is equivalent to that the training is conducted on each neighboring node (in G sub v ) through the pseudo labels and the pseudo computation subgraphs.

Figure 3: Dynamics of average accuracy in class-IL scenario. From left to right: 1. OGB-Arxiv, 2 classes per task. 2. CoraFull, 2 classes per task. 3. Reddit, 2 classes per task. 4. OGB-Products, 2 classes per task.

Figure 4: From left to right: accuracy matrix of PDGNNs, ER-GNN, LwF, and Fine-tune on OGB-Arxiv dataset.Table4: Performance comparisons under task-IL on different datasets (↑ higher means better).

By comparing Equation 32 and 35, we can see that training PDGNNs with a sufficient subgraph embedding e v equals to training the function f out (•; θ) on all nodes of the computation subgraph G sub v with a weight fout(xw;θ)•π(v,w; Â) w∈V sub v fout xw•π(v,w; Â);θ on each node to rescale the contribution dynamically. B.3 FURTHER DISCUSSION ON PSEUDO-TRAINING EFFECTS OF GENERALIZED SSE GENERATION FUNCTION

Denoting the set of the classes of our training data as C, the number of examples of each class in C can be represented as {n c | c ∈ C}. Then, we calculate a scale for each class c to balance their contribution in the loss function as s yv = nc i∈C ni , where y v,c = 1. Finally, our balanced loss is: L = v∈Vτ l(f (e v ; θ), y v ) • s yv + ew∈SSEM l(f (e w ; θ), y w ) • s yw .

Figure 5: Average accuracy (Red circles) and average forgetting (Black crosses) changes with buffer size on OGB-Products dataset (the left two) and Reddit dataset (the right two).

Figure 6: Average accuracy (Red circles) and average forgetting (Black crosses) changes with buffer size on CoraFull dataset.

Figure 7: Visualization of node representations of different classes on Reddit dataset. The node representations are taken after learning 1, 10, and 20 tasks. From the top to the bottom, we show the results of Fine-tune, ER-GNN, and PDGNNs-SSEM. Each color corresponds to a class.

Figure 8: Dynamics of average accuracy on CoraFull dataset with task sequence of length of 35 (left), 14 (middle), and 5 (right) in class-IL scenario.

Figure 9: Dynamics of average accuracy on OGB-Arxiv dataset with task sequence of length of 8 (left), 4 (middle), and 2 (right) in class-IL scenario.

The detailed statistics of datasets and task splittings

Algorithm 1 Coverage maximization sampling Input: Gτ , Vτ , Âτ , π(•, •; •), sample size n.

Performance & coverage ratios of different sampling strategies and buffer sizes on OGB-Arxiv dataset (↑ higher means better).

Performance comparisons under class-IL on different datasets (↑ higher means better).



For efficiency, we can also adopt the reservoir sampling based strategy to update existing SSEs in SSEM without recalculating the multinomial distribution. Specifically, given a new node set V τ , we first sample min{n, |V τ |} nodes (SSEs) S from V τ with the coverage maximization sampling. Next, we align all SSEs in SSEM and S in a sequence, i.e. the first |SSEM| elements are from SSEM and the following elements are from S. Finally, for each SSE e v in S, suppose its order in the sequence is o v ∈ {|SSEM|, |SSEM| + 1, ..., |SSEM| + |S|}, we generate a random integer r from uniform distribution on 1 to |SSEM| + o v }. If r falls in the range from 1 to |SSEM|, then the r-th SSEs in SSEM is replaced by e v , otherwise e v is deleted. In this way, the nodes in SSEM can be randomly updated with the newly sampled SSEs.

The detailed statistics of datasets and task splittings inputs to the model) share similar prediction results. If their labels are shared, then training these nodes could mutually reinforce each other. Accordingly, given an arbitrary function g(•, •), we can gain an insight into the strength of pseudo training effect by analyzing the similarity of the inputs when generating representations of different nodes. Without loss of generality, we assume g(•, •) be a continuous function (since g(•, •) does not require training, it does not have to be differentiable). Then, given two nodes v and w, we denote their corresponding inputs to the model as two vectors I v and I w . I v and I w may contain different neighborhood information based on the specific form of g(•, •). Now, it is obvious that the closer I v and I w are, the closer g(I v , Â) and g(I w , Â) are (due to the continuity of g(•, •)). In other words, stronger homophily will lead to stronger pseudo training effect as we analyzed in Theorem 1 in the paper. Besides, the frequency components (in terms of the spectrum of the function, e.g., with Fourier analysis) of g(•, •) also matters. If g(•, •) is mainly composed of low frequencies, i.e., the change of g(•, •) is slow with respect to the change of the input, then the pseudo training effect is stronger because more nodes are getting similar representations. But if the function g(•, •) contains strong high frequency components, i.e. g(•, •) changes significantly with the change of input, then the pseudo training effect is weaker since only very similar inputs of the nodes get similar outputs.

The configuration of the MLP part of PDGNNs. The node labels are the community, or "subreddit", that the posts belong to. The authors sampled 50 large communities and built a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 nodes with an average degree of 492, 114,615,892 edges, and a 602 dimensional feature vector for each node. We directly used the version integrated in DGL library.C.1.3 PRODUCT CO-PURCHASING NETWORKOGB-Products is collected in the Open Graph Benchmark 5 , representing an Amazon product copurchasing network 6 . It contains 2,449,029 nodes and 61,859,140 edges. Nodes represent products sold on Amazon, and edges indicate that the connected products are purchased together. In our experiments, we select 46 classes and omit the last class containing only 1 example.

Memory budget of different methods on different datasets.

annex

In this subsection, we provide additional results to compare PDGNNs-SSEM with the baselines. In Table 8 , we provide numerical results to compare different models and complement the curves of average accuracy provided in the paper. We list both the final average accuracy and average forgetting of all models on the OGB-Arxiv dataset with different task splittings in class-IL scenario. Besides, we also show the results of PDGNNs-SSEM with an extremely small buffer size (i.e., 0.001 of the size of the dataset), which is denoted with PDGNNs*. 0.001 of the size of OGB-Arxiv corresponds to storing only 4 examples per class and a total of 160 for 40 different classes, which is orders of magnitudes smaller than the buffer size of the memory based baselines with budgets of several hundred per class. From Table 8 , we can observe that both PDGNNs and PDGNNs* significantly outperform the baselines. Even the PDGNNs* can outperform baselines by a large margin, which demonstrates the high efficiency of SSEM. Considering that OGB-Arxiv contains 169,343 nodes, the performance of PDGNNs* is indeed impressive.Since the error bars of Figure 3 datasets with different task splittings (with class-IL scenario) in Figure 8 , 9, and 10. Note that the task sequence of length is equivalent to the number of tasks to learn (as shown in Table 5 ) for each dataset.Besides the class-IL scenario, we also provide additional results with complete error bars for the task-IL scenario in Figure 10 and 11.To show the performance difference between PDGNNs-SSEM and the baselines more concretely, we visualize the node representations of different classes with t-SNE Van der Maaten & Hinton (2008) while learning on the task sequence (with a length of 20, i.e., 20 tasks) of the Reddit dataset. In Figure 7 , besides PDGNNs-SSEM, we also show two other representative baselines including ER-GNN, specially designed for continual graph learning, and Fine-tune, without continual learning techniques. According to Figure 7 , PDGNNs-SSEM can maintain the nodes from different classes be well separated while continuously learning new tasks sequentially (each color corresponds to a class).In contrast, for ER-GNN and Fine-tune, the boundaries of different classes are less clear.

C.5 ADDITIONAL STUDIES ON THE BUFFER SIZE

In Figure 12 , based on the class-IL scenario, we study the performance of PDGNNs-SSEM on the OBG-Arxiv dataset when the buffer size (i.e., the ratio of dataset) varies from 0.0002 to 0.6. Figure 12 exhibits the similar performance of different SSE generation modules. Besides, when the buffer size grows from 0.0002 to 0.01, both the average accuracy and average forgetting of PDGNNs increase. When the buffer size reaches 0.1, the performance of PDGNNs is comparable to the setting which stores the entire training set (when the ratio of dataset is 0.6). These results demonstrate the efficiency of SSEM. Moreover, the results in Figure 12 also show that the performance difference among different SSE generation strategies is not significant.

D BROADER IMPACT

In this paper, we proposed a general technique to enable GNNs which can fit into the PDGNNs framework to continually learn on expanding networks. The method can be applied to any scenario requiring generating node representations on networks. The results of this paper can have an immediate and strong impact to address existing challenges for continual graph representation learning, enabling to achieve state-of-the-art performance, and thus positively impacting applications on social networks, recommender systems, dynamic systems, etc.Potential negative social impact may arise depending on the application scenario. For example, the privacy issue should be carefully considered when dealing with data containing user information.

