Lifelong Graph Learning

Abstract

Graph neural networks (GNNs) are powerful models for many graph-structured tasks. Existing models often assume that a complete structure of a graph is available during training. In practice, however, graph-structured data is usually formed in a streaming fashion so that learning a graph continuously is often necessary. In this paper, we aim to bridge GNN to lifelong learning by converting a graph problem to a regular learning problem, so that GNN can inherit the lifelong learning techniques developed for convolutional neural networks (CNNs). To this end, we propose a new graph topology based on feature cross-correlation, namely, the feature graph. It takes features as new nodes and turns nodes into independent graphs. This successfully converts the original problem of node classification to graph classification, in which the increasing nodes are turned into independent training samples. In the experiments, we demonstrate the efficiency and effectiveness of feature graph networks (FGN) by continuously learning a sequence of classical graph datasets. We also show that FGN achieves superior performance in two applications, i.e., lifelong human action recognition with wearable devices and feature matching. To the best of our knowledge, FGN is the first work to bridge graph learning to lifelong learning via a novel graph topology.

1.. Introduction

Graph neural networks (GNN) have received increasing attention and proved useful for many tasks with graphstructured data, such as citation, social, and protein networks [52] . However, graph data is sometimes formed in a streaming fashion and real-world datasets are continuously evolving over time, thus learning a streaming graph is expected in many cases [46] . For example, in a social network, the number of users often grows over time and we expect that the model can learn continuously with new users. In this paper, we extend graph neural networks to lifelong learning, which is also known as continual or incremental learning [26] . Lifelong learning often suffers from "catastrophic forgetting" if the models are simply updated with new samples [35] . Although some strategies have been developed to alleviate the forgetting problem for convolutional neural networks (CNN), they are still difficult for graph networks. This is because in the lifelong learning setting, the graph size can increase over time and we have to drop off old data or samples to learn new knowledge. However, the existing graph model cannot directly overcome this difficulty. For example, graph convolutional networks (GCN) require the entire graph for training [20] . SAINT [58] requires pre-processing for the entire dataset. Sampling strategies [7, 13, 58] easily forget old knowledge when learning new knowledge. Recall that regular CNNs are trained in a mini-batch manner where the model can take samples as independent inputs [23] . Our question is: can we convert a graph task into a traditional CNN-like classification problem, so that (I) nodes can be predicted independently and (II) the lifelong learning techniques developed for CNN can be easily adopted for GNN? This is not straightforward as node connections cannot be modeled by a regular CNN-like classification model. To solve this problem, we propose to construct a new graph topology, the feature graph in Figure 1 , to bridge GNN to lifelong learning. It takes features as nodes and turns nodes into graphs. This converts node classification to graph classification where the node increments become independent training samples, enabling natural mini-batch training. The contribution of this paper includes: (1) We introduce a novel graph topology, i.e. feature graph, to convert a problem of growing graph to an increasing number of training samples, which makes existing lifelong learning techniques developed for CNNs applicable to GNNs. (2) We take the cross-correlation of neighbor features as the feature adjacency matrix, which explicitly models feature "interaction", that is crucial for many graph-structured tasks. (3) Feature graph is of constant computational complexity with the increased learning tasks. We demonstrate its efficiency and effectiveness by applying it to classical graph datasets. (4) We also demonstrate its superiority in two applications, i.e. distributed human action recognition based on subgraph classification and feature matching based on edge classification.

2.1.. Lifelong Learning

Non-rehearsal Methods Lifelong learning methods in this category do not preserve any old data. To alleviate the forgetting problem, progressive neural networks [36] leveraged prior knowledge via lateral connections to previously learned features. Learning without forgetting (LwF) [24] introduced a knowledge distillation loss [15] to neural networks, which encouraged the network output for new classes to be close to the original outputs. Distillation loss was also applied to learning object detectors incrementally [41] . Learning without memorizing (LwM) [10] extended LwF by adding an attention distillation term based on attention maps for retaining information of the old classes. EWC [21] remembered old tasks by slowing down learning on important weights. RWalk [6] generalized EWC and improved weight consolidation by adding a KL-divergencebased regularization. Memory aware synapses (MAS) [1] computed an importance value for each parameter in an unsupervised manner based on the sensitivity of output function to parameter changes. [48] presented an embedding framework for dynamic attributed network based on parameter regularization. A sparse writing protocol is introduced to a memory [43] , ensuring that only a few memory spaces is affected during training. Rehearsal Methods Rehearsal lifelong learning methods can be roughly divided into rehearsal with synthetic data or rehearsal with exemplars from old data [33] . To ensure that the loss of exemplars does not increase, gradient episodic memory (GEM) [26] introduced orientation constraints during gradient updates. Inspired by GEM, [2] selected exemplars with a maximal cosine similarity of the gradient orientation. iCaRL [32] preserved a subset of images with a herding algorithm [49] and included the subset when updating the network for new classes. EEIL [5] extended iCaRL by learning the classifier in an end-to-end manner. [51] further extended iCaRL by updating the model with class-balanced exemplars. Similarly, [3, 16] further added constraints to the loss function to mitigate the effect of imbalance. To reduce the memory consumption of exemplars, [18] applied the distillation loss to feature space without having to access to the corresponding images. Rehearsal approaches with synthetic data based on generative adversary networks (GAN) were used to reduce the dependence on old data [14, 40, 50, 53] .

2.2.. Graph Neural Networks

Graph neural networks have been widely used to solve problems with graph-structured data [60] . The spectral network extended convolution to graph problems [4] . Graph convolutional network (GCN) [20] alleviated over-fitting on local neighborhoods via the Chebyshev expansion. To identify the importance of neighborhood features, graph attention network (GAT) [42] added an attention mechanism into GCN, further improving the performance on citation networks and the protein-protein interaction dataset. GCN and its variants require the entire graph during training, thus they cannot scale to large graphs. To solve this problem and train GNN with mini-batches, a sampling method, SAGE [13] is introduced to learn a function to generate node embedding by sampling and aggregating neighborhood features. JK-Net [54] followed the same sampling strategy and demonstrated a significant accuracy improvement on GCN with jumping connections. DiffPool [57] learned a differentiable soft cluster assignment to map nodes to a set of clusters, which then formed a coarsened input for the next layer. Ying et al. Nevertheless, most of the sampling techniques still require a pre-processing of the entire graph to determine the sampling process or require a complete graph structure,



(a) Regular graph G.

Feature graph G F .

Figure1. We introduce feature graph network (FGN) for lifelong graph learning. A feature graph takes the features as nodes and turns nodes into graphs, resulting in a graph predictor instead of the node predictor. This makes the lifelong learning techniques for CNN applicable to GNN, as the new nodes in a regular graph become individual training samples. Take the node a with label za in the regular graph G as an example, its features xa = [1, 0, 0, 1] are nodes {a1, a2, a3, a4} in feature graph G F a . The feature adjacency is established via feature cross-correlation between a and its neighbors N (a) = {a, b, c, d, e} to model feature "interaction."

[56]  designed a training strategy that relied on harder-and-harder training examples to improve the robustness and convergence speed of the model.FastGCN [7]  applied importance sampling to reduce variance and perform node sampling for each layer independently, resulting in a constant sample size in all layers.[17]   sampled lower layer conditioned on the top one ensuring higher accuracy and fixed-size sampling. Subgraph sampling techniques were also developed to reduce memory consumption. [8] sampled a block of nodes in a dense subgraph identified by a clustering algorithm and restricted the neighborhood search within the subgraph. SAINT [58] constructed mini-batches by sampling the training graph.

