Lifelong Graph Learning

Abstract

Graph neural networks (GNNs) are powerful models for many graph-structured tasks. Existing models often assume that a complete structure of a graph is available during training. In practice, however, graph-structured data is usually formed in a streaming fashion so that learning a graph continuously is often necessary. In this paper, we aim to bridge GNN to lifelong learning by converting a graph problem to a regular learning problem, so that GNN can inherit the lifelong learning techniques developed for convolutional neural networks (CNNs). To this end, we propose a new graph topology based on feature cross-correlation, namely, the feature graph. It takes features as new nodes and turns nodes into independent graphs. This successfully converts the original problem of node classification to graph classification, in which the increasing nodes are turned into independent training samples. In the experiments, we demonstrate the efficiency and effectiveness of feature graph networks (FGN) by continuously learning a sequence of classical graph datasets. We also show that FGN achieves superior performance in two applications, i.e., lifelong human action recognition with wearable devices and feature matching. To the best of our knowledge, FGN is the first work to bridge graph learning to lifelong learning via a novel graph topology.

1.. Introduction

Graph neural networks (GNN) have received increasing attention and proved useful for many tasks with graphstructured data, such as citation, social, and protein networks [52] . However, graph data is sometimes formed in a streaming fashion and real-world datasets are continuously evolving over time, thus learning a streaming graph is expected in many cases [46] . For example, in a social network, the number of users often grows over time and we expect that the model can learn continuously with new users. In this paper, (a) Regular graph G. Figure 1 . We introduce feature graph network (FGN) for lifelong graph learning. A feature graph takes the features as nodes and turns nodes into graphs, resulting in a graph predictor instead of the node predictor. This makes the lifelong learning techniques for CNN applicable to GNN, as the new nodes in a regular graph become individual training samples. Take the node a with label za in the regular graph G as an example, its features xa = [1, 0, 0, 1] are nodes {a1, a2, a3, a4} in feature graph G F a . The feature adjacency is established via feature cross-correlation between a and its neighbors N (a) = {a, b, c, d, e} to model feature "interaction." we extend graph neural networks to lifelong learning, which is also known as continual or incremental learning [26] . Lifelong learning often suffers from "catastrophic forgetting" if the models are simply updated with new samples [35] . Although some strategies have been developed to alleviate the forgetting problem for convolutional neural networks (CNN), they are still difficult for graph networks. This is because in the lifelong learning setting, the graph size can increase over time and we have to drop off old data or samples to learn new knowledge. However, the existing graph model cannot directly overcome this difficulty. For example, graph convolutional networks (GCN) require the entire graph for training [20] . SAINT [58] requires pre-processing for the entire dataset. Sampling strategies [7, 13, 58] easily forget old knowledge when learning new knowledge. Recall that regular CNNs are trained in a mini-batch manner where the model can take samples as independent inputs [23] . Our question is: can we convert a graph task into a traditional CNN-like classification problem, so that (I) nodes can be predicted independently and (II) the lifelong learning techniques developed for CNN can be easily adopted for GNN? This is not straightforward as node connections cannot be modeled by a regular CNN-like classification model. To solve this problem, we propose to construct a new graph topology, the feature graph in Figure 1 , to bridge GNN to lifelong learning. It takes features as nodes and turns nodes into graphs. This converts node classification to graph classification where the node increments become independent training samples, enabling natural mini-batch training. The contribution of this paper includes: (1) We introduce a novel graph topology, i.e. feature graph, to convert a problem of growing graph to an increasing number of training samples, which makes existing lifelong learning techniques developed for CNNs applicable to GNNs. (2) We take the cross-correlation of neighbor features as the feature adjacency matrix, which explicitly models feature "interaction", that is crucial for many graph-structured tasks. (3) Feature graph is of constant computational complexity with the increased learning tasks. We demonstrate its efficiency and effectiveness by applying it to classical graph datasets. (4) We also demonstrate its superiority in two applications, i.e. distributed human action recognition based on subgraph classification and feature matching based on edge classification.

2.1.. Lifelong Learning

Non-rehearsal Methods Lifelong learning methods in this category do not preserve any old data. To alleviate the forgetting problem, progressive neural networks [36] leveraged prior knowledge via lateral connections to previously learned features. Learning without forgetting (LwF) [24] introduced a knowledge distillation loss [15] to neural networks, which encouraged the network output for new classes to be close to the original outputs. Distillation loss was also applied to learning object detectors incrementally [41] . Learning without memorizing (LwM) [10] extended LwF by adding an attention distillation term based on attention maps for retaining information of the old classes. EWC [21] remembered old tasks by slowing down learning on important weights. RWalk [6] generalized EWC and improved weight consolidation by adding a KL-divergencebased regularization. Memory aware synapses (MAS) [1] computed an importance value for each parameter in an unsupervised manner based on the sensitivity of output function to parameter changes. [48] presented an embedding framework for dynamic attributed network based on parameter regularization. A sparse writing protocol is introduced to a memory module [43] , ensuring that only a few memory spaces is affected during training. Rehearsal Methods Rehearsal lifelong learning methods can be roughly divided into rehearsal with synthetic data or rehearsal with exemplars from old data [33] . To ensure that the loss of exemplars does not increase, gradient episodic memory (GEM) [26] introduced orientation constraints during gradient updates. Inspired by GEM, [2] selected exemplars with a maximal cosine similarity of the gradient orientation. iCaRL [32] preserved a subset of images with a herding algorithm [49] and included the subset when updating the network for new classes. EEIL [5] extended iCaRL by learning the classifier in an end-to-end manner. [51] further extended iCaRL by updating the model with class-balanced exemplars. Similarly, [3, 16] further added constraints to the loss function to mitigate the effect of imbalance. To reduce the memory consumption of exemplars, [18] applied the distillation loss to feature space without having to access to the corresponding images. Rehearsal approaches with synthetic data based on generative adversary networks (GAN) were used to reduce the dependence on old data [14, 40, 50, 53] .

2.2.. Graph Neural Networks

Graph neural networks have been widely used to solve problems with graph-structured data [60] . The spectral network extended convolution to graph problems [4] . Graph convolutional network (GCN) [20] alleviated over-fitting on local neighborhoods via the Chebyshev expansion. To identify the importance of neighborhood features, graph attention network (GAT) [42] added an attention mechanism into GCN, further improving the performance on citation networks and the protein-protein interaction dataset. GCN and its variants require the entire graph during training, thus they cannot scale to large graphs. To solve this problem and train GNN with mini-batches, a sampling method, SAGE [13] is introduced to learn a function to generate node embedding by sampling and aggregating neighborhood features. JK-Net [54] followed the same sampling strategy and demonstrated a significant accuracy improvement on GCN with jumping connections. DiffPool [57] learned a differentiable soft cluster assignment to map nodes to a set of clusters, which then formed a coarsened input for the next layer. Ying et al. [56] designed a training strategy that relied on harder-and-harder training examples to improve the robustness and convergence speed of the model. FastGCN [7] applied importance sampling to reduce variance and perform node sampling for each layer independently, resulting in a constant sample size in all layers. [17] sampled lower layer conditioned on the top one ensuring higher accuracy and fixed-size sampling. Subgraph sampling techniques were also developed to reduce memory consumption. [8] sampled a block of nodes in a dense subgraph identified by a clustering algorithm and restricted the neighborhood search within the subgraph. SAINT [58] constructed mini-batches by sampling the training graph. Nevertheless, most of the sampling techniques still require a pre-processing of the entire graph to determine the sampling process or require a complete graph structure, which makes those algorithms not directly applicable to lifelong learning. In this paper, we hypothesize that a different graph structure is required for lifelong learning, and the new structure is not necessary to maintain its original meaning. We noticed that there is some recent work focusing on continuously learning a graph problem, but they have different formulation. For example, several exemplar selection methods are tested in ER-GNN [59] . A weight preserving method is introduced to growing graph [25] . A combined strategy of regularization and data rehearsal is introduced to streaming graphs in [46] . To overcome the incomplete structure, [12] learns a temporal graph in sliding window.

3.. Problem Formulation

We start by defining regular graph learning before lifelong graph learning for completeness. An attribute graph is defined as G = (V, E), where V is the set of nodes and E ⊆ {v a , v b } |(v a , v b ) ∈ V 2 is the set of edges. Each node v ∈ V is associated with a target z v ∈ Z and a multichannel feature vector x v ∈ X ⊂ R F ×C and each edge e ∈ E is associated with a vector w e ∈ W ⊂ R W . In regular graph learning, we learn a predictor f to associate a node x v , v ∈ V ′ with a target z v , given graph G, node features X , edge vectors W, and part of the targets z v , v ∈ V \ V ′ . In lifelong graph learning, we have the same objective, but can only obtain the graph-structured data from a data continuum G L = {(x i , t i , z i , N k=1:K (x i ), W k=1:K (x i )) i=1:N }, where each item is formed by a node feature x i ∈ X , a task descriptor t i ∈ T , a target vector z i ∈ Z ti , a k-hop neighbor set N k=1:K (x i ), and an edge vector set W k=1:K (x i ) associated with the k-hop neighbors. For simplicity, we will use the symbol N (x i ) to denote the available neighbor set and their edges. We assume that every item (x i , N (x i ), t i , z i ) satisfies (x i , N (x i ), z i ) ∼ P ti (X , N (X ), Z), where P ti is a probability distribution of a single learning task. In lifelong graph learning, we will observe, item by item, the continuum of the graph-structured data as (x 1 , N (x 1 ), t 1 , z 1 ), . . . , (x N , N (x N ), t N , z N ) (1) While observing (1), our goal is to learn a predictor f L to associate a test sample (x, N (x), t) with a target z such that (x, N (x), z) ∼ P t . Such test sample can belong to a task observed in the past, the current task, or a task observed (or not) in the future. The task descriptors t i is defined for compatibility with lifelong learning that requires them [26] but is not used in the experiments. Note that samples are not drawn locally identically and independently distributed (i.i.d.) from a fixed probability distribution, since we don't know the task boundary. In the continuum, we only know the label of x i , but have no information about the labels of its neighbors N (x i ). The items in (1) are unavailable once they are observed and dropped. This is in contrast to the settings in [46] , where all historical data are available during training. As shown in the experiments, lifelong graph learning in practice often requires that the number of GNN layers L is larger than the availability of K-hop neighbors, i.e. L > K, which also leads many existing graph models inapplicable.

4.. Feature Graph Network

To better show the relationship with a regular graph, we first review GCN [20] . Given an graph G = (V, E) described in Section 3, the stacked node features can be written as X = [x 1 , x 2 , • • • , x N ] T ∈ R N ×F C , where x i ∈ R F C is a vectorized node feature. The GCN takes feature channel as C = 1, so that x i ∈ R F , X ∈ R N ×F . The l-th graph convolutional layer is defined as X (l+1) = σ Â • X (l) • W , where σ( • ) is an activation function, W ∈ R F (l) ×F (l+1) is a learnable parameter, and Â ∈ R N ×N is a normalized adjacency for A (refer to [20] for details). A graph convolutional layer doesn't change the number of nodes (rows of X) but change it feature dimension from F (l) to F (l+1) . GCN has been applied to many graph-structured tasks due to its simplicity and good generalization ability. However, the problems of GCN are also obvious. Besides the forgetting problem in lifelong learning, its node features in the next layer is a linear combination of the current layer, thus GCN and its variants cannot directly model the feature "interaction". To this end, we introduce the feature graphs by defining feature nodes and feature adjacency matrix.

4.1.. Feature Nodes

Recall that each node v in a regular graph G = (V, E) is associated with a multi-channel feature vector x = x T [1,:] , • • • , x T [F,:] ∈ R F ×C , where x [i,: ] is the i-th feature (row) of x. An attribute feature graph takes the features of a regular graph as nodes. It can be defined as G F = (V F , E F ), where each node v F ∈ V F is associated with a feature x T [i,:] and will be denoted as x F i . Intuitively, the number of nodes in the feature graph is the feature dimension in the regular graph, i.e. V F = F , and the feature dimension in feature graph is the feature channel in a regular graph, i.e. x F i ∈ R C . Therefore, we define the feature nodes for feature graph as V F = x F 1 , x F 2 , • • • , x F i , • • • , x F F . In this way, for each node v ∈ V, we have a feature graph G F . We next establish their relationship by defining the feature adjacency matrices via feature cross-correlation.

4.2.. Feature Adjacency Matrix

For each item in continuum (1) , the edges between x and its neighbors N (x) imply the existence of correlations between their features. We model the feature adjacency as the correlation over the k-hop neighborhood N k (x) and for each of the c channels independently, where c = 1, . . . , C: A F k,c (x) ≜ sgnroot E y∼N k (x) w x,y x [:,c] y T [:,c] , where w x,y ∈ R is the associated edge weight and sgnroot(x) = sign(x) |x| retains the magnitude of node features and the sign of their inner products. Note that A F k,c preserves the connectivity information by only encoding information from connected nodes. For each sample x, this produces C matrices of size F × F , where F ≪ N due to the lifelong learning settings. For undirected graphs, we change x [:,c] y T [:,c] to (x [:,c] y T [:,c] + y [:,c] x T [:,c] ) for symmetry. In practice, the expectation in (4) is approximated by averaging over the observed neighborhood: E w x,y x [:,c] y T [:,c] ≈ y∈N k (x) w x,y x [:,c] y T [:,c] |N k (x)| . In this way, the feature adjacency matrix is constructed dynamically and independently from neighborhood samples via (5), so that the continuum (1) is converted to graphs: (G F 1 , t 1 , z 1 ), . . . , (G F i , t i , z i ), . . . , (G F N , t N , z N ) where G F i = V F i , A F (x i ) . This means that our objective of learning a node predictor becomes learning a graph predictor f F : G F × T → Z that predicts target z for test sample (G F , t) so that (G F , t, z) ∼ P F t . Note that the feature graphs in the new continuum (6) are still non-i.i.d. In this way, a growing adjacency matrix is converted to multiple small adjacency matrices. Hence, a lifelong graph learning problem becomes a regular lifelong learning problem similar to [26] and the problem of increasing nodes can be solved by applying lifelong learning to the graph continuum (6). To be more clear, we list their detailed relationship in Table 1 , where the arrow → refers to the conversion from a regular graph to multiple feature graphs.

4.3.. Feature Graph Layers

Since feature graph is a new topology given by feature adjacency matrix, we are able to define many different types of layers. In this section we present three types of layers.

4.3.1. Feature Broadcast Layer

Inspired by GCN, the l-th broadcast layer is defined as x F (l+1) = σ ÂF k • x F (l) • W F , where σ( • ) is a non-linear activation function, ÂF k is the associated normalized feature adjacency matrix, and W ∈ R C (l) ×C (l+1) is a learnable parameter. For simplicity, the channel c is left out in (7) and the layer channel is broadcasted independently. It is worth noting that, although the definition of ( 7) appears similar to (2), they have different dimension and represent different meanings.

4.3.2. Feature Transform Layer

Similar to graph convolutional layer (2), the feature broadcast layer doesn't change the number of feature nodes. However, this is not always necessary, as the objective has been turned into graph classification. Therefore, we define a feature transform layer which can change the number of feature nodes and will be helpful for further reducing the number of learnable parameters. Different from the feature broadcast layer, we need to re-calculate the feature adjacency matrices from transformed neighbors. Therefore, given the feature graph G F , the l-th feature transform layer can be defined as A F (l) (x) ≜ A F k x (l) , y (l) , ∀y ∈ N k (x), x F (l+1) = σ W F • ÂF (l) (x) • x F (l) , F (l+1) = σ W F • ÂF (l) (x) • y F (l) , where W F ∈ R F (l+1) ×F (l) is a learnable parameter, σ( • ) is a non-linear activation function, and ÂF (l) (x) ∈ R F (l) ×F (l) is the normalized feature adjacency A F (l) (x). The node features sometimes can be smoothed due to graph propagation, hence we can replace ÂF (l) (x) • y F (l) in (8c) to [ ÂF (l) (x) • y F (l) , y F (l) ] by concatenating input features to prevent over-smoothing.

4.3.3. Feature Attention Layer

In the cases that a graph is fully connected or the edges have no weights, w x,y in (4) will be not well defined. Prior methods often rely on an attention mechanism, e.g. GAT [42], to focus on important neighbors. Inspired by this, we define the edge weights as an attention in (9). w x,y = exp(e x,y ) z∈N (z) exp(e x,z ) , e x,y = LeakyReLU(a T x x + a T y y + b), where a x , a y ∈ R F , b ∈ R are learnable attention parameters. We can also construct other types of layers based on the topology G F , e.g. extending convolution to kervolution [44] to improve the model expressivity or combining the pagerank algorithm [22] to further reduce feature over-smoothness. In this paper, we will mainly demonstrate the effectiveness of the three types of layers introduced above.

4.4.. Analysis and Computational Complexity

We next provide an intuitive explanation for feature graphs. A feature graph doesn't retain the physical meaning of its regular graph. Take a social network as an example, in a regular graph, each user is a node and user connections are edges. In feature graphs, each user is a graph, while the user features, including the users' age, gender, etc., are nodes. Therefore, user behavior prediction becomes graph classification based on node information and new users simply become multiple training samples. Since the number of user features is more stable than the number of users, the size of a feature graph is more stable than its regular graph. This simplifies the learning on growing graphs dramatically by reducing the problem to the sample-incremental learning. On the other hand, a regular graph usually assumes that some useful information of a node is often encoded into its neighbors, thus graph propagation is able to improve the model performance. A feature graph has the same assumption. However, feature graph doesn't directly propagate the neighbor features, but encode the neighbors into the feature adjacency matrices (4). Regardless of the nonlinear function, existing methods propagate neighbor features as ÂX, where X = [x 1 , • • • , x n ], meaning they can only propagate information element-wisely, as features in the next layer are weighted average of the current features [20, 22, 42, 58] . In contrast, each feature graph propagate features via ÂF x F , which explicitly model "interaction" between features. In this sense, feature graph doesn't lose information of edges but encode them into the feature adjacency. This explains its superiority over regular graph models in some cases of conventional graph learning as shown in Section 5. Take a citation graph as example, a keyword in an article may influence another keyword in the article citing the former. However, element-wise graph propagation like (2) cannot explicitly model this relationship [20, 22, 42, 58] . Although feature graph is also suitable for regular graph learning, we mainly focus our discussion on lifelong graph learning. Feature graphs have a low computational complexity. Concretely, the complexity for calculating the feature adjacency is O(nF 2 (l) ), where n is the expected number of node neighbors in the continuum. Therefore, regardless of the number of layers which is a constant for specific models, the complexity of graph propagation for a feature broadcast layer and a feature transform layer is of O(F 2 (l) C (l) C (l+1) ) and O(F (l) F (l+1) C 2 (l) ), respectively. If we take E (l) = F (l) C (l) as the number of elements of the features, the complexity for calculating each sample is roughly of O(E 2 + nF 2 ), where we leave out the layer index for simplicity. Note that they do not depend on the task number, thus feature graphs have constant complexity with the increased learning tasks. FGN is applicable even when the number of features F is very large. In this case, we often use a feature transformation layer (8) as a feature extractor to project the raw features onto a lower dimension. For example, we reduced the feature dimension on Cora from F (1) = 1433 to F (2) = 10, which is also adopted by GCN, GAT, APPNP, SAGE, etc.

5.. Experiments

Implementation Details We perform comprehensive tests on popular graph datasets including citation graph Cora, Citeseer, Pubmed [38] , and ogbn-arXiv [29] . For each dataset, we construct two different continuum: data-incremental and class-incremental. In data-incremental tasks, all samples are streamed randomly, while in class-incremental tasks, all samples from one class are streamed before switching to the next class. In both tasks, each node can only be present to the model once, i.e., a sample cannot be used to update the model parameters again once dropped. For this experiment, we implement a two-layer feature graph network in PyTorch [31] and adopt the SGD [19] optimizer. See further details in Appendix A. We choose the most popular graph models including GCN [20], GAT [42], SAGE [13] , and APPNP [22] as our baseline. Comparison with other methods such as SAINT [12] are omitted as they require a pre-processing of the entire dataset, which is incompatible with the lifelong learning setting. The overall test accuracy after learning all tasks is adopted as the evaluation metric [2]. Sequence Invariant Sampling We adopt the lifelong learning method described in [2] with a minor improvement for all the baseline models. As reported in the Section 4.2 of [2], it tends to select less earlier items in the continuum, which discourages the memorization of earlier knowledge. We find that this is because a uniform sampling is adopted for all items. To compensate for such effect, we set a customized selection probability for different items (See Appendix B). Performance Lifelong learning has a performance upper bound given by the regular learning settings, thus the performance in regular learning is also an important indicator for the effectiveness of feature graph. We present the overall performance of regular learning in Table 2 and denote this task as "R", where the same dataset settings are adopted for different models. To demonstrate the necessity of graph models, we also report the performance of MLP, which neglects the graph edges. It can be seen that feature graph achieves a comparable overall accuracy in all the datasets. This means that feature graph may also be useful for regular graph learning. We next show that feature graph in lifelong learning is approaching this upper bound and dramatically alleviate the issue of "catastrophic forgetting". The overall averaged performance on the dataincremental tasks is reported as task of "D" in Table 2 , where all performance is an average of three runs. We use a memory size of 500 for the datasets Cora, Citeseer and Pubmed, and a memory size of 512 for ogbn-arXiv. It is also worth noting that we have a low standard deviation, which means ‡ Task of "R", "D", "C" denote regular, data-incremental, and class-incremental learning, respectively. 

6.. Distributed Human Action Recognition

Implementation Details To demonstrate the flexibility of FGN, we apply it to an application, i.e. distributed human action recognition using wearable motion sensor networks in Figure 2a (Data from [55] ). Five sensors, each of which consists of a triaxial accelerometer and a biaxial gyroscope, are located at the left and right forearms, waist, left and right ankles, respectively. Each sensor produces 5 data streams and totally 5 × 5 data streams is available. The stream is recorded at 30Hz and is comprised of human subjects with ages from 19 to 75 and 13 daily action categories, including rest at standing (ReSt), rest at sitting (ReSi), rest at lying (ReLi), walk forward (WaFo), walk forward left-circle (WaLe), walk forward right-circle (WaRi), turn left (TuLe), turn right (TuRi), go upstairs (Up), go downstairs (Down), jog (Jog), jump (Jump), and push wheelchair (Push). We take every 25 sequential data points from a single sensor as a node and perform recognition for every 50 sequential points (1.67s) shown in Figure 2b , which is also adopted by [45] . This results in a temporally growing graph: Each node is associated with a multi-channel (C = 5) 1-D signal x i t ∈ R 25×5 , where i = 1, 2 . . . 5 is the sensor index and t = 1, 2 . . . T is the time index. We assume all nodes at adjacent time index are connected, i.e. for all t, nodes (x 1 t , . . . , x 5 t , x 1 t+1 , . . . , x 5 t+1 ) form a fully connected subgraph. Therefore, the problem of human action recognition becomes a problem of sub-graph (10 nodes) classification. Performance We list the overall performance of regular and lifelong learning in Table 6 , which is obtained from an average of three runs. It can be seen that FGN achieves the best overall performance in regular learning and a much higher performance in lifelong learning compared to the state-of-the-art methods. This means that FGN has a lower forgetting rate, demonstrating its effectiveness. We present the overall test accuracy during the lifelong learning process in Figure 2c . It can be seen that FGN has a much higher and stable performance than all the other methods. We also shown their final per-class precision in Figure 2d , which indicates that FGN achieves a much higher performance in nearly all the categories. Note that all models have a relatively low performance on ReSt, ReSi, and ReLi. This is because their signals are relatively flat and similar due to their static posture. This phenomenon is also observed by a template matching method KCC [45] . Note that we don't directly compare against to KCC, since KCC mainly performs matching for single subject, while we aim for prediction across different subjects. We argue that the superior performance of FGN is because of that FGN is able to explicitly model the relationship of different features. Take the walking action as an example, it is well known that the movement of a left arm is always related to the movement of the right leg. Such information can be explicitly modeled by the cross-correlation in the feature adjacency matrix. Moreover, it is also able to model the feature relationship at different time step, e.g. the movement of a left arm at time t is also related to the movement of the right arm at time t + 1. As aforementioned, FGN can explicitly learn such relationship with k = 2 in (4), while the other methods can't directly do this. Besides, FGN also achieves the least forgetting rate in Table 5 over all classes, which is the difference between the final and best performance during the entire learning process.

7.. Image Feature Matching

Implementation Details We next extend FGN to a more challenging application, i.e. image feature matching. It is crucial for many 3-D computer vision tasks including simultaneous localization and mapping (SLAM). As shown in Figure 4 , the interest point and their descriptors form an infinite temporal growing graph, in which the feature points are nodes and their descriptors are the node features as defined in Section 3. In this way, the problem of feature matching becomes edge prediction for a temporal growing graph. We next show that the performance of SuperGlue [37] , which used a regular graph attention model for feature matching, can be simply improved by changing the graph-attention matcher to our proposed FGN. For simplicity, we adopt a framework similar to that of SuperGlue but removed the cross-attention layers in the matching network. In image feature matching, we normally have a layered temporal graph, where edge weights are undefined. Therefore, we construct FGN by concatenating two feature broadcast layers (7) with the attention edge weights w x,y defined in (9). Performance We categorize the 80 test sequences from Tar-tanAir dataset [47] into groups based on their characteristics. Figure 4 shows several examples of consecutive matching in Indoors, Outdoors&Artificial, and Outdoors&Mixed environments. We report the mean matching error in Table 7 , compared with the GAT-based matcher. It can be observed that our method outperforms GAT by a noticeable margin on all categories, especially on the difficult ones (Outdoors, Natural, Hard). Specifically, FGN achieves an overall error reduction of 24.2% compared with GAT. We further report the difference in error landscape of FGN and GAT in Figure 3 . Although there is no significant difference in matching precision at the single-pixel level, our method experiences a greater boost in precision in the sub-pixel region when error tolerance increases. This effect is especially noticeable for the Artificial and Outdoors categories. This suggests that our method is more advantageous at identifying high quality matches in mildly difficult envi- Reduction (%) -10.6 -24.9 -30.9 -19.9 -28. ronments, which is beneficial for downstream tasks such as simultaneous localization and mapping (SLAM).

8.. Conclusion

In this paper, we focus on the problem of lifelong graph learning and propose the feature graph as a new graph topology, which solves the challenge of increasing nodes in a streaming graph. It takes features of a regular graph as nodes and takes the nodes as independent graphs. To construct the feature adjacency matrices, we accumulate the cross-correlation matrices of the connected feature vectors to model feature interaction. This successfully converts the original node classification to graph classification and turns the problem from lifelong graph learning into regular lifelong learning. The comprehensive experiments show that feature graph achieves superior performance in both dataincremental and class-incremental tasks. The applications on action recognition and feature matching demonstrates its superiority in lifelong graph learning. To the best of our knowledge, feature graph is the first work to bridge graph learning to lifelong learning via a novel graph topology.

A. Dataset Details

We perform comprehensive experiments on popular graph datasets including citation graph Cora, Citeseer, Pubmed, and ogbn-arXiv [28, 29, 38] . Their statistics are listed in Table 8 . Note that the dataset split is a little different from their original settings, as we can only test the final optimized model due to the requirement of lifelong learning, thus we don't need a validation set. For Cora, Citeseer, and Pubmed, the model consists of two feature broadcast layers (7) with C (1) = 1 and C (2) = 2 channels each. For obgn-arXiv, we found that node features can be easily smoothed due to multiple feature propagation, hence we use the feature transform layers to concatenate its input features. We take the one-hot vector as the target vector z i , adopt the crossentropy loss, and use the softsign [11] function σ(x) = x /(1+|x|) as the non-linear activation. 

B. Proof of Sequence Invariant Sampling

Let P be the probability that the observed items are selected at time t, thus the probability that one item is still kept in the memory after k selection is P k . This explains that earlier items have lower probability to be kept in the memory and this phenomenon was reported in the Section 4.2 of [2]. To compensate for such effect and Proposition B.1. To ensure that all items in the continuum have the same probability to be kept in the memory at any time t, we can set the probability that the n-th item is selected at time t as P n (t) =    1 t ⩽ M M /n t > M, t = n (n-1) /n t > M, t > n , where M denotes the memory size. Proof. It is obvious for t ⩽ M as we only need to keep all items in the continuum. For t > M , the probability that the n-th item is still kept in the memory at time t is P n,t = P n (n) • P n (n + 1) • • • P n (t -1) • P n (t), = M n • n n + 1 • • • • t -2 t -1 • t -1 t , = M t . This means the probability P n,t is irrelevant to n and all items in the continuum share the same probability. In practice, we always keep M items and sample balanced items accross classes.

C. Distributed Human Action Recognition

Implementation In practice, the temporal growing graph can only be learned sequentially, thus we take the first 80% of each sequence for training and the remaining 20% for testing. Specifically, we define the radius of neighborhood as the temporal distance. Therefore, all the nodes at the same instant are 1-hop neighbors of each other. For each feature graph we have K = 2 in the continuum (1). We construct FGN using two feature transform layers (8) with attention weights (9) and one fully connected layer to predict for the sub-graph classification. For fairness, we use C (1) = 5, F (1) = 25, C (2) = 32, and F (2) = 12 for all models. In the experiments, we find that GCN, APPNP obtain the best overall performance using the SGD optimizer, while MLP, GAT, and FGN performed the best using the Adam optimizer. Running time We also report the average running time for the models in Table 9 . Note that the efficiency of FGN is on par with other methods. Considering that FGN has a much better performance, we believe that FGN is more promising. 

D. Image Feature Matching

Although many hand-crafted feature descriptors such as SIFT [27] and ORB [34] have been proposed decades ago, their performance is still unsatisfied for large view point changes. Due to the well generalization ability, deep learning-based feature detectors have received increasing attentions. For example, SuperPoint [9] introduced a selfsupervised framework for extracting interest point detectors and descriptors. SuperGlue [37] introduced graph attention model into SuperPoint for feature matching. Implementation In the experiments, we adopt C = 1, K = 1, F = 256 in both FGN and FGN for fairness. The training loss function is adapted from [30] which maximizes the likelihood of predicting similar node embeddings corresponding to their spatial location. We recommend the readers refer to SuperPoint [9], SuperGlue [37] , and [30] for more details of the loss functions. Dataset We perform training and evaluation on the Tar-tanAir dataset [47] . TartanAir is a large (about 3TB) and very challenging visual SLAM dataset consisting of binocular RGB-D video sequences together with additional perframe information such as camera poses, optical flow, and semantic annotations. The sequences are rendered in AirSim [39] , a photo-realistic simulator, which features modeled environments with various themes including urban, rural, nature, domestic, public, sci-fi, etc. Figure 4 contains several example video frames from TartanAir. The dataset is collected such that it covers challenging viewpoints and di-verse motion patterns. In addition, the dataset also includes other traditionally challenging factors in SLAM tasks such as moving objects, changing lighting conditions, and extreme weather. We randomly select 80% of the sequences for training and take the remaining for testing. We recommend the readers refer to [47] for more details of the dataset.

E. Limitation

Although we have shown that FGN outperformed the SOTA methods in node classification, sub-graph classification, and edge prediction, it also has several limitations. First, our current implementation is not vectorized for taking a varying number of neighbors, which is less computationally efficient. Second, in the experiments, we assumed scalar edge weights, while in the general case, the graph edge weights are represented by a vector, as defined in the feature graphs. Third, since our main contribution is the novel graph topology, i.e., feature graph, we mainly compared it with the SOTA graph models such as GAT by applying an off-the-shelf lifelong learning algorithm. However, it might also be applicable to other lifelong learning algorithms. In the future, we plan to fully optimize the codes, extend it to applications with vector edge weights, and apply more lifelong learning algorithms.



Feature graph G F .

Per-class precision.

Figure 2. The wearable sensor networks, the streaming graph, precesion during learning, and the final performance comparison.

Figure3. The matching precision of our method (blue) and performance gap between our method and GAT (orange) at different level of tolerances. Our method outperforms GAT at sub-pixel precision in mildly difficult categories.

Figure 4. Matching examples on TartanAir dataset. Feature matching is a problem of edge prediction for temporal growing graph.

8 -20.8 -30.6 -24.2 † Scene categories include: (I)ndoor and (O)utdoor, (N)aturalistic (woods and sea floor), (A)rtificial (streets and buildings), (M)ixed (containing both natural and artificial objects), (D)ark (poor lighting condition), (H)ard (violent motion and/or complex textures).

The relationship of a graph and feature graph.

Overall performance comparison on all the datasets in Section 5. † We denote the best performance in bold and second best with underline for each task. Due to the settings of lifelong learning, dataset split is different from the original one, as we only test the final optimized model and don't need a validation set (Appendix A).

Average class forgetting rate in task "C" (%).

The number of model parameters used in Section 5.

Backward maximum forgetting rate of all models on all actions (%).

Precision comparison on the action recognition.

Mean Error (pixels) in image feature matching. †

The statistics of the datasets used for lifelong learning.

Running time comparison on the action recognition.

Acknowledgment

This work was sponsored by ONR grant #N0014-19-1-2 266 and ARL DCIST CRA award W911NF-17-2-0181.

