SUBSTRUCTURED GRAPH CONVOLUTION FOR NON-OVERLAPPING GRAPH DECOMPOSITION Anonymous

Abstract

Graph convolutional networks have been widely used to solve the graph problems such as node classification, link prediction, and recommender systems. It is well known that large graphs require large amount of memory and time to train graph convolutional networks. To deal with large graphs, many methods are being done, such as graph sampling or decomposition. In particular, graph decomposition has the advantage of parallel computation, but information loss occurs in the interface part. In this paper, we propose a novel substructured graph convolution that reinforces the interface part lost by graph decomposition. Numerical results indicate that the proposed method is robust in the number of subgraphs compared to other methods.

1. INTRODUCTION

Graph convolutional networks (GCNs) (Kipf & Welling, 2017) are widely used in node classification (Xiao et al., 2022) , link prediction (Zhang & Chen, 2018) , and recommender systems (Wu et al., 2022) . For a given graph, GCN constructs a renormalized graph Laplacian using the graph's adjacency matrix and uses it for layer propagation. Therefore, as the dimension of the adjacency matrix of the graph increases, more memory and time are required to train the network. There are two main types of research to solve the memory problem. The first is graph sampling methods (Hamilton et al., 2017; Chen et al., 2018; Ye et al., 2019; Zeng et al., 2020) . These methods basically create a subgraph at every iteration using an appropriate sampling algorithm like Deep-Walk (Perozzi et al., 2014) . The network is trained using this subgraph. GraphSAGE (Hamilton et al., 2017) used the edge information corresponding to a fixed-size neighborhood of uniformly sampled nodes. FastGCN (Chen et al., 2018) proposed the importance sampling and showed faster learning speed compared to GraphSAGE. VR-GCN (Ye et al., 2019) used the variance reduction technique to reduce the number of sampling nodes. GraphSAINT (Zeng et al., 2020) improved performance by using graph sampling instead of node sampling or edge sampling. Because the graph sampling method uses subgraphs to reduce memory usage, it is important to determine the number of samples. The higher the number of samples, the higher the performance is expected, but the slower the training speed and the memory is consumed. On the one hand, there is another approach to decompose the graph (Chiang et al., 2019) . The biggest advantage of the decomposition methods is that, unlike the sampling methods, it can be performed in advance before network training. A lot of research has been done on how to decompose the graph (Karypis & Kumar, 1998; Avery, 2011; Gonzalez et al., 2012) . Among them, METIS (Karypis & Kumar, 1998) , which can quickly decompose a graph using a multi-level structure, is widely used. In view of linear algebra, METIS derives a block diagonal matrix by performing a non-overlapping decomposition on the adjacency matrix of a given graph. ClusterGCN (Chiang et al., 2019) trains the network with a mini-batch gradient descent algorithm by performing block sampling on the block diagonal matrix generated by METIS. That is, this method trains the network by alternating block submatrices through random sampling. On the other hand, there is another way to train the network at once with the gradient descent algorithm by computing the block diagonal matrix for each block in parallel. A big difference from the alternating method is that it does not require inner iteration because it trains the network using all subgraphs at once and then merges them. However, non-overlapping decomposition drops blocks in off-diagonal part and does not supplement information about this part. Therefore, as the number of blocks increases, the amount of information lost increases, which also affects training of the network. In the field of numerical analysis, there are substructuring methods (Bramble et al., 1986; Farhat & Roux, 1991 ) that additionally use information on the interface part in the domain that has undergone non-overlapping decomposition. Assuming that the interface part is sparse when appropriate nonoverlapping decomposition is performed, the added computation and communication costs are very small. Therefore, although the interface part requires sequential computation, it does not become a bottleneck in the overall parallel structure. Motivated by the substructuring method, we modify the graph convolution with the block diagonal adjacency matrix generated by non-overlapping decomposition. That is, a substructure using the interface adjacency matrix is added to the graph convolution. We call a graph convolution with this added substructure a substructured graph convolution. A simple linear algebra calculation shows that the sum of the outputs of the aggregate using the block diagonal adjacency matrix and the interface adjacency matrix is different from the output of the aggregate using the original adjacency matrix. Therefore, to compensate for this difference, a weighted sum is performed by calculating coefficients by referring to the attention module that shows good performance in natural language processing (Vaswani et al., 2017) and image classification (Hu et al., 2018) . From the numerical results, it can be confirmed that the proposed graph convolution adequately complements the interface part. The rest of this paper is organized as follows. In Section 2, we introduce an abstract non-overlapping graph decomposition framework and two methods for training a given network with decomposed graphs. We present the substructured graph convolution in Section 3. Improved node classification accuracy or F1-score of the proposed graph convolution applied to GCN, GCNII, GAT, and SGC using various datasets is presented in Section 4. We conclude this paper with remarks in Section 5.

2. NON-OVERLAPPING GRAPH DECOMPOSITION

In this section, we briefly introduce an algebraic framework of non-overlapping graph decomposition. We then describe two methods for training the graph convolutional networks using the decomposed graphs.

2.1. ALGEBRAIC FRAMEWORK

Let A ∈ R n×n be an adjacency matrix of a given graph consisting of n nodes. Without loss of generality, let the graph be uniformly decomposed so that each subgraph has n/N nodes for a positive integer N . Let R k : R n → R n/N be the restriction operator onto k-th subgraph. We construct a non-overlapping decomposition of given adjacency matrix A under the node decomposition setting. A subgraph adjacency matrix A k ∈ R n/N ×n/N is defined by A k = R k AR T k , k = 1, • • • , N. (2.1) The non-overlapping decomposition A of A with subgraph adjacency matrices (2.1) is given by A = N k=1 R T k A k R k . (2.2) Then A becomes the adjacency matrix of the graph consisting of subgraphs having A 1 , • • • , A N as adjacency matrices. We define this graph as a non-overlapping graph decomposition for a given graph. For the block matrix representation  A = [A ij ] 1≤i,j≤N =     A 11 A 12 . . . A 1N A 22 A 22 . . . A 2N . . . . . . . . . . . . A N 1 A N 2 . . . A N N     ,

Non-overlapping decomposition

Figure 1 : Schematic description of the alternating method and the additive method after nonoverlapping graph decomposition. Note that we assume the case of N = 2 for simplicity. the corresponding block matrix representation of the non-overlapping decomposition A is written as A = diag [A ii ] N i=1 =     A 11 0 . . . 0 0 A 22 . . . 0 . . . . . . . . . . . . 0 0 . . . A N N     , where A ij ∈ R n/N ×n/N is defined by A ij = R i AR T j . That is, A is the block-diagonal part of A. Comparing (2.3) and (2.4), it can be seen that the number of off-diagonal parts lost in A increases as N increases. That is, it is clear that as N increases, the degree to which A approximates A decreases significantly; see Toselli & Widlund (2005) .

2.2. TRAINING WITH NON-OVERLAPPING GRAPH DECOMPOSITION

Now, we introduce two methods for training a network with a graph generated by the nonoverlapping graph decomposition. One is an alternating method known as ClusterGCN (Chiang et al., 2019) and the other is an additive method similar to Data Parallelism (Gonzalez et al., 2012) . First, we explain the basic training method of graph convolutional networks. Let G = (V , A) be a graph consisting of node vector V = (v 1 , • • • , v n ) with an adjacency matrix A ∈ R n×n . Each node v i has an F -dimensional feature vector x i ∈ R F and belongs to one of the C classes, which is labeled with a C-dimensional one-hot vector y i . The entire node feature X ∈ R n×F has {x 1 , • • • , x n } as row vectors. Similarly, the entire node label Y ∈ R n×C has {y 1 , • • • , y n } as row vectors. For convenience, it is assumed that the network f Θ consists of one graph convolution layer with trainable parameter Θ and the softmax function. Let W ∈ R F ×C and b ∈ R C be the weight and bias of the layer and Θ = {W , b}. Then the forward propagation of f Θ is written as 2017) . Here D is a degree matrix of given A and I is an identity matrix. Let L be a loss function that trains the network f Θ . Then, for each epoch, the network is trained by gradient descent method f Θ (x, A) = softmax(SXW + b), where S = (I + D) -1 2 (I + A)(I + D) -1 2 is a renormalized graph Laplacian (Kipf & Welling, Θ j+1 = Θ j -η∇ Θ j L(f Θ j (X, A), Y ). Now, we introduce the alternating method first. Using (2.1) and (2.2), each subgraph G k , feature matrix X k and label matrix Y k are defined as G k = {V k , A k }, V k = R k V , X k = R k X, Y k = R k Y . Then, for each epoch, the network is trained by mini-batch gradient descent method described in Algorithm 1. Algorithm 1: The alternating method with learning rate η for k = 1, • • • , N do Θ j+1 = Θ j -η∇ Θ j L(f Θ j (X k , A k ), Y k ) end Next, we introduce the additive method. The additive method computes the total output of layer gathering the outputs derived from each subgraph in parallel. After that, the network is trained using the gradient descent method. For each epoch, Algorithm 2 shows the update process of additive method. Algorithm 2: The additive method with learning rate η for k = 1, • • • , N in parallel do Y k = f Θ j (X k , A k ) end Y = N k=1 R T k Y k Θ j+1 = Θ j -η∇ Θ j L( Y , Y ) Figure 1 illustrates a schematic description of the additive method and the alternating method in the case of N = 2.

3. SUBSTRUCTURED GRAPH CONVOLUTION

As mentioned in Section 2, the non-overlapping graph decomposition has a disadvantage in that the loss of off-diagonal information of the adjacency matrix increases as N increases. Therefore, it can be expected that the performance of graph convolutional networks using graph decomposition depends heavily on the number of subgraphs N . The same phenomenon can be observed in both the alternating method and the additive method, and the experiment for this will be discussed in Section 4.

3.1. ALGEBRAIC FRAMEWORK

In numerical analysis, there is a substructuring method that increases performance by using the interface part without disturbing the parallel structure; see, e.g., Toselli & Widlund (2005) ; Dolean et al. (2015) . Let the non-overlapping decomposition A defined in Section 2 be the interior adjacency matrix. We consider an interface adjacency matrix A = A -A. Then, we propose a novel graph convolution called substructured graph convolution, which improves performance by adding interface parts like the substructuring method, while maintaining parallel structure. We now define the renormalized graph Laplacian S and S for A and A, respectively, as S = (I + D) -1 2 (I + A)(I + D) -1 2 , S = (I + D) -1 2 (I + A)(I + D) -1 2 , where D and D are the degree matrices of A and A, respectively. Note that S ̸ = S + S. In other words, simply adding the outputs of graph convolution using S and S is different from the output of graph convolution using S, the renormalized Laplacian matrix for A. Therefore, we consider a weighted sum of S and S rather than simple addition to get a better approximation to S. With appropriate coefficient vectors α and α, we consider the forward propagation of substructured graph convolution f Θ as

Non-overlapping decomposition

f Θ (X, A, A) = σ({diag( α) S + diag( α) S}XW + b), where Θ = {W , b} is a parameter of the layer and σ is a nonlinear activation function. For parallel computation of the substructured graph convolution, information compression is required to minimize communication between the interior part S and the interface part S in designing α and α. Motivated by the attention module, which is the core of the SE block used in CNN (Hu et al., 2018) , and the transformer structure mainly used in NLP (Vaswani et al., 2017) , we consider the coefficients α and α such as [ α, α] = softmax 1 F F i=1 ( SX) i , 1 F F i=1 ( SX) i , where (•) i denotes the i-th column. Note that F is the feature dimension of X. This operation compresses the information of the intermediate features generated by S and S, and then obtains the softmax value with minimal communication and computation cost. The process of computing the coefficients α and α is a structure that needs sequential computation, but it does not need additional parameters and uses the minimum cost to solve the bottleneck in the overall parallel structure. A brief graphical description of the substructured graph convolution for the case of N = 2 is shown in Figure 2 .

3.2. IMPLEMENTATION ISSUES

In this section, we discuss several issues on efficient implementation of the proposed substructured graph convolution. The first is the selection of an algorithm that performs the non-overlapping decomposition to a given graph. It is natural that the density of the interface adjacency matrix increases according to the shape of the graph if it is simply randomly cut or divided in order. This makes, the interior adjacency matrix becomes sparser, which degrades the performance of the existing graph convolution. Note that as the interface adjacency matrix is sparse, the amount of sequential computation decreases, so that the proposed graph convolution can be computed more efficiently. For this reason, we need a non-overlapping graph decomposition algorithm that minimizes edge-cuts. METIS (Karypis & Kumar, 1998) is one of the good algorithms, that uses a multi-level structure to quickly perform the decomposition and minimize edge-cuts. Table 1 shows the number of edge-cuts according to each decomposition method applied to Cora (McCallum et al., 2000) . METIS shows far fewer edge-cuts than simple random and ordered decomposition methods. Therefore, we use METIS for non-overlapping graph decomposition in the sequel. Next, we discuss why the renormalized graph Laplacian S and S are used. Since A = A + A, the renormalized graph Laplacian S can be decomposed as S = (I + D) -1 2 (I + A)(I + D) -1 2 + (I + D) -1 2 A(I + D) -1 2 . (3.1) Therefore, when the decomposition (3.1) is performed, the computation of (I +D) -1 2 is required for the computation of the interface part. This reduces the computational efficiency of the interface part that requires sequential computation and may cause a bottleneck in the interior part where parallel computation is possible. On the other hand, using S, sequential computation for the interface part can be efficiently performed. Lastly, we note that the proposed substructured graph convolution was made by referring to the renormalized graph Laplacian of GCN, but it is also applicable to GCNII (Chen et al., 2020) , GAT (Veličković et al., 2018) , and SGC (Wu et al., 2019) . The key idea is to construct the interface adjacency matrix, process the aggregate, and then determine each coefficient via the attention module. Then, substructured graph convolution is performed using the renormalized graph Laplacian corresponding to the aggregate part of a given network instead of the whole renormalized graph Laplacian.

4. APPLICATIONS

In this section, we present numerical results of the proposed graph convolution embedded into several existing GCNs: GCN, GCNII, GAT, and SGC. We evaluate the performance of transductive learning and inductive learning, which are mainly used as benchmarks in graph node classification problems. For the transductive learning task, standard citation network benchmark datasets Cora (McCallum et al., 2000) , CiteSeer (Giles et al., 1998) , and PubMed (Yang et al., 2016) were used. In these datasets, a node and an edge mean a document and a citation, respectively. For the transductive environment, only 20 training nodes were used per class, and 500 and 1, 000 nodes were used for validation and test, respectively. For the inductive learning task, we used a protein-protein interaction (PPI) dataset (Hamilton et al., 2017) consisting of graphs of different human tissues. The dataset has 20 training graphs and 2 validation and test graphs each. Also, the graph of the PPI dataset can have multiple class labels. Details of the number of nodes, edges, features, and classes in the dataset are given in Table 2 . All networks were implemented in Python with PyTorch (Paszke et al., 2019) and PyG (Fey & Lenssen, 2019) , and all computations were performed on a cluster equipped with Intel Xeon Gold 6240R (2.4GHz, 48C) CPUs, NVIDIA 3090 GPUs, and operating system CentOS 7.8.

4.1. TRANSDUCTIVE LEARNING

Transductive learning is a type of semi-supervised learning, i.e., given the nodes and edges of the graph, the network is trained using the labeled nodes, and then the unlabeled nodes are labeled. In particular, the transductive learning uses the same graph for training and testing. Thus, the transductive task shows how much the adjacency matrix of a given graph affects the labeling performance of the network.

4.1.1. NETWORK AND HYPERPARAMETER SETUP

GCN is a two-layer model which has 16 channels for the Cora and CiteSeer datasets and 64 channels for the PubMed dataset. GCNII is a model using 64 layers with 64 channels for the Cora and CiteSeer datasets and 16 layers with 256 channels for the PubMed dataset. Note that the hyperparameters α and λ for GCNII are 0.1 and 0.5, respectively. GAT is a model using two layers with 8 headers each with 8 channels. The last layer of GAT averages the outputs of the headers and all other layers concatenate the outputs. Finally, SGC performs feature propagation twice. Note that GAT uses ELU (Clevert et al., 2015) and other networks use ReLU as the activation function. All neural networks were trained for 200 epochs using Adam optimizer (Kingma & Ba, 2014) . The learning rate, weight decay, and dropout (Srivastava et al., 2014) were determined to give the best performance through grid search at [0.001, 0.005, 0.01], [0, 0.0001, 0.0005], and [0, 0.4, 0.6, 0.8], respectively.

4.1.2. EXPERIMENT RESULTS

First, to verify the performance of each convolution, we provide Table 3 , which shows the accuracy of given networks trained using the standard graph convolution for standard citation network benchmark datasets. Next, we compare the performance of the proposed substructured graph convolution with the additive method and the alternating method. Table 4 shows numerical results of all of the previously mentioned methods applied to GCN, GCNII, GAT, and SGC with Cora, CiteSeer, and PubMed datasets. First of all, as N increases, the overall accuracy decreases regardless of the network and dataset. In particular, in the cases of N = 32 and 64, it can be seen that the accuracy of the additive and alternating methods is much lowered because the edge-cuts of the given graph are very large. On the other hand, the substructured graph convolution shows that the decrease in accuracy is small as N increases compared to the additive and alternating methods. Moreover, in certain cases, substructured graph convolution shows better performance than standard graph convolution. This shows that adding the information of the interface part is properly reflected in the network and helps to improve accuracy.

4.2. INDUCTIVE LEARNING

The inductive learning task is a supervised learning. The biggest difference from the transductive learning is that the graphs for testing are different from the training graphs. If the adjacency matrix of the training graph is block diagonal, the trained network learns the block diagonal shape, so it is 



Figure 2: Graphical description of the proposed substructured graph convolution after nonoverlapping graph decomposition. Note that we assume the case of N = 2 for simplicity.

The number of edge-cut according to the number of subgraphs N with random, ordered, and METIS decomposition applied to Cora. The random decomposition method uses the randperm function in PyTorch, and the ordered decomposition method divides the node indices in order of Cora. Note that the total number of edges in the Cora is 5429.

Details of used datasets. Cora, CiteSeer, and PubMed datasets have a single class label, but the PPI dataset can have multiple class labels.

The accuracy of GCN, GCNII, GAT, and SGC on Cora, CiteSeer, and PubMed datasets.

The accuracy of GCN, GCNII, GAT, and SGC equipped with additive (AD), alternating (AL), and substructuring (SS) for the Cora, CiteSeer, and PubMed datasets, where N denotes the number of subgraphs.

F1-scores of GCN, GCNII, GAT, and SGC equipped with additive (AD), alternating (AL), and substructuring (SS) for the PPI dataset, where N is the number of subgraphs.

annex

difficult to expect labeling performance for a general graph. Thus, the inductive learning task shows the effect of reflecting the interface information on the generalizability of the network.

4.2.1. NETWORK AND HYPERPARAMETER SETUP

GCN is a three-layer model which has 1,024-channel layers with skip-connections (He et al., 2016) . GCNII is a model using 9 layers with 2,048 channels. Note that the hyperparameters α and λ for GCNII are set to 0.5 and 1.0, respectively. GAT is a model using three layers with 4 headers each with 256 channels. Also, the skip-connection is used in GAT. Finally, SGC performs feature propagation three times. GCN and GAT used ELU as the activation function, and the rest used ReLU. The optimizer and hyper parameters for training were set in the same way as in Section 4.1.In addition, the network was trained using a total of 20 PPI training graphs, one at a time, and the sequence of training graphs is shuffled every epoch.

4.2.2. EXPERIMENT RESULTS

Similar to the transductive learning task, we compare the proposed substructured graph convolution with the additive and alternating methods. Note that F1-scores for the PPI datasets of GCN, GCNII, GAT, and SGC trained using the standard graph convolution are 99.06, 89.79, 99.33, and 76.10, respectively. The numerical results in Table 5 confirm that the proposed substructured graph convolution performs better than other methods even on inductive learning tasks. In particular, the other two methods show that the F1-score drops sharply as N increases, whereas the substructured graph convolution shows little change in the F1-score. In other words, it can be seen that the form of the adjacency matrix of the graph used for training is important for the generalizability of the network. The adjacency matrix used in the additive method and the alternating method is in the form of a block diagonal excluding the interface part, and it can be seen that the interface adjacency matrix plays a large role in the generalizability of the network. In addition, it can be confirmed that the proposed substructured graph convolution improves the performance of the network by using the interface adjacency matrix appropriately in the inductive learning task as in the transductive learning task.

5. CONCLUSION

In this paper, we proposed a novel substructured graph convolution that is suitable for parallel computation and robust with respect to large numbers of subgraphs. Since the additional structure re- quires little computation and has no parameters, the proposed graph convolution does not cause a large bottleneck in parallel computation. We have experimentally shown that this novel graph convolution can train networks more effectively than additive and alternating methods. It also outperformed the standard graph convolution in certain cases. We expect that the proposed graph convolution can be efficiently utilized to train large graph datasets through multiple GPUs.

