LINGUINE: LEARNING TO PRUNE ON SUBGRAPH CONVOLUTION NETWORKS

Abstract

Graph Convolutional Network (GCN) has become one of the most successful methods for graph representation learning. Training and evaluating GCNs on large graphs is challenging since full-batch GCNs have high overhead in memory and computation. In recent years, research communities have been developing stochastic sampling methods to handle large graphs when it is unreal to put the whole graph into a single batch. The performance of the model depends largely on the quality and size of subgraphs in the batch-training. Existing sampling approaches mostly focus on approximating the full-graph structure but care less about redundancy and randomness in sampling subgraphs. To address these issues and explore a better mechanism of producing high-quality subgraphs to train GCNs, we proposed the Linguine framework where we designed a metamodel to prune the subgraph smartly. To efficiently obtain the meta-model, we designed a joint training scenario with the idea of hardness based learning. The empirical study shows that our method could augment the accuracy of the current state-of-art and reduce the error incurred by the redundancies in the subgraph structure. We also explored the reasoning behind smart pruning via its visualization.

1. INTRODUCTION

Graph Representation Learning has attracted much attention from the research communities in recent years, with emerging new work every year. Graph Convolution Neural Networks (GCNs) were proposed as the extension of Convolutional Neural Networks(CNNs) (LeCun et al., 1995) on geometric data. The first spectral-based GCN was designed on Spectral Graph Theory (Bruna et al., 2013) and was extended by many following works (Henaff et al., 2015; Defferrard et al., 2016) . Over recent years, the spatial-based counterpart (Kipf & Welling, 2016a ) gained more attention and had facilitated many machine learning tasks (Wu et al., 2020; Cai et al., 2018) including semisupervised node classification (Hamilton et al., 2017b ), link prediction (Kipf & Welling, 2016b; Berg et al., 2017) and knowledge graphs (Schlichtkrull et al., 2018) . In this work, we primarily focused on large-scale spatial-based GCNs (Hamilton et al., 2017a; Chen et al., 2018b; Gao et al., 2018; Huang et al., 2018; Zeng et al., 2019; Zou et al., 2019; Chiang et al., 2019) , where a given node aggregates hidden states from its neighbors in the previous layer, followed by a non-linear activation to obtain the topological representation. However, as the graph gets larger, GNN models suffer from the challenges imposed by limited physical memory and exponentially growing computation overhead. Recent work adopted sampling methods to handle the large volume of data and facilitate batch training. The majority of them could be classified as 3 types, layer-wise sampling (Hamilton et al., 2017a; Gao et al., 2018; Huang et al., 2018; Zou et al., 2019) , node-wise sampling (Chen et al., 2018b) and subgraph sampling (Chiang et al., 2019; Zeng et al., 2019) . In layer-wise sampling, we take samples from the neighbors of a given node in each layer. The number of nodes is growing exponentially as the GCNs gets deeper, which resulted in 'neighbor explosion'. In node-wise sampling, the nodes in each layer are sampled independently to form the structure of GCNs, which did avoid 'neighbor explosion'. But the GCN's structure is unstable and resulted in inferior convergence. In subgraph sampling, the GCNs are trained on a subgraph sampled on the original graph. The message was passed within the subgraph during training. This approach resolved the problem of neighbor explosion and can be applied to training deep GCNs. However, the subgraph's structure and connectivity had a great impact in the training phase. It might result in suboptimal performance and slow convergence if the subgraph is overly sparse (Chen et al., 2018b) . Different sampling methods can make a huge difference in the final accuracy and the convergence speed as shown in (Zeng et al., 2019) . In the context of large-scale GCN training, the limitations in GPU memory makes the maximum batch size restricted. Research communities are actively seeking efficient sampling methods to deal with the challenges in scalability, accuracy, and computation complexity on large-scale GCNs. The initiative of subgraph GCN is a stochastic approach to approximate their full-graph counterparts. However, experiments showed that GCN trained with partial information can achieve even less bias. (Zou et al., 2019) This was even more so when variance reduction is applied in GCN's training, which overcomes the negative effect induced by subgraph GCNs and improves its convergence speed as well as inference performance. (Hamilton et al., 2017a; Chen et al., 2018a; Zeng et al., 2019) However, the inner mechanism behind this random sample has yet to be studied. GCNs are also made to be deeper and more complicated with architecture design. The model overcomes the drawbacks of the gradient vanishing problem via applying Deep CNN's residual/dense connection and dilated convolution, achieving state of art performance on open graph benchmarks. (Li et al., 2019; 2020; Weihua Hu, 2020) However, as their model is significantly larger (amounting to more than 100 layers in some occasions) than previous approaches, the model is suffered from high computation overhead and memory cost. With the model taking up much space, the batch sizes are also strictly limited. We aim to provide an easier solution to train complex models while maintaining a relatively large receptive field in the graph, and keeping the training quality. We propose Linguine framework. In each forward pass, we 'smartly prune' inferior nodes to extract a concentrated smaller subgraph from the large subgraph randomly sampled previously. Therefore, we reduce the batch size and the memory requirement in training the model. We are also able to train the complex GCNs with larger receptive field and achieve better performance with the same budget. We parameterize the decision function in smart pruning with a light-weight meta-model, which is fed with the meta-information we obtained from training a light-weight proxy model. This keeps the extra cost of algorithm under control. Our framework is built upon existing subgraph sampling methods and utilize joint training to learn the meta-model. Our meta-model improved the quality of the subgraphs in training via actively dropping redundant nodes from its receptive field and concentrate the information. We summarize the contributions of this work as follows: 1. We designed a new training framework Linguine which aims to train high-performance GCNs on large graphs via model-parameterized smart pruning techniques. 2. Linguine provides a joint-training algorithm called bootstrapping, originally designed to train the meta-model in smart pruning, but also has a favorable impact in augmenting existing models and converge to better solutions with scalability and lower bias. It is also an effective algorithm inspired by the idea in the real-world learning process. 3. Empirical study justified that Linguine framework worked well on different scaled public benchmarks and compares favorably to previous methods. 4. We did an analysis on the mechanism behind smart-pruning via graph visualization.

2. RELATED WORK

Our framework was inspired and built upon two popular branches in Machine Learning. Graph Neural Networks Many GNNs have emerged over the recent years. Spatial-based GCNs are the most popular approaches among them and have gained broad interest from research communities (Atwood & Towsley, 2016; Niepert et al., 2016; Gilmer et al., 2017) . They stack multiple graph convolutional layers to extract high-level representations. In each layer, every node in the graph aggregates the hidden states from its neighbors on the previous layer. The final output is the embedding of each node in the graph. Existing work on GCNs utilized sampling technique to perform efficient minibatch training. Common approaches can be categorized as Layer-wise Sampling, Node-wise Sampling and Subgraph Sampling. In Layer-wise sampling and Node-wise Sampling, the layers are sampled recursively from top to bottom to form mini-batches. The major difference between Node-wise Sampling and Layer-wise Sampling lied in the sampling mechanism.

