FASG: FEATURE AGGREGATION SELF-TRAINING GCN FOR SEMI-SUPERVISED NODE CLASSIFICATION

Abstract

Recently, graph convolutioal networks (GCNs) have achieved significant success in many graph-based learning tasks, especially for node classification, due to its excellent ability in representation learning. Nevertheless, it remains challenging for GCN models to obtain satisfying predictions on graphs where few nodes are with known labels. In this paper, we propose a novel self-training algorithm based on GCN to boost semi-supervised node classification on graphs with little supervised information. Inspired by self-supervision strategy, the proposed method introduces an ingenious checking part to add new nodes as supervision after each training epoch to enhance node prediction. In particular, the embedded checking part is designed based on aggregated features, which is more accurate than previous methods and boosts node classification significantly. The proposed algorithm is validated on three public benchmarks in comparison with several state-of-the-art baseline algorithms, and the results illustrate its excellent performance.

1. INTRODUCTION

Graph convolutional network (GCN) can be seen as the migration of convolutional neural network (CNN) on non-Euclidean structure data. Due to its its excellent ability in representation learning, GCN has achieved significant success in many graph-based learning tasks, including node clustering, graph classification and link prediction (Dwivedi et al., 2020) . Kipf & Welling (2016) proposed a GCN mode from the perspective of spectrogram theory and validated its effectiveness on semi-supervised node classification task. Subsequent models such as GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2017) , SGCN (Wu et al., 2019) and APPNP (Klicpera et al., 2018) designed more sophisticated neighborhood aggregation functions from spatial or spectral views. These methods obtain much more effective results on semi-supervised node classification than traditional methods such as MLP, DeepWalk (Perozzi et al., 2014) , etc. However, the prediction accuracy of such GCN models depends largely on the quantity and quality of supervised information, and it will decrease significantly when the quantity of labeled nodes is quite small (Li et al., 2018) . The main reason lies that scarce supervised information is difficult to be spread far away in the graph so that unlabeled nodes are hardly to make full use of supervised information for prediction. Addressing the above issue, many studies have been devoted to improving the representation ability by designing multi-layer GCN model (Li et al., 2019) . However, the representation ability of GCN, as illustrated in Kipf & Welling (2016) , can hardly be improved by simply stacking layers just like MLP. Moreover, stacking too many layers tends to cause over-smoothing (Xu et al., 2018) that makes all node embeddings indistinguishable. Alternatively, Li et al. 2019) proposed a dynamic self-training framework to continuously refresh the training set by directly using the output of GCN without a checking part. In general these self-training algorithms generate pseudolabels using relatively simple checking mechanism, which may introduce false labels as supervision information and prevent the improvement of prediction accuracy. In this paper, we propose a novel feature aggregation self-training GCN (FASG) algorithm for semisupervised node classification. We firstly propose a lightweight classifier that applies linear SVM on aggregated node features, and validate that it achieves comparable performance to popular GCN approaches. Furthermore, this classifier is served as a checking part in the multi-round training process to generate pseudo-labels, which are used to filter unreliable nodes when expanding the supervised information. By fully considering the structural information of graph nodes, the newly developed checking part is able to improve the accuracy of the generated pseudo-labels and finally boost the node classification. Finally, we illustrate that the proposed self-training strategy can be integrated with various existing GCN models to improve the prediction performance. The proposed algorithms is validated in comparison with several state-of-the-art baseline algorithms in three public benchmarks, and the experimental results illustrate that the proposed algorithm outperforms all compared algorithms in general on all benchmarks. We will release the source code upon publication of this paper.

2. RELATED WORK

In the past decade CNN has achieved great success in many areas of machine learning (Krizhevsky et al., 2012; LeCun et al., 1998; Sermanet et al., 2012) , but its applications are mainly restricted in dealing with Euclidean structure data (Bruna et al., 2013) . Consequently, in recent years more and more studies are devoted to learning the representation on non-Euclidean structure data such as graph. Graph neural network (GNN) plays an important role in the field of graph representation learning, which can learn the representation of nodes or the whole graph. There are many famous GNN architectures including GCN (Kipf & Welling, 2016) , graph recurrent neural network (Hajiramezanali et al., 2019) and graph autoencoder (Pan et al., 2018) . As one of the most important architecture of GNN, GCN can be roughly categorized into spectral and spatial approaches. The spectral approaches (Bruna et al., 2013) define convolution operation by Laplacian feature decomposition of the graph, thereby filtering the graph structure in the spectral domain. On the basis of the Chebyshev polynomial (Defferrard et al., 2016) of the graph Laplacian matrix, Kipf & Welling (2016) proposed a much simper GCN framework that limits the filter to the first-order neighbor around each node. On the other hand, spatial approaches implement convolution in spatial domain by defining aggregation functions and transform functions. Notable work includes GraphSAGE (Hamilton et al., 2017) that transformed representation learning into a formal pattern called aggregation and combination and proposed several effective aggregation strategies such as mean-aggregator and max-aggregator, and GAT (Veličković et al., 2017) that focuses on the diversity in connected nodes and leverages selfattention mechanism to learn the important information in neighborhoods. Although these models have achieved far better performance on node classification than traditional methods, they still suffer from scarce supervised information due to the limitation on GCN layers making it hard to transform the supervised information to the entire graph. Self-training is an ancient and classic topic in the NLP field before deep learning era (Hearst, 1991; Riloff et al., 1999; Rosenberg et al., 2005; Van Asch & Daelemans, 2016) , and has recently been introduced into semi-supervised node classification. For making full use of supervised information to improve the prediction accuracy, Li et al. (2018) proposed to improve GCN model by self-training mechanism, which trains and applies a base model in rounds, and adds nodes with high confidences as supervision after each round. The newly added nodes are expect to be beneficial to predict rest nodes so as to enhance the final performance of the model. Following this line, the M3S training algorithm Sun et al. ( 2019) pretrains a model over the labeled data, and then assigns pseudo-labels to highly confident unlabeled samples that are considered as labeled data for the next round of the training. Later, Zhou et al. (2019) proposed a dynamic self-training GCN that generalizes and simplifies previous by directly using the output of GCN without a checking part to continuously refresh the training set. Similarly, Yang et al. (2020) proposed self-enhanced GNN (SEG) to improve the quality of the input data using the outputs of existing GNN models. These self-training methods expand the labeled node set with relatively simple checking mechanism or even directly using the output of GCN, as a result they may introduce noise as supervision and thus hurt the final prediction performance.



(2018) proposed to improve the reasoning ability of GCN models by applying self-training techniques on the training. Rather than trying to enhance the expressive ability of the model, the self-training strategy prefers to expand the supervised information by adding unlabeled nodes with high confidences to the training set at each round. Following this line, Sun et al. (2019) proposed a multi-stage self-training strategy (M3S) to enrich the training set, which uses deep cluster (Caron et al., 2018) and an aligning mechanism to generate pseudo-labels of nodes for updating of the training set. Later, Zhou et al. (

