GRAPHCGAN: CONVOLUTIONAL GRAPH NEURAL NETWORK WITH GENERATIVE ADVERSARIAL NET-WORKS

Abstract

Graph convolutional networks (GCN) achieved superior performances in graphbased semi-supervised learning (SSL) tasks. Generative adversarial networks (GAN) also show the ability to increase the performance in SSL. However, there is still no good way to combine the GAN and GCN in graph-based SSL tasks. In this work, we present GraphCGAN, a novel framework to incorporate adversarial learning with convolution-based graph neural network, to operate on graphstructured data. In GraphCGAN, we show that generator can generate topology structure and attributes/features of fake nodes jointly and boost the performance of convolution-based graph neural network classifier. In a number of experiments on benchmark datasets, we show that the proposed GraphCGAN outperforms the reference methods by a significant margin. Graph-based semi-supervised learning (SSL) aims to classify nodes in graph, where only small amounts of nodes are labeled due to the expensive and time-consuming label collection process. To solve such task, various graph neural networks (GNNs) have been proposed using the idea of convolutional neural networks (CNN) to implicitly propagate the information of labeled nodes to unlabeled nodes through the linkage between nodes (Kipf & Welling, 2016; Veličković et al., 2017; Hamilton et al., 2017) . These convolution-based graph neural networks have achieved superior performance on multiple benchmark datasets in graph-based SSL tasks (Wu et al., 2019) . Recently, generative adversarial networks (GANs) (Goodfellow et al., 2014) have been shown a power in improving the performance of image-based SSL problems (Odena, 2016; Salimans et al., 2016; Li et al., 2019b). In semi- GAN (Salimans et al., 2016), authors converted the M -class classification task into solving (M + 1)-class problem where the synthetic (M + 1)th class is generated by the GAN's generator. Later on, Dai et al. provided a theoretical insight that the generated data are able to boost the performance of classifier under certain assumptions. Our work is motivated by the the semi-GAN. GraphSGAN (Ding et al., 2018) first investigated the adversarial learning over graph, where the graph is embedding into an embedding space and synthetic data are generated in the corresponding space. The multi-layer perceptron (MLP) is trained as the classifier on the embedding vectors. However, to our knowledge, there is still no existed method to combine the adversarial learning to convolution-based GNNs on graph-based SSL task. In this work, we explore the potential of incorporating the convolution-based GNN and GAN. The challenges of constructing a general framework have three folds: first, the attributed graph data are non-Euclidean whose distribution contains information of graph topology structure as well as the attributes of nodes. Hence, it is not trivial to construct generator to model the distribution. Second, even the generator can model the graph's distribution, the generator should be trained properly to boost the performance of the classifier. A poor-quality generator would introduce noise to the existed graph and affect the classifier. Third, many variants of GCN have been proposed continuously. The framework should be built with flexibility to adapt to different convolution-based GNNs. We construct a novel approach called GraphCGAN to deal with above challenges. First, to model the distribution of graph, the generator is built sequentially from two sub-generators: one models



the attribute information (node's attribute) and another one models the graph topology structure (adjacency relation of node). Details can be found in Section 3.1. Second, in GraphCGAN, the generator is trained based on the feature matching technique (Salimans et al., 2016) which minimizes the distance between generated nodes and real nodes in the constructed feature space. This technique showed a good performance in SSL tasks in practice. The details for construction of loss functions can be found in Section 3.3. For GCN, the attributes of nodes are aggregated convolutionally by multiple layers. The representation of the last layer is usually considered as the prediction for the labels. For variants of GCN, the main differences exist in the strategy of layer aggregation (Hamilton et al., 2017) . In our framework, we choose the second to the last layer of convolution-based GNN as the feature matching functions. Therefore, our framework is easily extended to variants of GCN. More discussions can be found in Section 3.2.

2. PRELIMINARY

We first introduce the notation about graph. Let G = (V, E) denote a graph, where V is the set of nodes with |V | = n and E ⊂ V × V is a set of edges with |E| = m. The adjacency matrix A ∈ R |V |×|V | is defined as A ij = 1 if node v i and v j has edge, otherwise A ij = 0. Suppose each node v i has a d-dimensional feature x i ∈ R d and a single value label y i ∈ {1, 2, .., M }. In the semi-supervised learning setting, there is a disjoint partition for the nodes, V = V L ∪ V U , such that, for v i ∈ V L , the corresponding label is known and for v j ∈ V U the corresponding label is unknown. The distributions of node in labeled set V L and unlabeled set V U are denoted as p V L and p V U , respectively. The semi-supervised learning is to learn the label for unlabeled set {y j |v j ∈ V U } given adjacency matrix A, feature matrix X = [x i ] vi∈V and labels for labeled sets {y i |v i ∈ V L }.

2.1. CONVOLUTION BASED GRAPH NEURAL NETWORK CLASSIFIER

Based on the Laplacian smoothing, the convolution-based GNN models propagate the information of nodes features across the nodes' neighbors in each layer. Specifically, in GCN, the layer-wise propagation rule can be defined as follows: H (l+1) = σ(D -1 AH (l) W (l) + b (l) ), l = 0, 1, 2.., L -1 (1) where W (l) and b (l) are layer-specific trainable weight matrix and bias, respectively. σ(•) is an activation function. D is the diagonal degree matrix with D ii = j A ij . Hence, D -1 A represents normalization of adjacency matrix A. The initial layer H (0) is the feature matrix X. The final layer H (L) followed by a sof t max layer can be viewed as the prediction of one-hot representation for the true label y. Recently, many variants of the GCN layer-wise propagation rule had been proposed, including graph attention network, cluster GCN (Veličković et al., 2017; Chiang et al., 2019) , which achieved stateof-the-art performances in many benchmark datasets.

2.2. GENERATIVE ADVERSARIAL NETWORK BASED SEMI-SUPERVISED LEARNING

In semi-GAN, the classifier C and generator G play a non-cooperative game, where classifier aims to classify the unlabeled data as well as distinguish the generated data from real data; generator attempts to match feature of real data and that of generated data. Therefore, the objective function for classifier can be divided into two parts (Salimans et al., 2016) . The first part is the supervised loss function L sup = E v,y∼p V L log P C (y|v, y ≤ M ) which is the log probability of the node label given the real nodes. The second part is the unsupervised loss function L un-sup = E v∼p V U log P C (y ≤ M |v) + E v∼p V G log P C (y = M + 1|v) which is the sum of log probability of the first M classes for real nodes and the log probability of the (M + 1)th class for generated nodes V G . The classifier C can be trained by maximize the objective function L C = L sup + L un-sup . For objective function of generator, Salimans et al. (2016) found minimizing feature matching loss in Equation 3 achieved superior performance in practice L G = ||E v∼p V U (f (v)) -E z∼pz(z) (f (G(z)))|| 2 2 , where the feature matching function f (•) maps the input into a feature space and z ∼ p z (z) is drawn from a given distribution like uniform distribution. Furthermore, Dai et al. (2017) provided a theoretical justification that complementary generator G was able to boost the performance of classifier C in SSL task.

3. FRAMEWORK OF GRAPHCGAN

To combine the aforementioned Laplacian smoothing on graph and semi-GAN on SSL together, we develop GraphCGAN model, using generated nodes to boost the performance of convolution-based GNN models.

3.1. CONSTRUCTION OF GENERATOR FOR GRAPHCGAN

The generator G generates fake node v 0 by generating feature vector x 0 ∈ R d and adjacency relation a 0 ∈ R n jointly, where a 0,i = 1 if the fake node is connected to real node v i , otherwise a 0,i = 0. Therefore, the distribution for generated node p G (v 0 ) can be expressed by the joint distribution of the corresponding feature and adjacency relation p G (x 0 , a 0 ). From the conditional distribution formula, the joint distribution can be written as p G (x 0 , a 0 ) = p G1 (x 0 )p G2 (a 0 |x 0 ). We use sub-generators G 1 and G 2 to generate fake feature x 0 and a 0 |x 0 , respectively. In practice, a 0 |x 0 can be modeled by G 2 (z; x 0 ) = G 2 (z; G 1 (z)) where the adjacency relation a 0 is constructed by sub-generator G 2 given the input of x 0 . The distribution of generated node can be denoted by p G (v 0 ) = p G (x 0 , a 0 ) = p G (x 0 )p(a 0 |x 0 ) = p(G 1 (z))p(G 2 (z; G 1 (z))) =: p(G(z)). If B nodes (v 0,1 , v 0,2 , .., v 0,B ) are generated, the generated feature matrix is denoted as X 0 = (x T 0,1 , x T 0,2 , .., x T 0,B ) T and generated adjacency matrix has form A 0 = (a T 0,1 , a T 0,2 , .., a T 0,B ) T . Hence, the combined adjacency matrix can be denoted as Ã = A A T 0 A 0 I B ∈ R (n+B)×(n+B) , The combined feature vector is X = X X 0 ∈ R (n+B)×d . ( ) The diagonal degree matrix D ∈ R (n+B)×(n+B) can be denoted as D * 0 0 D B where D * ∈ R n×n with D * ,ii = j A ij + b A 0,bi and D B ∈ R B×B with D B,bb = j A 0,bj + 1.

3.2. ANALYSIS OF CLASSIFIER FOR GRAPHCGAN

In GraphCGAN, we adopt the convolution-based GNN, such as GCN, GraphSage (Hamilton et al., 2017) or GAT (Veličković et al., 2017) , as the the classifier. The classifier is applied to the enlarged graph G = [ X, Ã] to obtain the prediction ỹ of nodes V ∪ V G . Specially, considering the layer-wise propagation of GCN (Equation 1) as the classifier in GraphC-GAN, the propogation rule can be denoted as H(l+1) = σ( D-1 Ã H(l) W (l) + b(l) ) = σ( D -1 * 0 0 D -1 B A A T 0 A 0 I B H (l) * H (l) 0 W (l) + b (l) b (l) B ) = σ( D -1 * AH (l) * + D -1 * A T 0 H (l) 0 D -1 B A 0 H (l) * + D -1 B H (l) 0 W (l) + b (l) b (l) B ) = σ( D -1 * AH (l) * W (l) + b (l) * (D -1 B A 0 H (l) * + D -1 B W (l) )W (l) + b (l) B ) =: H (l+1) * H (l+1) 0 . ( ) where the first layer is chosen as the enlarged feature matrix H(0) = X. Weight matrix W (l) has the same in Equation 1. Bias vector b(l) has dimension l) to make the format clear. From Equation 7, the layer propagation of real nodes (first n rows) follows the same format as the GCN layer propagation in Equation 1. As a special case, for the zero generator A 0 = 0 or X 0 = 0, the performance of classifier on V ∪ V G would be the same as that of original classifier on V . (n + B) which is denoted as [b (l)T , b (l)T B ] T . We denote b (l) * = D -1 * A T 0 H (l) * W (l) + b ( For the last layer H(L) ∈ R (n+B)×M , we adopt the strategy in Salimans et al. (2016) to obtain the (M + 1) class label ỹ by ỹ = sof t max( H(L) ||0 (n+B)×1 ), where || denotes concatenation and 0 (n+B)×1 ∈ R (n+B)×1 is a zero matrix. The loss function for classifier in GraphCGAN follows the same format in Equation 2.

3.3. LOSS FUNCTIONS

Let us denote g(•, •; θ C ) as the map from feature vector and adjacency vector to the space of second to the last layer in convolution-based GNN with trainable parameter θ C . Specially, in the case of GCN, for node v i with feature vector x i and adjacency vector a i , g(x i , a i ; θ C ) = H(L-1) i , where H(L-1) i denotes the i-th row of H(L-1) and θ C = [W (l) ; b(l) ] l=0,1,..,L-2 . According to Equation 4, the loss function of generator G can be decomposed into two parts: the loss functions of sub-generators G 1 and G 2 separately. To construct G 1 , the feature matching function f in Equation 3 should solely depend on feature vector. Therefore, we mask the adjacency matrix Ã as identity matrix Ĩ ∈ R (n+B)×(n+B) in layer propagation. Formally, the feature matching loss function of G 1 is constructed as L G1 = ||E xi (g(x i , I i ; θ C )) -E z∼pz(z) (g(G 1 (z), 0; θ C ))|| 2 2 , where I i denote the i-th row of identity matrix I ∈ R n×n and 0 is the zero vector. After x 0 = G 1 (z) is built, the feature matching loss function of G 2 can be constructed similarly from L G2 = ||E ai (g(x i , a i ; θ C )) -E z∼pz(z) (g(x 0 , G 2 (z); θ C ))|| 2 2 . Therefore, loss function for G can be written as L G = L G1 + L G2 . Furthermore, when multiple fake nodes are generated, Salimans et al. (2016) showed that adding pull-away item to loss function can increase the entropy of generator which led to better performance in practice. The pull-away loss for sub-generators G 1 , G 2 can be denoted as L pt G1 = 1 B(B -1) B i j =i ( g(G 1 (z i ), 0; θ C ) T g(G 1 (z j )), 0; θ C ) ||g(G 1 (z i )), 0; θ C )||||g(G 1 (z j )), 0; θ C )|| ) and L pt G2 = 1 B(B -1) B i j =i ( g(x 0,i , G 2 (z i ); θ C ) T g(x 0,j , G 2 (z j ); θ C ) ||g(x 0,i , G 2 (z i ); θ C )||||g(x 0,j , G 2 (z j ); θ C )|| ). The loss function for G with pull-away item can be written as L * G = L G + L pt G1 + L pt G2 . Besides, Dai et al. (2017) constructed the complementary loss by L c G1 = E x∼p G 1 log(p(x))I(p(x) > ε), L c G2 = E a∼p G 2 log(p(a))I(p(a) > ε) , which could also increase performance. Therefore, the loss function for G with complementary loss can be written as L * * G = L * G + L c G1 + L c G2 . ( ) The procedure is formally presented in Algorithm 1. Algorithm 1: GraphCGAN Algorithm Input: Adjacency matrix A, Node feature X, initialized fake nodes V G = [A 0 , X 0 ]. hyper-parameters including dimension of the noise vector d noise , the number of steps K D , and the size of fake nodes B and early stop error. Output: Prediction Ỹ 1 while not early stop do 2 Combine the fake nodes V G to the graph and obtain Ã and X from Equation 5 and Equation 6; 12; 11 Obtain X 0 = G 1 (Z) and A 0 = G 2 (Z; G 1 (Z)).

4.1. GRAPH-BASED SEMI-SUPERVISED LEARNING

The challenge for graph-based SSL is to leverage unlabeled data to improve performance in classification. There are three categories of the Graph-based semi-supervised learning. The first one is the Laplacian regularization-based methods (Xiaojin & Zoubin, 2002; Lu & Getoor, 2003; Belkin et al., 2006) . The second type is the embedding-based methods, including DeepWalk (Perozzi et al., 2014) , SemiEmb (Weston et al., 2012) , and Planetoid (Yang et al., 2016) . The third type is convolutional based graph neural networks such as GCN (Kipf & Welling, 2016), GAT (Veličković et al., 2017) , ClusterGCN (Chiang et al., 2019) and DeepGCN (Li et al., 2019a) . Such methods address the semi-supervised learning in an end-to-end manner. Convolution-based methods perform the graph convolution by taking the weighted average of a node's neighborhood information. In many graph semi-supervised learning tasks, the convolution-based methods achieved the state-of-the-art performance (Wu et al., 2019) .

4.2. GNN LEARNING WITH GAN

GAN is wildly used in obtaining generative graph models. GraphGAN (Wang et al., 2018) proposed a framework for graph embedding task. Specifically, GraphGAN can generate the link relation for a center node. However, GraphGAN cannot be applied to attributed graph. MolGAN (De Cao & Kipf, 2018) proposed a framework for generating the attributed graph of molecule by generating the adjacency matrix and feature matrix independently. After that, MolGAN used an the score for the generated molecule as reward function to choose the reasonable combination of attributes and topology structure by an auxiliary reinforcement learning model. In comparison, GraphCGAN can generate attributes and adjacency matrix of the attributed graph jointly, which can capture the correlation between the attributes and topology relation. DGI (Veličković et al., 2018) proposed a general approach for learning node representations within graph-structured data in an unsupervised manner. For the generator, In DGI, the fake nodes are created from a pre-specified corruption function applied on the original nodes. In contrast, our GraphCGAN can generate the fake nodes from a dynamic generator during the training GAN process. For the classifier, the DGI uses GCN only, however, our GraphCGAN is flexible and adaptive to other convolution-based GNN models.

4.3. GAN WITH SEMI-SUPERVISED LEARNING

SGAN (Odena, 2016) first introduced the adversarial learning to the semi-supervised learning on image classification task. GAN-FM (Salimans et al., 2016) stabilized training process in SGAN by introducing feature-matching and minibatch techniques. In Kumar et al. (2017) , authors discuss about the effects of adding fake samples and claimed that moderate fake samples could improve the performance in image classification task. GraphSGAN Ding et al. (2018) proposed a framework for graph Laplacian regularization based classifier with GAN to solve graph-based semi-supervised learning tasks. In GraphSGAN, fake samples in the feature space of hidden layer are generated, hence it can not be applied to convolutional based classifiers. In constrast, our model generates fake nodes directly and is adaptive to convolutional based classifiers.

5. EXPERIMENTS

In this section, our primary goal is to show that the adversary learning can boost the performance of convolution-based GNNs in graph-based SSL under our framework. We evaluate GraphCGAN on established graph-based benchmark tasks against baseline convolution-based GNN models and some other related methods. We first introduce the dataset, experiment setup and results. Besides, we study the property of the generated nodes from our model during the training process. The ablation study is also provided in this section. The code GraphCGAN-ICLR.zip is provided as the supplementary file.

5.1. DATASETS

Three standard citation network benchmark datasets -Cora, Citeseer and Pubmed (Sen et al., 2008) are analyzed. We closely follow the setting in Kipf & Welling (2016) and Veličković et al. (2017) which allows for only 20 nodes per class to be used for training. The predictive power of the trained models is evaluated on 1000 test nodes, and 500 additional nodes are used for validation purposes.

5.2. EXPERIMENT SETUP AND RESULT

Two widely used of convolution-based GNNs, GCN and GAT, are considered as classifiers in GraphCGAN. In order to show the generated nodes can help improve the performance of the methods. We adopt the same model setting in the original papers (Kipf & Welling, 2016; Veličković et al., 2017) . Specially, for classifier in GraphCGAN-GCN, the number of layers L is 2, the dimension of the hidden layer is 16, the dropout rate is 0.5, activation function in the hidden layer is Relu. For GraphCGAN-GAT, the number of layers L is 2, the dimension of the hidden layer is 8, and number 46.5% 71.4% GraphSGAN (Ding et al., 2018) 83.0 ± 1.3% 73.1 ± 1.8% 77.2 ± 2.6 % DGI (velivckovic et al.,2018) 82.3 ± 0.6% 71.8 ± 0.7% 76.8 ± 0.6 % GCN (Kipf & Welling, 2017) 81.5% 70.3% 79.0 % GraphCGAN-GCN (ours) 82.4 ± 0.6% 72.6 ± 1.0% 79.9 ± 1.0% Gain for GCN 0.9 % 2.3 % 0.9 % GAT (velivckovic et al., 2017) 83.0 ± 0.7% 72.5 ± 0.7% 79.0 ± 0.3% GraphCGAN-GAT (ours) 84.0 ± 0.5% 73.2 ± 1.0% 80.7 ± 1.5% Gain for GAT 1.0 % 0.7 % 1.7 % Table 1 : Summary of results in terms of classification accuracy under 100 repetitions. The best and the second best results are masked in bold font. The results show that GraphCGAN-GCN and GraphCGAN-GAT outperform GCN and GAT in a significant margin, respectively. of attention heads is 8, the dropout rate is 0.6, activation function in the hidden layer is Sigmoid. The hyper-parameter for the weight of L2 regularization is 5e-4. For the generator, we use the loss function in Equation 12(Ablation study for loss function of generator can be found in Table 2 Appendix A). In Cora and Citeseer, we generate B = 64 fake nodes. In Pubmed, the number of fake nodes is B = 256. The ablation study of size of fake nodes are provided in Figure 1 . The results is presented in Table 1 , the best and the second best results are masked in bold font. We particularly note that both GraphCGAN-GCN and GraphCGAN-GAT outperform GCN and GAT in a significant margin, respectively. More specifically, we are able to improve upon GCN by a margin of 0.9%, 2.3% and 0.9% on Cora, Citeseer and Pubmed, respectively. Besides, GraphCGAN-GAT can improve upon GAT by a margin of 1.0%, 0.7% and 1.7%, suggesting that the adding fake nodes strategy in our GraphCGAN model can boost the performance for reference convolution-based GNN model. To be noticed that GraphCGAN can be easily extended to other convolution-based GNN models.

5.3. VISUALIZATION OF GAN PROCESS

In this subsection, we investigate about the distribution of the generated nodes during the training process. We consider three datasets to illustrate generated nodes in different perspectives. For Karate club graph (Zachary, 1977) , it contains 34 nodes without features. The feature matrix X is set as identity matrix during the training process. Therefore, the plot (first row in Figure 2 ) of fake nodes shows the distribution of G 2 (z; I). It can be found, after training, fake nodes mainly connect to the boundary nodesfoot_0 which is preferred as discussed in GraphSGAN (Ding et al., 2018) . MNIST datasets (LeCun et al., 1998) contain the images of handwritten digit. We can consider it as a graph with image feature by constructing an identity adjacency matrix Ã = Ĩ. Therefore, the plot (second row in Figure 2 ) of fake feature shows the distribution of G 1 (z) which has the shape around to digit eight. Last, we generated B = 256 nodes for Cora dataset which are plotted in two-dimension by T-SHN (Van Der Maaten, 2014) techniques on the feature space of g(•, •; θ C ) shown in the third row in Figure 2 , which can be considered as the distribution for G(z). We can find the generated nodes present as a complementary part for the existed nodes 

6. CONCLUSION

We propose GraphCGAN, a novel framework to improve the convolution-based GNN using GAN. In GraphCGAN, we design a generator to generate attributed graph, which is able to generate adjacency matrix and feature jointly. We also provide a new insight for the semi-supervised learning with convoluntional graph neural network under GAN structure. A flexible algorithm is proposed, which can be easily extended to other sophisticated architecture of GraphC, such as GAAN (Zhang et al., 2018) and GIN (Xu et al., 2018) . One potential future direction is to extend the GraphCGAN in other relevant tasks including community detection, co-embedding of attributed network (Meng et al., 2019) and even graph classification. Extending the model to incorporate edge features by generating the fake edge will allow us to tackle a larger amount of problems. Finally, in GAN, the stability of training process can be studied.



Boundary nodes are nodes connected to different clusters



Use convolution-based GNN as the classifier C, and extract the map to the intermediate layer g(., .) as Equation9;Train C by minimizing L C (Equation2) on combined graph, obtain predicted result Ỹ;iter D = iter D + 1; noise vector Z ∼ U (0, I) ∈ R B×dnoise ; 10 Train generator G = [G 1 ; G 2 ]by minimizing Equation 10 or Equation 11 or Equation

Figure 1: Ablation study for size of fake nodes B. It can be shown that moderate size of fake nodes can boost the classifier in graph-based SSL.

Figure 2: Representation of the generated nodes during training process. Karat club: Real nodes are shown as colored dots representing different manually assigned groups. Three fake nodes (black) are generated, for fake nodes, the initial generated adjacency vectors are set as 1, after training, fake nodes mainly connect to the boundary nodes; MNIST: One fake image as node's feature are generated, after training, the fake image shows the shape around to digit eight; Cora: 256 fake nodes (black) with features are generated, the plots show the t-SNE embedding for the feature space of the graph, the generated nodes show as a complementary part for the existed nodes (colored dots).

