VEM-GCN: TOPOLOGY OPTIMIZATION WITH VARIATIONAL EM FOR GRAPH CONVOLUTIONAL NETWORKS

Abstract

Over-smoothing has emerged as a severe problem for node classification with graph convolutional networks (GCNs). In the view of message passing, the oversmoothing issue is caused by the observed noisy graph topology that would propagate information along inter-class edges, and consequently, over-mix the features of nodes in different classes. In this paper, we propose a novel architecture, namely VEM-GCN, to address this problem by employing the variational EM algorithm to jointly optimize the graph topology and learn desirable node representations for classification. Specifically, variational EM approaches a latent adjacency matrix parameterized by the assortative-constrained stochastic block model (SBM) to enhance intra-class connection and suppress inter-class interaction of the observed noisy graph. In the variational E-step, graph topology is optimized by approximating the posterior probability distribution of the latent adjacency matrix with a neural network learned from node embeddings. In the M-step, node representations are learned using the graph convolutional network based on the refined graph topology for the downstream task of classification. VEM-GCN is demonstrated to outperform existing strategies for tackling over-smoothing and optimizing graph topology in node classification on seven benchmark datasets.

1. INTRODUCTION

Complex graph-structured data are ubiquitous in the real world, ranging from social networks to chemical molecules. Inspired by the remarkable performance of convolutional neural networks (CNNs) in processing data with regular grid structures (e.g., images), a myriad of studies on GCNs have emerged to execute "convolution" in the graph domain (Niepert et al., 2016; Kipf & Welling, 2017; Gilmer et al., 2017; Hamilton et al., 2017; Monti et al., 2017; Gao et al., 2018) . Many of these approaches follow a neighborhood aggregation mechanism (a.k.a., message passing scheme) that updates the representation of each node by iteratively aggregating the transformed messages sent from its neighboring nodes. Commencing with the pioneering works (Kipf & Welling, 2017; Gilmer et al., 2017) , numerous strategies have been developed to improve the vanilla message passing scheme such as introducing self-attention mechanism (Veličković et al., 2018; Zhang et al., 2020) , incorporating local structural information (Zhang et al., 2020; Jin et al., 2019; Ye et al., 2020) , and leveraging the link attributes (Gong & Cheng, 2019; Li et al., 2019; Jiang et al., 2019) . Despite significant success in many fundamental tasks of graph-based machine learning, message passing-based GCNs almost all process the observed graph structure as ground truth and might suffer from the over-smoothing problem (Li et al., 2018) , which would seriously affect the node classification performance. Given the observed noisy graph topology (i.e., excessive inter-class edges are linked while many intra-class edges are missing), when multiple message passing layers are stacked to enlarge the receptive field (the maximum hop of neighborhoods), features of neighboring nodes in different classes would be dominant in message passing. Thus, node representations would be corrupted by the harmful noise and affect the discrimination of graph nodes. The over-smoothing phenomenon in GCNs has already been studied from different aspects. Li et al. (2018) first interpreted over-smoothing from the perspective of Laplacian smoothing, while Xu et al. (2018) and Klicpera et al. (2019a) associated it with the limit distribution of random walk. Furthermore, Chen et al. (2020a) developed quantitative metrics to measure the over-smoothness from the topological view. They argued that the key factor leading to over-smoothing is the noise passing between nodes of different categories and the classification performance of GCNs is positively correlated with the proportion of intra-class node pairs in all edges. In this paper, we propose VEM-GCN, a novel architecture to address the over-smoothing problem with topology optimization for uncertain graphs. Considering that a "clearer" graph with more intra-class edges and fewer inter-class edges would improve the node classification performance of GCNs (Yang et al., 2019; Chen et al., 2020a) , VEM-GCN approaches a latent adjacency matrix parameterized by the assortative-constrained stochastic block model (SBM) where nodes share the same label are linked and inter-class edges should be cut off. To jointly refine the latent graph structure and learn desirable node representations for classification, variational EM algorithm (Neal & Hinton, 1998 ) is adopted to optimize the evidence lower bound (ELBO) of the likelihood function. In the inference procedure (E-step), graph topology is optimized by approximating the posterior probability distribution of the latent adjacency matrix with a neural network learned from node embeddings. In the learning procedure (M-step), a conventional GCN is trained to maximize the log-likelihood of the observed node labels based on the learned latent graph structure. The E-step and M-step optimize the graph topology and improve the classification of unlabeled nodes in an alternating fashion. The proposed VEM-GCN architecture is flexible and general. In the E-step, the neural network can support arbitrary desirable node embeddings generated by algorithms such as node2vec (Grover & Leskovec, 2016 ), struc2vec (Ribeiro et al., 2017) , and GCNs, or the raw node attributes. The GCN in the M-step can also be substituted with arbitrary graph models. Furthermore, recent strategies for relieving the over-smoothing issue, i.e., AdaEdge (Chen et al., 2020a) and DropEdge (Rong et al., 2020) , are shown to be the specific cases of VEM-GCN under certain conditions. For empirical evaluation, we conduct extensive experiments on seven benchmarks for node classification, including four citation networks, two Amazon co-purchase graphs, and one Microsoft Academic graph. Experimental results demonstrate the effectiveness of the proposed VEM-GCN architecture in optimizing graph topology and mitigating the over-smoothing problem for GCNs.

2. BACKGROUND AND RELATED WORKS

Problem Setting. This paper focuses on the task of graph-based transductive node classification. A simple attributed graph is defined as a tuple G obs = (V, A obs , X), where V = {v i } N i=1 is the node set, A obs = a obs ij ∈ {0, 1} N ×N is the observed adjacency matrix, and X ∈ R N ×f represents the collection of attributes with each row corresponding to the features of an individual node. Given the labels Y l = [y ic ] ∈ {0, 1} |V l |×C for a subset of graph nodes V l ⊂ V assigned to C classes, the task is to infer the classes Y u = [y jc ] ∈ {0, 1} |Vu|×C for the unlabeled nodes V u = V\V l based on G obs . Graph Convolutional Networks (GCNs). The core of most GCNs is message passing scheme, where each node updates its representation by iteratively aggregating features from its neighborhoods. Denote with W (l) the learnable weights in the l-th layer, N (i) the set of neighboring node indices for node v i , and σ(•) the nonlinear activation function. A basic message passing layer takes the following form: h (l+1) i = σ j∈N (i)∪{i} α (l) ij W (l) h (l) j . (1) Here, h j is the input features of node v j in the l-th layer, W (l) h (l) j is the corresponding transformed message, and α (l) ij is the aggregation weight for the message passing from node v j to node v i . Existing GCNs mainly differ in the mechanism for computing α (l) ij (Kipf & Welling, 2017; Veličković et al., 2018; Ye et al., 2020; Hamilton et al., 2017; Zhang et al., 2020) . Stochastic Block Model (SBM). SBM (Holland et al., 1983 ) is a generative model for producing graphs with community structures. It parameterizes the edge probability between each node pair by āij |y i , y j ∼ Bernoulli (p 0 ) , if y i = y j Bernoulli (p 1 ) , if y i = y j , ( ) where āij is an indicator variable for the edge linking nodes v i and v j , y i and y j denote their corresponding communities (classes), p 0 and p 1 are termed community link strength and cross-community link probability, respectively. The case where p 0 > p 1 is called an assortative model, while the case p 0 < p 1 is called disassortative. In this paper, we leverage an assortative-constrained SBM (Gribel et al., 2020) with p 0 = 1 and p 1 = 0 to model the latent graph for a clear topology. Over-smoothing. Real-world graphs often possess high sparsity and are corrupted by certain noise that leads to inter-class misconnection and missing intra-class edges. Over-smoothing is mainly caused by the indistinguishable features of nodes in different classes produced by the message passing along inter-class edges. Various strategies have been developed to alleviate this problem. JK-Net (Xu et al., 2018) utilizes skip connection for adaptive feature aggregation and DNA (Fey, 2019) further makes improvements based on the attention mechanism. PPNP and APPNP (Klicpera et al., 2019a) modify the message passing scheme by personalized PageRank (PPR) to avoid reaching the limit distribution of random walk. CGNN (Xhonneux et al., 2020) addresses over-smoothing in a similar manner as PPR. Zhao & Akoglu (2020) introduced a graph layer normalization scheme termed PairNorm to maintain the total pairwise distance between nodes unchanged across layers. GCNII (Chen et al., 2020b) extends GCN with Initial residual and Identity mapping. However, these methods cannot fundamentally address the over-smoothing issue, as they all view the observed graph as ground truth and the features of nodes in different classes would still be over-mixed along the inter-class edges. AdaEdge (Chen et al., 2020a) constantly refines the graph topology by adjusting the edges in a self-training-like fashion. However, AdaEdge only adjusts the edges linking nodes classified with high confidence, which leads to limited improvement or degradation in classification performance due to the incorrect operations for misclassified nodes. DropEdge (Rong et al., 2020) randomly removes a certain fraction of edges to reduce message passing. Despite enhanced robustness, DropEdge does not essentially optimize the graph topology. BBGDC (Hasanzadeh et al., 2020) generalizes Dropout (Srivastava et al., 2014) and DropEdge by adaptive connection sampling. Uncertain Graphs and Topology Optimization. Learning with uncertain graphs is another related research area, where the observed graph structure is supposed to be derived from noisy data rather than ground truth. Bayesian approaches are typical methods that introduce uncertainty to network analysis. Zhang et al. (2019) developed BGCN that considers the observed graph as a sample from a parametric family of random graphs and makes maximum a posteriori (MAP) estimate of the graph parameters. Tiao et al. (2019) also viewed graph edges as Bernoulli random variables and used variational inference to optimize the posterior distribution of the adjacency matrix by approximating the pre-defined graph priors. Some other Bayesian methods have also been developed to combine GCNs with probabilistic models (Ng et al., 2018; Ma et al., 2019) . However, without explicit optimization for the graph structure, they only improve the robustness under certain conditions such as incomplete edges, active learning, and adversarial attacks. For explicit topology optimization, Franceschi et al. (2019) presented LDS to parameterize edges as independent Bernoulli random variables and learn discrete structures for GCNs by solving a bilevel programming. However, LDS requires an extra validation set for training and suffers from limited scalability. TO-GCN (Yang et al., 2019) only adds the intra-class edges derived from the labeled nodes, which causes topology imbalance between V u and V l . GDC (Klicpera et al., 2019b) refines the adjacency matrix with graph diffusion to consider the links between high-order neighborhoods. However, the added edges might still be noisy to hamper the classification. GRCN (Yu et al., 2020) modifies the original adjacency matrix by adding a residual matrix with each element measuring the similarity between two corresponding node embeddings, and IDGL (Chen et al., 2020c) iteratively learns the graph structure in a similar manner. Pro-GNN (Jin et al., 2020) introduces low rank and sparsity constraints to recover a clean graph in defending adversarial attacks. NeuralSparse (Zheng et al., 2020) uses the Gumbel Softmax trick (Jang et al., 2017) to sample k neighbors from the original neighborhoods for each node but does not consider recovering missing intra-class edges. Different from the aforementioned methods, VEM-GCN aims at relieving the over-smoothing issue. We introduce a learned latent graph based on the assortative-constrained SBM to explicitly enhance intra-class connection and suppress inter-class interaction with the variational EM algorithm.

3. METHODOLOGY

In this section, we develop the VEM-GCN architecture for transductive node classification. VEM-GCN leverages the variational EM algorithm to achieve topology optimization, and consequently, address the over-smoothing issue by reducing noisy interactions between nodes in different classes. Specifically, E-step approximates the posterior probability distribution of the latent adjacency ma-trix to optimize the graph structure, and M-step maximizes the evidence lower bound of the loglikelihood function based on the refined graph. We first introduce our motivation and provide an overview of the proposed VEM-GCN architecture. Subsequently, we elaborate the mechanisms of the variational E-step and M-step, respectively.

3.1. MOTIVATION AND OVERVIEW

Motivation. As mentioned above, a graph with its nodes densely connected within their own communities (classes) has lower risk of over-smoothing. Under this consideration, the optimal adjacency matrix for GCN is Ã = YY (Yang et al., 2019; Chen et al., 2020a) , where Y ∈ R N ×C is the matrix of one-hot-encoded ground-truth labels. However, since we have to infer Y u for the unlabeled nodes V u , their true labels are not available for calculating Ã. Thus, we introduce a latent graph A latent learned from G obs through another neural network to help generate a topology clearer than A obs for GCNs. It is obvious that Ã is equivalent to a SBM with p 0 = 1 and p 1 = 0, and therefore we base the posterior probability distribution of the latent graph on this assumption. Overview. The basic principle behind our proposed VEM-GCN architecture is maximum likelihood estimation (MLE) in a latent variable model, i.e., to maximize the log-likelihood function of the observed node labels E q φ (Alatent|Gobs) [log p θ (Y l |G obs )] based on the approximate posterior distribution q φ (A latent |G obs ) of the latent graph A latent . According to variational inference, the evidence lower bound (ELBO) is optimized instead: log p θ (Y l |G obs ) ≥ L ELBO (θ, φ; Y l , G obs ) = E q φ (Alatent|Gobs) [log p θ (Y l , A latent |G obs ) -log q φ (A latent |G obs )], where the equality holds when q φ (A latent |G obs ) = p θ (A latent |Y l , G obs ). Note that q φ can be arbitrary desirable distributions on A latent and we use a neural network to parameterize it in this work. To jointly optimize the latent graph topology A latent and the ELBO L ELBO (θ, φ; Y l , G obs ), we adopt the variational EM algorithm to solve it (refer to Appendix A for the full algorithm).

3.2. E-STEP

In the inference procedure (E-step), θ is fixed and the goal is to optimze q φ (A latent |G obs ) to approximate the true posterior distribution p θ (A latent |Y l , G obs ). Under the condition of SBM, we assume each edge of the latent graph to be independent. Thus, q φ (A latent |G obs ) can be factorized by: q φ (A latent |G obs ) = i,j q φ (a latent ij |G obs ). Unlike LDS (Franceschi et al., 2019) using O(N 2 ) Bernoulli random variables to characterize the optimized graph with N nodes, we parameterize q φ (a latent ij |G obs ) through a neural network shared by all the possible node pairs (i.e., amortized variational inference (Gershman & Goodman, 2014) ), as shown in Eq. 5. Hence, our method shows scalability for large-scale graphs and is easier to train. z i = NN(e i ), q φ (a latent ij = 1|G obs ) = sigmoid(z i z j ), where e i is the node embedding of node v i , which can be derived from any desirable network embedding methods (e.g., node2vec (Grover & Leskovec, 2016 ), struc2vec (Ribeiro et al., 2017) , and GCNs) or the raw node attributes x i (the i-th row of X), and z i is the transformed features of node v i . NN(•) denotes a neural network and we use a Multi-Layer Perceptron (MLP) in this work. The probability for linking a node pair is defined as the inner-product of their transformed features activated by a sigmoid function. To approximate the posterior probability distribution of A latent , we rewrite p θ (A latent |Y l , G obs ) as: p θ (A latent |Y l , G obs ) = Yu p θ (A latent , Y u |Y l , G obs ) = E p θ (Yu|Y l ,Gobs) [p θ (A latent |Y l , Y u , G obs )]. Here, p θ (A latent |Y l , Y u , G obs ) is parameterized by the aforementioned assortative-constrained SBM (i.e., p θ (a latent ij = 1|y i , y j ) = y i y j for the one-hot-encoded node label y), p θ (Y u |Y l , G obs ) is the predicted categorical distributions for the unlabeled nodes derived in the previous M-step. Consequently, we can sample Ŷu ∼ p θ (Y u |Y l , G obs ) to estimate the expectation in Eq. 6 and leverage stochastic gradient descent (SGD) to minimize the reverse KL-divergence between the approximate posterior distribution q φ (A latent |G obs ) and the target p θ (A latent |Y l , G obs ). Under appropriate assumptions, q φ will converge to p θ (A latent |Y l , G obs ) as the iteration step of SGD t → ∞ (Bottou, 2010) . Thus, we can obtain the following objective function in the variational E-step for optimizing φ: L E = - i,j a latent ij ∈{0,1} λ(a latent ij )p θ (a latent ij |y i , y j ) log q φ (a latent ij |G obs ), ( ) where y is the ground truth label for node in labeled set V l , otherwise sampled from p θ (Y u |Y l , G obs ) for the nodes without given labels in each training step, and λ(a latent ij ) is the weighting hyperparameter to alleviate class imbalance between the inter-class edges and the intra-class edges.

3.3. M-STEP

In the learning procedure (M-step), φ is fixed and θ is updated to maximize the ELBO in Eq. 3. By factorizing p θ (Y l , A latent |G obs ) = p θ1 (Y l |A latent , G obs )p θ2 (A latent |G obs ) with θ = {θ 1 , θ 2 }, we have: L ELBO = E q φ (Alatent|Gobs) [log p θ1 (Y l |A latent , G obs )] -KL[q φ (A latent |G obs ) p θ2 (A latent |G obs )]. (8) Here, p θ1 (Y l |A latent , G obs ) in the first term can be parameterized by arbitrary GCN models described by Eq. 1 that infer the node labels from A latent and X. We use the vanilla GCN (Kipf & Welling, 2017) in this work (see Eq. 13 in Appendix A). The second term is the KL-divergence between q φ (A latent |G obs ) and the prior p θ2 (A latent |G obs ), which can be optimized by setting θ 2 = φ to force KL[q φ (A latent |G obs ) p θ2 (A latent |G obs )] = 0. Actually, p θ2 (A latent |G obs ) is of little interest to the final node classification task and we just need to maximize E q φ (Alatent|Gobs) [log p θ1 (Y l |A latent , G obs )] in the M-step. Considering the fact that the observed graph structure A obs should not be fully discarded and the approximation q φ (A latent |G obs ) derived in the previous E-step is sometimes not very accurate, we use q φ (A latent |G obs ) to refine A obs , and substitute q φ with the following qφ in practice: qφ (a latent ij = 1|G obs ) =    p, if q φ > ε 1 0, if q φ < ε 2 p • a obs ij , otherwise , where p ∈ (0, 1], ε 1 is close to one (commonly 0.999), and ε 2 is close to zero (commonly 0.01). Eq. 9 implies that, for edges predicted by q φ to be linked with high confidence (the value after sigmoid or the maximum value after softmax), they should be added to the observed graph with probability p. Edges predicted by q φ to be cut off with high confidence should be removed from the observed graph. Otherwise, we maintain the original graph structure with probability p. Similar to the E-step, we can sample the latent adjacency matrix Âlatent ∼ qφ (A latent |G obs ) (note that we pre-train p θ1 using A obs ) and leverage SGD to minimize the cross-entropy error between the GCN's predictions p θ1 (Y l | Âlatent , G obs ) and the ground-truth labels Y l for optimizing θ: L M = - vi∈V l C c=1 y ic log p θ1 (y ic | Âlatent , G obs ). In the test procedure, the final predictions for Y u are E qφ (Alatent|Gobs) [p θ1 (Y u |A latent , G obs )], which can be approximated by Monte-Carlo sampling: p θ (Y u |Y l , G obs ) = 1 S S i=1 p θ1 (Y u |A i latent , G obs ), with A i latent ∼ qφ (A latent |G obs ), where the number of samples S and the probability p in qφ are tuned hyperparameters. The two neural networks q φ and p θ are trained in an alternating fashion to reinforce each other. Topology optimization in the E-step improves the performance of the GCN in the M-step, and with more unlabeled nodes being correctly classified, q φ will better approximate the optimal graph Ã.

3.4. DISCUSSIONS

In this subsection, we discuss the relationship between VEM-GCN and two recent works for tackling over-smoothing, i.e., DropEdge (Rong et al., 2020) and AdaEdge (Chen et al., 2020a) . We show that these two methods are specific cases of VEM-GCN under certain conditions. More detailed comparisons with other related works (e.g., SBM-related GCNs) are discussed in Appendix B. VEM-GCN vs. DropEdge. DropEdge randomly removes a certain fraction of edges in each training step. The authors proved that this strategy can retard the convergence speed of over-smoothing. However, it does not address the over-smoothing issue at the core, since the graph topology is not fundamentally optimized and noisy messages still pass along inter-class edges. Considering the scenario where a node has few interactions with its community but many cross-community links, DropEdge cannot improve the discrimination of this stray node, since it does not recover the missing intra-class edges. We find that VEM-GCN degenerates to DropEdge, if we skip the E-step and just maximize E qφ (Alatent|Gobs) [log p θ1 (Y l |A latent , G obs )] with qφ (a latent ij = 1|G obs ) = p • a obs ij . VEM-GCN vs. AdaEdge. AdaEdge also constantly adjusts the graph topology in the training procedure. It adds the edge between two nodes which are predicted by the GCN as the same class with high confidence, and removes edges in a similar manner. If we skip the E-step and set qφ as Eq. 12, VEM-GCN and AdaEdge can be equivalent. qφ (a latent ij = 1|G obs ) =    1, if y i = y j and conf(y i ), conf(y j ) > τ 1 0, if y i = y j and conf(y i ), conf(y j ) > τ 2 a obs ij , otherwise , ( ) where y is the prediction made by GCN, conf(•) denotes the corresponding confidence, τ 1 and τ 2 are two thresholds. Eq. 12 implies that, this self-training-like fashion only adjusts the edges whose interacting nodes have already been classified with high confidence. Therefore, the performance improvement is limited and would even get worse for some misclassified nodes, as it might wrongly add inter-class edges to the observed graph A obs and remove helpful intra-class connections.

4. EXPERIMENTS

To evaluate our VEM-GCN architecture, we conduct extensive experiments on seven benchmark datasets. Under the same setting as DropEdge (Rong et al., 2020) and a label-scarce setting (i.e., low label rate), we compare the performance of VEM-GCN against a variety of state of the arts for tackling over-smoothing, uncertain graphs and topology optimization in GCNs. We further give the visualization results of topology optimization and quantitative analysis to verify the effectiveness of VEM-GCN in relieving the over-smoothing issue (complexity analysis is provided in Appendix E.3).

4.1. EXPERIMENTAL SETUP

Datasets and Baselines. We adopt seven well-known benchmark datasets to validate the proposed method. Cora (Sen et al., 2008) , Cora-ML (McCallum et al., 2000; Bojchevski & Günnemann, 2018) , Citeseer (Sen et al., 2008) , and Pubmed (Namata et al., 2012) are four citation network benchmarks, where nodes represent documents and edges are citations between documents. Amazon Photo and Amazon Computers are two segments from the Amazon co-purchase graph (McAuley et al., 2015) , in which nodes represent goods and edges indicate that two goods are frequently bought together. In the Microsoft Academic graph (Shchur et al., 2018) , nodes are authors and edges represent their co-authorship. All graphs use bag-of-words encoded representations as node attributes. An overview of the dataset statistics is summarized in Appendix C. Since VEM-GCN aims at addressing the over-smoothing problem with topology optimization, we evaluate the node classification performance of our method against various strategies for tackling over-smoothing, uncertain graphs and topology optimization in GCNs. For addressing the oversmoothing issue, five methods are considered: DropEdge (Rong et al., 2020) , DropICE, AdaEdge (Chen et al., 2020a) , PairNorm (Zhao & Akoglu, 2020) , and BBGDC (Hasanzadeh et al., 2020) , in which DropICE is implemented by removing the inter-class edges derived from V l . For tackling uncertain graphs, we compare against several Bayesian approaches including BGCN (Zhang et al., 2019) , VGCN (Tiao et al., 2019) , and G 3 NN (Ma et al., 2019) . For topology optimization, LDS (Franceschi et al., 2019) , GDC (Klicpera et al., 2019b) , TO-GCN (Yang et al., 2019) , GRCN (Yu et al., 2020) , and IDGL (Chen et al., 2020c) are the baselines. GMNN (Qu et al., 2019) is also taken as a baseline, as it also employs variational EM for transductive node classification. We conduct node classification under two experimental settings, i.e., full-supervised and label-scarce settings. The full-supervised setting follows DropEdge (Rong et al., 2020) , where each dataset is split into 500 nodes for validation, 1000 nodes for test and the rest for training. The label-scarce setting assigns labels to only a few nodes and selects 500 nodes for validation, while the rest are used for test. Under the label-scarce setting, we compare VEM-GCN with the baselines except for LDS, as LDS always uses the validation set for training, which is unfair for learning with limited training samples. DropICE is also omitted since the number of the removed inter-class edges derived from V l is very small in the label-scarce setting and thus DropICE only obtains similar performance as the vanilla GCN. Considering that the classification performance is highly influenced by the split of the dataset (Shchur et al., 2018) , we run all the models with the same 5 random data splits for each evaluation. To further ensure the credibility of the results, we perform 10 random weight initializations for each data split and report the average test accuracy for both experimental settings. Model Configurations. For a fair comparison, we evaluate all the methods under the same GCN backbone and the same training procedure. To be concrete, the graph model used in all baselines and our VEM-GCN (p θ1 in the M-step) is a vanilla GCN (Kipf & Welling, 2017) with the number of hidden units set as 32. Besides, we train the GCN backbone of all the methods for each dataset with the same dropout rate of 0.5, the same weight decay, the same learning rate of 0.01, the same optimizer (Adam (Kingma & Ba, 2015)), the same maximum training epoch of 1500, and the same early stopping strategy based on the validation loss with a patience of 50 epochs (for deeper models with more than 2 layers, we set the patience as 100 epochs). Note that IDGL empirically needs more training epochs to converge and we set its maximum training epoch as 10000 with a patience of 1000 epochs. As for q φ in the E-step, the input node embeddings are the attributes averaged over the neighborhood of each node and the network architecture is a four-layer MLP with hidden units of size 128, 64, 64, and 32, respectively. Please refer to Appendix D for more details of the implementations and hyperparameter settings for each dataset.

4.2. RESULTS AND ANALYSIS

Full-supervised Setting. Table 1 summarizes the classification results. The highest accuracy in each column is highlighted in bold. Note that the results of BBGDC and LDS on three large graphs (i.e., Pubmed, Amazon Computers, and MS Academic) and IDGL on MS Academic graph are missing due to the out-of-memory error. Table 1 demonstrates that none of the baselines outperform the vanilla GCN in all cases, while VEM-GCN consistently improves the test accuracy of the GCN backbone by noticeable margins. Specifically, we find the following facts under the full-supervised setting: (1) For tackling over-smoothing, AdaEdge, DropEdge and PairNorm demonstrate limited improvement on several datasets, while BBGDC and DropICE almost collapse for all cases. (2) LDS, TO-GCN, GDC, GRCN and IDGL cannot guarantee that their topology optimization could achieve performance gains for the GCN backbone. (3) Only adding intra-class edges (TO-GCN) or removing inter-class edges (DropICE) derived from V l might cause topology imbalance between V u and V l . The GCN trained on V l with enhanced graph topology would fail on V u with the original graph topology. ( 4) Bayesian approaches and GMNN can only achieve comparable performance with the vanilla GCN in almost all cases. Overall, these facts imply that VEM-GCN significantly benefits from the large labeled data to generate a clearer topology and achieve better performance. Label-scarce Setting. We randomly select 10 labeled nodes per class as the training set and evaluate the performance of VEM-GCN with varying layers. Table 2 shows the test accuracy and the oversmoothness measurements of the learned node embeddings (input node features of the last layer). The metric to measure the over-smoothness is defined in Appendix D.3 and supplementary results of VEM-GCN on additional datasets are shown in Appendix E.1. As can be seen in Table 2 , the vanilla GCN severely suffers from the over-smoothing issue, while VEM-GCN can achieve performance gains even with deeper layers (e.g., on the Cora dataset). DropEdge and AdaEdge can relieve the over-smoothing issue to some extent, but the performance still decreases drastically when stacking more GCN layers. The results of over-smoothness measurements indicate that VEM-GCN indeed produces more separable node embeddings across different classes to address the over-smoothing problem. We further take Amazon Photo as an example dataset to validate VEM-GCN under different label rates. Similar trend as Table 1 can be found in Table 3 . Convergence Analysis and Visualization Results. VEM-GCN leverages the variational EM algorithm for optimization. In this subsection, we analyze the convergence of VEM-GCN. Figure 1 depicts the accuracy improvement curve of p θ1 during the EM iterations under the full-supervised setting. We find that VEM-GCN requires only a few iterations to converge. We further take Cora-ML as an example to give the corresponding visualization results of graph topology optimization. Figure 2 show that the observed graph is very sparse and contains a few intra-class edges, while the optimized graph recover many missing intra-class edges to relieve the over-smoothing problem. Note that, although the refined graph is much denser than the observed graph, the hyperparameter p (0.05 here) in qφ helps maintain the sparsity of the latent adjacency matrix in the training procedure. Thus, the M-step can still be implemented efficiently using sparse-dense matrix multiplications.

5. CONCLUSION

In this paper, we present a novel architecture termed VEM-GCN for addressing the over-smoothing problem in GCNs with graph topology optimization. By introducing a latent graph parameterized by the assortative-constrained stochastic block model and utilizing the variational EM algorithm to jointly optimize the graph structure and the likelihood function, VEM-GCN outperforms a variety of state-of-the-art methods for tackling over-smoothing, uncertain graphs, and topology optimization in GCNs. For future work, we expect further improvements for the VEM-GCN architecture to deal with more complex graphs such as hypergraphs and heterogeneous graphs. Algorithm 1 VEM-GCN Input: Observed graph G obs and labels Y l for the labeled nodes V l . Parameter: φ in the E-step and θ in the M-step. Output: Predicted labels Y u for the unlabeled nodes V u . 1: Pre-train p θ with A obs and Y l to get initial p θ (Y u |Y l , G obs ). Set p θ (A latent |Y l , G obs ) = p θ (A latent |Y l , Ŷu , G obs ) according to Eq. 6. 7: Update q φ to optimize the objective function in Eq. 7 with SGD. 8: end for 9: M-step: 10: Obtain qφ (A latent |G obs ) according to Eq. 9. 11: for training step s 2 = 1, . . . , S 2 do 12: Sample Âlatent ∼ qφ (A latent |G obs ) for the latent adjacency matrix. 13: Update p θ to maximize the log-likelihood log p θ (Y l | Âlatent , G obs ) with SGD. 14: end for 15: Predict categorical distributions p θ (Y u |Y l , G obs ) according to Eq. 11. 16: end for 17: return Final predicted labels for V u based on p θ (Y u |Y l , G obs ). adjacency matrices sampled from the inferred SBM are used to train the GCN. Different from VEM-GCN, BGCN neither explicitly promotes intra-class connection nor demotes inter-class interaction. It only achieves robustness under certain conditions such as adversarial attacks, benefiting from the uncertainty brought by the inferred SBM. G 3 NN is a flexible generative model, where the graph generated by SBM is based on the predictions of an additional MLP learned from only X and Y l . The predictions for the unlabeled nodes are still based on G obs (i.e., the input adjacency matrix of the GCN is still A obs ). By contrast, VEM-GCN aims at addressing the over-smoothing issue with topology optimization. In VEM-GCN, the M-step trains a GCN to obtain the predictions of the unlabeled nodes based on A latent , X, and Y l . We then estimate the posterior distribution on A latent based on Y l and the predictions for the unlabeled nodes under the SBM assumption. Subsequently, the E-step optimizes the graph topology by training another auxiliary neural network with node embeddings as input to approximate the posterior distribution of A latent . The E-step and M-step are optimized in an alternating fashion to improve each other. VEM-GCN vs. VGCN. VGCN (Tiao et al., 2019 ) also introduces a latent graph A latent and optimizes L ELBO in Eq. 3. However, it directly optimizes the ELBO in a VAE (Kingma & Welling, 2014) fashion and the posterior distribution of A latent is set to approximate the pre-defined graph priors p(a prior ij = 1) = ρ 1 a obs ij + ρ 2 (1 -a obs ij ) with 0 < ρ 1 , ρ 2 < 1 using the re-parameterization trick. VGCN is to achieve robustness under fake link attacks and only shows comparable performance with GCN under the standard transductive learning setting (i.e., inferring Y u based on the original G obs ). By contrast, VEM-GCN does not introduce priors over graphs. We optimize the graph topology by explicitly enhancing intra-class connection and suppressing inter-class interaction using SBM and variational EM to relieve the over-smoothing issue. VEM-GCN vs. GMNN. Graph Markov Neural Network (GMNN) (Qu et al., 2019 ) also employs variational EM for node classification, but it is totally different from our method in motivations and objective functions. GMNN focuses on modeling the joint distribution of object (node) labels. Therefore, GMNN views Y u as latent variables and optimizes the following ELBO: log p θ (Y l |X) ≥ E q φ (Yu|X) [log p θ (Y l , Y u |X) -log q φ (Y u |X)]. In the E-step, GMNN parameterizes q φ (Y u |X) with a GCN and q φ (Y u |X) is optimized to approximate the posterior distribution p θ (Y u |Y l , X). In the M-step, GMNN utilizes another GCN to model the conditional distribution p θ (y i |y NB(i) , X) for each node v i ∈ V (NB(i) is the neighbor set of node v i ) with a conditional random field and maximizes the corresponding likelihood. On the contrary, VEM-GCN is proposed to relieve the over-smoothing issue. VEM-GCN optimizes the Under review as a conference paper at ICLR 2021 

C DATASET STATISTICS

We utilize seven node classification benchmarks in this paper, including four citation networks (i.e., Citeseer, Pubmed, Cora, and Cora-ML), two Amazon co-purchase graphs (i.e., Amazon Photo and Amazon Computers), and one Microsoft Academic graph, as summarized below. • Citation Networks. Cora, Citeseer, Pubmed can be downloaded from the official source code of GCN (Kipf & Welling, 2017 ) publicly available at https://github.com/ tkipf/gcn/tree/master/gcn/data, and Cora-ML can be downloaded from the source code of (A)PPNP (Klicpera et al., 2019a ) publicly available at https:// github.com/klicperajo/ppnp/tree/master/ppnp/data. • Amazon Co-purchase Graph. The Amazon Photo and Amazon Computers datasets from the Amazon co-purchase graph can be publicly downloaded from https://github. com/shchur/gnn-benchmark/tree/master/data (Shchur et al., 2018) . • Microsoft Academic Graph. The MS Academic graph can be downloaded from the source code of (A)PPNP (Klicpera et al., 2019a ) publicly available at https://github.com/ klicperajo/ppnp/tree/master/ppnp/data. An overview of the dataset statistics is listed in Table 4 . Note that for these open datasets, three (Cora, Citeseer, Pubmed) are given in the form of undirected graphs, while four (Cora-ML, Amazon Photo, Amazon Computers, MS Academic) are directed graphs. GCN treats all these datasets as undirected graphs (i.e., a ij = [a ij + a ji > 0], where [•] denotes Iverson bracket).

D FURTHER EXPERIMENTAL DETAILS D.1 IMPLEMENTATIONS

The implementation of VEM-GCN consists of two alternating steps in each iteration, including a variational E-step and an M-step. In the variational E-step, a simple four-layer MLP is implemented for q φ , where the numbers of neuron units of each layer are 128, 64, 64, and 32, respectively. We use tanh as the nonlinear activation function for the hidden layers. In the M-step, p θ1 is a vanilla GCN with the number of hidden units set as 32, and we use the official source code from https://github.com/tkipf/gcn/tree/master/gcn. All the baselines and our VEM-GCN architecture are trained on a single NVIDIA GTX 1080 Ti GPU with 11GB memory. We just utilize the raw node attributes X as the input to q φ . Note that we can also support any other desirable network embedding method and these experiments is left for the future work. Considering the fact that the bag-of-words representations of X is often noisy, we average the features of each node over its neighborhoods to smooth the input signal. Let Ârow = (D + γI N ) -1 (A + γI N ) denote the "self-enhanced" adjacency matrix with row normalization (we use γ = 1.5 in this paper), and [A B] be the concatenation of matrices A and B along the last dimension. Consequently, we summarize the specific input to q φ for all the datasets as below. • For Cora, Citeseer, and MS Academic, we use X = Ârow X as the input of q φ . • For Pubmed and Amazon Photo, we use X = X Ârow X Â2 row X as the input of q φ . • For Cora-ML and Amazon Computers, we use X = X Ârow X as the input of q φ . For the sampling in the E-step, we find that it is not always stable to draw the sampled labels Ŷu ∼ p θ (Y u |Y l , G obs ). To alleviate this problem, we maintain Ŷu ← argmax y (p θ (Y u |Y l , G obs )) with probability p e , and sample Ŷu ∼ p θ (Y u |Y l , G obs ) with probability (1 -p e ) to improve the performance in practice. For training of the E-step, we utilize SGD with momentum (of 0.99). In the test procedure, we perform inference in two ways: (1) keep p in Eq. 9 the same as the training procedure (commonly p < 1), sample S (S > 1) adjacency matrices, and predict the classes for the unlabeled nodes according to Eq. 11; (2) set p = 1 (i.e., the latent adjacency matrix is now deterministic) and S = 1. We report the best test accuracy obtained using these two sampling methods on each dataset and find that (2) almost always performs better.

D.2 HYPERPARAMETER SETTINGS

Table 5 summarizes the hyperparameters adopted for the full-supervised setting on the seven benchmark datasets. For the label-scarce setting, the hyperparameters are the same, except for ε 1 and ε 2 that need to be carefully tuned in the search space: ε 1 ∈ {0.95, 0.99, 0.995, 0.999, 0.9995, 0.9999}, ε 2 ∈ {0.001, 0.005, 0.01, 0.05, 0.1}. Table 5 : Hyperparameter setting for the results in Table 1 . λ(a obs ij = 0) = 1 and λ(a obs ij = 1) = λ. lr denotes learning rate. An exponential decay schedule is adopted for lr (E) with decay rate d r and decay step d s tuned: d r ∈ {0.96, 0.97, 0.98, 0.99}, d s ∈ {2500, 3000, 3500, 4000}. Batch size of the M-step is set to 96 (i.e., we randomly select 96 nodes for each step of optimizing Eq. 7). p is tuned in the search space: p ∈ {0.001, 0.002, . . . , 0.01, 0.02, . . . , 0.1, 0.15, . . . , 1}. 

D.3 METRIC FOR MEASURING OVER-SMOOTHNESS

To address the over-smoothing issue, one would prefer to reduce the intra-class distance to make node features in the same class similar, and increase the inter-class distance to produce distinguishable representations for nodes in different classes. Hence, we use the ratio of average inter-class distance to average intra-class distance (Euclidean distance of input node features in the last layer) to measure the over-smoothness. Formally, given the learned node embeddings H = {h i } N i=1 (the input node features of the last layer), we first calculate the distance matrix D = [d ij ] ∈ R N ×N with each entry defined as d ij = h i h i 2 - h i h i 2 2 , ( ) where • 2 denotes Euclidean norm. Next, we define the inter-class mask matrix and intra-class mask matrix as follows: M inter = -Ã + 1 N ×N , M intra = Ã -I N , where Ã = YY is the optimal graph and 1 N ×N is a matrix of size N × N with all entries set to 1.  D inter = D • M inter , D intra = D • M intra , where • denotes element-wise multiplication. Next we get the average inter-class distance AD inter and the average intra-class distance AD intra , with which we measure the over-smoothness as their ratio R: AD inter = i,j d inter ij i,j 1(d inter ij ) , AD intra = i,j d intra ij i,j 1(d intra ij ) , R = AD inter AD intra , where 1(x) = 1 if x > 0 otherwise 0.

E.1 CLASSIFICATION RESULTS UNDER THE LABEL-SCARCE SETTING

Supplementary experiments for Table 2 are shown in Table 6 . 

E.2 VISUALIZATION RESULTS

Figure 2 demonstrates the topology optimization results on the Cora-ML dataset under the fullsupervised setting. For the lable-scarce setting, the visualization results are shown in Figure 3 . We further take Cora-ML as an example dataset to provide t-SNE (Maaten & Hinton, 2008) visualizations of the learned node embeddings (input node features of the last layer) extracted by the vanilla GCN and our proposed VEM-GCN with varying layers under the label-scarce setting (10 labeled nodes per class). The results are shown in Figures 4 and 5 . As can be seen, VEM-GCN indeed generates more separable node embeddings for nodes in different classes (colors) for classification. In particular, a ten-layer vanilla GCN severely suffers from the over-smoothing issue where node features in different classes are over-mixed and thus indistinguishable, while our VEM-GCN architecture with a ten-layer GCN as the backbone still achieves comparable performance with a two-layer model.

E.3 COMPLEXITY ANALYSIS

VEM-GCN is very flexible and general. There is no constraint on the design of the two neural networks q φ and p θ . Therefore, the architecture can be combined with arbitrary GCN models and node embedding methods. Moreover, we generalize some existing state-of-the-art strategies for tackling the over-smoothing problem (i.e., DropEdge and AdaEdge). In comparison to the vanilla GCN, VEM-GCN achieves these benefits with topology optimization at the cost of efficiency. As illustrated in Sections 3.2 and 3.3, the M-step is a traditional training procedure for optimizing the GCNs. Although A latent recovers more intra-class edges than A obs , the parameter p in Eq. 9 helps maintain the sparsity of the latent graph in each step of the training procedure. Thus the M-step shares almost the same complexity as GCN. The E-step introduces an extra procedure that trains a MLP for graph structure optimization. However, to address the over-smoothing issue at the core, we argue that topology optimization is necessary. Actually, efficiency issue is a common problem for some Bayesian approaches and topology optimization methods. Despite decreased efficiency compared with the vanilla GCN, VEM-GCN optimizes A latent with amortized variational inference (i.e., training a MLP shared by all the node pairs with mini-batch SGD in the E-step), which is faster than BGCN (Zhang et al., 2019) (Bayesian method) and LDS (Franceschi et al., 2019) (topology optimization) and is scalable for large-scale graphs. For training on the Amazon Photo dataset with a NVIDIA GTX 1080 Ti GPU, VEM-GCN is about 3× faster than LDS and 4× faster than BGCN.

E.4 VEM-GCNII

As mentioned above, our VEM-GCN architecture is flexible and general. In the E-step, the neural network can support arbitrary desirable node embeddings and the GCN in the M-step can be substituted with any graph models. This subsection further verifies the effectiveness of our proposed method by trying different models for p θ1 (Y l |A latent , G obs ). We also apply the same node embeddings as illustrated in Appendix D. Trying more effective node embeddings is not the focus of this paper and is left for the future work. Recently, Chen et al. (2020b) proposed a simple and deep GCN model termed GCNII to address the over-smoothing issue. GCNII improves the vanilla GCN via Initial residual and Identity mapping: H (l+1) = σ (1 -α (l) ) PH (l) + α (l) H (0) (1 -β (l) )I N + β (l) W (l) , where H (0) = XW (0) is the output of the first layer (a fully connected layer), P = Ḋ-1 2 Ȧ Ḋ-1 2 is the normalized adjacency matrix, α (l) and β (l) are two hyperparameters. We utilize GCNII as the backbone that results in the VEM-GCNII architecture. We use the official source code from https://github.com/chennnM/GCNII and employ the hyperparameter settings reported in (Chen et al., 2020b) that achieve the best performance on the three citation networks (i.e., Cora, Citeseer, and Pubmed) under the label-scarce setting. For the other four datasets, we roughly tuned the hyperparameters but found that GCNII does not outperform the vanilla GCN. Thus, we only evaluate GCNII and VEM-GCNII on Cora, Citeseer, and Pubmed with 10 labeled nodes per class as the training set. Experimental results are shown in Table 7 . As can be seen, VEM-GCNII consistently boosts the performance of GCNII, further verifying the effectiveness and flexibility of our proposed architecture. 



Figure 2: Visualization results of topology optimization on the Cora-ML dataset under the fullsupervised setting. We plot an induced subgraph (node indices from 450 to 850) for a better view. (a) The observed graph A obs ; (b) The optimal graph Ã; (c) The approximate posterior distribution q φ on A latent ; (d) The refined graph qφ .

2: for EM iteration t = 1, . . . , T do for training step s 1 = 1, . . . , S 1 do 5: Sample Ŷu ∼ p θ (Y u |Y l , G obs ) for the unlabeled nodes V u . 6:

Then we can obtain the inter-class distance matrix D inter = [d inter ij ] ∈ R N ×N and the intraclass distance matrix D intra = [d intra ij ] ∈ R N ×N by element-wise multiplication D with the mask matrices:

Figure 3: Visualization results of topology optimization on the Cora-ML dataset under the labelscarce setting (10 labels per class). We plot an induced subgraph (node indices from 450 to 850) for a better view. (a) The observed graph A obs ; (b) The optimal graph Ã; (c) The approximate posterior distribution q φ on A latent ; (d) The refined graph qφ .

Figure 4: t-SNE plots of learned node embeddings extracted by vanilla GCN with varying layers on the Cora-ML dataset. Different colors denote different classes.

Figure 5: t-SNE plots of learned node embeddings extracted by VEM-GCN with varying layers on the Cora-ML dataset. Different colors denote different classes.

Average test accuracy (%) for all models (a two-layer vanilla GCN as the backbone) and all datasets under the full-supervised setting. OOM: Out-of-memory error.

Average test accuracy (%) and over-smoothness under the label-scarce setting (10 labels per class) with varying layers. For both metrics, the larger the better. A: Accuracy. S: Over-smoothness. / 3.78 84.4 / 3.88 84.4 / 3.70 84.4 / 4.45 84.3 / 4.12

Average test accuracy (%) under different label rates on the Amazon Photo dataset.

Summary of dataset statistics

Average test accuracy (%) and over-smoothness under the label-scarce setting (10 labels per class) with varying layers. For both metrics, the larger the better. A: Accuracy. S: Over-smoothness.

Average test accuracy (%) of GCNII and VEM-GCNII on the three citation networks under the label-scarce setting (10 labels per class). The number in parentheses denotes the number of layers.

A ALGORITHM

For a fair comparison, we adopt the vanilla GCN (Kipf & Welling, 2017) as the backbone for all baselines and our proposed VEM-GCN architecture. A two-layer GCN is in the following form:where X is the input node attribute matrix, I N is the identity matrix, Ȧ = A + I N is the adjacency matrix with added self-loops, Ḋ is its corresponding diagonal degree matrix, and θ 1 = {Θ (0) , Θ (1) } are the learnable weight parameters. Algorithm 1 describes the proposed VEM-GCN architecture.

B FURTHER DISCUSSIONS

In addition to recent strategies for tackling over-smoothing issues, we further distinguish VEM-GCN from the SBM-related GCNs and VGCN and GMNN (that introduce variational inference).Comparison with SBM-related GCNs. Stochastic block model (SBM) has been combined with GCNs in several recent works (i.e., BGCN (Zhang et al., 2019) and G 3 NN (Ma et al., 2019) ). However, these architectures are totally different from VEM-GCN in motivations, objective functions and training methods. BGCN (Zhang et al., 2019) jointly infers the node labels and the parameters of SBM using only A obs , which ignores the dependence of the graph on X and Y l . Then the

