VECODER -VARIATIONAL EMBEDDINGS FOR COM-MUNITY DETECTION AND NODE REPRESENTATION

Abstract

In this paper, we study how to simultaneously learn two highly correlated tasks of graph analysis, i.e., community detection and node representation learning. We propose an efficient generative model called VECODER for jointly learning Variational Embeddings for Community Detection and node Representation. VECODER assumes that every node can be a member of one or more communities. The node embeddings are learned in such a way that connected nodes are not only "closer" to each other but also share similar community assignments. A joint learning framework leverages community-aware node embeddings for better community detection. We demonstrate on several graph datasets that VECODER effectively outperforms many competitive baselines on all three tasks i.e. node classification, overlapping community detection and non-overlapping community detection. We also show that VECODER is computationally efficient and has quite robust performance with varying hyperparameters.

1. INTRODUCTION

Graphs are flexible data structures that model complex relationships among entities, i.e. data points as nodes and the relations between nodes via edges. One important task in graph analysis is community detection, where the objective is to cluster nodes into multiple groups (communities) . Each community is a set of densely connected nodes. The communities can be overlapping or non-overlapping, depending on whether they share some nodes or not. Several algorithmic (Ahn et al., 2010; Derényi et al., 2005) and probabilistic approaches (Gopalan & Blei, 2013; Leskovec & Mcauley, 2012; Wang et al., 2017; Yang et al., 2013) to community detection have been proposed. Another fundamental task in graph analysis is learning the node embeddings. These embeddings can then be used for downstream tasks like graph visualization (Tang et al., 2016; Wang et al., 2016; Gao et al., 2011; Wang et al., 2017) and classification (Cao et al., 2015; Tang et al., 2015) . In the literature, these tasks are usually treated separately. Although the standard graph embedding methods capture the basic connectivity, the learning of the node embeddings is independent of community detection. For instance, a simple approach can be to get the node embeddings via DeepWalk (Perozzi et al., 2014) and get community assignments for each node by using k-means or Gaussian mixture model. Looking from the other perspective, methods like Bigclam (Yang & Leskovec (2013) ), that focus on finding the community structure in the dataset, perform poorly for node-representation tasks e.g. node classification. This motivates us to study the approaches that jointly learn community-aware node embeddings. Recently several approaches, like CNRL (Tu et al., 2018) , ComE (Cavallari et al., 2017) , vGraph (Sun et al. (2019) ) etc, have been proposed to learn the node embeddings and detect communities simultaneously in a unified framework. Several studies have shown that community detection is improved by incorporating the node representation in the learning process (Cao et al., 2015; Kozdoba & Mannor, 2015) . The intuition is that the global structure of graphs learned during community detection can provide useful context for node embeddings and vice versa. The joint learning methods (CNRL, ComE and vGraph) learn two embeddings for each node. One node embedding is used for the node representation task. The second node embedding is the "context" embedding of the node which aids in community detection. As CNRL and ComE are based on Skip-Gram (Mikolov et al., 2013) and DeepWalk (Perozzi et al., 2014) , they inherit "context" embedding from it for learning the neighbourhood information of the node. vGraph also requires two node embeddings for parameterizing two different distributions. In contrast, we propose learning a single community-aware node representation which is directly used for both tasks. In this way, we not only get rid of an extraneous node embedding but also reduce the computational cost. In this paper, we propose an efficient generative model called VECODER for jointly learning both community detection and node representation. The underlying intuition behind VECODER is that every node can be a member of one or more communities. However, the node embeddings should be learned in such a way that connected nodes are "closer" to each other than unconnected nodes. Moreover, connected nodes should have similar community assignments. Formally, we assume that for i-th node, the node embeddings z i are generated from a prior distribution p(z). Given z i , the community assignments c i are sampled from p(c i |z i ), which is parameterized by node and community embeddings. In order to generate an edge (i, j), we sample another node embedding z j from p(z) and respective community assignment c j from p(c j |z j ). Afterwards, the node embeddings and the respective community assignments of node pairs are fed to a decoder. The decoder ensures that embeddings of both the nodes and the communities of connected nodes share high similarity. This enables learning such node embeddings that are useful for both community detection and node representation tasks. We validate the effectiveness of our approach on several real-world graph datasets. In Sec. 4, we show empirically that VECODER is able to outperform the baseline methods including the direct competitors on all three tasks i.e. node classification, overlapping community detection and nonoverlapping community detection. Furthermore, we compare the computational cost of training different algorithms. VECODER is up to 40x more time-efficient than its competitors. We also conduct hyperparameter sensitivity analysis which demonstrates the robustness of our approach. Our main contributions are summarized below: • We propose an efficient generative model called VECODER for joint community detection and node representation learning. • We adopt a novel approach and argue that a single node embedding is sufficient for learning both the representation of the node itself and its context. • Training VECODER is extremely time-efficient in comparison to its competitors.

2. RELATED WORK

Community Detection. Early community detection algorithms are inspired from clustering algorithms (Xie et al., 2013) . For instance, spectral clustering (Tang & Liu, 2011) is applied to the graph Laplacian matrix for extracting the communities. Similarly, several matrix factorization based methods have been proposed to tackle the community detection problem. For example, Bigclam (Yang & Leskovec (2013) ) treats the problem as a non-negative matrix factorization (NMF) task. It aims to recover the node-community affiliation matrix and learns the latent factors which represent community affiliations of nodes. Another method CESNA (Yang et al. (2013) ) extends Bigclam by modelling the interaction between the network structure and the node attributes. The performance of matrix factorization methods is limited due to the capacity of the bi-linear models. Some generative models, like vGraph (Sun et al., 2019) , Circles (Leskovec & Mcauley (2012) ) etc, have also been proposed to detect communities in a graph. Node Representation Learning. Many successful algorithms which learn node representation in an unsupervised way are based on random walk objectives (Perozzi et al., 2014; Tang et al., 2015; Grover & Leskovec, 2016; Hamilton et al., 2017) . Some known issues with random-walk based methods (e.g. DeepWalk, node2vec etc) are: (1) They sacrifice the structural information of the graph by putting over-emphasis on the proximity information (Ribeiro et al., 2017) and (2) great dependence of the performance on hyperparameters (walk-length, number of hops etc) (Perozzi et al., 2014; Grover & Leskovec, 2016) . Recently, Gilmer et al. (2017) recently showed that graph convolutions encoder models greatly reduce the need for using the random-walk based training objectives. This is because the graph convolutions enforce that the neighboring nodes have similar representations. Some interesting GCN based approaches include graph autoencoders e.g. GAE and VGAE(Kipf & Welling (2016b) ) and DGI (Velickovic et al., 2019) . Joint community detection and node representation learning. In the literature, several attempts have been made to tackle both these tasks in a single framework. Most of these methods propose an alternate optimization process, i.e. learn node embeddings and improve community assignments with them and vice versa (Cavallari et al., 2017; Tu et al., 2018) . Some approaches, like CNRL (Tu et al., 2018) and ComE (Cavallari et al., 2017) , are inspired from random walk, thus inheriting the shortcomings of random walk. Others, like GEMSEC (Rozemberczki et al. (2019) , are limited to the detection of non-overlapping communities. There also exist some generative models like Commu-nityGAN (Jia et al. (2019) ) and vGraph (Sun et al. (2019) ) that jointly learn community assignments and node embeddings. Some methods have high computational complexity, i.e. quadratic to the number of nodes in a graph, e.g. M-NMF (Wang et al. (2017) ) and DNR (Yang et al., 2016a) . CNRL, ComE and vGraph require learning two embeddings for each node for simultaneously tackling the two tasks. Unlike them, VECODER learns a single community-aware node representation which is directly used for both tasks. It is pertinent to highlight that although both vGraph and VECODER adopt a variational approach but the underlying models are quite different. vGraph assumes that each node can be represented as a mixture of multiple communities and is described by a multinomial distribution over communities, whereas VECODER models the node embedding by a single distribution. For a given node, vGraph, first draws a community assignment and then a connected neighbor node is generated based on the assignment. Whereas, VECODER draws the node embedding from prior distribution and then community assignment is conditioned on a single node only. In simple terms, vGraph also needs edge information in the generative process whereas VECODER does not require it. VECODER relies on the decoder to ensure that embeddings of the connected nodes and their communities share high similarity with each other.

3.1. PROBLEM FORMULATION

Suppose an undirected graph G = (V, E) with the adjacency matrix A ∈ R N ×N and a matrix X ∈ R N ×F of F -dimensional node features, N being the number of nodes. Given K as the number of communities, we aim to jointly learn the node embeddings and the community embeddings following a variational approach such that: (1) One or more communities can be assigned to every node and (2) the node embeddings can be used for both community detection and node classification.

3.2. VARIATIONAL MODEL

Generative Model: Let us denote the latent node embedding and community assignment for i-th node by the random variables z i ∈ R d and c i respectively. The generative model is given by: p(A) = c p(Z, c, A)dZ, where c = [c 1 , c 2 , • • • , c N ] and the matrix Z = [z 1 , z 2 , • • • , z N ] stacks the node embeddings. The joint distribution in (1) is mathematically expressed as p(Z, c, A) = p(Z) p θ (c|Z) p θ (A|c, Z), (2) where θ denotes the model parameters. Let us denote elements of A by a ij . Following existing approaches (Kipf & Welling, 2016b; Khan et al., 2020) , we consider z i to be i.i.d random variables. Furthermore, assuming c i |z i to be i.i.d random variables, the joint distributions in (2) can be factorized as p(Z) = N i=1 p(z i ) (3) p θ (c|Z) = N i=1 p θ (c i |z i ) (4) p θ (A|c, Z) = i,j p θ (a ij |c i , c j , z i , z j ), where Eq. ( 5) assumes that the edge decoder p θ (a ij |c i , c j , z i , z j ) depends only on c i , c j , z i and z j . Inference Model: We aim to learn the model parameters θ such that log(p θ (A)) is maximized. In order to ensure computational tractability, we introduce the approximate posterior q φ (Z, c|I) = i q φ (z i , c i |I) = i q φ (z i |I)q φ (c i |z i , I), where I = (A, X) if node features are available, otherwise I = A. We maximize the corresponding ELBO bound (for derivation, refer to the supplementary material), given by L ELBO ≈ - N i=1 D KL (q φ (z i |I) || p(z i )) - N i=1 1 M M m=1 D KL (q φ (c i |z (m) i , I) || p θ (c i |z (m) i )) + (i,j)∈E E (zi,zj ,ci,cj )∼q φ (zi,zj ,ci,cj |I) log p θ (a ij |c i , c j , z i , z j ) , where D KL (.||.) represents the KL-divergence between two distributions. The distribution q φ (z i , z j , c i , c j |I) in the third term of Eq. ( 7) is factorized into two conditionally independent distributions i.e. q φ (z i , z j , c i , c j |I) = q φ (z i , c i |I)q φ (z j , c j |I).

3.3. DESIGN CHOICES

In Eq. (3), p(z i ) is chosen to be the standard gaussian distribution for all i. The corresponding approximate posterior q φ (z i |I) in Eq. ( 6), used as node embeddings encoder, is given by q φ (z i |I) = N µ i (I), diag(σ 2 i (I)) . The parameters of q φ (z i |I) can be learnt by any encoder network e.g. graph convolutional network (Kipf & Welling (2016a)), graph attention network ( Veličković et al. (2017) ), GraphSAGE (Hamilton et al. ( 2017)) or even two matrices to learn µ i (I) and diag(σ 2 i (I)). Samples are then generated using reparametrization trick (Doersch (2016) ). For parameterizing p θ (c i |z i ) in Eq. ( 4), we introduce community embeddings {g 1 , • • • , g K }; g k ∈ R d . The distribution p θ (c i |z i ) is then modelled as the softmax of dot products of z i with g k , i.e. p θ (c i = k|z i ) = exp(< z i , g k >) K =1 exp(< z i , g >) . (10) The corresponding approximate posterior q φ (c i = k|z i , I) in Eq. ( 6) is affected by the node embedding z i as well as the neighborhood. To design this, our intuition is to consider the similarity of g k with the embedding z i as well as with the embeddings of the neighbors of the i-th node. The overall similarity with neighbors is mathematically formulated as the average of the dot products of their embeddings. Afterwards a hyperparameter α is introduced to control the bias between the effect of z i and the set N i of the neighbors of the i-th node. Finally, a softmax is applied as follows q φ (c i = k|z i , G) = exp α < z i , g k > +(1 -α) 1 |Ni| j∈Ni < z j , g k > K =1 exp α < z i , g > +(1 -α) 1 |Ni| j∈Ni < z j , g > . Hence, Eq. ( 11) ensures that graph structure information is employed to learn community assignments instead of relying on an extraneous node embedding as done in (Sun et al., 2019; Cavallari et al., 2017) . Finally, the choice of edge decoder in Eq. ( 5) is motivated by the intuition that the nodes connected by edges have a high probability of belonging to the same community and vice versa. Therefore we model the edge decoder as: p θ (a ij |c i = , c j = m, z i , z j ) = σ(< z i , g m >) + σ(< z j , g >) 2 . ( ) For better reconstructing the edges, Eq. ( 12) makes use of the community embeddings, node embeddings and community assignment information simultaneously. This helps in learning better node representations by leveraging the global information about the graph structure via community detection. On the other hand, this also forces the community assignment information to exploit the local graph structure via node embeddings and edge information.

3.4. PRACTICAL ASPECTS

The third term in Eq. ( 7) is estimated in practice using the samples generated by the approximate posterior. This term is equivalent to the negative of binary cross-entropy (BCE) loss between observed edges and reconstructed edges. Since community assignment follows a categorical distribution, we use Gumbel-softmax (Jang et al. ( 2016)) for backpropagation of the gradients. As for the second term of Eq. ( 7), it is also enough to set M = 1, i.e. use only one sample per input node. For inference, non-overlapping community assignment can be obtained for i-th node as C i = arg max k∈{1,••• ,K} q φ (c i = k|z i , I). ( ) To get overlapping community assignments for i-th node, we can threshold its weighted probability vector at , a hyperparameter, as follows C i = k q φ (c i = k|z i , I) max q φ (c i = |z i , I) ≥ , ∈ [0, 1]. 3.5 COMPLEXITY Computation of dot products for all combinations of node and community embeddings takes O(N Kd) time. Solving Eq. ( 11) further requires calculation of mean of dot products over the neighborhood for every node, which takes O(|E|K) computations overall as we traverse every edge for every community. Finally, we need softmax over all communities for every node in Eq. ( 10) and Eq. ( 11) which takes O(N K) time. Eq. ( 12) takes O(|E|) time for all edges as we have already calculated the dot products. As a result, the overall complexity becomes O(|E|K + N Kd). This complexity is quite low compared to other algorithms designed to achieve similar goals (Cavallari et al., 2017; Wang et al., 2017; Yang et al., 2016a) . 

4.1. DATASETS

We have selected 18 different datasets ranging from 270 to 126,842 edges. For nonoverlapping community detection and node classification, we use 5 the citation datasets (Bojchevski & Günnemann (2017) ; Yang et al. (2016b) ). The remaining datasets (Leskovec & Mcauley (2012) ; Yang & Leskovec (2015) ), used for overlapping community detection, are taken from SNAP repository (Leskovec & Krevl (2014) ). Following (Sun et al., 2019) , we take 5 biggest ground truth communities for youtube, amazon and dblp. Moreover, we also analyse the case of large number of communities. For this purpose, we prepare two subsets of amazon dataset by randomly selecting 500 and 1000 communities from 2000 smallest communities in the amazon dataset.

4.2. BASELINES

For overlapping community detection, we compare with the following competitive baselines: MNMF (Wang et al., 2017) learns community membership distribution by using joint non-negative matrix factorization with modularity based regularization. BIGCLAM (Yang & Leskovec (2013) ) also formulates community detection as a non-negative matrix factorization (NMF) task 2017)) jointly learns community and node embeddings by using gaussian mixture model formulation. CNRL (Tu et al., 2018) enhances the random walk sequences (generated by DeepWalk, node2vec etc) to jointly learn community and node embeddings. CommunityGAN (ComGAN)is a generative adversarial model for learning node embeddings such that the entries of the embedding vector of each node refer to the membership strength of the node to different communities. Lastly, we compare the results with the communities obtained by applying k-means to the learned embeddings of DGI (Velickovic et al., 2019) . For non-overlapping community detection and node classification, in addition to MNMF, DGI, CNRL, CommunityGAN, vGraph and ComE, we compare VECODER with the following baselines: DeepWalk (Perozzi et al. ( 2014)) makes use of SkipGram (Mikolov et al. (2013) ) and truncated random walks on network to learn node embeddings. LINE (Tang et al. (2015) ) learns node embeddings while attempting to preserve first and second order proximities of nodes. Node2Vec (Grover & Leskovec (2016) ) learns the embeddings using biased random walk while aiming to preserve network neighborhoods of nodes. Graph Autoencoder (GAE)Kipf & Welling (2016b) extends the idea of autoencoders to graph datasets. We also include its variational counterpart i.e. VGAE. GEMSEC is a sequence sampling-based learning model which aims to jointly learn the node embeddings and clustering assignments.

4.3. SETTINGS

For overlapping community detection, we learn mean and log-variance matrices of 16dimensional node embeddings. We set α = 0.9 and = 0.3 in all our experiments. Following Kipf & Welling (2016b), we first pre-train a variational graph autoencoder. We perform gradient descent with Adam optimizer (Kingma & Ba ( 2014)) and learning rate = 0.01. Community assignments are obtained using Eq. ( 14). For the baselines, we employ the results reported by Sun et al. (2019) . For evaluating the performance, we use F1-score and Jaccard similarity. For non-overlapping community detection, since the default implementations of most the baselines use 128 dimensional embeddings, for we use d = 128 for fair comparison. Eq. ( 13) is used for community assignments. For vGraph, we use the code provided by the authors. We employ normalized mutual information (NMI) and adjusted random index (ARI) as evaluation metrics. For node classification, we follow the training split used in various previous works (Yang et al., 2016b; Kipf & Welling, 2016a; Velickovic et al., 2019) , i.e. 20 nodes per class for training. We train logistic regression using LIBLINEAR (Fan et al. (2008) ) solver as our classifier and report the evaluation results on rest of the nodes. For the algorithms that do not use node features, we train the classifier by appending the raw node features with the learnt embeddings. For evaluation, we use F1-macro and F1-micro scores. All the reported results are the average over five runs. Further implementation details can be found in the code: https://anonymous.4open.science/r/1d95bf8f-8ce3-4870-a454-07db463b419f.

4.4. DISCUSSION OF RESULTS

In the following, we discuss the results to gain some important insights into the problem. Tables 2 and 3 summarize the results of the performance comparison for the overlapping community detection task. First, we note that our proposed method VECODER outperforms the competitive methods on all datasets in terms of Jaccard Similarity. VECODER also outperforms its competitors on 12 out of 13 datasets in terms of F1-score. It is the second best method on the 13th dataset (fb0). These results demonstrate the capability of VECODER to learn multiple community assignments quite well and hence reinforces our intuition behind the design of Eq. ( 11). Second, we observe that there is no consistent performing algorithm among the competitive methods. That is, excluding VECODER , the best performance is achieved by vGraph/vGraph+ on 5, ComGAN on 4 and ComE on 3 out of 13 datasets in terms of F1-score. A a similar trend can be seen in Jaccard Similarity. Third, it is noted that all the methods which achieve second best performance are solving the task of community detection and node representation learning jointly. This supports our claim that treating the two tasks jointly results in better performance. Fourth, we observe that vGraph+ results are generally better than vGraph. This is because vGraph+ incorporates a regularization term in the loss function which is based on Jaccard coefficients of connected nodes as edge weights. However, it should be noted that this prepossessing step is computationally expensive for densely connected graphs. Tab. 4 shows the results on non-overlapping community detection. First, we observe that MNMF, DeepWalk, LINE and Node2Vec provide a good baseline for the task. However, these methods are not able to achieve comparable performance on any dataset relative to the frameworks that treat the two tasks jointly. Second, VECODER consistently outperforms all the competitors in NMI and ARI metrics, except for CiteSeer where it achieves second best ARI. Third, we observe that GCN based models i.e. GAE, VGAE and DGI show competitive performance. That is, they achieve second best performance in all the datasets except CiteSeer. In particular, DGI achieves second best NMI results in 3 out of 5 datasets and 2 out of 5 datsets in terms of ARI. Nonetheless, DGI results are not very competitive in Tab. 2 and Tab. 3, showing that while DGI can be a good choice for learning node embeddings for attributed graphs with non-overlapping communities, it is not the best option for non-attributed graphs or overlapping communities. The results for node classification are presented in Tab. 5. VECODER achieves best F1-micro and F1-macro scores on 4 out of 5 datasets. We also observe that GCN based models i.e. GAE, VGAE and DGI show competitive performance, following the trend in results of Tab. 4. Furthermore, we note that the node classification results of CommunityGan (ComGAN) are quite poor. We think a potential reason behind it is that the node embeddings are constrained to have same dimensions as the number of communities. Hence, different components of the learned node embeddings simply represent the membership strengths of nodes for different communities. The linear classifiers may find it difficult to separate such vectors.

4.5. HYPERPARAMETER SENSITIVITY

We study the dependence of VECODER on and α by evaluating on four datasets of different sizes: fb698(N = 61), fb1912(N = 747), amazon1000(N=1540) and youtube(N = 5346). We sweep for = {0.1, 0.2, • • • , 0.9}. For demonstrating effect of α, we fix = 0.3 and sweep for α = {0.1, 0.2, • • • , 0.9}. The average results of five runs for and α are given in Fig. 1a and Fig. 1b respectively. Overall VECODER is quite robust to the change in the values of and α. In case of , we see a general trend of decrease in performance when the threshold is set quite high e.g. > 0.7. This is because the datasets contain overlapping communities and a very high will cause the algorithm to give only the most probable community assignment instead of potentially providing multiple communities per node. However, for a large part of sweep space, the results are almost consistent. When is fixed and α is changed, the results are mostly consistent except when α is set to a low value. Eq. (11) shows that in such a case the node itself is almost neglected and VECODER tends to assign communities based upon neighborhood only, which may cause a decrease in the performance. This effect is most visible in amazon1000 dataset because it has only 1.54 points on average per community i.e. there is a good chance for neighbours of a point of being in different communities. Therefore, only depending upon the neighbors will most likely result in poor results. 

4.6. TRAINING TIME

Now we compare the training times of different algorithms in Fig. 2 . As some of the baselines are more resource intensive than others, we select aws instance type g4dn.4xlarge for fair comparison of training times. For vGraph, we for 1000 iterations and for VECODER for 1500 iterations. For all other algorithms we use the default parameters as used in section 4.3. We observe that the methods that simply output the node embeddings take relatively less time compared to the algorithms that jointly learn node representations and community assignments e.g VECODER , vGraph and CNRL. Among these algorithms VECODER is the most time efficient. It consistently trains in less time compared to its direct competitors. For instance, it is about 12 times faster than ComE for CiteSeer-full and about 40 times faster compared to vGraph for Cora-full dataset. This provides evidence for lower computational complexity of VECODER in Section 3.5.

5. CONCLUSION

We propose a scalable generative method VECODER to simultaneously perform community detection and node representation learning. Our novel approach learns a single community-aware node embedding for both the representation of the node and its context. VECODER is scalable due to its low complexity, i.e. O(|E|K + N Kd). The experiments on several graph datasets show that VECODER consistently outperforms all the competitive baselines on node classification, overlapping community detection and non-overlapping community detection tasks. Moreover, training the VECODER is highly time-efficient than its competitors.



Figure 1: Effect of hyperparameters on the performance. F1 and Jaccard scores are in solid and dashed lines respectively.



. It simultaneously optimizes the model likelihood of observed links and learns the latent factors which represent community affiliations of nodes. CESNA(Yang et al. (2013)) extends BIGCLAM by statistically modelling the interaction between the network structure and the node attributes. Circles(Leskovec & Mcauley (2012)) introduces a generative model for community detection in ego-networks by learning node similarity metrics for every community. SVI (Gopalan & Blei (2013)) formulates membership of nodes in multiple communities by a Bayesian model of networks. vGraph(Sun et al. (2019)) simultaneously learns node embeddings and community assignments by modelling the nodes as being generated from a mixture of communities. vGraph+, a variant further incorporates regularization to weigh local connectivity. ComE(Cavallari et al. (

