DEEP GRAPH-LEVEL CLUSTERING USING PSEUDO-LABEL-GUIDED MUTUAL INFORMATION MAXIMIZA-TION NETWORK

Abstract

In this work, we study the problem of partitioning a set of graphs into different groups such that the graphs in the same group are similar while the graphs in different groups are dissimilar. This problem was rarely studied previously, although there have been a lot of work on node clustering and graph classification. The problem is challenging because it is difficult to measure the similarity or distance between graphs. One feasible approach is using graph kernels to compute a similarity matrix for the graphs and then performing spectral clustering, but the effectiveness of existing graph kernels in measuring the similarity between graphs is very limited. To solve the problem, we propose a novel method called Deep Graph-Level Clustering (DGLC). DGLC utilizes a graph isomorphism network to learn graph-level representations by maximizing the mutual information between the representations of entire graphs and substructures, under the regularization of a clustering module that ensures discriminative representations via pseudo labels. DGLC achieves graph-level representation learning and graph-level clustering in an end-to-end manner. The experimental results on six benchmark datasets of graphs show that our DGLC has state-of-the-art performance in comparison to many baselines.

1. INTRODUCTION

Graph-structured data widely exist in real-world scenarios, such as social networks (Newman, 2006) and molecular analysis (Gilmer et al., 2017) . Compared to other data formats, graph data explicitly contain connections between data through the attributes of nodes and edges, which can provide rich structural information for many applications. In recent years, machine learning on graph-structured data gains more and more attention. Many supervised and unsupervised learning methods have been proposed for graph-structured data in various applications. The machine learning problems of graph-structured data can be organized into two categories: nodelevel learning and graph-level learning. In node-level learning, the samples are the nodes in a single graph. Node-level learning mainly includes node classification (Li et al., 2017; Wu et al., 2021; Xu et al., 2021) and node clustering (Wang et al., 2017; Pan & Kang, 2021; Lin et al., 2021) . Classical node classification methods are often based on graph embedding (Yan et al., 2006; Cai et al., 2018) and graph regularization (Subramanya & Bilmes, 2009; Bhagat et al., 2011) , while recent advances are based on graph neural networks (GNN) (Kipf & Welling, 2017; Xu et al., 2019; Wu et al., 2020) . Owing to the success of GNN in nodes classification, a few researchers have proposed GNN-based methods for nodes clustering (Wang et al., 2019; Bo et al., 2020; Zhu & Koniusz, 2021) . Different from node-level learning, in graph-level learning, the samples are a set of graphs that can be organized into different groups. Classical methods for graph-level classification are often based on graph kernels (Vishwanathan et al., 2010; Yanardag & Vishwanathan, 2015) while recent advances are based on GNN (Wu et al., 2020; Rong et al., 2020) . Researchers generally utilize various types of GNN, e.g., graph convolutional networks (GCNs) (Kipf & Welling, 2017) and graph isomorphism network (GIN) (Xu et al., 2019) to learn graph-level representations by aggregating inherent node information and structural neighbor information in graphs, then they train a classifier based on the learned graph-level representations (Zhang et al., 2018; Sun et al., 2020; Wang et al., 2021; Doshi & Chepuri, 2022) . Nevertheless, collecting large amounts of labels for graph-level classification is costly in real-world, and the clustering on graph-level data is much more difficult than that on nodes and still remains an open issue. It thereby shows the importance of exploring graph-level clustering, namely partitioning a set of graphs into different groups such that the graphs in the same group are similar while the graphs in different groups are dissimilar. Previous research on graph-level clustering is very limited. The major reason is that it is difficult to represent graphs as feature vectors or quantify the similarity between graphs in an unsupervised manner. An intuitive approach to graph-level clustering is to perform spectral clustering (Ng et al., 2001) over the similarity matrix produced by a graph kernels (Kondor & Pan, 2016; Du et al., 2019; Togninalli et al., 2019) on graphs. Although there have been a few graph kernels such as random walk kernel (Gärtner et al., 2003) and Weisfeiler-Lehman kernel (Shervashidze et al., 2011) , most of them rely on manual design that fails to provide desirable generalization capability for various types of graphs and produce satisfactory similarity matrices for spectral clustering, which will be demonstrated in Section 4.3. Another solution comes with the encouraging development of GNNs. Some latest works such as GCNs (Kipf & Welling, 2017) and GIN (Xu et al., 2019) have been proven to be effective in learning node/graph-level representations for various downstream tasks, e.g., node clustering (Wang et al., 2017; Bo et al., 2020; Liu et al., 2022) and graph classification (Sun et al., 2020; Sato et al., 2021; You et al., 2021) -thanks to the powerful generalization and representation learning capability of deep neural networks. Therefore, it may be possible to achieve graph-level clustering by performing classical clustering algorithms such as k-means (Hartigan & Wong, 1979) and spectral clustering over the graph-level representations produced by various unsupervised graph representation learning methods (Grover & Leskovec, 2016; Narayanan et al., 2017; Adhikari et al., 2018; Sun et al., 2020) . Although the afore-mentioned GNN-based unsupervised graph-level representation learning methods have shown promising performance in terms of some down-stream tasks such as node clustering and graph classification, they do not guarantee to generate effective features for the clustering tasks on entire graphs. In contrast, the graph-level clustering may benefit from an end-to-end framework that can learn clustering-oriented features in the graph-level representation learning. We summarize our motivation here: 1) Graph-level clustering is an important problem but it is rarely studied, though there have been a lot of works on graph-level classification and node-level clustering. 2) The performance of graph-kernels followed by spectral clustering and two-stage methods (deep graphlevel feature learning followed by k-means or spectral clustering) haven't been well explored. 3) An end-to-end deep learning based graph-level clustering method is expected to outperform graph kernels and the two-stage methods because the feature learning is clustering-oriented. Therefore, we propose a novel graph clustering method called deep graph-level clustering (DGLC) in this paper. The proposed method is a fully unsupervised framework and yields the clustering-oriented graphlevel representations via jointly optimizing two objectives: representation learning and clustering. The main contributions of this paper are summarized as follows. • We investigate the effectiveness of various graph kernels as well as unsupervised graph representation learning methods in the problem of graph-level clustering. • We propose an end-to-end graph-level clustering method. In the method, the clustering objective can guide the representation learning for entire graphs, which is demonstrated to be much more effective than those two-stage models in this paper. • We conduct extensive comparative experiments of graph-level clustering on six benchmark datasets. Our method is compared with five graph kernel methods and four cutting-edge GNN representation learning methods, under the evaluation of three quantitative metrics and one qualitative (visualization) metric. Our method has state-of-the-art performance.

2. PRELIMINARIES

The notations used in this paper are shown in Table 1 . In the next two subsections, we briefly introduce graph kernels and GNN based graph-level representation learning methods. We will also illustrate how to apply them to graph-level clustering.  (G m , G n ) := F ⊤ Gm F Gn , where F Gi denotes frequency. In recent years, much effort has been devoted to the identification of desirable sub-graphs ranging from Graphlet kernel (Shervashidze et al., 2009) , Random walk kernel (Vishwanathan et al., 2010) , Shortest path kernel (Borgwardt & Kriegel, 2005) to Subgraph matching kernel (Kriege & Mutzel, 2012) , Pyramid match kernel (Nikolentzos et al., 2017) , etc. For example, one of the most popular kernels is the Weisfeiler-Lehman kernel (Shervashidze et al., 2011) . It belongs to subtree kernel family and could scale up to large and labeled graphs. Weisfeiler-Lehman kernel is built upon other base kernels through Weisfeiler-Lehman test of isomorphism on graphs. The essential idea of Weisfeiler-Lehman kernel is to relabel the graph with not only the original label of each vertex, but also the sorted set of labels of its neighbors (sub-tree structure). With runtime scaling only linearly in the number of edges of the graphs, Weisfeiler-Lehman kernel is widely applied in computational biology and social network analysis. However, Weisfeiler-Lehman kernel's hashing step is somewhat ad-hoc, with performance varying from data to data (Kondor & Pan, 2016) . Another state-of-the-art algorithm is the shortest-path kernel (Borgwardt & Kriegel, 2005) , which is based on paths instead of conventional walks and cycles. By transforming the original graph into shortestpaths graph Gv,u,e = {the number of occurrences of vertex v and u connected by shortest-path e}, it avoids the high computational complexity of graph kernels based on walks, subtrees and cycles. In this paper, several graph kernels are selected as comparative models to test their efficiency on clustering. More specifically, we perform spectral clustering with the similarity matrices computed by graph kernels. One limitation is that existing graph kernels are not effective enough to quantify the similarity between graphs. In addition, most of them cannot take advantages of the nodes features and labels of graph. The related results and time complexity comparison can be found in Table 3 -5 and Appendix A.5.

2.2. UNSUPERVISED GRAPH-LEVEL REPRESENTATION LEARNING

In recent years, GNN related models (Wu et al., 2020; Zhou et al., 2020) have shown state-of-the-art performance in many graph-data related tasks such as nodes classification (Kipf & Welling, 2017; Zhang et al., 2019) and graph classification (Zhang et al., 2018; Xu et al., 2019; Sun et al., 2020) . A number of graph representation learning methods have been proposed to handle the graph/node classification and node clustering tasks. For example, Grover & Leskovec (2016) proposed to learn low-dimensional mapping for nodes that maximally preserves the neighborhood information of nodes. Veličković et al. (2019) proposed to learn node representations for node classification via maximizing the mutual information between the patch representations and summarized graph representations. Similarly, Sun et al. (2020) utilized the mutual information maximization strategy and GIN (Xu et al., 2019) to learn graph representations for graph-level classifications. You et al. (2020; 2021) took inspiration from the self-supervised learning to augment the graph data to construct positive/negative pairs, thereby learn effective graph representations with contrastive learning strategy (Chen et al., 2020) . It should be pointed out that existing graph representation learning methods rarely investigate the graph-level clustering task, as it is far difficult than graph classification or node clustering. An intuitive strategy is to perform k-means (Hartigan & Wong, 1979) or spectral clustering (Ng et al., 2001) on the learned graph-level representations given by those methods. Nevertheless, the clustering performance is not desirable as can be observed in Section 4.4, because the representations learned by those methods are not guaranteed to be suitable or effective for graph-level clustering. Therefore, we present our DGLC method to investigate the way to learn clustering-oriented graph-level representations, of which the learning is guided by an explicit clustering objective. c) and G (i) ∩G (j) = ∅ for any i ̸ = j, such that the graphs in the same group are similar while the graphs in different groups are dissimilar, without using any label information.

3. METHODOLOGY

3.1 PROBLEM FORMULATION Given a set of n graphs, i.e., G := {G 1 , G 2 , . . . , G n }, where the i-th graph G i = (V i , E i ) has node features X i = {x (i) v } v∈Vi and X := {X 1 , X 2 , . . . , X n }. The graph-level clustering aims to partition the set G into a few non-overlapped groups, i.e., G = G (1) ∪G (2) ∪• • • G ( Since the original graph data may not have graph-level feature vectors or they often contain redundant and distracting information, a more effective way is to perform clustering in a latent space given by some representation learning methods. Therefore, we propose to learn latent representations and conduct clustering simultaneously, where the representation learning and clustering facilitate each other. We formalize the objective function for graph-level clustering as follows L(ϕ, θ) := L r (g ϕ (X , G), X , G) + L c|θ (g ϕ (X , G)). (1) In ( 1), L r denotes the representation learning objective that aims to map the input data X , G into a latent space via a deep graph neural network with parameters ϕ. L c|θ denotes the clustering objective on the representations g ϕ (X , G) and is associated with a deep neural network with parameters θ that may also contain the cluster centers or assignments. Note that there could be a trade-off parameter between L r and L c|θ , but we just ignore it for convenience. We see that the objective L(ϕ, θ) does not only learn cluster-oriented representations, but also directly produces clustering results. So there is no need to perform k-means or spectral clustering after the pure representation learning like those two-step models mentioned in Section 2.2.

3.2. LEARNING GRAPH-LEVEL REPRESENTATIONS

To learn effective representations of the graphs, we take advantages of GNN (Kipf & Welling, 2017; Xu et al., 2019; Wu et al., 2020) . GNN leverages the node information and structural information to learn representations for node or graph. GNN aggregates the neighboring information of each node to itself iteratively, thus the learned features could capture both the inherent node information and its neighbors' information. Specifically, the learned feature h v for node v in the k-th layer can be formulated as follows h (k) v = COMBINE (k) h (k-1) v , h (k) v = COMBINE (k) h (k-1) v , AGGREGATE (k) ({h (k-1) u : u ∈ N (v)}) , where h (k) v denotes the aggregated neighbor features in k-th layer, N (v) is the neighborhood set of node v. Particularly, the initial representation h (0) v is set as the node features of v, i.e., x v . It is worth noting that more global information could be obtained as the layer deepens, while some more generalized information would be possessed in the earlier layers (Xu et al., 2019) . Therefore, considering the information from various depths of the network would help us get more powerful representations for graph-level clustering tasks. Following the idea, we concatenate the representation learned at each layer as h i ϕ = CONCAT {h (k) i } K k=1 ) , where h i ϕ is concatenated representation for node i, and h (k) i is the representation learned in k-th layer. After that, we can utilize a READOUT function to obtain the graph-level representation, i.e., H ϕ (G j ) = READOUT({h i ϕ } |Gj | i=1 ), where |G j | denotes the number of nodes in G j . Therefore, for the given graph dataset G := {G j ∈ G} n b j=1 in a batch, H ϕ (G) ∈ R n b ×Kd h can be regarded as the learned graph-level representations, where n b is number of graphs in a batch, d h is the dimension of each hidden layer of GNN and K is the number of GNN layers. Note that we use the sum readout strategy in this work. As the graph-level clustering is an unsupervised learning task, it is important to learn more representative features in an unsupervised manner. We follow (Hjelm et al., 2019; Sun et al., 2020) to achieve this by maximizing the mutual information between the representations of entire graphs and substructure, since it has been demonstrated as a powerful unsupervised graph representation learning technique. Specifically, for the given graph datasets in a batch G that follows an empirical probability distribution P on the original data space, the estimator I ϕ,ψ of the mutual information (MI) over the global and local pairs is defined as follows: φ, ψ = arg max ϕ,ψ G⊆G 1 |G| i∈G I ϕ,ψ (h i ϕ ; H ϕ (G)) ≜ -L r|ϕ,ψ , where |G| is the number of nodes in G, i denotes a single node in G, I ϕ,ψ can be parameterized by a discriminator network T with parameter ψ. By using Jensen-Shannon MI estimator Nowozin et al. (2016) , I ϕ,ψ can be formulated as: I ϕ,ψ (h i ϕ (G); H ϕ (G)) := E P [-sp(-T ϕ,ψ (h i ϕ (s); H ϕ (s)))] -E P× P[sp(T ϕ,ψ (h i ϕ (s ′ ); H ϕ (s)))], where s denotes the input (positive) sample, and s ′ denotes the negative sample from the distribution P that is identical to distribution P. Particularly, the combinations of global (graph-level) and local (node-level) representations in a batch are used to produce negative samples. sp(y) = log(1 + e y ) indicates the softplus function. Note that we maximize the MI between graph-level and nodelevel representations, which facilitates graph-level representations to contain as much information as possible that is shared between node-level representations. It is intuitive that performing k-means or spectral clustering directly on the graph-level representations learned seems to be an applicable way, but it often tends to be a trivial solution because the representations learned in this way solely are not guaranteed to be applicable for the graph-level clustering task that we focus in this work.

3.3. END-TO-END GRAPH-LEVEL CLUSTERING

To capture more suitable representations for graph-level clustering, we attempt to learn clusteroriented representations by introducing an explicit clustering objective. Specifically, we propose a clustering network connected with the graph-level features in the representation learning network described above. Then the graph-level features will be projected to the cluster embedding in the low-dimensional latent space, which can be formalized as follows: z j = f θ (H ϕ (G j )), where z j denotes the learned cluster embedding for graph G j , and f θ is the MLP-based clustering projector with network parameter θ. Let Z ϕ,θ (G) ∈ R dz×n b be the cluster embeddings in a batch, where d z is the dimension of cluster embedding layer. Subsequently, we take inspiration from (Van der Maaten & Hinton, 2008; Xie et al., 2016) to define the graph-level cluster assignment distribution Q based on Z ϕ,θ (G) as follows: q jt|ϕ,θ = (1 + ∥z j -µ t ∥ 2 ) -1 c t=1 (1 + ∥z j -µ t ∥ 2 ) -1 , where z j is the j-th column of Z ϕ,θ (G), c is the number of clusters, µ t is the t-th cluster center that can be initialized by k-means, and q jt|ϕ,θ is the graph-level cluster assignment indicating the probability that graph G j belongs to cluster t. Next, we can further define an auxiliary refined cluster assignment distribution P to emphasizes those assignments with high confidence in Q as follows: p jt = q 2 jt|ϕ,θ / n b j=1 q jt|ϕ,θ c t=1 (q 2 jt|ϕ,θ / n b j=1 q jt|ϕ,θ ) , where P encourages a more pronounced gap between assignments with high and low probability in Q and can be regarded as pseudo labels for guiding the optimization of Q. Therefore, we can define the clustering objective by minimizing the KL-divergence between P and Q as follows: L c|ϕ,θ = KL(P ||Q) = n b j=1 c t=1 p jt log p jt q jt|ϕ,θ . L c|θ aims to force Q to approximate P , i.e., to let P guide the optimization of Q so that the high confident assignment can be emphasized, which can also be regarded as a self-training strategy. By jointly optimizing Eq. 5 and 10, we can construct an end-to-end deep graph-level clustering framework that simultaneously implements graph-level representation learning and clustering. The overall objective of DGLC in terms of minibatch optimization is as follows L batch (ϕ, ψ, θ) = - 1 |G| i∈G I ϕ,ψ (h i ϕ ; H ϕ (G)) L r|ϕ,ψ + n b j=1 c t=1 p jt log p jt q jt|ϕ,θ L c|ϕ,θ .

4. EXPERIMENTS

In this section, we evaluate the proposed method in comparison with several state-of-the-art competitors in graph-level clustering task. We first introduce the datasets and baseline methods used in the experiment and describe the detailed settings of network and parameters. Then, we demonstrate the effectiveness of our method through comprehensive experimental analysis.

Dataset:

We use six well-known graph datasets in the experiment, including MUTAG 1 , PTC-MR 2 , PTC-MM 3 , BZR 4 , ENZYMES 5 , COX2 6 . We summarize the information of each dataset in Table 2 . More detailed information of each dataset refers to the Appendix A.1. (Johansson et al., 2014) , Graphlet kernel (GK) (Shervashidze et al., 2009) , and four unsupervised graph-level representation learning methods including InfoGraph (Sun et al., 2020) , Gromov-Wasserstein factorization (GWF) (Xu et al., 2022) , Graph contrastive learning (GraphCL) (You et al., 2020) , and Joint augmentation optimization (JOAO) (You et al., 2021 & Wong, 1979) and spectral clustering on the learned graph-level representations. Particularly, for GWF (Xu et al., 2022) we not only follow the original paper to perform k-means, but also perform spectral clustering to evaluate its clustering performance. To provide a fair comparison in our experiment, we use exactly the same network architecture as our competitors of unsupervised graph representation learning (Sun et al., 2020; You et al., 2020; 2021) , i.e., utilizing the Graph isomorphism network (GIN) (Xu et al., 2019) as the backbone GNN. The cluster projector is constructed with a two-layer MLP-based fully-connected network. We use Adam as the optimizer, the learning rate is chosen from [10 -3 , 10 -5 ], the batch-size is set to 128 and the total running epoch is set to 20. Moreover, there are three important hyper-parameters in our method, i.e., the layer numbers of GNN, the hidden dimension d h of each GNN layer and the dimension d z of the clustering layer. We evaluate the influence of different values of them on the graph-level clustering performance in Appendix A.3 due to the limitation of the paper length. To evaluate the clustering performance, we consider three popular metrics including clustering accuracy (ACC), normalized mutual information (NMI) and adjusted rand index (ARI). The detailed definition of the three metrics refer to the Appendix A.2. We utilize Pytorch Geometric (Fey & Lenssen, 2019) and GraKeL (Siglidis et al., 2020) libraries to implement our method and other baseline methods. Note that we run all experiments 10 times with NVIDIA Tesla A100 GPU and AMD EPYC 7532 CPU, and report their means and standard deviations.

4.3. EXPERIMENTAL RESULTS

We compare the proposed DGLC method with 13 baselines and state-of-the-art methods on the six popular benchmarks. The experimental results are shown in Table 3 -5, from which we have the following observations. 

4.4. QUALITATIVE STUDY

In this section, we conduct a qualitative study to provide visual comparison for the graph-level clustering. Specifically, we compare our method with several state-of-the-art unsupervised graph representation learning methods including InfoGraph, GWF, GrahCL and JOAO by utilizing t-SNE (Van der Maaten & Hinton, 2008) and visualize their learned graph-level representations on MUTAG and ENZYMES. The visualization results are shown in Figure 1 . We can observe that compared with other methods, DGLC explicitly reveals more compact intra-class structure and more distinct inter-class discrepancy. For example, the learned representations of the two classes in MU-TAG are more separated in our method compared to others. Besides, we can find that InfoGraph, GraphCL and JOAO fail to capture good clustering structure for ENZYMES, while GWF and ours do. In general, the visualization results of the learned graph-level representations also support the effectiveness of our method.

4.5. PARAMETER SENSITIVITY ANALYSIS AND ABLATION STUDY

To evaluate the robustness of DGLC and the effectiveness of each component, we conduct the parameter sensitivity analysis and ablation study. Please see Appendix A.3 and A.4 for the experimental results and discussions due to the limitation of the paper length.

4.6. COMPUTATIONAL TIME COMPARISON

We also demonstrate the time efficiency of DGLC by comparing the running time with several graph kernels and unsupervised graph representation learning baselines. Please see Appendix A.5 for the experimental results and discussions due to the limitation of the paper length.

5. CONCLUSION

This work has studied the problem of graph-level clustering and proposed an end-to-end deep graphlevel clustering method based on deep graph neural network. The proposed DGLC method leverages the powerful representation learning capability of GIN and defines an explicit clustering objective to help learn cluster-favor representations for graph-level clustering. We compared the proposed method with two types of baselines, one is based on graph kernels followed by spectral clustering and the other is based on graph-level representation learning followed by k-means and spectral clustering. The experiments on six graph datasets have showed that our method has much higher clustering accuracy than the baselines. Note that ACC and NMI range from In this section, we conduct experiments to evaluate the influence of each proposed strategy on our method. Specifically, we construct four degradation models of our method by respectively removing some components of it. There are: • DGLC d1 : We remove the clustering loss and joint training strategy of DGLC and evaluate the model by performing k-means on the learned graph-level representations, i.e., the model can be regarded as InfoGraph in this way. • DGLC d2 : We keep the clustering loss and joint training strategy while directly using kmeans to produce the clustering results instead of producing the clustering labels with the cluster label assignment Q. • DGLC d3 : We degrade DGLC as a two-stage model, i.e., we train the model by respectively optimizing the graph representation learning objective and clustering objective. The clustering results are still obtained from the graph-level cluster assignment Q in the second training stage. We run experiments on MUTAG and BZR to evaluate their performance. Table 6 summarizes the experimental results, from which we have the following observations: • Both DGLC d2 and DGLC d3 significantly outperform DGLC d1 , which fully suggests that learning clustering-oriented representations would benefit graph-level clustering. • Producing clustering results from the graph-level cluster assignment Q is more reasonable as the clustering performance degrades when directly performing k-means on the learned cluster embeddings. • Joint training with representation learning and clustering objectives yields better clustering performance. For example, DGLC outperforms DGLC d3 by 3.20%, 4.86%, 8.67% in terms of ACC, NMI and ARI on MUTAG.

A.5 COMPUTATIONAL TIME COMPARISON

In this section, we compare the proposed DGLC with some baseline methods to demonstrate its efficiency in time consuming. Specifically, for graph kernels, we select RW (Vishwanathan et al., 2010) , WL (Shervashidze et al., 2011 ), SP(Borgwardt & Kriegel, 2005) and LT (Johansson et al., 2014) as our competitors. For unsupervised graph representation learning methods, we select GWF (Xu et al., 2022) and InfoGraph (Sun et al., 2020) . Note that we run 20 epochs for GWF, InfoGraph and DGLC for fair comparison. Table 7 shows the running times of each method on six benchmark datasets used in this paper. We can see that RW, LT and GWF are quite time consuming, especially on datasets like ENZYMES and COX2 that contain numerous nodes and edges. In contrast, WL, SP, InfoGraph and DGLC are much more efficient compared with them and have comparable time efficiency. To validate the effectiveness of the proposed method on large-scale graph datasets, we supplement two more datasets in our experiment. Specifically, we choose NCI1, NCI109, and COLLAB datasets to conduct experiment, the detail information of the three datasets are shown in Table 11 . The experiment results are shown in Table 12 and Table 13 . We can see that almost graph kernels show low efficiency and bad clustering performance when handling large-scale datasets, some of them are too time consuming. While the proposed DGLC method shows superiority compared with graph kernels and graph representation learning methods. DGLC obtain the best clustering performance in most cases. Besides, the experiment on COLLAB, which contains 3 classes, also demonstrates the effectiveness of the proposed DGLC method in processing datasets containing more than 2 classes. 



Figure 1: t-SNE visualization of the learned graph-level representations of our methods and other unsupervised graph representation learning methods. The first row is the visualization for MUTAG, while the second row is for ENZYMES.

[0, 1], while ARI ranges from[-1, 1]. The higher values of ACC, NMI and ARI represent the better clustering performance.A.3 PARAMETER SENSITIVITY ANALYSISWe analyze the sensitivity of DGLC to the hyperparameters, i.e., the hidden dimension d h of GNN layers, the embedding dimension d z of clustering layer and the number of GNN layers. Here we take MUTAG and PTC-MR datasets as the example to evaluate the influence of the change of d h and d z values. Specifically, we select the values of d h in[16, 32, . . . , 256]  and d z in [5, 10, . . . , 30], the results are shown in Figure2. We can observe that the accuracy on both datasets are relatively stable, showing little fluctuation when parameters vary. In contrast, NMI and ARI are of high performance when the selection of parameters are moderate. In general, DGLC shows robust performance against the two parameters. Nevertheless, we recommend to choose d z from 10 to 25 and d h from 32 to 128 to obtain better clustering performance in practice. Except for the ones mentioned above, we further conduct the sensitivity analysis on the number of GNN hidden layers on three datasets (MUTAG, PTC-MR and BZR). We vary the number of GNN hidden layers in[2, 3, . . . , 10]. The experimental results are shown in Figure3. It could be seen that PTC-MR is quite stable for all three metrics. For MUTAG and BZR, whereas, DGLC shows better performance when setting the number of GNN hidden layers to 4 and 5. In general, DGLC obtains relatively stable performance at different numbers of GNN layers, despite fluctuations at some specific fetch values.

Figure 2: Sensitivity analysis of accuracy, NMI and ARI regarding the dimension d h of GNN hidden layers and the embedding dimension d z of clustering layer on MUTAG and PTC-MR datasets.

Figure 3: Sensitivity analysis of ACC, NMI and ARI regarding the number of GNN hidden layers on MUTAG, PTC-MR and BZR datasets

Notations for the main variables and parameters in this paper. For a graph G, after its sub-graphs {G i } are defined, the kernel is calculated according to the occurrences of the sub-graphs of {G i }. Namely, K g

Information of the six benchmark datasets.

EXPERIMENTAL SETTINGSFor the graph kernel methods we used, they are all normalized with the base graph kernel to be Vertex Histogram kernel if needed, then we directly perform spectral clustering(Ng et al., 2001) on the the similarity matrices produced by them to obtain the clustering results. Note that we also include the kmeans(Hartigan & Wong, 1979) performance of several graph kernels in the Appendix A.6. While for the unsupervised graph-level representation learning methods, we perform k-means (Hartigan

Clustering performance (ACC, NMI, ARI) on MUTAG and PTC-MR. The best result is highlighted in bold.

Clustering performance (ACC, NMI, ARI) on PTC-MM and BZR. The best result is highlighted in bold.

Clustering performance (ACC, NMI, ARI) on ENZYMES and COX2. The best result is highlighted in bold. kernel based approaches, DGLC is more general for different types of graph data. Compared with the latest unsupervised graph representation learning approaches, DGLC has a clear clustering objective in the optimization and thus tends to learn clustering-oriented graph-level representations and achieves state-of-the-art performance.

Clustering performance (ACC, NMI, ARI) on MUTAG and BZR. The best result is highlighted in bold.

Running time comparison (in seconds) on the six benchmark graph datasets.

Clustering performance (ACC, NMI, ARI) on MUTAG and PTC-MR. The best result is highlighted in bold. Vishwanathan et al., 2010)+KM 77.66±0.00 30.82±0.00 30.26±0.00 51.16±0.00 0.19±0.00 -0.55±0.00 WL(Shervashidze et al., 2011)+KM 73.94±0.00 15.51±0.00 22.25±0.00 57.56±0.00 1.10±0.00 1.89±0.00 WL-OA(Kriege et al., 2016)+KM 73.94±0.00 16.92±0.00 22.42±0.00 55.81±0.00 0.59±0.00 0.99±0.00 SP(Borgwardt & Kriegel, 2005)+KM 76.06±0.00 15.38±0.00 25.11±0.00 59.30±0.00 1.87±0.00 2.73±0.00 DGLC(Ours) 84.68±0.89 35.75±2.51 47.01±2.64 60.93±0.57 2.98±0.43 4.29±0.52 A.7 EXPERIMENT ON LARGE-SCALE DATASET

Clustering performance (ACC, NMI, ARI) on PTC-MM and BZR. The best result is highlighted in bold. Vishwanathan et al., 2010)+KM 55.06±0.00 0.02±0.00 0.00±0.00 58.52±0.00 0.19±0.00 -1.55±0.00 WL(Shervashidze et al., 2011)+KM 58.63±0.00 0.82±0.00 2.15±0.00 68.15±0.00 0.98±0.00 5.17±0.00 WL-OA(Kriege et al., 2016)+KM 58.04±0.00 0.81±0.00 1.93±0.00 67.90±0.00 2.17±0.00 -6.78±0.00 SP(Borgwardt & Kriegel, 2005)+KM 61.01±0.00 0.85±0.00 2.67±0.00 65.43±0.00 0.27±0.00 2.36±0.00 DGLC(Ours) 63.30±0.81 2.70±0.45 5.53±0.61 80.98±0.60 9.79±0.92 20.53±1.84

Clustering performance (ACC, NMI, ARI) on ENZYMES and COX2. The best result is highlighted in bold. Vishwanathan et al., 2010)+KM 23.17±0.00 2.50±0.00 1.74±0.00 53.96±0.00 0.60±0.00 -1.68±0.00 WL(Shervashidze et al., 2011)+KM 21.50±0.00 2.18±0.00 0.96±0.00 50.96±0.00 0.54±0.00 -0.33±0.00 WL-OA(Kriege et al., 2016)+KM 20.83±0.00 1.68±0.00 0.55±0.00 50.75±0.00 0.51±0.00 -0.37±0.00 SP(Borgwardt & Kriegel, 2005)+KM 22.17±0.00 2.79±0.00 1.70±0.00 52.03±0.00 0.13±0.00 0.01±0.00 DGLC(Ours) 27.08±1.49 6.39±1.09 2.86±0.80 78.28±0.17 2.38±0.99 6.79±3.37

Information of the three large-scale datasets. Dataset name Number of graphs Average nodes Average edges Classes

Clustering performance (ACC, NMI, ARI) on NCI1 and NCI109. The best result is highlighted in bold. N/A denotes the results are unavailable (out of memory or the running time over 24 hours).

annex

Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57-81, 2020.Hao Zhu and Piotr Koniusz. Simple spectral graph convolution. In Proceedings of the International Conference on Learning Representations, 2021.

A.1 DETAILED INFORMATION OF DATASET

We provide the detailed information of six graph datasets used in our experiment here:• MUTAG is a compound dataset that contains 188 compounds, which are grouped into 2 categories based on the mutagenic effect of them to a bacterium. Note molecules possess natural graph structure, where they are expressed by average 17.93 nodes (for atoms) and 19.79 edges (for chemical bonds).• PTC-MR and PTC-MM are the subset of PTC dataset, which is a compound dataset that divided into 2 categories based on the carcinogenicity to rodents. Note that PTC-MR contains 344 compounds with average 14.29 nodes and 14.69 edges, while PTC-MM contains 336 compounds with average 13.97 ndoes and 14.32 edges, respectively.• BZR is the ligand dataset for benzodiazepine receptor, which are divided into 2 classes according to the activity and inactivity of compounds. Note that BZR contains 405 graphs in total with average 35.75 nodes and 38.36 edges per graph.• ENZYMES contains 600 protein data for 6 classes of enzymes, with 100 proteins per class.Each protein data can be represented as a graph with average 32.63 nodes and 62.14 edges.• COX2 consists of 467 inhibitor for cyclooxygenase-2 and are divided into 2 classes based on whether the compounds are active or inactive. Note that each graph in this dataset is with average 41.22 nodes and 43.45 edges.

A.2 DEFINITION OF THREE CLUSTERING METRICS

In this section, we introduce three clustering metrics used in this paper, with y j and ŷj denoting the true labels and the predicted labels for graph G j respectively.Clustering accuracy (ACC): ACC is expressed as the comparison of the true labels and predicted labels leveraged on sample size n, which is defined as follows:Normalized mutual information (NMI): NMI score scales the mutual information scores by some generalized mean of entropy of true label set Ω and cluster label set C. It can be formalized as follows: 

