NODE NUMBER AWARENESS REPRESENTATION FOR GRAPH SIMILARITY LEARNING Anonymous

Abstract

This work aims to address two important issues in the graph similarity computation, the first one is the Node Number Awareness Issue (N 2 AI), and the second one is how to accelerate the inference speed of graph similarity computation in downstream tasks. We found that existing Graph Neural Network based graph similarity models have a large error in predicting the similarity score of two graphs with similar number of nodes. Our analysis shows that this is because of the global pooling function in graph neural networks that maps graphs with similar number of nodes to similar embedding distributions, reducing the separability of their embeddings, which we refer to as the N 2 AI. Our motivation is to enhance the difference between the two embeddings to improve their separability, thus we leverage our proposed Different Attention (DiffAtt) to construct Node Number Awareness Graph Similarity Model (N 2 AGim). In addition, we propose the Graph Similarity Learning with Landmarks (GSL 2 ) to accelerate similarity computation. GSL 2 uses the trained N 2 AGim to generate the individual embedding for each graph without any additional learning, and this individual embedding can effectively help GSL 2 to improve its inference speed. Experiments demonstrate that our N 2 AGim outperforms the second best approach on Mean Square Error by 24.3%(1.170 vs 1.546), 43.1%(0.066 vs 0.116), and 44.3%(0.308 vs 0.553), for AIDS700nef, LINUX, and IMDBMulti datasets, respectively. Our GSL 2 is at most 47.7 and 1.36 times faster than N 2 AGim and the second faster model. Our code is publicly available on https://github.com/iclr231312/N2AGim.

1. INTRODUCTION

Graph similarity computation is a fundamental problem for graph-based applications, e.g., graph data mining, graph retrieval, and graph clustering (Kriege et al., 2020; Ok & Korea, 2020) . Graph Edit Distance (GED), which is defined as the least number of graph edit operators to transform graph G i to graph G j , is one of the most popular graph similarity metrics (Gao et al., 2010; Neuhaus et al., 2006; Bougleux et al., 2015) . The graph edit operators are insert or delete a node/edge, or relabel an edge. Unfortunately, the exact GED computation is NP-Hard in general (Zeng et al., 2009) , which is too expensive to leverage in the downstream tasks. Recently, many Graph Neural Networks (GNNs) based graph similarity computation algorithms have been proposed to compute the GED in a faster manner (Bai et al., 2019; 2020; Li et al., 2019; Ling et al., 2021; Bai & Zhao, 2021; Wang et al., 2021) . The GNN-based algorithms transform the GED value to a similarity score and use an end-to-end framework to learn to map the given two graphs to their similarity score. As a general framework, the Siamese neural network can be used to aggregate information on each graph, while the feature fusion module can be used to capture the similarity between them, and the Multi-layer Perceptron (MLP) is then leveraged for the regression. However, the existing popular graph similarity models become very inaccurate in predicting the similarity of two graphs with similar number of nodes, as shown in Fig 1 . It is clear that the MSE of all four models becomes large as the difference in the number of nodes in the two graphs becomes smaller. In order to better understand this issue, we present in Section 3 a theoretical analysis of the most widely used modules in the graph similarity models from a statistical viewpoint. As shown in Fig 2(a) -(e), our conclusion is that all global pooling functions, also called graph readout functions, map graphs with similar number of nodes to similar embeddings, which reduces the separability  f (G 1 , G 2 ) = |N 1 -N 2 |/max(N 1 , N 2 ) , where N i is the number of nodes in G i . It is clear that all models have a larger MSE when the SizeDiff is smaller, i.e. when the number of nodes in the graph pair is similar. between embeddings and leads to a large MSE for the models in predicting the similarity of two graphs with similar number of nodes. We refer to this issue of indistinguishable embeddings of graphs with similar number of nodes as the Node Number Awareness Issue (N 2 AI). Our motivation to address the N 2 AI is to focus more on the differences between two similar embeddings during the learning process, and we propose the Different Attention (DiffAtt) to construct our Node Number Awareness Graph Simialrity Model (N 2 AGim). DiffAtt is simple in architecture, and can be added as a plug-and-play module to any global pooling method. Our evaluations on three datasets (Section 5) demonstrate that the models with different pooling methods achieve a significant improvement after using DiffAtt. Moreover, our N 2 AGim achieves state-of-the-art performance compared to the popular GNN-based graph similarity models, e.g., better about on average 33.3%(0.515 vs 0.772), 51.4%(0.515 vs 1.059) on Mean Square Error (MSE) than EGSCT (Qin et al., 2021) and GraphSim (Bai et al., 2020) , respectively. Figure 2 : (a)-(d) Distributions of output from different global pooling functions with N nodes, which show that all global pooling functions map graphs with similar number of nodes to similar distributions. See Section 3 for details. (e) Illustration of the N 2 AI, i.e., the distribution of the embeddings of two graphs with similar number of nodes is indistinguishable. Region A represents where two distributions overlap, while B is the opposite. Our aim is to enhance the information in B to address the N 2 AI. (f)-(g) Illustration of the Early Fusion Model (EFM) and Individual Embedding Model (IEM). Another issue of interest in the field of graph similarity learning is to accelerate the inference speed of graph similarity models in downstream tasks. Qin et al. (2021) (Qin et al., 2021) uses a special designed Knowledge Distillation (KD) paradigm to leverage an EFM teacher to improve the individual embeddings generaged by the IEM student. However, motivated by Balcan et al. (2008) , we propose a faster and more accurate IEM called Graph Similarity Learning with Landmarks (GSL 2 ). In GSL 2 , a subset of graphs, called landmarks S, are selected, and then each graph G is represented as a vector u G = [GED(G, Ĝ1 ), • • • , GED(G, Ĝm )] T , where Ĝ ∈ S. Finally, an MLP is learned to map the concatenation of the embeddings of the two graphs to their GED target. Instead of learning the embeddings on the graph data, our GSL 2 uses an already trained graph similarity model to directly generate an individual embedding for each graph, and this individual embedding can effectively improve the inference speed of GSL 2 . To sum up, the contributions of this paper can be summarized as follows: • We found that the existing graph similarity models have a relatively large error in predicting the actual similarity of two graphs with similar number of nodes, because the global pooling function maps graphs with similar number of nodes into two distributions that are similar, which we refer to as N 2 AI, thus reducing the performance of the graph similarity learning. • A novel GNN-based graph similarity model, named N 2 AGim, is proposed. Our N 2 AGim achieves excellent results in the graph similarity learning task by leverage the proposed DiffAtt to effectively address the N 2 AI. • In order to speed up the inference of graph similarity models, we propose the GSL 2 . The GSL 2 directly represents each graph as a vector, where each component is the GED value between the graph and a landmark. GSL 2 then learns the target GED values based on these graph representations. • Experimental results show that our N 2 AGim achieves the state-of-the-art performance, while our GSL 2 achieves a good accuracy and inference speed to efficiently handle downstream tasks.

2.1. GRAPH NEURAL NETWORKS

Graph data G can be viewed as a pair of adjacency matrix A ∈ {0, 1} N ×N and a node feature matrix X ∈ R N ×C . N is the number of nodes in the graph, and C is the dimension of the initial node features. Node i and node j have an edge if and only if A i,j = 1. Considering X = [x 1 , x 2 , ..., x N ] T , a Message Passing Neural Network (MPNN) layer is defined as (Fey & Lenssen, 2019) : x (k) i = γ (k) (x (k-1) i , j∈N (i) φ (k) (x (k-1) i , x (k-1) j , e i,j )), where x (k) i ∈ R C k is an embedding of node i at the k th layer, and φ performs a differentiable transform on each node or edge. is an aggregate function to aggregate the transformed attributes of nodes and their neighbors. N (i) denotes the neighbors of node i and e i,j is the edge feature from node i to node j. γ is a differentiable function to update the node embeddings. Following the idea of MPNN, several GNNs and their variants have been proposed to deal with graph mining tasks, e.g., Kipf & Welling (2016) and Velickovic et al. (2017) . One of the vital works is the Graph Isomorphism Network (GIN) (Xu et al., 2018) , which is at most as powerful as the Weisfeiler-Lehman (WL) graph isomorphism test (Leman & Weisfeiler, 1968) , and it is defined as : x i = h Θ (1 + ) • x i + j∈N (i) x j , where h Θ is an MLP. We believe that this representation ability is effective in addressing N 2 AI. Therefore, we leverage the GIN layer as the backbone to construct our N 2 AGim.

2.2. DEEP GRAPH SIMILARITY LEARNING

The graph similarity problem is defined as: given two graphs G i and G j with their similarity metric, the graph similarity models learn a function that maps the two graphs to their similarity metric. The Graph Matching Network (GMN) (Li et al., 2019) is the first deep graph similarity model, which computes the similarity between two given graphs by a cross-graph attention mechanism. Bai et al. (2019) turned the graph similarity task into a regression task. They not only proposed widely used graph similarity datasets, but also leveraged the GCN layers and self-attention-based fusion to design SimGNN. Further, in their later work (Bai et al., 2020) , the proposed GraphSim directly learns the similarity based on the node-level interaction of the two given graphs. By leveraging a trained SimGNN to guide the search space of the A* algorithm, GENN-A* (Wang et al., 2021) achieves the best-in-class performance, but needs too long inference time on the test data, i.e., 290.1 hours to solve the GED computation on AIDS700nef dataset (Qin et al., 2021) . Considering that GENN-A* is too time-consuming in practice, we do not compare our proposed methods with it in our evaluations. In order to achieve a faster speed, Qin et al. (2021) proposed a Knowledge Distillation (KD) paradigm to improve the individual graph embeddings generated by the student model. However, we found that none of these existing graph similarity models are designed to address the N 2 AI. In order to address the N 2 AI, our N 2 AGim leverages the GIN layers and the proposed DiffAtt to enhance the differences between the embeddings of two graphs and therefore achieves the state-of-the-art performance on benchmark datasets. Compared to EGSCS (Qin et al., 2021) , our GSL 2 directly generates an individual graph representation for each graph, and the only learnable parameter of our GSL 2 is a simple MLP. Besides, the evaluation also shows that our GSL 2 has a higher accuracy and faster inference speed than EGSCS.

3. NODE NUMBER AWARENESS ISSUE (N 2 AI) ANALYSIS

Here, we provide a formal theoretical analysis of N 2 AI and reveal the reasons for its existence. GNNs usually generate the embeddings of a graph by multi-layer GNN aggregation layers and global pooling methods. The GIN is known for having at most as powerful as WL-Test, that is, to distinguish whether two graphs are isomorphic, which indicates that GIN is effective in distinguishing graphs with similar node numbers and address the N 2 AI. Hence, we focus on the impact of widly used different global pooling methods on the N 2 AI, including the one order statistical methods, i.e., Global Sum Pooling (GSP), Global Max Pooling (GMP), Global Average Pooling (GAP) and the second order statistical methods, i.e., Second Order Pooling (SOP) (Wang & Ji, 2020 ). Let's assume that the node feature matrix output by the graph neural network layers is X = [v 1 , v 2 , • • • , v C ], where v i ∈ R N is the feature on the ith channel. We model all variables in X as i.i.d random variables that follow a Gaussian distribution N (µ, σ 2 ), where µ > 0. The one order statistical methods F is used to convert v i into a single value g = F(v i ), which outputs a fix-sized vector, and the second order statistical methods convert v i and v j as a single value g = F(v i , v j ), which outputs a fix-sized matrix. Here, we learn the N 2 AI by studying whether a pooling method F i can appropriately distinguish between X with number of nodes N and N + δ, i.e., the differentiation between the two distributions p(g|N, F i ) and p(g|N + δ, F i ), where δ denotes the difference between the number of nodes. We first assume that X obeys N (1, 4) and show the output distribution of different pooling functions for different number of nodes in Fig 2(a)-(d) . Intuitively, all four global pooling methods have a lot of overlap in terms of output distribution when the number of nodes is similar, and less overlap in terms of distribution when the number of nodes is very different. We further define the probability of this overlap with the following equation: O(F i , N, δ) = min {p(g|N, F i ), p(g|N + δ, F i )} max {p(g|N, F i ), p(g|N + δ, F i )} dg, where O(F i , N, δ) denotes the proportion of the overlapping area that occupies the total area of F i with N and N +δ nodes. Obviously, the outputs of the GSP and GAP obey the Gaussian distributions N (N µ, N σ 2 ) and N (µ, σ 2 N ), respectively, but it is difficult to obtain the distributions that the GMP and the SOP satisfy. Therefore, we perform a large number of randomized experiments and leverage the Kernel Density Estimation (KDE) to obtain an approximate distribution for the GMP and SOP. The overlapping probabilities of the four global pooling methods are shown in Fig 3 . From Fig 3(a) , it is clear that most of the global pooling methods overlap more than 80% of the area of the distribution with N and N +1 nodes, which means that existing graph similarity networks have difficulty distinguishing graphs with similar number of nodes in the output distribution, thus leading to the N 2 AI. According to Fig 3(b) , the probability of overlap between embedding distributions decreases as the difference in node counts increases. A way to address N 2 AI is to make the graph similarity model focus on the differences between the two embeddings. Inspired by this, we propose the DiffAtt to enhance the difference between two embeddings generated by the above four global Multi-Scale GIN layers. Given a graph data G = (A, X), where A and X are as defined in Section 2, the GIN layers, which can effectively address N 2 AI because it is at most as powerful as WL-Test to distinguish whether two graphs are isomorphic or not, are leveraged as our backbone to update the node embeddings. All the MLPs in GIN have one linear layer with the Layer Normalization (Ba et al., 2016) and ReLU activation function. Besides, we apply the residual connections (He et al., 2016) and an additional FeedForward Neural Network (FFN) to enhance the node embeddings. We stack 3 GIN layers to aggregate multi-scale information of the node's neighbors. After each GIN layer, a one order statistical pooling method is applied to generate the graph embeddings. Different Attention based feature fusion. We propose Different Attention (DiffAtt) to enhance the difference between the embeddings to address the N 2 AI and obtain a joint embedding by fusing features of the two graph-level embeddings at each layer. Given the graph embeddings h (k) i and h (k) j at the kth layer, the DiffAtt is defined as : Att (k) = Sof tmax(M LP s (k) (|h (k) i -h (k) j |)) u (k) Gi = f latten(Att (k) h (k) i ), u (k) Gj = f latten(Att (k) h (k) j ) (2) where u

(k)

Gi ∈ R C is the enhancement embeddings of G i , the f latten(•) denotes the flatten operation, and denotes the Hadamard Product. It is evident that DiffAtt can give greater weight to large differences between the two embeddings and dynamically capture the differences that really matter with learnable parameters, which can effectively increase the separability of two graph embeddings with similar or even same number of nodes, thus effectively addressing N 2 AI. Next, we concatenate the two enhancement embeddings as their joint embeddings as u (k) Gi,Gj = concat([u (k) Gi , u (k) Gj ]). Fi- nally, we concatenate all the joint embeddings u (k) Gi,Gj at different layers to obtain a multi-scale joint embedding as u Gi,Gj = concat([u (0) Gi,Gj , • • • , u (3) Gi,Gj ]). MLP regressor. A two-layer MLP is then applied to map u Gi,Gj to the similarity scores. In the graph similarity task, the normalization GED is defined as nGED (G 1 , G 2 ) = GED(Gi,Gj ) (|Ni|+|Nj |)/2 , and the ground truth similarity score is defined as exp(-nGED(G i , G j )), which is in the range of (0,1]. We adopt the Mean Square Error (MSE) as the loss function to train N 2 AGim.

4.2. GRAPH SIMILARITY LEARNING WITH LANDMARKS (GSL 2 )

The graph similarity task inherently requires a deep fusion of the features of two graphs at the early stage and then learns from the joint embeddings to predict the similarity score, as shown in Fig 2(f) . This makes it difficult to extract the individual embedding of each graph, which leads to higher computational costs in practice (Qin et al., 2021) . Qin et al. (2021) used a KD-paradigm to improve the individual embeddings generated by the student IEM. In contrast, we provide a novel IEM framework, called Graph Similarity Learning with Landmarks (GSL 2 ), which directly generates the individual embeddings of each graph without additional learning. Theorem 1 Let landmarks S denote an infinite set containing every graph, and u G is the embedding of graph G, defined as: u G = [GED(G, Ĝ1 ), • • • , GED(G, ĜInfinity )] T , where Ĝi ∈ S. The GED value between any G 1 and G 2 satisfies: GED(G 1 , G 2 ) = min i {u G1 i + u G2 i } = min i {GED(G 1 , Ĝi ) + GED(G 2 , Ĝi )}, The proof of Theorem 1 is provided in the Appendix A. Theorem 1 illustrates that the GED values between two graphs can be calculated by their GED values with landmarks. However, Theorem 1 needs to satisfy two requirements, the first one is to obtain an infinite S and the second one is a large number of calculations of the exact GED values of the graphs and landmarks, both of which are impossible to satisfy in a practical scenario. The first requirement can be approximately solved by randomly selecting M graphs from the training graph set to form S. For the second requirement, the approximate GED values between graphs and landmarks can be computed quickly using a graph similarity model, e.g., SimGNN, GraphSim, or N 2 AGim, which is equivalent to adding noise to the generated ũG . However, in practice, we find that the direct use of min i { ũG1 i + ũG2 i } to approximate the GED target GED(G 1 , G 2 ) has a relatively large error due to the limited number of landmarks and the noise in ũG . Therefore, we propose to use MLP to learn to map the two generated embeddings to their GED target. An illustration of our GSL 2 is shown in Fig 4(b) , and the details are as follows: First, a subset of the graphs, S = { Ĝ1 , • • • , ĜM }, named landmarks, are randomly selected from the the training graph set. Second, any graph similarity model can be leveraged to efficiently obtain the individual embeddings for each graph by computing their GEDs to the landmarks. However, from the above analysis, we can see that reducing the noise in ũG can improve the prediction accuracy. Therefore, we leverage our N 2 AGim, which achieves the state-of-the-art performance, to calculate GED values for all graphs with landmarks. However, we found that directly converting exponential similarity values to GED values caused significant errors, so we used the ATS 2 similarity metric to retrain N 2 AGim, see the Appendix B for details. Third, we concatenate two individual graph embeddings together in a joint embedding and learn an MLP to map the joint embedding to their GED target.

4.3. COMPARISON OF OUR N 2 AGIM AND GSL 2

Accuracy. N 2 AGim can effectively address N 2 AI by fusing the features of two graphs by DiffAtt at multiple scales, and thus can achieve better performance. However, GSL 2 uses N 2 AGim to quickly generate an individual embedding with noise for each graph and learn from the noisy embeddings, so the performance will be lower than that of N 2 AGim. Inference speed. Given q query graphs, the aim is to compute the similarity between all query graphs and the p graphs which already exist in the database. Assume the time to compute the similarity of a pair of graphs is T N for N 2 AGim and the time to compute the similarity of a pair of embeddings is T M LP for the MLP in GSL 2 . Since N 2 AGim, as an EFM, requires fusion of graph pairs to obtain joint embeddings at each layer, it has a computational time of p × q × T N . Figure 5 : Visualisation of MSE for different models on test data with similar number of nodes. We split the test graph pairs with no more than 37.5% difference in the number of nodes into 5 bins for validation. Note that on graphs with a small number of nodes like AIDS700nef and LINUX, a difference of 7.5% SizeDiff represents a difference of approximately one node. However, GSL 2 , as a IEM, first generates the individual embedding of each graph using N 2 AGim and then predicts the similarity between the two embeddings using MLP, requiring a computation time of (p + q) × M × T N + p × q × T M LP , where M is the number of landmarks. Since GSL 2 reduces the time complexity of computing joint embeddings and the inference speed of MLPs is generally lower than that of N 2 AGim, i.e., T M LP T N , the query tasks can be addressed more efficiently by GSL 2 . Especially in industrial scenarios, graph data is usually preprocessed offline as the embedding. If all the graph embeddings are stored offline, the inference time of GSL 2 is just p × q × T M LP . In summary, as an IEM, GSL 2 can be up to T N /T M LP times faster than N 2 AGim. The experimental results in Section 5.3 also demonstrate that GSL 2 can be up to 47.7 times faster than N 2 AGim, which shows that GSL 2 can effectively address the similarity computation tasks.

5. EXPERIMENTS

In this section, we evaluate our proposals using the AIDS700nef, LINUX and IMDBMulti datasets provided by Bai et al. (2019) for the graph similarity learning and compare our method with other state-of-the-art methods. The statistics and details of these datasets and data processing are provided in Appendix B. Note that all the experiments are performed on a Linux server with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz and 8 NVIDIA GeForce RTX 2080Ti. The evaluation metrics we adopted are Mean Square Error (MSE) (in the format of 10 -3 ), Spearman's Rank Correlation Coefficient (ρ) and Precision at 10 (p@10). All metrics with their meanings are listed in the Appendix. The N 2 AGim is evaluated using the PyTorch Geometric (Fey & Lenssen, 2019) . We use Adam optimizer, a learning rate of 0.001, the batch size is set to 2000, and the hidden channel is set to 64. We run 200 epochs on the three datasets, and after running 150 epochs, we perform validation at the end of every epoch. Ultimately, the parameter that results in the least validation loss is chosen to perform the evaluations on the test data. We implement the GSL 2 using the PyTorch (Paszke et al., 2019) , and the details can be found in our source code.

5.1. ABLATION STUDY

We perform ablation studies to show the influence of the DiffAtt in N 2 AGim and GSL 2 with different GED computation algorithms. For N 2 AGim, we compare the performance of using and without using DiffAtt across the four global pooling functions mentioned. Besides, we compare our DiffAtt with the other popular attention based global pooling methods, i.e., Neural Tensor Network (NTN) (Bai et al., 2019) , Embedding Fusion Network (EFN) (Qin et al., 2021) , Global Soft Attention (GSA) (Li et al., 2015) , Set2Set (Vinyals et al., 2015) , Context Based Attention (CBA) (Bai et al., 2019) , and Cross Context Based Attention (C2BA), under the same architecture. It is worth noting that most of the existing graph similarity models leverage the CBA to generate graph embeddings, e.g., Li et al. (2019) ; Bai et al. (2019) ; Qin et al. (2021) ; Zhang et al. (2021) , which is defined as F CBA (X) = N n=1 sigmoid(x T n c)x n , where c denotes the context information of the graphs. C2BA is different from CBA only in that global context information c is from another graph in the graph pair. For our GSL 2 , we experiment with different graph similarity models. We also provide additional ablation experiments on the hyperparameter selection of our methods, including selecting different numbers of landmarks in GSL 2 , etc. in Appendix G. 1 : Results of the ablation study on using or without using our DiffAtt. Bold means the best. The Average denotes the average value over the three datasets. The ↑ denotes that the larger this indicator is, the better the performance, while the ↓ indicates the opposite. AI of all pooling methods. In terms of the average results, SOP and GSP showed better results than GMP and GAP after the use of DiffAtt, which is the result of the smaller percentage of overlap area and greater distribution differences. Because of the higher performance and less computational cost of GSP, we finally chose it as the global pooling function in N 2 AGim. F i DiffAtt AIDS700nef LINUX IMDBMulti Average MSE ↓ ρ ↑ P@10 ↑ MSE ↓ ρ ↑ P@10 ↑ MSE ↓ ρ ↑ P@10 ↑ MSE ↓ ρ ↑ P@ Compared to other attention mechanisms, in Table 2 , DiffAtt achieves the best results on all metrics under the same experimental setup and architecture, especially better than the EFN on average on three metrics 5.8%(0.515 vs 0.547), 0.7%(0.943 vs 0.936) and 1.3%(0.853 vs 0.842), respectively. As can be seen from the Table 3 , the accuracy of GSL 2 increases as the generated GEDs are closen to the true GED values, which validates our analysis in Section 4.2 and shows that the performance of GSL 2 can be improved by using our N 2 AGim.

5.2. GRAPH SIMILARITY LEARNING

We compare our N 2 AGim and GSL 2 with a number of state-of-the-art methods for graph similarity learning tasks: GMN (Li et al., 2019) , SimGNN (Bai et al., 2019) , H2MN (Zhang et al., 2021) , GraphSim (Bai et al., 2020) , and EGSC (Qin et al., 2021) (s -ŝ) 2 . To provide a consistent comparison, we use the MSE metric of the latter formula, and the results are shown in Table 4 . Our N 2 AGim achieves the best performance in most of the cases. On AIDS700nef, the performance is improved by about 24.3%(1.170 vs 1.546 on MSE), 2.0%(0.916 vs 0.898 on ρ) and 3.5%(0.672 vs 0.649 on p@10) compared to EGSCS. On LINUX, our N 2 AGim achieves the Table 3 : Results of the ablation study of comparing different graph similarity models used in GSL 2 . All graph similarity models were trained using ATS 2 and we transform the results to the exponential similarity scores and report it. The brackets represent the GED algorithm used in GSL 2 and the GT represents the ground truth GED values. M is set to 60, 30 and 70 on three datasets, respectively. best performance in all three metrics, especially in MSE which is 43.1%(0.066 vs 0.116) better than the second best model, GraphSim. On IMDBMulti, N 2 AGim achieves the best MSE and p@10 performance, but close not perform as well on ρ (0.918 vs 0.938) than EGSCT. Although GSL 2 does not learn embeddings directly in the graph data, it achieves the state-of-the-art performance on three of the nine metrics on three datasets. Compared to the EGSCS, our GSL 2 achieved better performance in eight of the nine metrics on three datasets, demonstrating the powerful expressive ability of the generated embeddings in GSL 2 . Compared to GSL 2 , which learns on embeddings with noise, N 2 AGim achieves a better performance, especially better by about 20.4%(1.170 vs 1.470), 1.2%(0.916 vs 0.905) and 11.3%(0.672 vs 0.604) on AIDS700nef. In addition, we visualised the MSE in test data with similar number of nodes for different models in Fig 5 . Compared to the other models, N 2 AGim shows a significant improvement on graph pairs with similar number of nodes, demonstrating its effectiveness in addressing N 2 AI.

5.3. INFERENCE TIME

In this section, we provide a comparison of inference times for GSL 2 and the rest of the graph similarity models on test data. Our evaluation reflects real-world graph queries: we treat the training graph set as the graphs that already exist in the database and can be preprocessed, and the test graph set as the query graph. We calculate the similarity of a query graph to all graphs in the database at once to obtain the total query time, and all times are averaged over five tests. The results are shown in Table 5 . By obtaining individual embeddings of the graphs offline, GSL 2 -F comes out to be 12.9, 11.3 and 47.7 times faster than N 2 AGim on the three datasets, respectively. Compared to EGSCS-F, GSL 2 -F is 1.36, 1.22 and 1.24 times faster, respectively. This shows the potential of GSL 2 to efficiently compute graph similarity in realistic scenarios.

6. CONCLUSION

This paper addresses two issues in graph similarity tasks, one is the N 2 AI and the other is the issue of improving the speed of graph similarity model inference for downstream tasks. By analysing the performance of popular graph similarity models, we show that graph similarity models have difficulty distinguishing the embeddings of two graphs with similar number of nodes, because the global pooling function maps graphs with similar number of nodes to similar embedding distributions, reducing the separability between embeddings. Therefore, DiffAtt is proposed to enhance the difference between two similar embeddings, thus the proposed N 2 AGim achieves the state-of-the-art performance. To speed up the graph similarity computation, the GSL 2 is proposed. Instead of learning embeddings in graph data, GSL 2 generates individual embeddings directly by a trained graph similarity model. Our analysis and experiments both demonstrate that such individual embeddings have a powerful expressive ability and can efficiently handle downstream tasks.

7. ETHICS STATEMENT

This work proposes two methods to address the real-time graph similarity tasks. Our proposed methods have a great potential for practical graph-based applications due to their high precision and high speed. Our methods also be applied to address any similarity problem between the graph data, e.g., the binary function similarity problem, which can be helpful for the software copyright issue. Therefore, we believe that our methods do not have any negative impact of the society but make positive impact of the society.

8. REPRODUCIBILITY STATEMENT

Our code is publicly available on https://github.com/iclr231312/N2AGim. We provide trained models and test code in our anonymous repository to help researchers quickly reproduce test results. Besides, we provide the source code for the training, including the hyperparameter settings and the fixed random seed we use to ensure our work is reproducible. Please get more details from our repository. A PROOF OF THEOREM 1. Here, we provide a proof of Theorem 1. The GED satisfies the triangle inequality, i.e., for any G 1 ,G 2 and G 3 , there must exist: GED(G 1 , G 2 ) ≤ GED(G 1 , G 3 ) + GED(G 2 , G 3 ). In particular, when G 3 is isomorphic to a graph on the least-cost edit path between G 1 and G 2 , there must exist: GED(G 1 , G 2 ) = GED(G 1 , G 3 ) + GED(G 2 , G 3 ). Thus, assume there exists an infinite set S containing every graph, any graph can be encoded as an infinitely long vector as: u G = [GED(G, Ĝ1 ), GED(G, Ĝ2 ), • • • , GED(G, ĜInfinity )] T , where Ĝi ∈ S, so that for any two graphs G 1 and G 2 , it exists: GED(G 1 , G 2 ) = min i (u G1 i + u G2 i ).

B DATASETS AND PRE-PROCESSING

We perform an evaluation of our methods on AIDS700nef, LINUX and IMDBMulti datasets provided by Bai et al. (2019) . Following is a brief overview of benchmark datasets: 1. AIDS700nef dataset contains 700 graphs from AIDS dataset which represent antivirus screen chemical compound, and all of them have 10 or less than 10 nodes. 2. LINUX dataset contains 1000 graphs selected from Wang et al. (2012) , which represent Program Dependence Graph (PDG) generated by Linux kernel. 3. IMDBMulti dataset (Yanardag & Vishwanathan, 2015) contains ego-networks of actors/actresses, where nodes represent an actor/actress and edges indicate that these two actors/actress participated in the same movie. For AIDS700nef and LINUX datasets, Bai et al. (2019) compute the GED of every graph pair using an algorithm named A*, and for IMDB datasets, the minimum of GED computed by three algorithms: Beam (Neuhaus et al., 2006) , Hungarian (Riesen & Bunke, 2009) and VJ (Fankhauser et al., 2011) , is considered as the ground truth. In order to enhance the node features of the graph, we concatenate the one-hot encoding of the node degree into its features on these three datasets. Note that, the GED metric is first normalized as nGED = GED G i ,G j 0.5•(|Gi|+|Gj |) , where |G i | represents the number of nodes in G i . and then adopted a function λ(x) = e -x to transform to range (0,1]. We randomly split datasets into 60%, 20%, 20% as training graph set T r, validation graph set V , and testing graph set T e, respectively. We take the Cartesian product of T r labeled with their similarity scores as the the training set. The validation set (testing set) is defined as the Cartesian product of T r and V ( T e ) labeled with the ground truth. The training set is defined as {(G i , G j , s Gi,Gj )|G i ∈ T r, G j ∈ T r}, where s Gi,Gj denotes the similarity score of G i and G j , and the validation dataset and the testing dataset is {(G i , G j , s Gi,Gj )|G i ∈ T r, G j ∈ V }, {(G i , G j , s Gi,Gj )|G i ∈ T r, G j ∈ T e}, respectively. However, we found in GSL 2 that directly converting the exponential similarity scores predicted by the graph similarity models to GED values can cause significant errors. Therefore, we used a new similarity score to train the graph similarity models, called Adaptive Transform Similarity Scores (ATS 2 ), and transform the results back to exponential similarity score for comparison with other models at test time. The ATS 2 is defined as:  AT S 2 (G 1 , G 2 ) = 1 -lg(nGED(G 1 , G 2 ) + 1)/ lg(max i,j {nGED(G i , G j )} + 1). (8)

C EVALUATION METRICS

The evaluation metrics that we adopted are the Mean Square Error (MSE), the Spearman's Rank Correlation Coefficient (ρ) (Spearman, 1961) , and the Precision at 10 (P@10) (Bai et al., 2019) . Moreover, we provide the results of the τ (Kendall, 1938) and P@20 metrics in the experiments in Appendix. The MSE metric can accurately calculate the distance between the predition results from the model and the ground truth, and ρ and τ evaluate the matching between the global ranking result of the prediction results and the ground truth, while P@k is the intersection of the top k results of the prediction and the ground truth.

D N 2 AI

Here, we provide a more detailed description of the N 2 AI in the popular graph similarity models. We grouped the testing set by the number of nodes in the graph and counted the MSEs. We normalized the MSEs with Min-Max Normalization, and visualized the results in Fig 6 . It is clearly that the large MSE are all concentrated in locations with similar number of nodes, which reflects the prevalence of N 2 AI. E N 2 AGIM WITH SECOND ORDER POOLING Given the feature map X k at the kth layer, Second Order Pooling (SOP) is defined as: H (k) = F(X (k) ) = (X (k) ) T • (X (k) ) where H (k) ∈ R C×C is a fixed size matrix. For the SOP, we define the DiffAtt as : Dif f (k) = |H (k) i -H (k) j | Att (k) = Sof tmax2D(Dif f (k) ⊗ Θ (k) + bias (k) ) U (k) Gi,Gj = Att (k) (H (k) i H (k) j T ) (10) Where U (k) Gi,Gj ∈ R C×C is the joint embeddings, and Θ (k) ∈ R C×C and bias (k) ∈ R C×C are two learnable parameter. If the U (k) Gi,Gj is directly flattened into a vector for regression, it will result in a larger amount of parameters for the model. Therefore, motivated by the Bai et al. (2020) , we apply four convolution layers with residual connections to learn the U (k) Gi,Gj , and flatten the feature map to a vector u Gi,Gj at different layers to obtain a multi-scale joint embedding as u Gi,Gj = concat([u (0) Gi,Gj , • • • , u (3) Gi,Gj ]).

F JOINT EMBEDDING VISUALISATION WITH DIFFATT AND WITHOUT DIFFATT

We also visualised the joint embeddings generated with and without DiffAtt with each of the three datasets using T-SNE, as shown in Fig 7 . It is clearly that the joint embeddings generated with DiffAtt are more separable than those without DiffAtt. We experimented with selecting different numbers of landmarks on the performance of GSL 2 , and the results are shown in Table 12 , 13, and 14. From the experimental results, we can find that increasing the value of M can improve the accuracy of GSL 2 , but it also increases the inference time. Considering the balance of inference speed and accuracy, we finally chose M as 60, 30, and 70 for three datasets, respectively. G.4 EXPERIMENTAL RESULTS FOR GSL 2 WITH DIFFERENT RANDOM SELECTED LANDMARKS. We next provide experimental results of GSL 2 under different random seeds to test the sensitivity of GSL 2 to the selected landmarks, which is shown in Table 15 , 16, and 17, respectively. From these results, we can see that the selection of different landmarks affects the performance of GSL 2 , but the effect is not significant, which shows the robustness of our GSL 2 . We provide inference speed of GSL 2 based on the other graph similarity models and the results are shown in Table 18 . It is clear that GSL 2 -F speeds up SimGNN by 7.7, 8.5, 73 times on three datasets, respectively, and speeds up GraphSim 6.1, 10.2 and 59 times, respectively. G.6 EXPERIMENTAL RESULTS ON GSL 2 WITHOUT USING MLPS. We provide experiments directly using min i { ũG1 i + ũG2 i } and the results are shown in Table 19 . Obviously, due to the limited number of landmarks, and the noise in the generated u G , direct use of min i { ũG1 i + ũG2 i } is very ineffective. G.7 EXPERIMENTAL RESULTS ON GSL 2 WITH DIFFERENT REGRESSION ALGORITHMS. We provide experiments using different regression algorithms in GSL 2 and the results are shown in the Tab 20 21, 22, respectively. The experimental setup is the same as in Section 5, and we use the default parameters in Pycaret (Ali, 2020) to train each model. In practice, the parameters of the learning algorithm can be adjusted to obtain better results. It can be seen that the decision tree based regression algorithms achieve good performance in this noisy embeddings. This also illustrates the strong expressive ability of our generated embeddings. 

H LIMITATIONS AND FUTURE WORKS

GSL 2 represents each graph as the GED values between it and the landmarks, and learns on these embeddings. However, this restricted number of landmarks and the embeddings with noise limit the performance of GSL 2 . Moreover, the different randomly chosen landmarks can have some impact on the performance of the GSL 2 , which requires a better landmark selection strategy to be proposed. We leave these issues for the future works. Besides, this paper also discover the N 2 AI, a common problem in graph similarity learning, which could inspire the future works. 



Figure 1: Histogram of the Mean Square Error (MSE) of the existing graph similarity models on three datasets at different level of SizeDiff. The SizeDiff represents the percentage difference in the number of nodes and is defined as SizeDiff (G 1 , G 2 ) = |N 1 -N 2 |/max(N 1 , N 2 ), where N i is the number of nodes in G i . It is clear that all models have a larger MSE when the SizeDiff is smaller, i.e. when the number of nodes in the graph pair is similar. between embeddings and leads to a large MSE for the models in predicting the similarity of two graphs with similar number of nodes. We refer to this issue of indistinguishable embeddings of graphs with similar number of nodes as the Node Number Awareness Issue (N 2 AI).

Figure 2: (a)-(d) Distributions of output from different global pooling functions with N nodes, which show that all global pooling functions map graphs with similar number of nodes to similar distributions. See Section 3 for details. (e) Illustration of the N 2 AI, i.e., the distribution of the embeddings of two graphs with similar number of nodes is indistinguishable. Region A represents where two distributions overlap, while B is the opposite. Our aim is to enhance the information in B to address the N 2 AI. (f)-(g) Illustration of the Early Fusion Model (EFM) and Individual Embedding Model (IEM). Another issue of interest in the field of graph similarity learning is to accelerate the inference speed of graph similarity models in downstream tasks. Qin et al. (2021) divided the graph similarity models into two categories, one is the Early Fusion Model (EFM), shown in Fig 2(f), which performs feature fusion at an early stage to achieve high accuracy but slow inference, and the other one is the Individual Embedding Model (IEM), shown in Fig 2(g), which generates an individual embedding for each graph and then performs fusion. This model is fast but achieves low accuracy. The existing solution (Qin et al., 2021) uses a special designed Knowledge Distillation (KD) paradigm to leverage an EFM teacher to improve the individual embeddings generaged by the IEM student. However, motivated by Balcan et al. (2008), we propose a faster and more accurate IEM called Graph Similarity Learning with Landmarks (GSL 2 ). In GSL 2 , a subset of graphs, called landmarks S, are selected, and then each graph G is represented as a vector u G = [GED(G, Ĝ1 ), • • • , GED(G, Ĝm )] T , where Ĝ ∈ S. Finally, an MLP is learned to map the concatenation of the embeddings of the two graphs to their GED target. Instead of learning the embeddings on the graph data, our GSL 2 uses an already trained graph similarity model to directly generate an individual embedding for each graph, and this individual embedding can effectively improve the inference speed of GSL 2 . To sum up, the contributions of this paper can be summarized as follows:

Figure 3: Overlapping probability of graph embedding with N and N + δ number of nodes. (a) shows the overlap probabilities with different N when the δ is 1; (b) shows the overlap probabilities for different δ when the N is 5.

Figure 4: (a) N 2 AGim first uses the multi-scale GIN layers to aggregate the information in the graph, then DiffAtt for feature fusion, and finally MLP to predict the similarity scores. (b) GSL 2 generates individual embeddings for each graph by calculating the GED values between them and landmarks, and then uses MLP to map the individual embeddings of the two graphs to their GED values.pooling methods. The evaluation on benchmark datasets demonstrates that our DiffAtt brings a huge performance improvement to the graph similarity models.

to reduce the parameters of the model. Finally, we concatenate all the joint embeddings u (k)

Figure 6: Heatmap of the normalized MSE for different graph similarity models with different number of nodes. 14

Figure7: t-SNE visualisation of the joint embeddings generated by models using DiffAtt and those not using DiffAtt. The colours represent the similarity ground truth of the joint embeddings. It is clear that the joint embeddings with DiffAtt are more separable than those without DiffAtt, for example, the similarity scores of the joint embeddings along the arrows gradually increase, and the joint embeddings with high similarity are concentrated in the elliptical region.

demonstrates that our DiffAtt effectively improves 44 metrics out of 48 metrics of the four global pooling methods, especially giving a huge boost to 12 metrics on the IMDBMulti dataset,

Results of the ablation study of comparing our DiffAtt with other attention methods. Bold means the best, and † means the next best.

Results of the graph similarity learning task. Bold means the best.

Results of inference time for each model. The suffix of '-R' means that the input is the raw query graph, while the suffix of '-F' means that the embeddings of the query graph are stored offline. All times reported below are in seconds.

Statistics of all the datasets used in our experiments. Datasets Graphs Avg nodes Avg edges Pairs of testing graphs Node attr

Results of MSE for different models on test data with small difference in the number of nodes on the AIDS700nef dataset. AGim achieves better performance than the other models when the number of nodes is similar, showing that N 2 AGim can address N 2 AI effectively. Here we provide more detailed numerical comparison results in Table7, 8 and 9. Note that the graphs in AIDS700nef and LINUX all have a relatively small number of nodes, and a 7.5% difference in node number can be seen as a difference of one node. The results demonstrate that N 2 AGim achieves better performance than the other models at all levels of SizeDif f , especially about 26.7%(1.661 vs 2.265), 19.4%(0.191 vs 0.237) and 35.4%(1.103 vs 1.707) better than the second better performance when the difference in number of nodes is less than 7.5% on the three datasets, respectively. This is a strong evidence that N 2 AGim practically improves N 2 AI.G.2 MORE ABLATION EXPERIMENTS ON N 2 AGIMWe experimented with different settings of the backbone of N 2 AGim on the AIDS700nef dataset. First we experimented with different types of GNNs and whether to use residual connections and FFNs to enhance node embeddings, and the results are shown in Table10. It is shown that all GNNs show a significant improvement with the enhancement, especially in the MSE met-

Results of MSE for different models on test data with small differences in the number of nodes on the LINUX dataset.

Results of MSE for different models on test data with small differences in the number of nodes on the IMDBMulti dataset.

Experimental results on N 2 AGim with different GNNs and whether to use residual connections and FFNs on the AIDS700nef dataset.Compared to other GNNs, GIN achieved better results in most cases, especially better on MSE about 1.8%(1.170 vs 1.191) and 4.3%(1.170 vs 1.223), making it more suitable as a backbone for N2AGim. We further experimented with the performance of N 2 AGim using different numbers of GIN layers, which is shown in Table11and found that the number of layers had little effect on performance, so we chose to use 3 layers of GIN.

Experimental results on N 2 AGim with different number of GIN layers on the AIDS700nef dataset.

Experimental results for different numbers of landmarks selected on the AIDS700nef dataset. M denotes the number of landmarks. Considering the balance of inference speed and accuracy, we finally chose M as 60.

Experimental results for different numbers of landmarks selected on the LINUX dataset. Considering the balance of inference speed and accuracy, we finally chose M as 30.

Experimental results for different numbers of landmarks selected on the IMDBMulti dataset. Considering the balance of inference speed and accuracy, we finally chose M as 70.M GSL 2 -R (s) GSL 2 -F (s) MSE ↓

Experimental results on GSL 2 using different random seeds on the AIDS700nef dataset.

Experimental results on GSL 2 using different random seeds on the LINUX dataset.

Experimental results on GSL 2 using different random seeds on the IMDBMulti dataset.

Experimental results on how faster GSL 2 can improve other graph similarity models.

Experimental results on direct use of min i { ũG1 i + ũG2 i }.

