AN EMPIRICAL STUDY OF THE EXPRESSIVENESS OF GRAPH KERNELS AND GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Graph neural networks and graph kernels have achieved great success in solving machine learning problems on graphs. Recently, there has been considerable interest in determining the expressive power mainly of graph neural networks and of graph kernels, to a lesser extent. Most studies have focused on the ability of these approaches to distinguish non-isomorphic graphs or to identify specific graph properties. However, there is often a need for algorithms whose produced graph representations can accurately capture similarity/distance of graphs. This paper studies the expressive power of graph neural networks and graph kernels from an empirical perspective. Specifically, we compare the graph representations and similarities produced by these algorithms against those generated by a wellaccepted, but intractable graph similarity function. We also investigate the impact of node attributes on the performance of the different models and kernels. Our results reveal interesting findings. For instance, we find that theoretically more powerful models do not necessarily yield higher-quality representations, while graph kernels are shown to be very competitive with graph neural networks.

1. INTRODUCTION

In recent years, graph-structured data has experienced an enormous growth in many domains, ranging from chemo-and bio-informatics to social network analysis. Several problems of increasing interest require applying machine learning techniques to graph-structured data. Examples of such problems include predicting the quantum mechanical properties of molecules (Gilmer et al., 2017) and modeling physical systems (Battaglia et al., 2016) . To develop successful machine learning models in the domain of graphs, we need techniques that can both extract the information that is hidden in the graph structure, and also exploit the information contained within node and edge attributes. In the past years, the problem of machine learning on graphs has been governed by two major families of approaches, namely graph kernels (Nikolentzos et al., 2019) and graph neural networks (GNNs) (Wu et al., 2020) . Recently, much research has focused on measuring the expressive power of GNNs (Xu et al., 2019; Morris et al., 2019; Murphy et al., 2019; Maron et al., 2019a; b; Sato et al., 2019; Keriven & Peyré, 2019; Chen et al., 2019; Dasoulas et al., 2020; Nikolentzos et al., 2020; Barceló et al., 2020) . On the other hand, in the case of graph kernels, there was a limited number of similar studies (Kriege et al., 2018) . This is mainly due to the fact that the landscape of graph kernels is much more diverse than that of GNNs. Indeed, although numerous GNN variants have been recently proposed, most of them share the same basic idea, and can be reformulated into a single common framework, socalled message passing neural networks (MPNNs) (Gilmer et al., 2017) . These models employ a message passing procedure to aggregate local information of vertices and are closely related to the Weisfeiler-Lehman (WL) test of graph isomorphism, a powerful heuristic which can successfully test isomorphism for a broad class of graphs (Arvind et al., 2015) . When dealing with learning problems on graphs, a practitioner needs to choose one GNN or one graph kernel for her particular application. The practitioner is then faced with the following question: Does this GNN variant or graph kernel capture graph similarity better than others? Unfortunately, this question is far from being answered. Most of the above studies investigate the power of GNNs in terms of distinguishing between non-isomorphic graphs or in terms of how well they can approximate combinatorial problems. However, in graph classification/regression problems, we are not that much interested in testing whether two (sub)graphs are isomorphic to each other, but mainly in classifying graphs or in predicting real values associated with these graphs. In such tasks, it has been observed that stronger GNNs do not necessarily outperform weaker GNNs. Therefore, it seems that the design of GNNs is driven by theoretical considerations which are not realistic in practical settings. Ideally, we would like to learn representations which accurately capture the similarities or distances between graphs. A practitioner can then choose an algorithm based on its empirical performance. Indeed, GNNs and graph kernels are usually evaluated on standard datasets derived from bio-/chemo-informatics and from social media (Morris et al., 2020) . However, several concerns have been raised recently with regards to the reliability of those datasets, mainly due to their small size and to inherent isomorphism bias problems (Ivanov et al., 2019) . More importantly, it has been observed that the adopted experimental settings are in many cases ambiguous or not reproducible (Errica et al., 2020) . The experimental setup is not standardized across different works, and there are often many issues related to hyperparameter tuning and to how model selection and model assessment are performed. These issues easily generate doubts and confusion among practitioners that need a fully transparent and reproducible experimental setting. Present work. In this paper, we empirically evaluate the expressive power of GNNs and graph kernels. Specifically, we build a dataset that contains instances of different families of graphs. Then, we compare the graph representations and similarities produced by GNNs and graph kernels against those generated by an intractable graph similarity function which we consider to be an oracle function that outputs the true similarity between graphs. We perform a large number of experiments where we compare several different kernels, architectures, and pooling functions. Secondly, we study the impact of node attributes on the performance of the different models and kernels. We show that annotating the nodes with their degree and/or triangle participation can be beneficial in terms of performance in the case of GNNs, while it is not very useful in the case of graph kernels. Finally, we investigate which pairs of graphs (from our dataset) lead GNNs and kernels to the highest error in the estimated simialrity. Surprisingly, we find that several GNNs and kernels assign identical or similar representations to very dissimilar graphs. We publicly release code and dataset to reproduce our results, in order to allow other researchers to conduct similar studiesfoot_0 .

2. RELATED WORK

Over the past years, the expessiveness of graph kernels was assessed almost exclusively from experimental studies. Therefore, still, there is no theoretical justification on why certain graph kernels perform better than others. From the early days of the field, it was clear though that the mapping induced by kernels that are computable in polynomial time is not injective (and thus these kernels cannot solve the graph isomorphism problem) (Gärtner et al., 2003) . Recently, Kriege et al. (2018) proposed a framework to measure the expressiveness of graph kernels based on ideas from property testing. It was shown that some well-established graph kernels such as the shortest path kernel, the graphlet kernel, and the Weisfeiler-Lehman subtree kernel cannot identify basic graph properties such as planarity or bipartitness. With the exception of the work of Scarselli et al. (2008) , until recently, there has been little attempt to understand the expressive power of GNNs. Several recent studies have investigated the connections between GNNs and the Weisfeiler-Lehman (WL) test of isomorphism and its higher-order variants. For instance, it was shown that standard GNNs do not have more power in terms of distinguishing between non-isomorphic graphs than the WL algorithm (Morris et al., 2019; Xu et al., 2018) . Morris et al. (2019) proposed a family of GNNs which rely on a message passing scheme between subgraphs of cardinality k, and which have exactly the same power in terms of distinguishing non-isomorphic graphs as the set-based variant of the k-WL algorithm. In a similar spirit, (Maron et al., 2019a) introduced k-order graph networks which are at least as powerful as the folklore variant of the k-WL graph isomorphism test in terms of distinguishing non-isomorphic graphs. These models were also shown to be universal (Maron et al., 2019c; Keriven & Peyré, 2019) , but require using high order tensors and therefore are not practical. Chen et al. (2019) show that the two main approaches for studying the expressive power of GNNS, namely graph isomorphism testing and invariant function



Code available at: https://github.com/xxxxx/xxxxx/

