A CRITICAL LOOK AT THE EVALUATION OF GNNS UNDER HETEROPHILY: ARE WE REALLY MAKING PROGRESS?

Abstract

Node classification is a classical graph representation learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it is often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and it is typically assumed that specialized methods are required to achieve strong performance on such graphs. In this work, we challenge this assumption. First, we show that the standard datasets used for evaluating heterophily-specific models have serious drawbacks, making results obtained by using them unreliable. The most significant of these drawbacks is the presence of a large number of duplicate nodes in the datasets squirrel and chameleon, which leads to train-test data leakage. We show that removing duplicate nodes strongly affects GNN performance on these datasets. Then, we propose a set of heterophilous graphs of varying properties that we believe can serve as a better benchmark for evaluating the performance of GNNs under heterophily. We show that standard GNNs achieve strong results on these heterophilous graphs, almost always outperforming specialized models. Our datasets and the code for reproducing our experiments are available at https: //github.com/yandex-research/heterophilous-graphs.

1. INTRODUCTION

The field of machine learning on graph-structured data has recently attracted a lot of attention, with Graph Neural Networks (GNNs) achieving particularly strong results on most graph tasks. Thus, using GNNs has become a de-facto standard approach to graph machine learning, and many versions of GNNs have been proposed in the literature (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Xu et al., 2019) , most of them falling under a general Message Passing Neural Networks (MPNNs) framework (Gilmer et al., 2017) . MPNNs learn node representations by an iterative neighborhood-aggregation process, where each layer updates each node's representation by combining previous-layer representations of the node itself and its neighbors. The node feature vector is used as the initial node representation. Thus, MPNNs combine node features with graph topology, allowing them to learn complex dependencies between nodes. In many real-world networks, edges tend to connect similar nodes. This property is called homophily. Typical examples of homophilous networks are social networks, where users tend to connect to users with similar interests, and citation networks, where papers mostly cite works from the same research area. The opposite of homophily is called heterophily: this property describes the preference of network nodes to connect to nodes not similar to them. For example, in financial transaction networks, fraudsters often perform transactions with non-fraudulent users, and in dating networks, most connections are between people of opposite genders. Early works on GNNs mostly evaluated their models on homophilous graphs. This has led to claims that GNNs implicitly use the homophily of a graph and are thus not suitable for heterophilous datasets (Zhu et al., 2021; 2020; He et al., 2022; Wang et al., 2022) . Recently, many works have proposed new GNN models specifically designed for heterophilous graphs that are claimed to outperform standard GNNs. However, these models are typically evaluated on the same six heterophilous graphs first used in the context of learning under heterophily by Pei et al. (2020) . In this work, we challenge this evaluation setting. We highlight several downsides of the standard heterophilous datasets, such as low diversity, small size, extreme class imbalance of some datasets, and, most importantly, the presence of a large number of duplicate nodes in squirrel and chameleon datasets. We show that models rely on the train-test data leakage introduced by duplicated nodes to achieve strong results, and removing these nodes significantly affects the performance of the models. Motivated by the shortcomings of the currently used heterophilous benchmarks, we collect a set of diverse heterophilous graphs and propose to use them as a better benchmark. The proposed datasets come from different domains and exhibit a variety of structural properties. We evaluate a wide range of GNNs, both standard and heterophily-specific, on the proposed benchmark, which, to the best of our knowledge, constitutes the most extensive empirical study of heterophily-specific models. In doing so, we uncover that the standard baselines almost always outperform heterophilyspecific models. Thus, the progress in learning under heterophily might have been limited to the standard datasets used for evaluation. Our results also show that there is, however, a trick that is useful for learning on heterophilous graphs -separating ego-and neighbor-embeddings, which was proposed in Zhu et al. (2020) . This trick consistently improves the baselines (such as GAT and Graph Transformer) and allows one to achieve the best results. We hope that the proposed benchmark will be helpful for further progress in learning under heterophily.

2. RELATED WORK

Measuring homophily While much effort has been put into developing graph representation learning methods for heterophilous graphs, there is no universally agreed-upon measure of homophily. Homophily measures typically used in the literature are edge homophily (Abu-El-Haija et al., 2019; Zhu et al., 2020) , which is simply the fraction of edges that connect nodes of the same class, and node homophily (Pei et al., 2020) , which computes the proportion of neighbors that have the same class for each node and then averages these values across all nodes. These two measures are simple and intuitive; however, as shown in Lim et al. (2021); Platonov et al. (2022) , they are sensitive to the number of classes and their balance, which makes these measures hard to interpret and incomparable across different datasets. To fix these issues, Lim et al. (2021) propose another homophily measure. However, Platonov et al. (2022) show that it also can provide unreliable results. To solve the issues with existing measures, Platonov et al. (2022) propose to use adjusted homophily, which corrects the number of intra-class edges by their expected value. Thus, adjusted homophily becomes insensitive to the number of classes and their balance. Platonov et al. (2022) show that adjusted homophily satisfies a number of desirable properties, which makes it appropriate for comparing homophily levels between different datasets. Thus, in our work, we will use adjusted homophily for measuring homophily of graphs. Graph datasets Early works on GNNs mostly evaluated their models on highly homophilous graphs. The most popular of them are three citation networks: cora, citeseer, and pubmed (Giles et al., 1998; McCallum et al., 2000; Namata et al., 2012; Sen et al., 2008; Yang et al., 2016) . Examples of other graph datasets for node classification that appear in the literature include citation networks coauthor-cs, coauthor-physics and co-purchasing networks amazon-computers, amazon-photo from Shchur et al. (2018) , discussion network reddit from Hamilton et al. (2017) . These datasets also have high levels of homophily. Recently, Open Graph Benchmark (Hu et al., 2020) was created to provide challenging large-scale graphs for evaluating GNN performance. The proposed datasets such as ogbn-arxiv, ogbn-products, ogbn-papers100M are also highly homophilous (Zhu et al., 2020; Platonov et al., 2022) .

