COMBINING LABEL PROPAGATION AND SIMPLE MOD-ELS OUT-PERFORMS GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs) are a predominant technique for learning over graphs. However, there is relatively little understanding of why GNNs are successful in practice and whether they are necessary for good performance. Here, we show that for many standard transductive node classification benchmarks, we can exceed or match the performance of state-of-the-art GNNs by combining shallow models that ignore the graph structure with two simple post-processing steps that exploit correlation in the label structure: (i) an "error correlation" that spreads residual errors in training data to correct errors in test data and (ii) a "prediction correlation" that smooths the predictions on the test data. We call this overall procedure Correct and Smooth (C&S), and the post-processing steps are implemented via simple modifications to standard label propagation techniques that have long been used in graph-based semi-supervised learning. Our approach exceeds or nearly matches the performance of state-of-the-art GNNs on a wide variety of benchmarks, with just a small fraction of the parameters and orders of magnitude faster runtime. For instance, we exceed the best-known GNN performance on the OGB-Products dataset with 137 times fewer parameters and greater than 100 times less training time. The performance of our methods highlights how directly incorporating label information into the learning algorithm (as is common in traditional methods) yields easy and substantial performance gains. We can also incorporate our techniques into big GNN models, providing modest gains in some cases.

1. INTRODUCTION

Following the success of neural networks in computer vision and natural language processing, there are now a wide range of graph neural networks (GNNs) for making predictions involving relational data (Battaglia et al., 2018; Wu et al., 2020) . These models have had much success and sit atop leaderboards such as the Open Graph Benchmark (Hu et al., 2020) . Often, the methodological developments for GNNs revolve around creating strictly more expressive architectures than basic variants such as the Graph Convolutional Network (GCN) (Kipf & Welling, 2017) or GraphSAGE (Hamilton et al., 2017a) ; examples include Graph Attention Networks (Veličković et al., 2018 ), Graph Isomorphism Networks (Xu et al., 2018) , and various deep models (Li et al., 2019; Rong et al., 2019; Chen et al., 2020) . Many ideas for new GNN architectures are adapted from new architectures in models for language (e.g., attention) or vision (e.g., deep CNNs) with the hopes that success will translate to graphs. However, as these models become more complex, understanding their performance gains is a major challenge, and scaling them to large datasets is difficult. Here, we see how far we can get by combining much simpler models, with an emphasis on understanding where there are easy opportunities for performance improvements in graph learning, particularly transductive node classification. We propose a simple pipeline with three main parts (Figure 1 ): (i) a base prediction made with node features that ignores the graph structure (e.g., a shallow multi-layer perceptron or just a linear model); (ii) a correction step, which propagates uncertainties from the training data across the graph to correct the base prediction; and (iii) a smoothing of the predictions over the graph. Steps (ii) and (iii) are post-processing and implemented with classical methods for graph-based semi-supervised learning, namely, label propagation techniques (Zhu, 2005).foot_0 With a few modifications and new deployment of these classic ideas, we achieve stateof-the-art performance on several node classification tasks, outperforming big GNN models. In our framework, the graph structure is not used to learn parameters (which is done in step (i)) but instead as a post-processing mechanism. This simplicity leads to models with orders of magnitude fewer parameters that take orders of magnitude less time to train and can easily scale to large graphs. We can also combine our ideas with state-of-the-art GNNs, although the performance gains are modest. A major source of our performance improvements is directly using labels for predictions. This idea is not new -early diffusion-based semi-supervised learning algorithms on graphs such as the spectral graph transducer (Joachims, 2003) , Gaussian random field models (Zhu et al., 2003) , and and label spreading (Zhou et al., 2004) all use this idea. However, the motivation for these methods was semi-supervised learning on point cloud data, so the "node features" were used to construct the graph itself. Since then, these techniques have been used for learning on relational data consisting of a graph and some labels but no node features (Koutra et al., 2011; Gleich & Mahoney, 2015; Peel, 2017; Chin et al., 2019) ; however, they have largely been ignored in the context of GNNs. (That being said, we still find that even simple label propagation, which ignores features, does surprisingly well on a number of benchmarks.) This provides motivation for combining two orthogonal sources of prediction power -one coming from the node features (ignoring graph structure) and one coming from using the known labels directly in predictions. Recent research connects GNNs to label propagation (Wang & Leskovec, 2020; Jia & Benson, 2020; 2021) as well as Markov Random fields (Qu et al., 2019; Gao et al., 2019) , and some techniques use ad hoc incorporation of label information in the features (Shi et al., 2020) . However, these approaches are usually still expensive to train, while we use label propagation in two understandable and low-cost ways. We start with a cheap "base prediction" from a model that uses only node features and ignores the graph structure. After, we use label propagation for error correction and then to smooth final predictions. These post-processing steps are based on the fact that errors and labels on connected nodes tend to be positively correlated. Assuming similarity between connected nodes is at the center of much network analysis and corresponds to homophily or assortative mixing (McPherson et al., 2001; Newman, 2003; Easley & Kleinberg, 2010) . In the semi-supervised learning literature, the analog is the smoothness or cluster assumption (Chapelle et al., 2003; Zhu, 2005) . The good performance of label propagation that we see across a wide variety of datasets suggests that these correlations hold on common benchmarks.



One of the main methods that we use(Zhou et al., 2004) is often called label spreading. The term "label propagation" is used in a variety of contexts(Zhu, 2005; Wang & Zhang, 2007; Raghavan et al., 2007; Gleich & Mahoney, 2015). The salient point for this paper is that we assume positive correlations on neighboring nodes and that the algorithms work by "propagating" information from one node to another.



Figure 1: Illustration of our GNN-free model, Correct and Smooth (C&S), with a toy example.Nodes in the left and right clusters have different labels, marked by color (orange or blue). We use a multilayer perceptron (MLP) for base predictions, ignoring the graph structure. We assume this gives the same prediction on all nodes in this example (which could happen if, e.g., all nodes had the same features). After, base predictions are corrected by propagating errors from the training data. Finally, corrected predictions are smoothed with label propagation.

