THE SURPRISING POWER OF GRAPH NEURAL NET-WORKS WITH RANDOM NODE INITIALIZATION

Abstract

Graph neural networks (GNNs) are effective models for representation learning on graph-structured data. However, standard GNNs are limited in their expressive power, as they cannot distinguish graphs beyond the capability of the Weisfeiler-Leman (1-WL) graph isomorphism heuristic. This limitation motivated a large body of work, including higher-order GNNs, which are provably more powerful models. To date, higher-order invariant and equivariant networks are the only models with known universality results, but these results are practically hindered by prohibitive computational complexity. Thus, despite their limitations, standard GNNs are commonly used, due to their strong practical performance. In practice, GNNs have shown a promising performance when enhanced with random node initialization (RNI), where the idea is to train and run the models with randomized initial node features. In this paper, we analyze the expressive power of GNNs with RNI, and pose the following question: are GNNs with RNI more expressive than GNNs? We prove that this is indeed the case, by showing that GNNs with RNI are universal, a first such result for GNNs not relying on computationally demanding higher-order properties. We then empirically analyze the effect of RNI on GNNs, based on carefully constructed datasets. Our empirical findings support the superior performance of GNNs with RNI over standard GNNs. In fact, we demonstrate that the performance of GNNs with RNI is often comparable with or better than that of higher-order GNNs, while keeping the much lower memory requirements of standard GNNs. However, this improvement typically comes at the cost of slower model convergence. Somewhat surprisingly, we found that the convergence rate and the accuracy of the models can be improved by using only a partial random initialization regime.

1. INTRODUCTION

Graph neural networks (GNNs) (Scarselli et al., 2009; Gori et al., 2005) are neural architectures designed for learning functions over graph-structured data, and naturally encode desirable properties such as permutation invariance (resp., equivariance) relative to graph nodes, and node-level computation based on message passing between these nodes. These properties provide GNNs with a strong inductive bias, enabling them to effectively learn and combine both local and global graph features (Battaglia et al., 2018) . As a result, GNNs have been applied to a multitude of tasks, ranging from protein classification (Gilmer et al., 2017) and synthesis (You et al., 2018 ), protein-protein interaction (Fout et al., 2017) , and social network analysis (Hamilton et al., 2017) , to recommender systems (Ying et al., 2018) and combinatorial optimization (Bengio et al., 2018; Selsam et al., 2019) . However, popular GNN architectures, primarily based on message passing (MPNNs), are limited in their expressive power. In particular, MPNNs are at most as powerful as the Weisfeiler-Leman (1-WL) graph isomorphism heuristic (Morris et al., 2019; Xu et al., 2019) , and thus cannot discern between several families of non-isomorphic graphs, e.g., sets of regular graphs (Cai et al., 1992) . To address this limitation, alternative GNN architectures with provably higher expressive power than MPNNs have been proposed. These models, which we refer to as higher-order GNNs, are inspired by the more powerful generalization of 1-WL to k-tuples of nodes, known as k-WL (Grohe, 2017). These models are the only GNNs with an established universality result, but these models are computationally very demanding. As a result, MPNNs, despite their limited expressiveness, remain the standard GNN model for graph learning applications. In a parallel development, MPNNs have recently achieved significant empirical improvements using random node initialization (RNI), through which initial graph node embeddings are randomly set. Indeed, RNI has enabled MPNNs to distinguish instances that 1-WL cannot distinguish, and is proven to enable better approximation of a class of combinatorial problems (Sato et al., 2020) . However, the effect of RNI on the expressive power of GNNs has not yet been comprehensively studied, and its impact on the inductive capacity and learning ability of GNNs remains unclear. In this paper, we thoroughly study the impact of RNI on MPNNs. First, we prove that MPNNs enhanced with RNI are universal, in the sense that they can approximate every function defined on graphs of any fixed order. This follows from a logical characterisation of the expressiveness of MPNNs (Barceló et al., 2020) combined with an argument on order-invariant definability. Our result strongly contrasts with existing 1-WL limitations for deterministic MPNNs, and provides a foundation for developing very expressive and memory-efficient MPNN models. To empirically verify our theoretical findings, we carry out a careful empirical study to quantify the practical impact of RNI. To this end, we design EXP, a synthetic dataset requiring 2-WL expressive power for models to achieve above-random performance, and run MPNNs with RNI on it, to observe how well and how easily this model can learn and generalize based on this dataset. Then, we propose CEXP, a modification of EXP with partially 1-WL distinguishable data, and evaluate the same questions in this more variable setting. Overall, the contributions of this paper are as follows: -We prove that MPNNs with RNI are universal, a significant improvement over the 1-WL limit of standard MPNNs and, to our knowledge, a first universality result for memory-efficient GNNs. -We introduce two carefully designed datasets, EXP and CEXP, based on graph pairs only distinguishable by 2-WL or higher, to rigorously evaluate the impact of RNI. -Using these datasets, we thoroughly analyze the effects of RNI on MPNN, and observe that (i) MPNNs with RNI can closely match the performance of higher-order GNNs, (ii) the improved performance of MPNNs with RNI comes at the cost of slower convergence (compared to higherorder GNNs), and (iii) using a partial random initialization regime over node features typically improves convergence rate and the accuracy of the models. -We additionally perform the same experiments with analog, sparser datasets, with longer training, and observe similar behavior, but more volatility.

2. GRAPH NEURAL NETWORKS

Graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2009) are neural architectures dedicated to learning functions over graph-structured data. In a GNN, nodes in the input graph are assigned vector representations, which are updated iteratively through series of invariant or equivariant computational layers. We recall message passing neural networks (MPNNs) (Gilmer et al., 2017) , a popular family of GNN models, and its expressive power in relation to the Weisfeiler-Leman graph isomorphism heuristic. We discuss alternative GNN models in Section 3; for a broader coverage, we refer the reader to the literature (Hamilton, 2020). In MPNNs, node representations aggregate messages from their neighboring nodes, and use this information to iteratively update their representations. Formally, given a node x, its vector representation v x,t at time t, and its neighborhood N (x), a message passing update can be written as: v x,t+1 = combine v x,t , aggregate {v y,t y ∈ N (x)} , where combine and aggregate are functions, and aggregate is typically permutation-invariant. Once message passing is complete, the final node representations are then used to compute target outputs. Prominent message passing GNN architectures include graph convolutional networks (GCNs) (Kipf & Welling, 2017) and gated graph neural networks (GGNNs) (Li et al., 2016) . It is well-known that standard MPNNs have the same power as the 1-dimensional Weisfeiler-Leman algorithm (1-WL) (Xu et al., 2019; Morris et al., 2019) . This entails that two nodes in a graph cannot be distinguished if 1-WL does not distinguish them, and neither can two graphs be distinguished if 1-WL cannot distinguish them.

