BREAKING THE EXPRESSIVE BOTTLENECKS OF GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Recently, the Weisfeiler-Lehman (WL) graph isomorphism test was used to measure the expressiveness of graph neural networks (GNNs), showing that the neighborhood aggregation GNNs were at most as powerful as 1-WL test in distinguishing graph structures. There were also improvements proposed in analogy to k-WL test (k ą 1). However, the aggregators in these GNNs are far from injective as required by the WL test, and suffer from weak distinguishing strength, making it become expressive bottlenecks. In this paper, we improve the expressiveness by exploring powerful aggregators. We reformulate aggregation with the corresponding aggregation coefficient matrix, and then systematically analyze the requirements of the aggregation coefficient matrix for building more powerful aggregators and even injective aggregators. It can also be viewed as the strategy for preserving the rank of hidden features, and implies that basic aggregators correspond to a special case of low-rank transformations. We also show the necessity of applying nonlinear units ahead of aggregation, which is different from most aggregation-based GNNs. Based on our theoretical analysis, we develop two GNN layers, Expand-ingConv and CombConv. Experimental results show that our models significantly boost performance, especially for large and densely connected graphs.

1. INTRODUCTION

Graphs are ubiquitous in the real world. Social networks, traffic networks, knowledge graphs, and molecular structures are typical graph-structured data. Graph Neural Networks (GNNs) (Scarselli et al., 2008; Gori et al., 2005) , leveraging the power of neural networks to graph-structured data, have a rapid development recently (Kipf & Welling, 2016; Hamilton et al., 2017; Bronstein et al., 2017; Gilmer et al., 2017; Duvenaud et al., 2015) . Expressive power of GNNs measures their abilities to represent different graph structures (Sato, 2020) . It decides the performance of GNNs where the awareness of graph structures is required, especially on large graphs with complex topologies. The neighborhood aggregation scheme (or message passing) follows the same pattern with weisfiler-lehman (WL) graph isomorphism test (Weisfeiler & Leman, 1968 ) to encode graph structures, where node representations are computed iteratively by aggregating transformed representations of its neighbors with structural information learned implicitly. Therefore, the WL test is used to measure the expressiveness of GNNs. Unfortunately, general GNNs are at most as powerful as 1-order WL test (Morris et al., 2019; Xu et al., 2019) . There is also work trying to improve the expressiveness that are beyond 1-order WL test (Maron et al., 2019; Morris et al., 2019; Chen et al., 2019; Li et al., 2020b; Vignac et al., 2020) . However, the weak distinguishing strength of aggregators is the fundamental limitation. The expressiveness analysis measured by the WL test assumes that aggregators are injective, which is usually unreachable. Therefore, this motivates us to investigate the following questions: What are the key factors to limit the expressiveness of GNN? and how to break these limitations? Aggregators are permutation invariant functions that operate on sets while preserving permutation invariance. (Zaheer et al., 2017) first theoretically studied permutation invariant functions and provided a family of functions to which any permutation invariant function must belong. (Xu et al., 2019) extended it on multisets but only for countable space. (Corso et al., 2020) further extended it to uncountable space. (Murphy et al., 2018) and (Murphy et al., 2019) expressed a permutation invariant function by approximating an average over permutation-sensitive functions with tractability strategies. (Dehmamy et al., 2019) showed that a single propagation rule applied in general GNNs is rather restrictive in learning graph moments (Lin & Skiena, 1995) . They and (Corso et al., 2020) improved the distinguishing strength of aggregation by leveraging multiple basic aggregators (SUM, MEAN, NORMALIZED MEAN, MAX/MIN, and STD). This strategy showed its effectiveness on tasks taken from classical graph theory. In contrast to existing studies towards aggregators in GNNs, we provide a new GNN formulation, where the aggregation is represented as the multiplication of the corresponding hidden feature matrix of neighbors and the aggregation coefficient matrix. This new formulation enables us to answer the following questions: (i) when a GNN will lose its expressive power; (ii) How to build aggregators with higher distinguishing strength, even injective aggregators. Based on our theoretical analysis, we propose two GNN layers: ExpandingConv and CombConv, and evaluate them on general graph classification and graph regression tasks. Our key contributions are summarized as follows: • We formalize the distinguishing strength of aggregators as a partial order, and theoretically show that the choice of aggregators can be bottlenecks of expressiveness. We also propose to apply nonlinear units ahead of aggregations to break the distinguishing strength limitations of aggregators as well as to achieve an implicit sampling mechanism. • We reformulate the neighborhood aggregation with the aggregation coefficient matrix and then provide a theoretical point of view on building powerful aggregators and even injective aggregators. • We propose ExpandingConv and CombConv layers which achieve state-of-the-art performance on a variety of graph tasks. We also show that multi-head GAT is one of the Ex-pandingConv implementations, which brings a theoretical explanation for its effectiveness.

2. PRELIMINARIES

2.1 NOTATIONS For a graph GpV, Eq, we denote the set of edges, nodes and node feature vectors respectively by E G , V G and X G . N pvq represents the set of neighbors of v including itself, i.e., N pvq " tu P V G |pu, vq P E G u Y tvu. We use rns to denote the set t1, 2, ..., nu. tt...uu represents a multi-set, i.e., a set with possibly repeating elements. Π n represents the set of all permutations of the integers 1 to n. h π , where π P Π |h| , is a reordering of the elements of a sequence h according to π. Given a matrix X P R aˆb , X T represents the transpose of X, and vecpXq P R abˆ1 represents the column stack of X.

2.2. GRAPH NEURAL NETWORKS

Most GNNs adopt the neighborhood aggregation scheme (Gilmer et al., 2017) to learn the node representations, which utilizes both node features and graph structures. In the k-th layer, the representation of node v h pkq v " Updateph pk´1q v , Aggregateptth pk´1q u |u P N pvquuqq. Aggregators in GNNs. An aggregator is a permutation invariant function (Zaheer et al., 2017) with bounded size inputs. It satisfies: (i) size insensitive: an aggregator can take an arbitrary but finite size of inputs; (ii) permutation invariant: an aggregator is invariant to the permutation of input. There are a limited number of basic aggregators such as SUM, MEAN, NORMALIZED MEAN, MAX/MIN, STD, etc. Most proposed GNNs apply one of these aggregators. Sum-of-power mapping (Zaheer et al., 2017) and normalized moments (Corso et al., 2020) can also be used as aggregators and they allow for a variable number of aggregators.

3. PROPOSED MODEL

In this section, we first formalize the distinguishing strength of aggregators as a partial order, and show why basic aggregators used in popular GNNs become bottlenecks of expressiveness. Then, we analyze the requirements for building powerful aggregators and even injective aggregators. Finally, we introduce two GNN layers based on our theoretical analysis.

