BREAKING THE EXPRESSIVE BOTTLENECKS OF GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Recently, the Weisfeiler-Lehman (WL) graph isomorphism test was used to measure the expressiveness of graph neural networks (GNNs), showing that the neighborhood aggregation GNNs were at most as powerful as 1-WL test in distinguishing graph structures. There were also improvements proposed in analogy to k-WL test (k ą 1). However, the aggregators in these GNNs are far from injective as required by the WL test, and suffer from weak distinguishing strength, making it become expressive bottlenecks. In this paper, we improve the expressiveness by exploring powerful aggregators. We reformulate aggregation with the corresponding aggregation coefficient matrix, and then systematically analyze the requirements of the aggregation coefficient matrix for building more powerful aggregators and even injective aggregators. It can also be viewed as the strategy for preserving the rank of hidden features, and implies that basic aggregators correspond to a special case of low-rank transformations. We also show the necessity of applying nonlinear units ahead of aggregation, which is different from most aggregation-based GNNs. Based on our theoretical analysis, we develop two GNN layers, Expand-ingConv and CombConv. Experimental results show that our models significantly boost performance, especially for large and densely connected graphs.

1. INTRODUCTION

Graphs are ubiquitous in the real world. Social networks, traffic networks, knowledge graphs, and molecular structures are typical graph-structured data. Graph Neural Networks (GNNs) (Scarselli et al., 2008; Gori et al., 2005) , leveraging the power of neural networks to graph-structured data, have a rapid development recently (Kipf & Welling, 2016; Hamilton et al., 2017; Bronstein et al., 2017; Gilmer et al., 2017; Duvenaud et al., 2015) . Expressive power of GNNs measures their abilities to represent different graph structures (Sato, 2020) . It decides the performance of GNNs where the awareness of graph structures is required, especially on large graphs with complex topologies. The neighborhood aggregation scheme (or message passing) follows the same pattern with weisfiler-lehman (WL) graph isomorphism test (Weisfeiler & Leman, 1968 ) to encode graph structures, where node representations are computed iteratively by aggregating transformed representations of its neighbors with structural information learned implicitly. Therefore, the WL test is used to measure the expressiveness of GNNs. Unfortunately, general GNNs are at most as powerful as 1-order WL test (Morris et al., 2019; Xu et al., 2019) . There is also work trying to improve the expressiveness that are beyond 1-order WL test (Maron et al., 2019; Morris et al., 2019; Chen et al., 2019; Li et al., 2020b; Vignac et al., 2020) . However, the weak distinguishing strength of aggregators is the fundamental limitation. The expressiveness analysis measured by the WL test assumes that aggregators are injective, which is usually unreachable. Therefore, this motivates us to investigate the following questions: What are the key factors to limit the expressiveness of GNN? and how to break these limitations? Aggregators are permutation invariant functions that operate on sets while preserving permutation invariance. (Zaheer et al., 2017) first theoretically studied permutation invariant functions and provided a family of functions to which any permutation invariant function must belong. (Xu et al., 2019) extended it on multisets but only for countable space. (Corso et al., 2020) further extended it to uncountable space. (Murphy et al., 2018) and (Murphy et al., 2019) expressed a permutation invariant function by approximating an average over permutation-sensitive functions with tractability strategies. (Dehmamy et al., 2019) showed that a single propagation rule applied in general GNNs is rather restrictive in learning graph moments (Lin & Skiena, 1995) . They and (Corso et al., 2020) improved the distinguishing strength of aggregation by leveraging multiple basic aggregators (SUM, MEAN, NORMALIZED MEAN, MAX/MIN, and STD). This strategy showed its effectiveness on tasks taken from classical graph theory. In contrast to existing studies towards aggregators in GNNs, we provide a new GNN formulation, where the aggregation is represented as the multiplication of the corresponding hidden feature matrix of neighbors and the aggregation coefficient matrix. This new formulation enables us to answer the following questions: (i) when a GNN will lose its expressive power; (ii) How to build aggregators with higher distinguishing strength, even injective aggregators. Based on our theoretical analysis, we propose two GNN layers: ExpandingConv and CombConv, and evaluate them on general graph classification and graph regression tasks. Our key contributions are summarized as follows: • We formalize the distinguishing strength of aggregators as a partial order, and theoretically show that the choice of aggregators can be bottlenecks of expressiveness. We also propose to apply nonlinear units ahead of aggregations to break the distinguishing strength limitations of aggregators as well as to achieve an implicit sampling mechanism. • We reformulate the neighborhood aggregation with the aggregation coefficient matrix and then provide a theoretical point of view on building powerful aggregators and even injective aggregators. • We propose ExpandingConv and CombConv layers which achieve state-of-the-art performance on a variety of graph tasks. We also show that multi-head GAT is one of the Ex-pandingConv implementations, which brings a theoretical explanation for its effectiveness.

2. PRELIMINARIES

2.1 NOTATIONS For a graph GpV, Eq, we denote the set of edges, nodes and node feature vectors respectively by E G , V G and X G . N pvq represents the set of neighbors of v including itself, i.e., N pvq " tu P V G |pu, vq P E G u Y tvu. We use rns to denote the set t1, 2, ..., nu. tt...uu represents a multi-set, i.e., a set with possibly repeating elements. Π n represents the set of all permutations of the integers 1 to n. h π , where π P Π |h| , is a reordering of the elements of a sequence h according to π. Given a matrix X P R aˆb , X T represents the transpose of X, and vecpXq P R abˆ1 represents the column stack of X.

2.2. GRAPH NEURAL NETWORKS

Most GNNs adopt the neighborhood aggregation scheme (Gilmer et al., 2017) to learn the node representations, which utilizes both node features and graph structures. In the k-th layer, the repre- sentation of node v h pkq v " Updateph pk´1q v , Aggregateptth pk´1q u |u P N pvquuqq. Aggregators in GNNs. An aggregator is a permutation invariant function (Zaheer et al., 2017) with bounded size inputs. It satisfies: (i) size insensitive: an aggregator can take an arbitrary but finite size of inputs; (ii) permutation invariant: an aggregator is invariant to the permutation of input. There are a limited number of basic aggregators such as SUM, MEAN, NORMALIZED MEAN, MAX/MIN, STD, etc. Most proposed GNNs apply one of these aggregators. Sum-of-power mapping (Zaheer et al., 2017) and normalized moments (Corso et al., 2020) can also be used as aggregators and they allow for a variable number of aggregators.

3. PROPOSED MODEL

In this section, we first formalize the distinguishing strength of aggregators as a partial order, and show why basic aggregators used in popular GNNs become bottlenecks of expressiveness. Then, we analyze the requirements for building powerful aggregators and even injective aggregators. Finally, we introduce two GNN layers based on our theoretical analysis.

3.1. DISTINGUISHING STRENGTH OF AGGREGATORS

To ensure generality, our analysis of aggregators is always considered in multisets and uncountable case, where the inputs are continuous and with possibly repeating elements. We first introduce distinguishing strength under the concept of partial order (Schmidt, 2011) . Distinguishing strength. The distinguishing strength of aggregator f aggr1 is stronger than f aggr2 , denoted by f aggr1 ľ f aggr2 , if and only if for any two multisets x 1 and x 2 where the number of elements can be different, f aggr2 px 1 q ‰ f aggr2 px 2 q ñ f aggr1 px 1 q ‰ f aggr1 px 2 q. Meanwhile, if there exist x 1 1 and x 1 2 such that f aggr1 px 1 1 q ‰ f aggr1 px 1 2 q but f aggr2 px 1 1 q " f aggr2 px 1 2 q, f aggr1 is strictly stronger than f aggr2 , denoted by f aggr1 ą f aggr2 . If f aggr1 ľ f aggr2 ľ f aggr1 , we say these two aggregators have the same distinguishing strength, denoted by f aggr1 " f aggr2 . If there exist multisets x 1 and x 2 such that f aggr1 px 1 q ‰ f aggr1 px 2 q, f aggr2 px 1 q " f aggr2 px 2 q, and there also exist x 1 1 and x 1 2 such that f aggr1 px 1 1 q " f aggr1 px 1 2 q, f aggr2 px 1 1 q ‰ f aggr2 px 1 2 q, we say f aggr1 and f aggr2 are incomparable. Distinguishing strength is a partial order, and the set of all aggregators form a poset. In this poset, the aggregators with the greatest distinguishing strength should be injective. With the definition of distinguishing strength, we can compare any two aggregators. The distinguishing strength of widely used aggregators SUM, MEAN, MAX/MIN is incomparable. One can easily give two multisets of elements that are distinguished by one aggregator but are not distinguished by the others as showed in (Corso et al., 2020) . Equivariant aggregator. f aggr : ttR d uu Ñ R d is an equivariant aggregator if and only if f aggr pttT ẍ1 , T ¨x2 , ¨¨¨, T ¨xn uuq " T ¨faggr pttx 1 , x 2 , ¨¨¨, x n uuq for any T P R mˆd and ttx i P R d |i P rnsuu. Widely used SUM and MEAN are equivariant aggregators but MAX/MIN is not. We denote f aggr1 b f aggr2 a new aggregator by combing f aggr1 and f aggr2 with f aggr1 b f aggr2 pXq " rf aggr1 pXq||f aggr2 pXqs, where || denotes concatenation. Lemma 1. (i) For any continuous function g, we have g ˝faggr ĺ f aggr , and when g is injective, f aggr " g ˝faggr ; (ii) f aggr1 b f aggr2 ľ f aggr1 and f aggr1 b f aggr2 ľ f aggr2 . If f aggr1 and f aggr2 are incomparable, f aggr1 b f aggr2 ą f aggr1 and f aggr1 b f aggr2 ą f aggr2 ; (iii) If f aggr is an equivariant aggregator, then f aggr pT ¨x1 , T ¨x2 , ¨¨¨, T ¨xn q ĺ f aggr px 1 , x 2 , ¨¨¨, x n q for any T P R mˆd and ttx i P R d |i P rnsuu. We prove Lemma 1 in Appendix B. Lemma 1 indicates that aggregators become bottlenecks of distinguishing strength. For the equivariant aggregator, any linear transformation before aggregation and any transformation after aggregation have no contribution to the distinguishing strength. For SUM and MEAN, we have gpSUMpT ¨x1 , T ¨x2 , ¨¨¨, T ¨xn qq ĺ SUMpx 1 , x 2 , ¨¨¨, x n q and gpMEANpT ¨x1 , T ¨x2 , ¨¨¨, T ¨xn qq ĺ MEANpx 1 , x 2 , ¨¨¨, x n q, where T P R mˆd , and g can be any continuous function. Based on Lemma 1, we can now compare the distinguishing strength of aggregations in some popular GNNs. GIN-0 sums all hidden features of neighbors at first, and then pass them to a 2-layer MLP. Therefore, when considering in a continuous input features space, the distinguishing strength of GIN-0 is at most as powerful as the SUM aggregator. GCN uses a NORMALIZED MEAN (denoted by nMEAN) aggregator. Given a node v and its neighbors, nMEANpv, u 1 , ¨¨¨, u n q " 1 ? |N pvq| ¨p hv ? |N pvq| `hu 1 ? |N pu1q| `¨¨¨`h un ?

|N punq|´1

q. nMEAN is also an equivariant aggregator, and the distinguishing strength of aggregation in GCN is at most as powerful as nMEAN. GAT corresponds to the weighted SUM aggregation, where the weight coefficients are the functions of hidden features. This makes the distinguishing strength of GAT and SUM incomparable. Based on these observations, a potential approach to breaking the distinguishing strength limitation is to apply a nonlinear processing on inputs before aggregation.

3.2. BUILDING POWERFUL AGGREGATORS

In this section, we analyze the requirements for building more powerful aggregators and further injective aggregators. We first introduce a new representation of GNN layers which unifies several popular GNN layers. Given a node v and its neighbors N pvq, our new formulation represents the GNN operation as follows: (2) $ ' & ' % m v " f local pvq /* Here, m v P R |N pvq| is the aggregation coefficients vector of node v. Note that m v should be the mapping of local structures such as node degrees, node or edge features of the k-hop neighbors assigned on node v to ensure the same encoding of isomorphic graphs. hpt´1qT vπ " ph pt´1q v , h pt´1q u1 , ¨¨¨, h pt´1q u | N pvq| q T P R |N pvq|ˆd is the matrix representation of v's neighbors accord- ing to a permutation π. f NN : R d Ñ R d 1 is a neural network that extracts task-relevant information from the aggregated representation r ptq v , and is used to update hidden feature h ptq v of node v. According to Equation 2, the aggregation should be with high distinguishing strength to avoid indistinguishability among neighbors. Meanwhile, the extraction should be powerful enough to efficiently extract task-relevant structural patterns from the aggregated representation of neighbors. Based on these observations, we reformulate GCN, GIN0 and GAT with their corresponding threestage representations as follows: GCN : $ ' ' & ' ' % mv " 1 ? |N pvq| p 1 ? |N pvq| , 1 ? |N pu 1 q| , ¨¨¨, 1 ? |N pu |N pvq|´1 q| q r ptq v " mvπ hpt´1qT vπ h ptq v " σpW r ptq v `bq, GIN0 : $ ' & ' % mv " 1 1ˆ|N pvq| r ptq v " mvπ hpt´1qT vπ h ptq v " MLPpr ptq v q, GAT : $ ' & ' % m ptq v " pattph pt´1q v , h pt´1q v q, attph pt´1q v , h pt´1q u 1 q, ¨¨¨, attph pt´1q v , h pt´1q u |N pvq´1| qq r ptq v " m ptq vπ hpt´1qT vπ h ptq v " σpW r ptq v `bq. Their default formulations are given in Appendix A. In the aggregation step, GCN's m v is the mapping of neighbors' degrees; GIN0's m v is the mapping of node v's degree which is equivalent to SUM aggregator; GAT's m v is the mapping of neighbors' features. All of them are the mappings of local structures as given in Equation 2. In this three-stage representation, the aggregation is reformulated as the multiplication of the aggregation coefficients vector and the feature matrix of neighbors. It provides insights on improving the distinguishing strength of aggregations. First, we show how to characterize the permutation invariance in this formulation. Let M P R sˆn denote an aggregation coefficient matrix where s ě 1. Note that in GCN, GIN and GAT, s is restricted to be 1. h π P R nˆd is the matrix representation of n input elements according to π. The aggregation computation in the second step of Equation 2 is f aggr pM , h π q " M P π h π " M π h π , where P π is the permutation matrix according to π. P π h π ensures the same output for all h π , π P Π |h| . M π " M P π is the reordering of columns of M according to π. For any π 1 , π 2 P Π n , f aggr pM , h π1 q " f aggr pM , h π2 q, thus permutation invariance holds. Once M is decided, we obtain a unique aggregator denoted by f M . For any sequence of input elements h, f M phq " f aggr pM , h π q, where π P Π n can be any ordering of neighbors. Next, we analyze the distinguishing strength of f M . Proposition 1. For any two matrices M P R sˆn and M 1 P R s 1 ˆn with s, s 1 ď n, we have (i) f `M M 1 ˘ľ f M , where `M M 1 ˘means stacking these two matrices; (ii) f `M M 1 ˘ą f M if and only if rankp `M M 1 ˘q ą rankpM q; (iii) Any multiset of size n is distinguishable with f M if and only if rankpM q " n. We prove Proposition 1 in Appendix C. Proposition 1 shows that the distinguishing strength of f M is decided by the rank of the corresponding M . Yet, the distinguishing strength analysis in Proposition 1 is only suitable for multisets aggregated with shared f M . Next, we extend the analysis for the case of different aggregators. Let Respf M q denote the set of all outputs of f M . Our proposed three-stage representation also provides useful insight on the constraints among different aggregators. That is, in order to fully distinguish different local structures, for any two different f M1 and f M2 , Respf M1 q X Respf M2 q " H. This is because to fully distinguish different local structures, we should ensure their aggregated representations are different. Since M is restricted to be the mapping of local structures (such as khop neighbors), different M means that the corresponding local structures are different. Therefore, the aggregation results of different f M must be different. However, it is not satisfied by existing GNNs, and there are few studies on distinguishing multisets aggregated by different aggregators. In Proposition 2, we present a detailed analysis of it. Proposition 2. For any M 1 , M 2 P R sˆn1 and M 1 1 , M 1 2 P R s 1 ˆn2 , (i) Respf `M1 M 1 1 ˘q X Respf `M2 M 1 2 ˘q Ď Respf M1 q X Respf M2 q; (ii) If Respf `M1 M 1 1 ˘q X Respf `M2 M 1 2 ˘q Ă Respf M1 q X Respf M2 q, then rankp `M1 M2 M 1 1 M 1 2 ˘q ą rankp `M1 M2 ˘q; We prove Proposition 2 in Appendix D. Proposition 2 shows the necessity of preserving the rank of aggregation coefficient matrix when considering the distinguishing strength among different aggregators. Next, we provide a sufficient condition for building desired multiple injective aggregators with the outputs having no intersections. Proposition 3. For any two aggregators f M1 and f M2 with M 1 P R sˆn1 and M 2 P R sˆn2 , if rankp `M1 M2 ˘q " n 1 `n2 , then f M1 and f M2 are injective and Respf M1 q X Respf M2 q " H. We prove Proposition 3 in Appendix E. Proposition 1, 2 and 3 provide a new perspective for building powerful aggregators and even injective aggregators. Compared with the distinguishing strength studies in (Xu et al., 2019) and (Corso et al., 2020) , as well as existing strategies for building injective aggregators, e.g., sum-of-power mapping (Zaheer et al., 2017) and normalized moments (Corso et al., 2020) , we reformulate the aggregation with aggregation coefficients matrix and show the relations of the distinguishing strength of aggregators and the rank of the corresponding aggregation coefficients matrices. Besides, the aggregation of this method is controlled by aggregation coefficients which can be learned from graph data to better leverage structural information. In this paper, to simplify the analysis, we only consider the aggregations within one-hop neighbors. The results can be easily extended to more sophisticated aggregators with the overall framework unchanged In the perspective of preserving the rank of hidden features among neighbors, r " M h indicates that rankprq ď minprankpM q, rankphqq. To preserve the rank of hidden features in aggregations such that rankprq " rankphq, we need rankpM q ě rankphq. This builds a connection between improving the distinguishing strength of aggregators and preserving the rank of hidden features among neighbors, both of which have the requirements on the rank of M . General aggregators such as ones in GCN, GIN-0 and GAT have rankpM q " 1. Thus, rankprq is always fixed to 1 no matter what the rank of the input features is. Correspondingly, they have a weak distinguishing strength. Equation 2 splits the aggregation and feature/structure extraction into two independent steps, which helps to figure out that the expressive power loss happens in the aggregation step, and then the model extracts feature/structure information on the distorted encodings of neighbors. From Equation 2, the aggregation can be considered as a representation regularization step, which unifies different multisets of neighbors into the same representation style while holding permutation invariance. Then, the model can extract structural information on this regulated data with a shared trainable matrix as the third step in Equation 2. Based on this observation, we propose two novel GNN layers: ExpandingConv and CombConv.

3.3. EXPANDINGCONV

In this section, we first present ExpandingConv framework. Then we provide one of its implementa- (3) In Equation 3, we implement f local pu, vq as the function of hidden features of nodes u and v. There can be other implementations, and we leave them for future work. W P R sˆ2d and b P R sˆ1 are trainable matrices. (Luan et al., 2019) empirically showed that different nonlinear activatoin functions have different contributions in preserving the rank of matrices. We use the recommended Tanh as the activation function in the computation of m ptq uv to better preserve the rank of aggregation coefficient matrices. MLP denotes a 2-layer perceptron. Next, we represent Equation 3 with the corresponding three-stage representation as given in Section 3.2 to obtain its aggregation coefficient matrix and analyze its distinguishing strength. To simplify this process, we only consider 1-layer MLP with W 1 P R dˆsd and b 1 P R dˆ1 . h ptq v " ÿ uPN pvq ReLUpW 1 vecpm ptq uv h pt´1qT u q `b1 q " ¨řuPN pvq ReLUpW 1 r1,:s vecpm ptq uv h pt´1qT u q `b1 r1s q ř uPN pvq ReLUpW 1 r2,:s vecpm ptq uv h pt´1qT u q `b1 r2s q . . . ř uPN pvq ReLUpW 1 rd,:s vecpm ptq uv h pt´1qT u q `b1 rds q ‹ ‹ ‹ ‹ ‹ ‹ ‚ " ¨řuPN 1 pvq pW 1 r1,:s vecpm ptq uv h pt´1qT u q `b1 r1s q ř uPN 2 pvq pW 1 r2,:s vecpm ptq uv h pt´1qT u q `b1 r2s q . . . ř uPN d pvq pW 1 rd,:s vecpm ptq uv h pt´1qT u q `b1 rds q ‹ ‹ ‹ ‹ ‹ ‹ ‚ " ¨W 1 r1,:s W 1 r2,:s . . . W 1 rd,:s ‹ ‹ ‹ ‹ ‹ ‹ ‚ ¨vecp ř uPN 1 pvq m ptq uv h pt´1qT u q vecp ř uPN 2 pvq m ptq uv h pt´1qT u q . . . vecp ř uPN d pvq m ptq uv h pt´1qT u q ‹ ‹ ‹ ‹ ‹ ‹ ‚ `¨| N1pvq| ¨b1 r1s |N2pvq| ¨b1 r2s . . . |N d pvq| ¨b1 rds ‹ ‹ ‹ ‹ ‹ ‹ ‚ " ¨W 1 r1,:s W 1 r2,:s . . . W 1 rd,:s ‹ ‹ ‹ ‹ ‹ ‹ ‚ ¨vecpM ptq v 1 π 1 hpt´1qT v 1 π 1 q vecpM ptq v 2 π 2 hpt´1qT v 2 π 2 q . . . vecpM ptq v d π d hpt´1qT v d π d q ‹ ‹ ‹ ‹ ‹ ‹ ‚ `¨| N1pvq| ¨b1 r1s |N2pvq| ¨b1 r2s . . . |N d pvq| ¨b1 rds ‹ ‹ ‹ ‹ ‹ ‹ ‚ , where N i pvq Ď N pvq|i P rds are sampled subsets of neighbors in each dimension. M ptq vi " pm ptq u1v , m ptq u2v , ¨¨¨, m ptq u |N i pvq| v q P R sˆ|Nipvq| and hpt´1q vi " ph pt´1q u1 , h , ¨¨¨, h pt´1q u |N i pvq| q P R dˆ|Nipvq| are aggregation coefficients matrix and hidden feature matrix corresponding to the subset of neighbors N i pvq Ď N pvq according to π i . We denote rh pt´1q v || hpt´1q vi s " ph pt´1q v ||h pt´1q u1 , h pt´1q v ||h pt´1q u2 , ¨¨¨, h pt´1q v ||h pt´1q u |N i pvq| q P R 2dˆ|Nipvq| , then M ptq vi " TanhpW rh pt´1q v

|| hpt´1q

vi s `bq P R sˆ|Nipvq| . According to Equation 4, we finally obtain the three-stage representation equivalent to Equation 3. $ ' ' ' ' & ' ' ' ' % M ptq v i " TanhpW rh pt´1q v || hpt´1q v i s `bq rptq v i " M ptq v i π i hpt´1qT v i π i , i P rds h ptq v " diagpW 1 r1,:s , W 1 r2,:s , ¨¨¨, W 1 rd,:s qpvecpr ptq v 1 q, vecpr ptq v 2 q, ¨¨¨, vecpr ptq v d qq T `p|N1pvq| ¨b1 r1s , |N2pvq| ¨b1 r2s , ¨¨¨, |N d pvq| ¨b1 rds q T . (5) According to the computation of rptq vi , rankpr ptq vi q ď minps, d, |N i pvq|q. By configuring a larger s, we have rankpr ptq vi q ą 1 with a high probability, which is different from general GNNs with a rank of 1. As analyzed in Section 3.2, this achieves more powerful aggregators as well as preserves the rank of hidden features among neighbors. The obtained rptq vi after aggregation is the unified representations of neighbors. We then use the trainable matrix W 1 P R sdˆd to extracts feature/structure information. Unlike the aggregation step, the dimensions reduction here (from sd to d) would not cause information loss. This can be explained by the fact that only task-relevant structural information needs to be preserved and passed to the next layer, and it can be embedded in lower dimensions. Comparisons with multi-head GAT. Proposition 4. Multi-head GAT is an implementation of ExpandingConv as follows: $ ' & ' % α vu " softmaxpLeakyReLUprdiagpã 1T , ã2T , ¨¨¨, ãKT q ||diagpã 11T , ã12T , ¨¨¨, ã1KT qsrW h pt´1q v ||W h pt´1q u sqq h ptq v " σ ´1 K W ř jPNv vecpα vu h pt´1qT u q ¯, where W " || K k"1 W k P R kdˆd is the concatenation of the trainable matrix in all K heads. We prove Proposition 4 in Appendix F. Although multi-head GAT is based on attention mechanism, ExpandingConv provides a new perspective to explain its effectiveness. Applying multi-head attention mechanism helps to preserve the rank of hidden features as well as achieve more powerful aggregators. However, the usage of LeakyReLU may be harmful to preserving the rank of the aggregation coefficient matrix (Luan et al., 2019) . GAT as well as most other GNNs (such as GCN, GIN, etc) follows the same pattern that applies nonlinear units after aggregation. According to the analysis in Section 3.1, Equation 3 applies MLP on vecpm ptq uv h pt´1qT u q before SUM to break the distinguishing strength limitation of SUM. It also produces other interesting results. By reformulating Equation 3 with its three-stage representation as Equation 5, each dimension of hidden features aggregates on a subset of neighbors independently, which corresponds to a kind of dimension-wise neighbor sampling mechanism. We call the modification of applying ReLU ahead of SUM aggregator as Re-SUM mechanism. (Mishra et al., 2020) and (Rong et al., 2019) studied dropedge and node masking mechanism on node-level predictions, both of which can be considered as neighbor sampling strategies that have shown their effectiveness in improving the generalization ability of aggregation-based GNNs and are also used as unbiased data augmentation technique for training. Compared with dropedge and node masking, Re-SUM realizes a dimension-wise neighbor sampling, and it does not need to manually set the sampling ratio since this mechanism takes effects implicitly. Re-SUM shows that the neural network itself can perform sampling by properly combining nonlinear units and aggregators, without explicitly modifying the network architecture. Our experimental results verified the effectiveness of the Re-SUM on a variety of graph tasks.  where W P R dˆ2d and b P R dˆ1 . Similar to ExpandingConv, CombConv also applies Re-SUM aggregation. The difference is that each dimension of hidden features is aggregated with an independent weighted aggregator. ExpandingConv with s " 1 corresponds to a special case of CombConv where all dimensions share the same aggregator. Therefore, the distinguishing strength of Comb-Conv is stronger than ExpandingConv with s " 1. Meanwhile, CombConv does not expand the hidden features of nodes in aggregation. Hence, it requires fewer parameters.

4. EXPERIMENTS

In this section, we evaluate ExpandingConv and CombConv on graph-level prediction tasks on OGB (Weihua Hu, 2020), TU (Kersting et al., 2016; Yanardag & Vishwanathan, 2015) and QM9 (Ramakrishnan et al., 2014; Wu et al., 2018; Ruddigkeit et al., 2012) . The code is available at https://github.com/qslim/epcb-gnns. Configurations. We use the default dataset splits for OGB. The QM9 dataset is randomly split into 80% train, 10% validation and 10% test as given in (Morris et al., 2019; Maron et al., 2019) . For TU dataset, we follow the standard 10-fold cross validation protocol and splits from (Zhang et al., 2018) and report our results following the protocol described in (Xu et al., 2019; Ying et al., 2018) . We use the concatenation of hidden features from all layers to compute the entire graph representations (Xu et al., 2018) . In our tests, all models are equipped with batch normalization (Ioffe & Szegedy, 2015) on each hidden layer when evaluating on OGB and TU, and are not when evaluating on QM9. All datasets' descriptions and detailed hyperparameter settings are given in Appendix H. We first conduct comprehensive ablation studies to evaluate the effectiveness of powerful aggregators and Re-SUM mechanism on OGB and QM9 as given in Table 1 and Table 2 . Then, we compare the performance of ExpandingConv and CombConv with competitive baselines on all three datasets as given in Table 3 and 

4.1. ABLATION STUDIES

Effect of powerful aggregators. For complex graph structures with dense connections or with abundant node/edge features, they would benefit from a higher expressive model to maximumly distinguish different structures and extract relevant structural patterns as the model goes deeper to leverage large receptive fields. This is validated on both QM9 and OGB. We configure s " 1, 4, 8, 16, 32 of ExpC-s for all 12 targets of QM9. As we apply a larger s, the model continuously achieves better performance on most targets. We randomly select s " 1, 4 for ogbg-ppa, ogbgmolhiv and s " 1, 5 for ogbg-code. The results show that applying larger s gains performance improvements, especially on ogbg-ppa which involves large graphs with dense connections. Effect of Re-SUM mechanism. In Table 1 and Table 2 , the performance differences between ExpC*-1 (CombC*) and ExpC-1 (CombC) show the effectiveness of Re-SUM. In our tests, the Re-SUM can be extremely powerful on graphs with dense connections such as ogbg-ppa, which is validated on both ExpandingConv (with 6.85% improvements) and CombConv (with 4% improvements). On most targets of QM9, this mechanism also gains improvements. For small graphs with sparse connections such as ogbg-hiv and ogbg-molpcba, the improvements are not very significant.

4.2. COMPARISONS WITH BASELINES

Table 3 and Table 4 show the performance comparisons of our models with baselines on QM9, TU and OGB respectively. All datasets in QM9 and OGB graph-level predictions are used for evaluations. For TU, we use 3 widely used datasets: COLLAB includes graphs with dense connections; REDDIT-BINARY (RDT-B) and REDDIT-MULTI-12K (RDT-M12) include large and sparse graphs with one center node having dense connections with other nodes. All results of baselines are taken from the original papers except for the results of GraphSAGE on TU, multi-head GAT on OGB and GIN0* on QM9 which were not reported by the original papers. We report the results of Graph-SAGE provided by (Ying et al., 2018) and evaluate multi-head GAT and GIN0* by ourselves. To ensure a fair comparison, for OGB and TU, we configure the number of heads in multi-head GAT and s in ExpC-s to be the same which is selected in t3, 4, 5u. For QM9, the number of heads is 8 and s P t4, 8, 16, 32u. GIN0* in QM9 denotes GIN0 without batch normalization. Compared with baselines, our models achieve the best performance on 7 out of all 12 targets of QM9, 3 out of all 4 graph-level prediction datasets of OGB and all 3 selected TU datasets. Our models get 1.9% improvements on COLLAB and 2.83% improvements on REDDIT-MULTI-12K compared with SOTA baselines. On ogbg-ppa, our models achieve 2.6% higher classification accuracies compared with SOTA baselines. On ogbg-code, they achieve 1.5% improvements. Multi-head GAT can also be considered as an implementation of ExpandingConv. However, its performance on graph-level predictions is not competitive. According to its three-stage representation, the usage of LeakyReLU in the aggregation step is harmful to preserving the rank, and the usage of softmax makes it harder to analyze the rank. In the extraction step, the 1-layer MLP may have a limited representation power to represent the desired extraction functions.

A GCN, GAT AND GIN

Here, we present implementations of GCN, GAT and GIN for the usage of our analysis. Graph Convolution Networks (GCN) (Kipf & Welling, 2016) . H ptq " σp D´1 2 Â D´1 2 H pt´1q W `bq Graph Attention Networks (GAT) (Veličković et al., 2017) . h ptq v " σ ¨ÿ uPN pvq att ´hpt´1q v , h pt´1q u ¯W h pt´1q u ‚ Graph Isomorphism Networks (GIN-0) (Xu et al., 2019) . h ptq v " MLPp ÿ uPN pvq h pt´1q u q. ( ) B PROOF OF LEMMA 1 Proof. (i) For any two multisets x 1 and x 2 , if gpf aggr px 1 qq ‰ gpf aggr px 2 qq, then f aggr px 1 q ‰ f aggr px 2 q. Therefore, we have f aggr ľ g ˝faggr . If g is injective, then f aggr px 1 q ‰ f aggr px 2 q ñ gpf aggr px 1 qq ‰ gpf aggr px 2 qq. We have f aggr ľ g ˝faggr ľ f aggr , therefore f aggr " g ˝faggr . (ii) For any two multisets x 1 and x 2 , f aggr1 px 1 q ‰ f aggr1 px 2 q ñ rf aggr1 px 1 q||f aggr2 px 1 qs ‰ rf aggr1 px 2 q||f aggr2 px 2 qs. Therefore, f aggr1 b f aggr2 ľ f aggr1 . If f aggr1 and f aggr2 are incomparable, there exist x 3 and x 4 such that f aggr2 px 3 q ‰ f aggr2 px 4 q but f aggr1 px 3 q " f aggr1 px 4 q. Therefore, there exist x 3 and x 4 such that rf aggr1 px 3 q||f aggr2 px 3 qs ‰ rf aggr1 px 4 q||f aggr2 px 4 qs but f aggr1 px 3 q " f aggr1 px 4 q. f aggr1 b f aggr2 ą f aggr1 . (iii) Since f aggr is an equivariant aggregator, then f aggr pT ¨x1 , T ¨x2 , ¨¨¨, T ¨xn q " T faggr px 1 , x 2 , ¨¨¨, x n q ĺ f aggr px 1 , x 2 , ¨¨¨, x n q.

C PROOF OF PROPOSITION 1

Proof. (i) f `M M 1 ˘pxq " `M M 1 ˘πx π " ˆmπ x π m 1 π x π ˙" ˆfM pxq f M 1 pxq ˙, then for any x 1 and x 2 , we have f M px 1 q ‰ f M px 2 q ñ f `M M 1 ˘px 1 q ‰ f `M M 1 ˘px 2 q, and therefore we conclude that f `M M 1 ˘ľ f M . (ii) "f `M M 1 ˘ą f M ð rankp `M M 1 ˘q ą rankpM q" We prove the claim by contradiction. Assume that rankp `M M 1 ˘q ą rankpM q and f ` M M 1 ˘" f M . f `M M 1 ˘" f M means that for any x 1 and x 2 , m π x 1π " m π x 2π ô `M M 1 ˘πx 1π " `M M 1 ˘πx 2π , where π is the ordering of input elements. Let s " x 1π ´x2π . For any i P rns, sris " x 1π ris x2π ris P R. Then for any s P R n , m π s " 0 ô `M M 1 ˘πs " 0. The system of linear equations m π x " 0 and `M M 1 ˘πx " 0 share the same solution space. Let R S denote the rank of this solution space, then rankpm π q `RS " rankp `M M 1 ˘πq `RS " n. Therefore, rankpm π q " rankp `M M 1 ˘πq, then we have rankpM q " rankp `M M 1 ˘q. Since we assumed that rankp `M M 1 ˘q ą rankpM q, we reach a contradiction. "f `M M 1 ˘ą f M ñ rankp `M M 1 ˘q ą rankpM q" We prove an equivalent proposition "rankp `M M 1 ˘q ď rankpM q ñ f `M M 1 ˘ĺ f M ". Note that rankp `M M 1 ˘q ě rankpM q and f `M M 1 ˘ľ f M as given in Proposition 1(i). We only need to prove "rankp `M M 1 ˘q " rankpM q ñ f `M M 1 ˘" f M ". rankp `M M 1 ˘q " rankpM q means that any row in M 1 is linearly dependent to rows in M . Therefore, there exists L P R s 1 ˆs so that `M M 1 ˘" `I L ˘M . For any x 1 and x 2 with M P π x 1π " M P π x 2π , `I L ˘M P π x 1π " `I L ˘M P π x 2π , and therefore `M M 1 ˘Pπ x 1π " `M M 1 ˘Pπ x 2π , where π is the ordering of input elements. That is, for any x 1 and x 2 , f M px 1 q " f M px 2 q ñ f `M M 1 ˘px 1 q " f `M M 1 ˘px 2 q, thus f `M M 1 ˘ĺ f M . Finally, we have f `M M 1 ˘" f M . (iii) "Any multiset of size n is distinguishable with f M ñ rankpM q " n" Since rankpM q ď n, we an equivalent proposition "rankpM q ă n ñ there exists at least two multisets which are indistinguishable". Considering the system of linear equations y " M x where x P R n , if rankpM q ă n, then there exists y 1 such that rankpM q " rankpM , y 1 q ă n. According to the Rouché-Capelli theorem, there are infinite solutions x 1 i such that y 1 " M x 1 1 " M x 1 2 " ¨¨¨" M x 1 8 , Each x 1 i comes from a multiset with a particular order. Next, we need to prove that all these x 1 i come from more than one multiset. As a multiset with bounded size n constitutes at most n! different orders, the infinite number of x 1 i corresponds to y 1 must come from more than one multisets, making these multisets indistinguishable. "Any multiset of size n is distinguishable with f M ð rankpM q " n" Since rankpM q " n and s " n, for any x P R n , y " M x P R n is unique. Correspondingly, for any P π x π , M pP π x π q is unique.

D PROOF OF PROPOSITION 2

Proof. (i) According to the proof of Proposition 1  (i), f `M M 1 ˘pxq " ˆfM pxq f M 1 pxq ˙. For any x 1 and Let ã‹ " pa ‹ 1 , a ‹ 2 , ¨¨¨, a ‹ K q and ã1‹ " pa ‹ K`1 , a ‹ K`2 , ¨¨¨, a ‹ 2K q. Then, ¨a1T rW 1 h pt´1q v ||W 1 h pt´1q u s a 2T rW 2 h pt´1q v ||W 2 h pt´1q u s . . . a KT rW 1 h pt´1q v ||W K h pt´1q u s ‹ ‹ ‹ ‹ ‚ " ¨rã 1 ||ã 11 s T rW 1 h pt´1q v ||W 1 h pt´1q u s rã 2 ||ã 12 s T rW 2 h pt´1q v ||W 2 h pt´1q u s . . . rã K ||ã 1K s T rW 1 h pt´1q v ||W K h pt´1q u s ‹ ‹ ‹ ‹ ‚ " ¨ã 1T W 1 h pt´1q v `ã 11T W 1 h pt´1q u ã2T W 2 h pt´1q v `ã 12T W 2 h pt´1q u . . . ãKT W K h pt´1q v `ã 1KT W K h pt´1q u ‹ ‹ ‹ ‹ ‚ " ¨ã 1T W 1 h pt´1q v ã2T W 2 h pt´1q v . . . ãKT W K h pt´1q v ‹ ‹ ‹ ‹ ‚ `¨ã 11T W 1 h pt´1q u ã12T W 2 h pt´1q u . . . ã1KT W K h pt´1q u ‹ ‹ ‹ ‹ ‚ " ¨ã 1T ã2T . . . ãKT ‹ ‹ ‹ ‚ ¨W 1 W 2 . . . W K ‹ ‹ ‚ h pt´1q v `¨ã 11T ã12T . . . ã1KT ‹ ‹ ‹ ‚ ¨W 1 W 2 . . . W K ‹ ‹ ‚ h pt´1q u " diagpã 1T ,

G COMPARISONS WITH MULTI-AGGREGATOR IMPLEMENTATIONS

ExpandingConv can also be considered as a kind of multi-aggregator scheme. In Equation 5, each row of M vi can be viewed as a weighted aggregator where the weight coefficients are learned from data. Proposition 1 shows that to obtain higher distinguishing strength by utilizing more aggregators, the weight coefficients of newly added aggregators should be linearly independent to all existing aggregators. The distinguishing strength of weighted aggregators is incomparable with basic aggregators. However, since each row of M vi is equivalent to an independent aggregator, one can simply modify the implementation of f local pu, vq to obtain the variant whose distinguishing strength is strict stronger than basic aggregators as follows: ExpandingConv| m 1ptq uv "rm Compared with lerveraging multiple basic aggregators in (Corso et al., 2020) and (Dehmamy et al., 2019) , lerveraging weighted aggregator allows for variable numbers of aggregators. Meanwhile, the weighted coefficients are learned from data, which can better capture relevant structural patterns. 



CONCLUSIONWe show how basic aggregators used in general GNNs become expressive bottlenecks. To address this limitation, we develop theoretical foundations of building powerful aggregators. We also propose the Re-SUM mechanism which achieves dimension-wise sampling. To evaluate their effectiveness, we develop two novel GNN layers, and conduct extensive experiments on public graph benchmarks. The results are consistent with our analysis, and our proposed models achieve SOTA performance on a variety of graph-level prediction benchmarks.



tions and analyze how ExpandingConv achieves more powerful aggregations. The ExpandingConv framework is # m ptq uv " f local pu, vq| uPN pvq h ptq v " f aggr pttvecpm ptq uv h pt´1qT u q|u P N pvquuq, where m ptq uv P R sˆ1 with s ą 1 and f local pu, vq is the mapping of local structures between nodes u and v. The implementation of f local pu, vq is very flexible with the only restriction of ensuring the same encoding of isomorphic graphs. vecpm ptq uv h pt´1qT uq P R sdˆ1 is the expanded representation of hidden features h pt´1q u P R dˆ1 . Then a GNN layer f aggr learns structural information on this expanded representations. We introduce an implelentation as follows:

uv " f local pu, vq| uPN pvq h ptq v " f aggr pttvecpm ptq uv d h pt´1q u q|u P N pvquuq, where m ptq uv P R dˆ1 and d denotes element-wise product. An implementation of CombConv is given as follows:

ptq uv ||1s ą SUM, ExpandingConv| m 1ptq uv "rm ptq uv ||1|| 1 |N pvq| s ą SUM b MEAN.

Figure 2: Effectiveness of Re-SUM on QM9.

Table 4 to show their improvements. ExpC-s denotes ExpandingConv with W P R sˆ‹ . We use ExpC* and CombC* to denote the ExpandingConv and CombConv without Re-SUM. Ablation studies on OGB and QM9. Higher is better.

Ablation studies on QM9. Lower is better.

Comparisons with baselines on OGB and TU. Higher is better.

Comparisons with baselines on QM9. Lower is better.

ã2T , ¨¨¨, ãKT qW h pt´1q

E PROOF OF PROPOSITION 3

Proof. Since M 1 P R sˆn1 , M 2 P R sˆn2 and rankp `M1 M2 ˘q " n 1 `n2 , we have rankpM 1 q " n 1 and rankpM 2 q " n 2 . According to Proposition 1, f M1 and f M2 are injective. We build the system of linear equations y " Ax, where x P R n1`n2 , and A " `M1 M2 ˘`I n 1 0 0 ´In 2 ˘P R sˆpn1`n2q . Then, rankpAq " rankp `M1 M2 ˘`I n 1 0 0 ´In 2 ˘q " rankp `M1 M2 ˘q " n 1 `n2 , which means Ax " 0 has no non-zero solutions. Let x 1 " pxr1s, xr2s, ¨¨¨, xrn 1 sq and x 2 " pxrn 1 `1s, xrn 1 `2s, ¨¨¨, xrn 1 `n2 sq such that x " `x1x 2 ˘. For any x ‰ 0,Therefore, for any x 1 P R n1 , x 2 P R n2 and x 1 ,π q for any P π x 1 π and P π x 2 π . As a result, Respf M1 X Respf M2 q " H.

F PROOF OF PROPOSITION 4

Proof. For Multi-head GAT, there are two types of implementations on aggregating each head, concatenation and average. Here, we only consider the average aggregation implementation.where W k P R dˆd is the trainable matrix for the k-th head, and W " || K k"1 W k P R kdˆd is the concatenation of the trainable matrix in all K heads;H DETAILS OF EXPERIMENTAL SETUP Datasets. Benchmark datasets for graph kernels provided by TU (Kersting et al., 2016) suffer from their small scales of data, making them not sufficient to evaluate the performance of models (Dwivedi et al., 2020) . Our evaluations are conducted on graph property predictions datasets ogbg-ppa, ogbg-code, ogbg-molhiv in OGB (Weihua Hu, 2020) and QM9 (Ramakrishnan et al., 2014; Wu et al., 2018; Ruddigkeit et al., 2012) which are large-scale graph datasets including graph classification and graph regression tasks. ogbg-ppa is extracted from the protein-protein association networks with large and densely connected graphs. ogbg-code is a collection of Abstract Syntax Trees (ASTs) obtained from Python method definitions with large and sparse graphs. ogbg-molhiv is molecular property prediction datasets with relative small graphs. QM9 consists 134K small organic molecules with the task to predict 12 targets for each molecule. All data is obtained from pytorch-geometric library (Fey & Lenssen, 2019) . 

