RECURSIVE NEIGHBORHOOD POOLING FOR GRAPH REPRESENTATION LEARNING

Abstract

While massage passing based Graph Neural Networks (GNNs) have become increasingly popular architectures for learning with graphs, recent works have revealed important shortcomings in their expressive power. In response, several higher-order GNNs have been proposed, which substantially increase the expressive power, but at a large computational cost. Motivated by this gap, we introduce and analyze a new recursive pooling technique of local neighborhoods that allows different tradeoffs of computational cost and expressive power. First, we show that this model can count subgraphs of size k, and thereby overcomes a known limitation of low-order GNNs. Second, we prove that, in several cases, RNP-GNNs can greatly reduce computational complexity compared to the existing higher-order k-GNN and Local Relational Pooling (LRP) networks.

1. INTRODUCTION

Graph Neural Networks (GNNs) are powerful tools for graph representation learning (Scarselli et al., 2008; Kipf & Welling, 2017; Hamilton et al., 2017) , and have been successfully used in applications such as encoding molecules, simulating physics, social network analysis, knowledge graphs, and many others (Duvenaud et al., 2015; Defferrard et al., 2016; Battaglia et al., 2016; Jin et al., 2018) . An important class of GNNs is the set of Message Passing Graph Neural Networks (MPNNs) (Gilmer et al., 2017; Kipf & Welling, 2017; Hamilton et al., 2017; Xu et al., 2019; Scarselli et al., 2008) , which follow an iterative message passing scheme to compute a graph representation. Despite the empirical success of MPNNs, their expressive power has been shown to be limited. For example, their discriminative power, at best, corresponds to the one-dimensional Weisfeiler-Leman (1-WL) graph isomorphism test (Xu et al., 2019; Morris et al., 2019) , so they cannot, e.g., distinguish regular graphs. Moreover, they also cannot count any induced subgraph with at least three vertices (Chen et al., 2020) , or learn graph parameters such as clique information, diameter, or shortest cycle (Garg et al., 2020) . Still, in several applications, e.g. in computational chemistry, materials design or pharmacy (Elton et al., 2019; Sun et al., 2020; Jin et al., 2018) , we aim to learn functions that depend on the presence or count of specific substructures. To strengthen the expressive power of GNNs, higher-order representations such as k-GNNs (Morris et al., 2019) and k-Invariant Graph Networks (k-IGNs) (Maron et al., 2019) have been proposed. k-GNNs are inspired by the k-dimensional WL (k-WL) graph isomorphism test, a message passing algorithm on k-tuples of vertices, and k-IGNs are based on equivariant linear layers of a feed-forward neural network applied to the input graph as a matrix, and are at least as powerful as k-GNN. These models are provably more powerful than MPNNs and can, e.g., count any induced substructure with at most k vertices. But, this power comes at the computational cost of at least Ω(n k ) operations for n vertices. The necessary tradeoffs between expressive power and computational complexity are still an open question. The expressive power of a GNN is often measured in terms of a hierarchy of graph isomorphism tests, i.e., by comparing it to a k-WL test. Yet, there is limited knowledge about how the expressive power of higher-order graph isomorphism tests relates to various functions of interest (Arvind et al., 2020) . A different approach is to take the perspective of specific functions that are of practical interest, and quantify a GNN's expressive power via those. Here, we focus on counting induced substructures to measure the power of a GNN, as proposed in (Chen et al., 2020) . In particular, we study whether it is possible to count given substructures with a GNN whose complexity is between that of MPNNs and the existing higher-order GNNs. To this end, we study the scheme of many higher-order GNNs (Morris et al., 2019; Chen et al., 2020) : select a collection of subgraphs of the input graph, encode these, and (possibly iteratively) compute a learned function on this collection. First, we propose a new such class of GNNs, Recursive Neighborhood Pooling Graph Neural Networks (RNP-GNNs). Specifically, RNP-GNNs represent each vertex by a representation of its neighborhood of a specific radius. Importantly, this neighborhood representation is computed recursively from its subgraphs. As we show, RNP-GNNs can count any induced substructure with at most k vertices. Moreover, for any set of substructures with at most k vertices, there is a specifiable RNP-GNN that can count them. This flexibility allows to design a GNN that is adapted to the power needed for the task of interest, in terms of counting (induced) substructures. The Local Relational Pooling (LRP) architecture too has been introduced with the goal of counting substructures (Chen et al., 2020) . While it can do so, it is polynomial-time only if the encoded neighborhoods are of size o(log(n)). In contrast, RNP-GNNs use almost linear operations, i.e., n 1+o(1) , if the size of each encoded neighborhood is n o(1) . This is an exponential theoretical improvement in the tolerable size of neighborhoods, and a significant improvement over the complexity of O(n k ) in k-GNN and k-IGN. Finally, we take a broader perspective and provide an information-theoretic lower bound on the complexity of a general class of GNNs that can provably count substructures with at most k vertices. This class includes GNNs that represent a given graph by aggregating a number of encoded graphs, where the encoded graphs are related to the given graph with an arbitrary function. In short, in this paper, we make the following contributions: • We introduce Recursive Neighborhood Pooling Graph Neural Networks (RNP-GNNs), a flexible class of higher-order graph neural networks, that provably allow to design graph representation networks with any expressive power of interest, in terms of counting (induced) substructures. • We show that RNP-GNNs offer computational gains over existing models that count substructures: an exponential improvement in terms of the "tolerable" size of the encoded neighborhoods compared to LRP networks, and much less complexity in sparse graphs compared to k-GNN and k-IGN. • We provide an information-theoretic lower bound on the complexity of a general class of GNN that can count (induced) substructures with at most k vertices.

2. BACKGROUND

Message Passing Graph Neural Networks. Let G = (V, E, X) be a labeled graph with |V| = n vertices. Here, X v ∈ X denotes the initial label of v ∈ V, where X ⊆ N is a (countable) domain. A typical Message Passing Graph Neural Network (MPNN) first computes a representation of each vertex, and then aggregates the vertex representations via a readout function into a representation of the entire graph G. The representation h (i) v of each vertex v ∈ V is computed iteratively by aggregating the representations h (i-1) u of the neighboring vertices u: m (i) v = AGGREGATE (i) { {h (i-1) u : u ∈ N (v)} } , h (i) v = COMBINE (i) h (i-1) v , m (i) v , for any v ∈ V, for k iterations, and with h (0) v = X v . The AGGREGATE/COMBINE functions are parametrized, learnable functions, and { {.} } denotes a multi-set, i.e., a set with (possibly) repeating elements. A graph-level representation can be computed as h G = READOUT h (k) v : v ∈ V , where READOUT is a learnable function. For representational power, it is important that the learnable functions above are injective, which can be achieved, e.g., if the AGGREGATE function is a summation and COMBINE is a weighted sum concatenated with an MLP (Xu et al. (2019) ). Higher-Order GNNs. To increase the representational power of GNNs, several higher-order GNNs have been proposed. In a k-GNN, a message passing algorithm is applied to the k-tuples of vertices, in a similar fashion as GNNs do on vertices (Morris et al., 2019) . At initialization, each k-tuple is labeled with its type, that is, two k-tuples are labeled differently if their induced subgraphs are not isomorphic. As a result, k-GNNs can count (induced) substructures with at most k vertices even at initialization. Another class of higher-order networks are k-IGNs, which are constructed with linear invariant/equivariant feed-forward layers, whose inputs consider graphs via adjacency matrices (Maron et al., 2019) . k-IGNs are at least as powerful as k-GNNs, and hence they too can count substructures with at most k vertices. However, both methods need O(n k ) operations. Specifically for counting substructures, Chen et al. (2020) Subgraphs and GNNs. The idea of considering local neighborhoods to have better representations than MPNNs is considered in several works (Liu et al., 2019; Monti et al., 2018; Liu et al., 2020; Yu et al., 2020; Meng et al., 2018; Cotta et al., 2020; Alsentzer et al., 2020; Huang & Zitnik, 2020) . For example, in link prediction, one can use local neighborhoods around links and apply GNNs, as suggested in (Zhang & Chen, 2018) . A novel method based on combining GNNs and a clustering algorithm is proposed in (Ying et al. (2018) ). For graph comparison (i.e., testing whether a given possibly large subgraph exists in the given model), Ying et al. (2020) compare the outputs of GNNs for small subgraphs of the two graphs. To improve the expressive power of GNNs, Bouritsas et al. (2020) use features that are counts of specific subgraphs of interest. Another related work is (Vignac et al., 2020) , where an MPNN is strengthened by learning local context matrices around vertices.

4. RECURSIVE NEIGHBORHOOD POOLING

Next, we construct Recursive Neighborhood Pooling Graph Neural Networks (RNP-GNNs), GNNs that can count any set of induced substructures of interest, with lower complexity than previous models. We represent each vertex by a representation of its radius r 1 -neighborhood, and then combine these representations. The key question is how to encode these local neighborhoods in a vector representation. To do so, we introduce a new idea: we view local neighborhoods as small subgraphs, and recursively apply our model to encode these neighborhood subgraphs. When encoding the local subgraphs, we use a different radius r 2 , and, recursively, a sequence of radii (r 1 , r 2 , . . . , r t ) ∈ N t to obtain the final representation h (t) v of vertices after t recursion steps. While MPNNs also encode a representation of a local neighborhood of certain radius, the recursive representations differ as they essentially take into account intersections of neighborhoods. As a result, as we will see in Section 5.1, they retain more structural information and are more expressive. Models such as k-GNN and LRP also compute encodings of subgraphs, and then update the resulting representations via message passing. We can do the same with the neighborhood representations computed by RNP-GNNs to encode more global information, although our representation results in Section 5.1 hold even without that. In Section 6, we will compare the computational complexity of RNP-GNN and these other models. To do so, we find the representation of each vertex u ∈ G(N 2 (v) \ {v}). For instance, to compute the representation of u 1 , we apply an RNP-GNN with recursion parameters (2, 1) and aggregate u 1 u 2 u 3 N 2 (v) (N 2 (v) \ {v}) ∩ N 2 (u 1 ) u 11 v v G((N 2 (v) \ {v}) ∩ (N 2 (u 1 ) \ {u 1 })) , which is shown in the bottom left of the figure. To do so, we recursively apply an RNP-GNN with recursion parameter (1) on  G((N 2 (v) \ {v}) ∩ (N 2 (u 1 ) \ {u 1 }) ∩ (N 1 (u 11 ) \ {u 11 })), G (t-1) v N r1 (v) \ {v}) denote the induced subgraph of G on the set of vertices N r1 (v) \ {v}, with augmented vertex label X (t-1) u = (h (t-1) u , 1[(u, v) ∈ E]) for any u ∈ N r1 (v) \ {v} . This means we add information about whether vertices are direct neighbors (with distance one) of v. Given a recursion sequence (r 1 , r 2 , . . . , r t ) or radii, the representations are updated as m (t) v = RNP-GNN (t-1) G (t-1) v (N r1 (v) \ {v}) , h (t) v = COMBINE (t) h (t-1) v , m (t) v , for any v ∈ V, and h G = READOUT { {h (t) v : v ∈ V} } . ( ) Different from MPNNs, the recursive update ( 4) is in general applied to a subgraph, and not a multi-set of vertex representations. RNP-GNN (t-1) is an RNP-GNN with recursion parameters (r 2 , . . . , r t ) ∈ N t-1 . The final READOUT is an injective, permutation-invariant learnable multi-set function. If t = 1, then m (t) v = AGGREGATE (t) { {(h (t-1) u , 1[(u, v) ∈ E]) : u ∈ N r1 (v)} } (7) is a permutation-invariant aggregation function as used in MPNNs, only over a potentially larger neighborhood. For r 1 = 1 and t = 1, RNP-GNN reduces to MPNN. In Figure 1 , we illustrate an RNP-GNN with recursion parameters (2, 2, 1) as an example. We also provide pseudocode for RNP-GNNs in Appendix C. Figure 2 : MPNNs cannot count substructures with three vertices or more (Chen et al., 2020) . For example, the graph with black center vertex on the left cannot be counted, since the two graphs on the left result in the same vertex representations as the graph on the right.

5. EXPRESSIVE POWER

In this section, we analyze the expressive power of RNP-GNNs.

5.1. COUNTING (INDUCED) SUBSTRUCTURES

In contrast to MPNNs, which, in general, cannot count substructures of three vertices or more, in this section we prove that for any set of substructures, there is an RNP-GNN that provably counts them. We begin with a few definitions. Definition 1. Let G, H be arbitrary labeled simple graphs, where V is the set of vertices in G. Also, for any S ⊆ V, let G(S) denote the subgraph of G induced by S. The induced subgraph count function is defined as C(G; H) := S⊆V 1{G(S) ∼ = H}, i.e., the number of subgraphs of G isomorphic to H. For unlabeled H, the function is defined analogously. We also need to define a notion of covering for graphs. Our definition uses distances on graphs. Definition 2. Let H = (V H , E H ) be a (possibly labeled) simple connected graph. For any S ⊆ V H and v ∈ V H , define dH (v; S) := max u∈S d(u, v), where d(., .) is the shortest-path distance in H. Definition 3. Let H be a (possibly labeled) simple connected graph on t + 1 vertices. A permutation of vertices, such as (v 1 , v 2 , . . . , v t+1 ), is called a vertex covering sequence, with respect to a sequence r = (r 1 , r 2 , . . . , r t ) ∈ N t called a covering sequence, if and only if dH i (v i ; S i ) ≤ r i , ( ) for any i ∈ [t + 1] = {1, 2, . . . , t + 1}, where S i = {v i , v i+1 . . . , v t+1 } and H i = H(S i ) is the subgraph of H induced by the set of vertices S i . We also say that H admits the covering sequence r = (r 1 , r 2 , . . . , r t ) ∈ N t if there is a vertex covering sequence for H with respect to r. In particular, in a covering sequence we first consider the whole graph as a local neighborhood of one of its vertices with radius r 1 . Then, we remove that vertex and compute the covering sequence of the remaining graph. Figure 3 shows an example of a covering sequence computation. An important property, which holds by definition, is that if r is a covering sequence for H, then any r ≥ r (in a point-wise sense) is also a covering sequence for H. Note that any connected graph on k vertices admits at least one covering sequence, which is (k -1, k -2, . . . , 1). To observe this fact, note that in a connected graph, there is at least one vertex that can be removed and the remaining graph still remains connected. Therefore, we may take this vertex as the first element of a vertex covering sequence, and inductively find the other elements. Since the diameter of a connected graph with k vertices is always bounded by k -1, we achieve the desired v 2 v 1 v 3 v 5 v 4 v 6 v 5 v 4 v 6 v 5 v 4 v 6 v 2 v 1 v 3 v 2 v 1 v 3 Figure 3 : Example of a covering sequence computed for the graph on the left. For this graph, (v 6 , v 1 , v 4 , v 5 , v 3 , v 2 ) is a vertex covering sequence with respect to the covering sequence (3, 3, 3, 2, 1). The first two computations to obtain this covering sequence are depicted in the middle and on the right. result. However, we will see in the next section that, when using covering sequences to identify sufficiently powerful RNP-GNNs, it is desirable to have covering sequences with low r 1 , since the complexity of the resulting RNP-GNN depends on r 1 . We provide an algorithm in Appendix D to find such covering sequences in polynomial time More generally, if H 1 and H 2 are (possibly labeled) simple graphs on k vertices and H 1 H 2 , i.e., H 1 is a subgraph of H 2 (not necessarily induced-subgraph), then, it follows from the definition that any covering sequence for H 1 is also a covering sequence for H 2 . As a side remark, as illustrated in Figure 4 , covering sequences need not always to be decreasing. Using covering sequences, we can show the following result. Theorem 1. Consider a set of (labeled or unlabeled) graphs H on t + 1 vertices, such that any H ∈ H admits the covering sequence (r 1 , r 2 , . . . , r t ). Then, there is an RNP-GNN with recursion parameters (r 1 , r 2 , . . . , r t ) that can count any H ∈ H. In other words, if there exists H ∈ H such that C(G 1 ; H) = C(G 2 ; H), then f (G 1 ; θ) = f (G 2 ; θ). The same result also holds for the non-induced subgraph count function. Theorem 1 states that, with appropriate recursion parameters, any set of (labeled or unlabeled) substructures can be counted by an RNP-GNN. Interestingly, induced and non-induced subgraphs can be both counted in RNP-GNNsfoot_0 . The theorem holds for any covering sequence that is valid for all graphs in H. For any graph, one can compute a covering sequence by computing a spanning tree, and sequentially pruning the leaves of the tree. The resulting sequence of nodes is a vertex covering sequence, and the corresponding covering sequence can be obtained from the tree too (Appendix D). A valid covering sequence for all the graphs in H is the coordinate-wise maximum of all these sequences. For large substructures, the sequence (r 1 , r 2 , . . . , r t ) can be long or include large numbers, and this will affect the computational complexity of RNP-GNNs. For small, e.g., constant-size substructures, the recursion parameters are also small (i.e., r i = O(1) for all i), raising the hope to count these structures efficiently. In particular, r 1 is an important parameter. In Section 6, we analyze the complexity of RNP-GNNs in more detail.

5.2. A UNIVERSAL APPROXIMATION RESULT FOR LOCAL FUNCTIONS

Theorem 1 shows that RNP-GNNs can count substructures if their recursion parameters are chosen carefully. Next, we provide a universal approximation result, which shows that they can learn any function related to local neighborhoods or small subgraphs in a graph. First, we recall that for a graph G, G(S) denotes the subgraph of G induced by the set of vertices S. Definition 4. A function : G n → R d is called an r-local graph function if (G) = φ({ {ψ(G(S)) : S ⊆ V, |S| ≤ r} }), where ψ : G r → R d is a function on graphs and φ is a multi-set function. In other words, a local function only depends on small substructures. Theorem 2. For any r-local graph function (.), there exists an RNP-GNN f (.; θ) with recursion parameters (r -1, r -2, . . . , 1) such that f (G; θ) = (G) for any G ∈ G n . As a result, we can provably learn all the local information in a graph with an appropriate RNP-GNN. Note that we still need recursions, because the function ψ(.) may be an arbitrarily difficult graph function. However, to achieve the full generality of such a universal approximation result, we need to consider large recursion parameters (r 1 = r-1) and injective aggregations in the RNP-GNN network. For universal approximation, we may also need high dimensions if feedforward network layers are used for aggregation (see the proof of the theorem for more details). As a remark, for r = n, achieving universal approximation on graphs implies solving the graph isomorphism problem. But, in this extreme case, the computational complexity of the model in general is not a polynomial in n.

6. COMPUTATIONAL COMPLEXITY

The computational complexity of RNP-GNNs is graph-dependent. For instance, we need to compute the set of local neighborhoods, which is cheaper for sparse graphs. A complexity measure existing in the literature is the tensor order. For higher-order networks, e.g., k-IGNs, we need to consider tensors in R n k . The space complexity is then O(n k ) and the time complexity can be even more, dependent on the algorithm used to process tensors. In general, for a message passing algorithm on graphs, the complexity of the model depends linearly on the number of vertices (if the graph is sparse). Therefore, to bound the complexity of a method, we need to bound the number of node representation updates, which we do in the following theorem. Theorem 3. Let f (.; θ) : G n → R d be an RNP-GNN with the recursion parameters (r 1 , r 2 , . . . , r t ). Assume that the observed graphs G 1 , G 2 , . . ., whose representations we compute, satisfy the following property: max v∈[n] |N r1 (v)| ≤ c, ( ) where c is a graph independent constant. Then, the number of node updates in the RNP-GNN is O(nc t ). In other words, if c = n o(1) and t = O(1), then RNP-GNN requires relatively few updates (that is, n 1+o(1) ), compared to the higher-order networks (O(n t+1 )). Also, in this case, finding neighborhoods is not difficult, since neighborhoods are small (n o(1) ). Note that if the maximum degree of the given graphs is ∆, then c = O(r 1 ∆ r1 ). Therefore, similarly, if ∆ = n o(1) then we can count with at most n 1+o(1) updates. The above results show that when using RNP-GNNs with sparse graphs, we can learn functions of substructures with k vertices without requiring k-order tensors. LRPs also encode neighborhoods of distance r 1 around nodes. In particular, all c! permutations of the nodes in a neighborhood of size c are considered to obtain the representation. As a result, LRP networks only have polynomial complexity if c = o(log(n)). Thus, RNP-GNNs can provide an exponential improvement in terms of the tolerable size c of neighborhoods with distance r 1 in the graph. Moreover, theorem 3 suggests to aim for small r 1 . The other r i 's may be larger than r 1 , as shown in Figure 4 , but do not affect the upper bound on the complexity.

7. AN INFORMATION-THEORETIC LOWER BOUND

In this section, we provide a general information-theoretic lower bound for graph representations that encode a given graph G by first encoding a number of (possibly small) graphs G 1 , G 2 , . . . , G t and then aggregating the resulting representations. The sequence of graphs G 1 , G 2 , . . . , G t may be obtained in an arbitrary way from G. For example, in an MPNN, G i can be the computation tree (rooted tree) at node i. As another example, in LRP, G i is the local neighborhood around node i. v 1 v 2 v 3 v 4 v 5 v 6 Figure 4: For the above graph, (v 1 , v 2 , . . . , v 6 ) is a vertex covering sequence. The corresponding covering sequence (1, 4, 3, 2, 1) is not decreasing. Formally, consider a graph representation f (.; θ) : G n → R d as f (G; θ) = Φ({ {ψ(G i ) : i ∈ [t]} }), [t] = {1, . . . , t} for any  G ∈ G n , where Φ is a multi-set function, (G 1 , G 2 , . . . , G t ) = Ξ(G) where Ξ(.) : G n → ∞ m=1 G m t is .; θ) such that if C(G 1 ; H) = C(G 2 ; H), then f (G 1 ; θ) = f (G 2 ; θ). Then 3 , t = Ω(n k s-1 ). In particular, for any (s, t)-good graph representation with s = 2, i.e., binary encoding functions, we need Ω(n k ) encoded graphs. This implies that, for s = 2, enumerating all subgraphs and deciding for each whether it equals H is near optimal. Moreover, if s ≤ k, then t = Ω(n) small graphs would not suffice to enable counting. More interestingly, if k, s = O(1), then it is impossible to perform the substructure counting task with t = O(log(n)). As a result, in this case, considering n encoded graphs (as is done in GNNs or LRP networks) cannot be exponentially improved. The lower bound in this section is information-theoretic and hence applies to any algorithm. It may be possible to strengthen it by considering computational complexity, too. For binary encodings, i.e., s = 2, however, we know that the bound cannot be improved since manual counting of subgraphs matches the lower bound.

8. TIME COMPLEXITY LOWER BOUNDS FOR COUNTING SUBGRAPHS

In this section, we put our results in the context of known hardness results for subgraph counting. In general, the subgraph isomorphism problem is known to be NP-complete. Going further, the Exponential Time Hypothesis (ETH) is a conjecture in complexity theory (Impagliazzo & Paturi, 2001) , and states that several NP-complete problems cannot be solved in sub-exponential time. ETH, as a stronger version of the P = N P problem, is widely believed to hold. Assuming that ETH holds, the k-clique detection problem requires at least n Ω(k) time (Chen et al., 2005) . This means that if a graph representation can count any subgraph H of size k, then computing it requires at least n Ω(k) time. Corollary 1. Assuming ETH conjecture holds, any graph representation that can count any substructure H on k vertices with appropriate parametrization needs n Ω(k) time to compute. The above bound matches the O(n k ) complexity of the higher-order GNNs. Comparing with Theorem 4 above, Corollary 1 is more general, while Theorem 4 has fewer assumptions and offers a refined result for aggregation-based graph representations. Given that Corollary 1 is a worst-case bound, a natural question is whether we can do better for subclasses of graphs. Regarding H, even if H is a random Erdös-Rényi graph, it can only be counted in n Ω(k/ log k) time (Dalirrooyfard et al., 2019) . Regarding the input graph in which we count, consider two classes of sparse graphs: strongly sparse graphs have maximum degree ∆ = O(1), and weakly sparse graphs have average degree ∆ = O(1). We argued in Theorem 3 that RNP-GNNs achieve almost linear complexity for the class of strongly sparse graphs. For weakly sparse graphs, in contrast, the complexity of RNP-GNNs is generally not linear, but still polynomial, and can be much better than O(n k ). One may ask whether it is possible to achieve a learnable graph representation such that its complexity for weakly sparse graphs is still linear. Recent results in complexity theory imply that this is impossible: Corollary 2 (Gishboliner et al. (2020) ; Bera et al. ( 2019)). There is no graph representation algorithm that runs in linear time on weakly sparse graphs and is able to count any substructure H on k-vertices (with appropriate parametrization). Hence, RNP-GNNs are close to optimal for several cases of counting substructures with parametrized learnable functions.

9. CONCLUSION

In this paper, we studied the theoretical possibility of counting substructures (induced-subgraphs) by a graph representation network. We proposed an architecture, called RNP-GNN, and we proved that for reasonably sparse graphs we can efficiently count substructures. Characterizing the expressive power of GNNs via the set of functions they can learn on substructures may be useful for developing new architectures. In the end, we proved a general lower bound for any graph representation which counts subgraphs and works by aggregating representations of a collection of graphs derived from the graph. Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165-5175, 2018.

A PROOFS

A.1 PROOF OF THEOREM 1 A.1.1 PRELIMINARIES Let us first state a few definitions about the graph functions. Note that for any graph function f : G n → R d , we have f (G) = f (H) for any G ∼ = H. Definition 5. Given two graph functions f, g : G n → R d , we write f g, if and only if for any G 1 , G 2 ∈ G n , ∀G 1 , G 2 ∈ G n : g(G 1 ) = g(G 2 ) =⇒ f (G 1 ) = f (G 2 ), or, equivalently, ∀G 1 , G 2 ∈ G n : f (G 1 ) = f (G 2 ) =⇒ g(G 1 ) = g(G 2 ). Proposition 1. Consider graph functions f, g, h : G n → R d such that f g and g h. Then, f h. In other words, is transitive. Proof. The proposition holds by definition. Proposition 2. Consider graph functions f, g : G n → R d such that f g. Then, there is a function ξ : R d → R d such that ξ • f = g. Proof. Let G n = i∈N F i be the partitioning induced by the equality relation with respect to the function f on G n . Similarly define G i , i ∈ N for g. Note that due to the definition, {F i : i ∈ N} is a refinement for {G i : i ∈ N}. Define ξ to be the unique mapping from {F i : i ∈ N} to {G i : i ∈ N} which respects the equality relation. One can observe that such ξ satisfies the requirement in the proposition. Definition 6. An RNP-GNN is called maximally expressive, if and only if • all the aggregate functions are injective as mappings from a multi-set on a countable ground set to their codomain. • all the combine functions are injective mappings. Proposition 3. Consider two RNP-GNNs f, g with the same recursion parameters r = (r 1 , r 2 , . . . , r t ) where f is maximally expressive. Then, f g.

Proof.

The proposition holds by definition. Proposition 4. Consider a sequence of graph functions f, g 1 , . . . , g k . If f g i for all i ∈ [k], then f k i=1 c i g i , for any c i ∈ R, i ∈ N. Proof. Since f g i , we have ∀G 1 , G 2 ∈ G n : f (G 1 ) = f (G 2 ) =⇒ g i (G 1 ) = g i (G 2 ), for all i ∈ [k]. This means that for any G 1 , G 2 ∈ G n if f (G 1 ) = f (G 2 ) then g i (G 1 ) = g i (G 2 ), i ∈ [k] , and consequently k i=1 c i g i (G 1 ) = k i=1 c i g i (G 2 ). Therefore, from the definition we conclude f k i=1 c i g i . Note that the same proof also holds in the case of countable summations as long as the summation is bounded.  Otherwise, n = k, and we define C(G; H) := H∈ H(H) c H,H × 1{G ∼ = H}, where The following proposition shows that there is no difference between counting induced labeled graphs and counting induced unlabeled graphs in RNP-GNNs. Proposition 6. Let H 0 be an unlabeled connected graph. Assume that for any labeled graph H, which is constructed by adding arbitrary labels to H 0 , there exists an RNP-GNN f H (.; θ H ) such that f H C(G; H), then for its unlabeled counterpart H 0 , there exists an RNP-GNN f (.; θ) with the same recursion parameters as f H (.; θ H ) such that f C(G; H 0 ). H(H) := { H ∈ G k : H H , Proof. If there exists an RNP-GNN f H (.; θ H ) such that f H C(G; H), then for a maximally expressive RNP-GNN f (.; θ) with the same recursion parameters as f H we also have f H C(G; H). Let H be the set of all labeled graphs H = (V, E, X) ∈ G k up to graph isomorphism, where X ∈ X k for a countable set X . Note that H = {H 1 , H 2 , . . .} is a countable set. Now we write  C(G; H 0 ) = S⊆[n] |S|=k 1{G(S) ∼ = H 0 } (22) = S⊆[n] |S|=k i∈N 1{G(S) ∼ = H i } (23) = i∈N S⊆[n] |S|=k 1{G(S) ∼ = H i } (24) = i∈N C(G; H i ). C H (r) ⊆ C G (r), for any sequence r. Proof. The proposition follows from the fact that the function d is decreasing with introducing new edges. Proposition 8. Assume that Theorem 1 holds for induced-subgraph count functions. Then, it also holds for the non-induced subgraph count functions. Proof. Assume that for a connected (labeled or unlabeled) graph H, there exists an RNP-GNN with appropriate recursion parameters f H (.; θ H ) such that f H C(G; H), then we prove there exists an RNP-GNN f (.; θ) with the same recursion parameters as f H such that f C(G; H). If there exists an RNP-GNN f H (.; θ H ) such that f H C(G; H), then for a maximally expressive RNP-GNN f (.; θ) with the same recursion parameters as f H we also have f C(G; H). Note that C(G, H) = S⊆[n] |S|=k C(G(S); H) (30) = S⊆[n] |S|=k H∈ H(H) c H,H × 1{G(S) ∼ = H} (31) = H∈ H(H) c H,H S⊆[n] |S|=k 1{G(S) ∼ = H} (32) = i∈N c Hi,H × C(G, H i ), where H(H) = {H 1 , H 2 , . . .}. Claim 1. f C(G, H i ) for any i. Using Proposition 4 and Claim 1 we conclude that f C(G; H) since C(G; H) is finite and f C(G, H i ) for any i, and the proof is complete. The missing part which we must show here is that for any H i the sequence (r 1 , r 2 , . . . , r t ) which covers H also covers H i . This follows from Proposition 7. We are done. At the end of this part, let us introduce an important notation. For any labeled connected simple graph on k vertices G = (V, E, X), let G * v be the resulting induced graph obtained after removing v ∈ V from G with the new labels defined as X * u := (X u , 1{(u, v) ∈ E}), for each u ∈ V \ {v}. We may also use X * v u for more clarification. Now assume that the claim holds for c H = c -1 ≥ 1. We show that it also holds for c H = c. Let H 1 , H 2 , . . . , H c denote the connected components of H. Also assume that H i ∼ = H j for all i = j. We will relax this assumption later. Let us define A G := {(S 1 , S 2 , . . . , S c ) : ∀i ∈ [c] : S i ⊆ [n]; G(S i ) ∼ = H i }. Note that we can write |A G | = c i=1 C(G; H i ) (61) = C(G; H) + ∞ j=1 c j C(G; H j ), where 

A.3 PROOF OF THEOREM 3

To prove Theorem 3, we need to bound the number of node updates required for an RNP-GNN with recursion parameters (r 1 , r 2 , . . . , r t ). First of all, we have n variables used for the final representations of vertices. For each vertex v 1 ∈ V, we explore the local neighborhood N r1 (v 1 ) and apply a new RNP-GNN network to that neighborhood. In other words, for the second step we need to update |N r1 (v 1 )| nodes. Similarly, for the ith step of the algorithm we have as most λ i := max v1∈[n] max v j+1 ∈Nr j (v j ) ∀j∈[i-1] |N r1 (v 1 ) ∩ N r2 (v 2 ) ∩ N r3 (v 3 ) . . . ∩ N ri (v i )|, updates. Therefore, we can bound the number of node updates as n × t i=1 λ i . Since λ i is decreasing in i, we simply conclude the desired result. A.4 PROOF OF THEOREM 4 Let K k denote the complete graph on k vertices. Claim 3. For any k, n ∈ N, such that n is sufficiently large, {C(G; K k ) : G ∈ G n } ≥ (cn/k log(n/k) -k) k k! = Ω(n k ), ( ) where c is a constant which does not depend on k, n. In particular, we claim that the number of different values that C(G; K k ) can take is n k , up to poly-logarithmic factors. To prove the theorem, we use the above claim. As a result, (t + 1) s-1 = Ω(n k ) or t = Ω(n k s-1 ). To complete the proof, we only need to prove the claim. Proof of Claim 3. Let p 1 , p 2 , . . . , p m be distinct prime numbers less than n/k. Using the prime number theorem, we know that lim n→∞ m n/k log(n/k) = 1. In particular, we can choose n large enough to ensure cn/k log(n/k) < m for any constant c < 1.  e = (u, v) ∈ G B ⇐⇒ ∃ i, j ∈ [m], i = j : u ∈ V i & v ∈ V j . The graph G B is well-defined since 

B RELATIONSHIP TO THE RECONSTRUCTION CONJECTURE

Theorem 2 provides a universality result for RNP-GNNs. Here, we note that the proposed method is closely related to the reconstruction conjecture, an old open problem in graph theory. This motivates us to explain their relationship/differences. First, we need a definition for unlabeled graphs. Definition 10. Let F n ⊆ G n be a set of graphs and let G v = G(V \ {v}) for any finite simple graph G = (V, E), and any v ∈ V. Then, we say the set F is reconstructible if and only if there is a bijection { {G v : v ∈ V} } Φ ←→ G, for any G ∈ F n . In other words, F n is reconstructible, if and only if the multi-set { {G v : v ∈ V} } fully identifies G for any G ∈ F n . It is known that the class of disconnected graphs, trees, regular graphs, are reconstructible (Kelly et al., 1957; McKay, 1997) . The general case is still open; however it is widely believed that it is true. Conjecture 1 (Kelly et al. (1957) ). G n is reconstructible.



For simplicity, we assume that H only contains t + 1 vertex graphs. If H includes graphs with strictly less than t + 1 vertices, we can simply add a sufficient number of zeros to the RHS of their covering sequences. The theorem also holds for induced-subgraphs, with/without vertex labels. Ω(m) is Ω(m) up to poly-logarithmic factors.



Figure 1: Illustration of a Recursive Neighborhood Pooling GNN (RNP-GNN) with recursion parameters (2, 2, 1). To compute the representation of vertex v in the given input graph (depicted in the top left of the figure), we first recurse on G(N 2 (v) \ {v}), depicted in the top right of the figure).To do so, we find the representation of each vertex u ∈ G(N 2 (v) \ {v}). For instance, to compute the representation of u 1 , we apply an RNP-GNN with recursion parameters (2, 1) and aggregate G((N 2 (v) \ {v}) ∩ (N 2 (u 1 ) \ {u 1 })), which is shown in the bottom left of the figure. To do so, we recursively apply an RNP-GNN with recursion parameter (1) on G((N 2 (v) \ {v}) ∩ (N 2 (u 1 ) \ {u 1 }) ∩ (N 1 (u 11 ) \ {u 11 })), in the bottom right of the figure.

Figure1: Illustration of a Recursive Neighborhood Pooling GNN (RNP-GNN) with recursion parameters (2, 2, 1). To compute the representation of vertex v in the given input graph (depicted in the top left of the figure), we first recurse on G(N 2 (v) \ {v}), depicted in the top right of the figure). To do so, we find the representation of each vertex u ∈ G(N 2 (v) \ {v}). For instance, to compute the representation of u 1 , we apply an RNP-GNN with recursion parameters (2, 1) and aggregate G((N 2 (v) \ {v}) ∩ (N 2 (u 1 ) \ {u 1 })), which is shown in the bottom left of the figure. To do so, we recursively apply an RNP-GNN with recursion parameter (1) onG((N 2 (v) \ {v}) ∩ (N 2 (u 1 ) \ {u 1 }) ∩ (N 1 (u 11 ) \ {u 11 })),in the bottom right of the figure.Formally, an RNP-GNN is a parametrized learnable function f (.; θ) : G n → R d , where G n is the set of all labeled graphs on n vertices. Let G = (V, E, X) be a labeled graph with |V| = n vertices, and let h (0) v = X v be the initial representation of each vertex v. Let N r (v) denote the neighborhood of radius r of vertex v, and let G

Let H = (V H , E H , X H ) be a labeled connected simple graph on k vertices. For any labeled graphG = (V G , E G , X G ) ∈ G n , the induced subgraph count function C(G; H) is defined as C(G; H) := S⊆[n]1{G(S) ∼ = H}. (18) Also, let C(G; H) denote the number of non-induced subgraphs of G which are isomorphic to H. It can be defined with the homomorphisms from H to G. Formally, if n > k define C(G; H) := S⊆[n] |S|=k C(G(S); H).

Consider a class of (s, t)-good graph representations f (.; θ) which can count any substructure on k vertices. As a result, f C(G; K k ) for an appropriate parametrization θ. By the definition, f (.) must take at least {C(G;K k ) : G ∈ G n } different values, i.e., {f (G; θ) : G ∈ G n } ≥ {C(G; K k ) : G ∈ G n } . (66) Also, {f (G; θ) : G ∈ G n } ≤ { {ψ(G i ) : i ∈ [t]} } : G ∈ G n ,(67)where (G 1 , G 2 , . . . , G t ) = Ξ(G). But, ψ can take only s values. Therefore, we have{C(G; K k ) : G ∈ G n } ≤ {f (G; θ) : G ∈ G n } (68) ≤ { {ψ(G i ) : i ∈ [t]} } : G ∈ G n (69) ≤ { {α i : i ∈ [t]} } : ∀i ∈ [t] : α i ∈ [s

For anyB = {b 1 , b 2 , . . . , b k } ⊆ [m], define G B as a graph on n vertices such that V G B = V 0 ( i∈[k] V i ), and |V i | = p bi . Also,

i=1 p bi ≤ k × n/k = n. Note that C(G B ; K k ) = k i=1 p bi . Also, since p i , i ∈ [m], are prime numbers, there is a unique bijection B ϕ ←→ C(G B ; K k ). K k ) : G ∈ G n } ≥ {C(G B ; K k ) : B ⊆ [m], |B| = k} (k log(n/k)k) k k! .

is a function on graphs taking s values. In short, we encode t graphs, and each encoding takes one of s values. We call this graph representation function an (s, t)-good graph representation.Theorem 4. Consider a parametrized class of (s, t)-good representations f (.; θ) : G n → R d that is able to count any (not necessarily induced 2 ) substructure with k vertices. More precisely, for any graph H with k vertices, there exists f (

is defined with respect to the graph isomorphism, and c H,H ∈ N denotes the number of subgraphs in H identical to H. Note that H(H) is a finite set and denotes being a (not necessarily induced) subgraph. Proposition 5. Let H be a family of graphs. If for any H ∈ H, there is an RNP-GNN f H (.; θ) with recursion parameters (r 1 , r 2 , . . . , r t ) such that f H C(G; H), then there exists an RNP-GNN f (.; θ) with recursion parameters (r 1 , r 2 , . . . , r t ) such that f H∈H C(G; H).

Definition 8. Let H be a (possibly labeled) simple connected graph. For any S ⊆ V H and v ∈ V H , define Definition 9. Let H be a (possibly labeled) connected simple graph on k = t + 1 vertices. A permutation of vertices, such as (v 1 , v 2 , . . . , v t+1 ), is called a vertex covering sequence, with respect to a sequence r = (r 1 , r 2 , . . . , r t ) ∈ N t , called a covering sequence, if and only ifdH i (v i ; S i ) ≤ r i , (28)for i ∈ [t + 1],whereH i = H(S i ) and S i = {v i , v i+1 , . . . , v t+1 }. Let C H (r) denote the set of all vertex covering sequences with respect to the covering sequence r for H. Proposition 7. For any G, H ∈ G k , if G H (non-induced subgraph), then

H 1 , H 2 , . . . are all non-isomorphic graphs obtained by adding edges (at least one edge) between c graphs H 1 , H 2 , . . . , H c , or contracting a number of vertices of them. The constants c j are just used to remove the effect of multiple counting due to the symmetry. Now, since for any H i , H j the number of connected components is strictly less that c, using the induction, we have f C(G; H i ) and f C(G; H j ) for all j and all i ∈ [c]. According to Proposition 4, we conclude that f C(G; H) and this completes the proof. Also, ifH i , i ∈ [c],are not pairwise non-isomorphic, then we can use αC(G; H) in above equation instead of C(G; H), where α > 0 removes the effect of multiple counting by symmetry. The proof is thus complete.

A.1.2 PROOF OF THEOREM 1

We utilize an inductive prove on t, which is the length of the covering sequence of H. Equivalently, due to the definition, t = k -1, where k is the number of vertices in H. First, we note that due to Proposition 8, without loss of generality, we can assume that H is a simple connected labeled graph and the goal is to achieve the induced-subgraph count function via an RNP-GNN with appropriate recursion parameters. We also consider only maximally expressive networks here to prove the desired result.Induction base. For the induction base, i.e., t = 1, H is a two-vertex graph. This means that we only need to count the number of a specific (labeled) edge in the given graph G. Note that in this case we apply an RNP-GNN with recursion parameter r 1 ≥ 1. Denote the two labels of the vertices inwhere f (.; θ) we assume that f (.; θ) is maximally expressive. The goal is to show that f C(G; H).Using the transitivity of , we only need to choose appropriate φ, ψ, ϕ to achieve f = C(G; H) as the final representation. LetThen, a simple computation shows thatSince f (.; θ) is an RNP-GNN with recursion parameter r 1 and for any maximally expressive RNP-GNN f (.; θ) with the same recursion parameter as f we have f f and f C(G; H), we conclude that f C(G; H) and this completes the proof.Induction step. Assume that the desired result holds for t -1 (t ≥ 2). We show that it also holds for t. Let us first defineNote that H * = ∅ by the assumption. LetFor all H * ∈ H * , using the induction hypothesis, there is a (universal) RNP-GNN f (.; θ) with recursion parameters (r 2 , r 3 , . . . , r t ) such that f C(G; H * ). Using Proposition 4 we concludeDefine a maximally expressive RNP-GNN with the recursion parameters (r 1 , r 2 , . . . , r t ) as follows:Similar to the proof for t = 1, here we only need to propose a (not necessarily maximally expressive) RNP-GNN which achieves the function C(G; H).

Let us define

where(49)Note that the existence of such function ξ is guaranteed due to Proposition 2. Now we writewhich means thatHowever, for a maximally expressive RNP-GNN f (.; θ) we know that f f H * u for all H * u ∈ H and this means that f C(G; H). The proof is thus complete.

A.2 PROOF OF THEOREM 2

For any labeled graph H on r vertices (not necessarily connected) we claim that RNP-GNNs can count them. Claim 2. Let f (.; θ) : G n → R d be a maximally expressive RNP-GNN with recursion parameters (r -1, r -2, . . . , 1). Then, f C(G; H).

Now consider the function

We claim that f (f is defined in the previous claim) and this completes the proof according to Proposition 2.To prove the claim, assume that f (G 1 ) = f (G 2 ). Then, we conclude that C(G 1 ; H) = C(G 2 ; H) for any labeled H (not necessarily connected) with r vertices. Now, we havewhich shows that (G 1 ) = (G 2 ).Proof of Claim 2. To prove the claim, we use an induction on the number of connected components c H of graph H. If H is connected, i.e., c H = 1, then according to Theorem 1, we know that f C(G; H).For RNP-GNNs, the reconstruction from the subgraphs G * v , v ∈ [n] is possible, since we relabel any subgraph (in the definition of X * ) and this preserves the critical information for the recursion to the original graph. In the reconstruction conjecture, this part of information is missing, and this makes the problem difficult. Nonetheless, since in RNP-GNNs we preserve the original node's information in the subgraphs with relabeling, the reconstruction conjecture is not required to hold to show the universality results for RNP-GNNs, although that conjecture is a motivation for this paper. Moreover, if it can be shown that the reconstruction conjecture it true, it may be also possible to find a simple encoding of subgraphs to an original graph and this may lead to more powerful but less complex new GNNs.

C THE RNP-GNN ALGORITHM

In this section, we provide pseudocode for RNP-GNNs. The algorithm below computes node representations. For a graph representation, we can aggregate them with a common readout, e.g.,. Following (Xu et al., 2019) , we use sum pooling here, to ensure that we can represent injective aggregation functions.Algorithm 1 Recursive Neighborhood Pooling-GNN (RNP-GNN)r 2 , r 3 , . . . , r t ), ( (2) , . . . , (t) ))With this algorithm, one can achieve the expressive power of RNP-GNNs if high dimensional MLPs are allowed (Xu et al., 2019; Hornik et al., 1989; Hornik, 1991) . That said, in practice, smaller MLPs may be acceptable (Xu et al., 2019) .

D COMPUTING A COVERING SEQUENCE

As we explained in the context of Theorem 1, we need a covering sequence (or an upper bound to that) to design an RNP-GNN network that can count a given substructure. A covering sequence can be constructed from a spanning tree of the graph.For reducing complexity, it is desirable to have a covering sequence with minimum r 1 (Theorem 3). Here, we suggest an algorithm for obtaining such a covering sequence, shown in Algorithm 2. For obtaining merely an aribtrary covering sequence, one can compute any minimum spanning tree (MST), and then proceed as with the MST in Algorithm 2.Given an MST, we build a vertex covering sequence by iteratively removing a leave v i from the tree and adding the respective node v i to the sequence. This ensures that, at any point, the remaining graph is connected. At position i corresponding to v i , the covering sequence contains the maximum distance r i of v i to any node in the remaining graph, or an upper bound on that. For efficiency, an upper bound on the distance can be computed in the tree.To minimize r 1 = max u∈V d(u, v 1 ), we need to ensure that a node in arg min v∈V max u∈V d(u, v) is a leaf in the spanning tree. Hence, we first compute max u∈V d(u, v) for all nodes v, e.g., by running All-Pairs-Shortest-Paths (APSP) (Kleinberg & Tardos, 2006) , and sort them in increasing order by this distance. Going down this list, we try whether it is possible to use the respective node as v 1 , and stop when we find one. Say v * is the current node in the list. To compute a spanning tree where v * is a leaf, we assign a large weight to all the edges adjacent to v * , and a very low weight to all other edges. If there exists such a tree, running an MST with the assigned weights will find one. Then, we use v * as v 1 in the vertex covering sequence. This algorithm runs in polynomial time.Algorithm 2 Computing a covering sequence with minimum r 1 Input: H = (V, E, X) where V = [t + 1] Output: A minimal covering sequence (r 1 , r 2 . . . , r t ), and its corresponding vertex covering sequence (v 1 , v 2 , . . . , v t+1 ) For any u, v ∈ V, compute d(u, v) using APSP (u 1 , u 2 , . . . , u t+1 ) ← all the vertices sorted increasingly in s(v) := max u∈V d(u, v) for i = 1 to t + 1 do Set edge weights w(u, v) = 1 + t × 1{u = u i ∨ v = u i } for all (u, v) ∈ E H T ← the MST of H with weights w if u i is a leaf in H T then v 1 ← u i r 1 ← s(u i ) break end if end for for i = 2 to t + 1 do v i ← one of the leaves of H T r i ← max u∈V H T d(u, v i ) H T ← H T after removing v i end for return (r 1 , r 2 , . . . , r t ) and (v 1 , v 2 , . . . , v t+1 )

