EXPRESSIVE POWER OF INVARIANT AND EQUIVARIANT GRAPH NEURAL NETWORKS

Abstract

Various classes of Graph Neural Networks (GNN) have been proposed and shown to be successful in a wide range of applications with graph structured data. In this paper, we propose a theoretical framework able to compare the expressive power of these GNN architectures. The current universality theorems only apply to intractable classes of GNNs. Here, we prove the first approximation guarantees for practical GNNs, paving the way for a better understanding of their generalization. Our theoretical results are proved for invariant GNNs computing a graph embedding (permutation of the nodes of the input graph does not affect the output) and equivariant GNNs computing an embedding of the nodes (permutation of the input permutes the output). We show that Folklore Graph Neural Networks (FGNN), which are tensor based GNNs augmented with matrix multiplication are the most expressive architectures proposed so far for a given tensor order. We illustrate our results on the Quadratic Assignment Problem (a NP-Hard combinatorial problem) by showing that FGNNs are able to learn how to solve the problem, leading to much better average performances than existing algorithms (based on spectral, SDP or other GNNs architectures). On a practical side, we also implement masked tensors to handle batches of graphs of varying sizes. The pioneering works that applied neural networks to graphs are Gori et al. (2005) and Scarselli et al. ( 2009) that learn node representation with recurrent neural networks. More recent message passing architectures make use of non-linear functions of the adjacency matrix (Kipf & Welling, 2016),

1. INTRODUCTION

Graph Neural Networks (GNN) are designed to deal with graph structured data. Since a graph is not changed by permutation of its nodes, GNNs should be either invariant if they return a result that must not depend on the representation of the input (typically when building a graph embedding) or equivariant if the output must be permuted when the input is permuted (typically when building an embedding of the nodes). More fundamentally, incorporating symmetries in machine learning is a fundamental problem as it allows to reduce the number of degree of freedom to be learned. Deep learning on graphs. This paper focuses on learning deep representation of graphs with network architectures, namely GNN, designed to be invariant to permutation or equivariant by permutation. From a practical perspective, various message passing GNNs have been proposed, see Dwivedi et al. (2020) for a recent survey and benchmarking on learning tasks. In this paper, we study 3 architectures: Message passing GNN (MGNN) which is probably the most popular architecture used in practice, order-k Linear GNN (k-LGNN) proposed in Maron et al. (2018) and order-k Folklore GNN (k-FGNN) first introduced by Maron et al. (2019a) . MGNN layers are local thus highly parallelizable on GPUs which make them scalable for large sparse graphs. k-LGNN and k-FGNN are dealing with representations of graphs as tensors of order k which make them of little practical use for k ≥ 3. In order to compare these architectures, the separating power of these networks has been compared to a hierarchy of graph invariants developed for the graph isomorphism problem. Namely, for k ≥ 2, k-WL(G) are invariants based on the Weisfeiler-Lehman tests (described in Section 4.1). For each k ≥ 2, (k + 1)-WL has strictly more separating power than k-WL (in the sense that there is a pair of non-isomorphic graphs distinguishable by (k + 1)-WL and not by k-WL). GIN (which are invariant MGNN) introduced in Xu et al. (2018) are shown to be as powerful as 2-WL. In Maron et al. (2019a) , Geerts (2020b) and Geerts (2020a) , k-LGNN are shown to be as powerful as k-WL and 2-FGNN is shown to be as powerful as 3-WL. In this paper, we extend this last result about k-FGNN to general values of k. So in term of separating power, when restricted to tensors of order k, k-FGNN is the most powerful architecture among the ones considered in this work. This means that for a given pair of graphs G and G , if (k + 1)-WL(G) = (k + 1)-WL(G ), then there exists a k-FGNN, say GNN G,G such that GNN G,G (G) = GNN G,G (G ). Approximation results for GNNs. Results on the separating power of GNNs only deal with pairwise comparison of graphs: we need a priori a different GNN for each pair of graphs in order to distinguish them. Such results are of little help in a practical learning scenario. Our main contribution in this paper overcomes this issue and we show that a single GNN can give a meaningful representation for all graphs. More precisely, we characterize the set of functions that can be approximated by MGNNs, k-LGNNs and k-FGNNs respectively. Standard Stone-Weierstrass theorem shows that if an algebra A of real continuous functions separates points, then A is dense in the set of continuous function on a compact set. Here we extend such a theorem to general functions with symmetries and apply it to invariant and equivaraint functions to get our main result for GNNs. As a consequence, we show that k-FGNNs have the best approximation power among architectures dealing with tensors of order k. Universality results for GNNs. Universal approximation theorems (similar to Cybenko (1989) for multi-layers perceptron) have been proved for linear GNNs in Maron et al. (2019b) ; Keriven & Peyré (2019) ; Chen et al. (2019) . They show that some classes of GNNs can approximate any function defined on graphs. To be able to approximate any invariant function, they require the use of very complex networks, namely k-LGNN where k tends to infinity with n the number of nodes. Since we prove that any invariant function less powerful than (k + 1)-WL can be approximated by a k-FGNN, letting k tends to infinity directly implies universality. Universality results for k-FGNN is another contribution of our work. Equivariant GNNs. Our second set of results extends previous analysis from invariant functions to equivariant functions. There are much less results about equivariant GNNs: Keriven & Peyré (2019) proves the universality of linear equivariant GNNs, and Maehara & Hoang (2019) shows the universality of a new class of networks they introduced. Here, we consider a natural equivariant extension of k-WL and prove that equivariant (k + 1)-LGNNs and k-FGNN can approximate any equivariant function less powerful than this equivariant (k + 1)-WL for k ≥ 1. At this stage, we should note that all universality results for GNNs by Maron et al. (2019b) ; Keriven & Peyré (2019) ; Chen et al. (2019) are easily recovered from our main results. Also our analysis is valid for graphs of varying sizes. Empirical results for the Quadratic Assigment Problem (QAP). To validate our theoretical contributions, we empirically show that 2-FGNN outperforms classical MGNN. Indeed, Maron et al. (2019a) already demonstrate state of the art results for the invariant version of 2-FGNNs (for graph classification or graph regression). Here we consider the graph alignment problem and show that the equivariant 2-FGNN is able to learn a node embedding which beats by a large margin other algorithms (based on spectral method, SDP or GNNs). Outline and contribution. After reviewing more previous works and notations in the next section, we define the various classes of GNNs studied in this paper in Section 3 : message passing GNN, linear GNN and folklore GNN. Section 4 contains our main theoretical results for GNNs. First in Section 4.2 we describe the separating power of each GNN architecture with respect to the Weisfeiler-Lehman test. In Section 4.3, we give approximation guarantees for MGNNs, LGNNs and FGNNs at fixed order of tensor. They cover both the invariant and equivariant cases and are our main theoretical contributions. For these, we develop in Section D a fine-grained Stone-Weierstrass approximation theorem for vector-valued functions with symmetries. Our theorem handles both invariant and equivariant cases and is inspired by recent works in approximation theory. In Section 6, we illustrate our theoretical results on a practical application: the graph alignment problem, a well-known NP-hard problem. We highlight a previously overlooked implementation question: the handling of batches of graphs of varying sizes. A PyTorch implementation of the code necessary to reproduce the results is available at https://github.com/mlelarge/graph_neural_net for example polynomials (Defferrard et al., 2016) . For regular-grid graphs, they match classical convolutional networks which by design can only approximate translation-invariant functions and hence have limited expressive power. In this paper, we focus instead on more expressive architectures. Following the recent surge in interest in graph neural networks, some works have tried to extend the pioneering work of Cybenko (1989) ; Hornik et al. (1989) for various GNN architectures. Among the first ones is Scarselli et al. (2009) , which studied invariant message-passing GNNs. They showed that such networks can approximate, in a weak sense, all functions whose discriminatory power is weaker than 1-WL. Yarotsky (2018) described universal architectures which are invariant or equivariant to some group action. These models rely on polynomial intermediate layers of arbitrary degrees, which would be prohibitive in practice. Maron et al. (2019b) leveraged classical results about the polynomials invariant to a group action to show that k-LGNN are universal as k tends to infinity with the number of nodes. Keriven & Peyré (2019) derived a similar result, in the more complicated equivariant case by introducing a new Stone-Weierstrass theorem. Similarly to Maron et al. (2019b) , they require the order of tensors to go to infinity. Another route towards universality is the one of Chen et al. (2019) . In the invariant setting, they show for a class of GNN that universality is equivalent to being able to discriminate between (non-isomorphic) graphs. However, the only way to achieve such discriminatory power is to use tensors of arbitrary high order, see also Ravanbakhsh (2020) . Our work encompass and precise these results using high-order tensors as it yields approximation guarantees even at fixed order of tensor. Sato et al. (2019) and DimeNet in Klicpera et al. (2020) are message passing GNN incorporating more information than those studied here. Partial results about their separating power follows from Garg et al. ( 2020) which provides impossibility results to decide graph properties including girth, circumference, diameter, radius, conjoint cycle, total number of cycles, and k-cliques. Chen et al. (2020) studies the ability of GNNs to count graph substructures. Though our theorems are much more general, note that their results are improved by the present work. Note also, that if the nodes are given distinct features, MGNNs become much more expressive Loukas (2019) but looses their invariant or equivariant properties. Averaging i.e. relational pooling (RP) has been proposed to recover these properties Murphy et al. (2019a) . However, the ideal RP, leading to a universal approximation, cannot be used for large graphs due to its complexity of O(|V |!). Regarding the other classes of RPGNN i.e. the k-ary pooling (Murphy et al., 2019b) , we will show how our general theorems in the invariant case can be applied to characterize their approximation power (see Section 5).

CPNGNN in

Note that for neural networks on sets, the situation is a bit simpler. Efficient architectures such as DeepSets (Zaheer et al., 2017) or PointNet (Qi et al., 2017) have been shown to be invariant universal. Similar results exist in the equivariant case (Segol & Lipman, 2020; Maron et al., 2020) , whose proofs rely on polynomial arguments. Though this is not our main motivation, our approximation theorems could also be applied in this context see Sections D.3 and D.4.

2.1. NOTATIONS: GRAPHS AS TENSORS

We denote by F, F 0 , F 1/2 , F 1 , . . . arbitrary finite-dimensional spaces of the form R p (for various values of p) typically representing the space of features. Product of vectors in R p always refer to component-wise product. There are two ways to see graphs with features. First, graphs can be seen as tensors of order k: G ∈ F n k . The classical representation of a graph by its (weighted) adjacency matrix for k = 2 is a tensor of order 2 in R n 2 . This case allows for features on edges by replacing R n 2 with F n 2 where F is some R p . Second, graphs can also be represented by their discrete structure with an additional feature vector. More exactly, denote by G n the set of discrete graphs G = (V, E) with n nodes V = [n] and edges E ⊆ V 2 (with no weights on edges). Such a G ∈ G n with a vector h 0 ∈ F n represents a graphs with features on the vertices.

2.2. DEFINITIONS: INVARIANT AND EQUIVARIANT OPERATORS

Let [n] = {1, . . . , n}. The set of permutations on [n] is denoted by S n . For G ∈ F n k and σ ∈ S n , we define: (σ G) σ(i1),...,σ(i k ) = G i1,...,i k . Note that the operation is valid between a permutation in S n and a graph G as soon as the number of nodes of G is n, i.e. it is valid for any order k tensor representation of the graph. Two graphs G 1 , G 2 are said isomorphic if they have the same number of nodes and there exists a permutation σ such that G 1 = σ G 2 . Definition 1. A function f : F n k 0 → F 1 is said to be invariant if f (σ G) = f (G) for every permutation σ ∈ S n and every G ∈ F n k 0 . A function f : F n k 0 → F n 1 is said to be equivariant if f (σ G) = σ f (G) for every permutation σ ∈ S n and every G ∈ F n k 0 . Note that composing an equivariant function with an invariant function gives an invariant function. For k ≥ 1, we define the invariant summation layer S k : F n k → F by S k (G) = i∈[n] k G i for G ∈ F n k . We also define the equivariant reduction layer S k 1 : F n k → F n as follows: S k 1 (G) i = 1≤i2...i k ≤n G i,i2,...i k . For message passing GNN, we will use the equivariant layer Id +λS 1 : F n → F n defined by, (Id +λS 1 )(G) i = G i + λS 1 (G), where λ ∈ R is a learnable parameter. In the sequel, we will need a mapping I k lifting the input graph to a higher order tensor. We denote by I k : F n 2 0 → F n k 1 the initialization function mapping for a given graph each k-tuple to its isomorphism type. We refer to the appendix Section C.3 for a precise description of this linear equivariant function. Note at this stage that I 2 is given by, for G ∈ F n 2 , I(G) i,j = (G i,j , δ i,j ) where δ i,j is 0 if i = j and 1 otherwise. Indeed for a pair of nodes i, j in a graph (without features), there are only three isomorphism types: i = j; i = j and (i, j) is an edge; i = j but (i, j) is not an edge.

3. GNN DEFINITIONS

In this section, we define the various GNN architectures studied in this paper. In all architectures, there is a main building block or layer mapping F n k t to F n k t+1 where F n k t can be seen as the space for the representation of the graph at layer t. We will define three different types of layers for message passing GNN, linear GNN and folklore GNN.The case k = 2 is probably the most interesting case from a practical point view and corresponds to a case where a layer takes as input a graph (with features on nodes and edges) and produces as output a graph (with new features on nodes and edges). For each type of GNNs, there will be an invariant and an equivaraint version. All architectures will share the last function: m I : F T +1 → F for the invariant case and m E : F n T +1 → F n for the equivariant case which are continuous functions. It is typically modeled by a Multi Layer Perceptron, which is applied on each component for the equivariant case. In words, each network takes as input a graph G ∈ F n 2 0 , produces in the invariant case a graph embedding in F T +1 and in the equivaraint case a node embedding in F n T +1 , then these embeddings are passed through the function m I or m E respectively to get a feature in F or F n for the learning task.

3.1. MESSAGE PASSING GNN

Message passing GNN (MGNN) are defined for classical graphs G with features on the nodes. More exactly they take as input a discrete graph G = (V, E) ∈ G n and features on the nodes h 0 ∈ F n . MGNN are then defined inductively as follows: let h i ∈ F denote the feature at layer associated with node i, the updated features h +1 i are obtained as: h +1 i = f h i , h j j∼i , where j ∼ i means that nodes j and i are neighbors in the graph G, i.e. (i, j) ∈ E, and the function f is a learnable function taking as input the feature vector of the center vertex h i and the multiset of features of the neighboring vertices h j j∼i . Indeed, it follows from Lem. 33 in Appendix, that any such function f can be approximated by a layer of the form, h +1 i = f 0   h i , j∼i f 1 h i , h j   , where f 0 : F × F +1/2 → F +1 and f 1 : F × F → F +1/2 , so that F is the field for the features at the -th layer. We call such a function a message passing layer and denote it by F : F n → F n

+1

(note that F depends implicitly from the graph). Then an equivariant message passing GNN is simply obtained by the composition of message passing layers: F T • . . . F 2 • F 1 , where each F i is a message passing layer. Clearly since each F i is equivariant, this message passing GNN is also equivariant and produces features on each node in the space F T . In order to obtain an invariant GNN, we apply an invariant function from F n T → F T +1 on the output of an equivariant message passing GNN. In practice, a symmetric function is applied on the vectors of features indexed by the nodes, typically the sum of the features i (F T • . . . F 2 • F 1 (G)) i is taken as an invariant feature for the graph G. With our notation, S 1 • F T • . . . F 2 • F 1 (where S 1 was defined in Section2.2) defines an invariant message passing GNN. Hence, we define the sets of message passing GNNs as follows: MGNN I = {m I • S 1 • F T • . . . F 2 • F 1 , ∀T } MGNN E = {m E • (Id +λS 1 ) • F T • . . . F 2 • F 1 , ∀T } where F t : F n t → F n t+1 are message passing layers.

3.2. LINEAR GNN

We define the linear graph layer of order k as F : F n k → F n k +1 , where for all G ∈ F n k , F (G) = f (L[G]) where L : F n k → F n k is a linear equivariant function, and f : F → F +1 is a learnable function applied on each of the n k features and F is the field for the features at the -th layer. We then define the sets of linear GNNs as follows: k-LGNN I = {m I • S k • F T • . . . F 2 • F 1 • I k , ∀T } k-LGNN E = {m E • S k 1 • F T • . . . F 2 • F 1 • I k , ∀T } where I k : F n 2 0 → F n k 1 is defined in §2.2 and for t ≥ 1, F t : F n k t → F n k t+1 are linear equivariant layers.

3.3. FOLKLORE GNN

The main building block of Folklore GNN (FGNN) is what we call the folklore graph layer (FGL) of order k defined as follows: for k ≥ 1, F : F n k → F n k +1 where for all G ∈ F n k and all i ∈ [n] k , F (G) i = f 0   G i , n j=1 k w=1 f w G i1,...,iw-1,j,iw+1,...,i k   , where f 0 : F × F +1/2 → F +1 and f k : F → F +1/2 are learnable functions. As shown in Lem. 33 in Appendix, FGL is an equivariant function which is indeed very expressive. For classical graphs G ∈ F n 2 0 , we can now define 2-FGNN by composing folklore graph layers F t : F n 2 t → F n 2 t+1 , so that F T • . . . F 1 • F 0 is an equivariant GNN producing a graph in F n 2 T +1 . To obtain an invariant feature of the graph, we use the summation layer S 2 defined in Section 2.2 so that S 2 • F T • . . . F 1 • F 0 is now an invariant 2-FGNN. In order to define general k-FGNN, we first need to lift the classical graph to a tensor in F n k , then we apply folklore graph layers of order k and finally we need to project the tensor in F n k to a tensor in F n for the equivariant version and to a tensor in F for the invariant version. The first step is done with the linear equivariant function I k : F n 2 0 → F n k 1 defined in Section 2.2. The last step is done with the reduction layer S k 1 for the equivariant case and the summation layer S k for the invariant case, both defined in Section 2.2. We define the sets of folklore GNNs as follows: k-FGNN I = {m I • S k • F T • . . . F 2 • F 1 • I k , ∀T } k-FGNN E = {m E • S k 1 • F T • . . . F 2 • F 1 • I k , ∀T } where F t : F n k t → F n k t+1 are FGLs.

4.1. WEISFEILER-LEHMAN INVARIANT AND EQUIVARIANT VERSIONS

We introduce a family of functions on graphs parametrized by integers k ≥ 2 developed for the graph isomorphism problem and working with tuples of k vertices. Each k-tuple i ∈ V k = [n] k is given a color c 0 (i) corresponding to its isomorphism type (see Section B.2). The k-WL test relies on the following notion of neighborhood, defined by, for any w ∈ [k], and i = (i 1 , . . . , i k ) ∈ V k , N w (i) = {(i 1 , . . . , i w-1 , j, i w+1 , . . . , i k ) : j ∈ V }. Then, the colors of the k-tuples are refined as follows, c t+1 (i) = Lex (c t (i), (C t 1 (i), . . . , C t k (i))) where, for w ∈ [k], C t w (i) = c t ( ĩ) : ĩ ∈ N w (i) and the function Lex means that all occuring colors are lexicographically ordered and replaced by an initial segment of the natural numbers. For a graph G, let k-WL T I (G) denote the multiset of colors of the k-WL algorithm at the T th iteration. After a finite number of steps (which depends on the number of vertices in the graph), the algorithm stops because a stable coloring is reached (no color class of k-tuples is further divided). We denote by k-WL I (G) the multiset of colors in the stable coloring. This is a graph invariant that is usually used to test if graphs are isomorphic. The power of this invariant increases with k Cai et al. (1989) . We now define an equivariant version of k-WL test to express the discriminatory power of equivariant architectures For this, we construct a coloring of the vertices from the coloring of the k-tuples given by the standard k-WL algorithm. Formally, define k-WL T E : F n 2 0 → F n by, for i ∈ V : k-WL T E (G) i = c T (i) : i ∈ V k , i 1 = i . Similarly, define k-WL E (G) = c(i) : i ∈ V k , i 1 = i where c(i) is the stable coloring obtained by the algorithm.

4.2. SEPARATING POWER OF GNNS

We formulate our results using the equivalence relation introduced by Timofte (2005) , which characterizes the separating power of a set of functions. Definition 2. Let F be a set of functions f defined on a set X, where each f takes its values in some Y f . The equivalence relation ρ (F) defined by F on X is: for any x, x ∈ X, (x, x ) ∈ ρ (F) ⇐⇒ ∀f ∈ F, f (x) = f (x ) . Given two sets of functions F and E, we say that F is more separating (resp. strictly more separating) than E if ρ (F) ⊆ ρ (E) (resp. ρ (F) ρ (E)). Note that all the functions in F and E need to be defined on the same set but can take values in different sets. For example, we can easily see that for the k-WL algorithm defined above, the equivariant version is more separating than the invariant one. Some properties of the WL hierarchy of tests can be rephrased with the notion of separating power. In particular, Cai et al. (1989) showed that (k + 1)-WL I distinguishes strictly more than k-WL I , which can be rewritten simply as (for a function f , we write ρ (f ) for ρ ({f })) ρ ((k + 1)-WL I ) ρ (k-WL I ) . (3) This notion of separating power enables us to concisely summarize the current knowledge about the discriminatory power of classes of GNN. Proposition 3. We have, for k ≥ 2, ρ (MGNN I ) = ρ (2-WL I ) ρ (MGNN E ) = ρ (2-WL E ) (4) ρ (k-LGNN I ) = ρ (k-WL I ) ρ (k-LGNN E ) ⊆ ρ (k-WL E ) (5) ρ (k-FGNN I ) = ρ ((k + 1)-WL I ) ρ (k-FGNN E ) = ρ ((k + 1)-WL E ) Only results about the invariant cases were previously known: (4) comes from Xu et al. (2018) (Maron et al., 2018) ; the folklore layer involves a (dense) matrix multiplication of shape n × n. If 2-FGNN is the most complex architecture, we see that it has the best separating power among all architectures proposed so far dealing with tensors of order 2.

4.3. APPROXIMATION RESULTS FOR GNNS

For X, Y finite-dimensional spaces, let us denote by C I (X, Y ), C E (X, Y ), , the set of invariant, respectively equivariant, continuous functions from X to Y . The closure of a class of function F for the uniform norm is denoted by F. Our result extend easily to graphs of varying sizes but this is deferred to Section F.2 for clarity. The theorem below states in particular that the class k-FGNN can approximate any continuous function that is less separating than (k + 1)-WL in the invariant and in the equivariant cases. Theorem 4. Let K discr ⊆ G n × F n 0 , K ⊆ F n 2 0 be compact sets. For the invariant case, we have: MGNN I = {f ∈ C I (K discr , F) : ρ (2-WL I ) ⊆ ρ (f )} k-LGNN I = {f ∈ C I (K, F) : ρ (k-WL I ) ⊆ ρ (f )} k-FGNN I = {f ∈ C I (K, F) : ρ ((k + 1)-WL I ) ⊆ ρ (f )} For the equivariant case, we have: MGNN E = {f ∈ C E (K discr , F n ) : ρ (2-WL E ) ⊆ ρ (f )} k-LGNN E = {f ∈ C E (K, F n ) : ρ (k-LGNN E ) ⊆ ρ (f )} ⊃ {f ∈ C E (K, F n ) : ρ (k-WL E ) ⊆ ρ (f )} k-FGNN E = {f ∈ C E (K, F n ) : ρ ((k + 1)-WL E ) ⊆ ρ (f )} In the invariant case for k = 2,we have MGNN I = 2-LGNN I 2-FGNN I where the strictness of the last inclusion comes from (3). In other words, 2-FGNN I has a better power of approximation than the other architectures working with tensors of order 2. We already knew by Proposition 3 that 2-FGNN I is the best separating architecture among those studied in this paper, dealing with tensor of order 2 and our theorem implies that this is also the case for the approximation power. To clarify the meaning of these statements, we explain why the inclusions "⊆" are actually straightforward. For concreteness, we focus on k-FGNN I ⊆ {f ∈ C I (K, F) : ρ ((k + 1)-WL I ) ⊆ ρ (f )}. Take h ∈ k-FGNN I , this means that there is a sequence GNN j ∈ k-FGNN I such that, sup G∈K h(G) -GNN j (G) goes to zero when j goes to infinity. Therefore, h is continuous and constant on each ρ (k-FGNN I )-class. Indeed, for any (G, G ) ∈ ρ (k-FGNN I ), GNN j (G) = GNN j (G ) so that h(G) = lim i GNN j (G) = lim j GNN j (G ) = h(G ). Hence we have ρ (k-FGNN I ) ⊆ ρ (h) and by Prop. 3, ρ (k-FGNN I ) = ρ ((k + 1)-WL I ), allowing us to get the inclusion above. On the contrary, the reverse inclusions "⊃" are much more intricate but they are also the most valuable. For instance, consider the inclusion k-FGNN I ⊃ {f ∈ C I (K, F) : ρ ((k + 1)-WL I ) ⊆ ρ (f )}. If one wishes to learn a function h ∈ C I (K, F) with k-FGNN I , this function must at least be approximable by the class of k-FGNN I . Our theorem precisely guarantees that if h is less separating that k-WL I , it can be approximated by k-FGNN I : ∀ > 0, ∃GNN ∈ k-FGNN I , sup G∈K h(G) -GNN(G) ≤ . For this, we show a much more general version of the famous Stone-Weierstrass theorem (see Section D) which relates the separating power with the approximation power. Following the elegant idea of Maehara & Hoang (2019) , we augment the input space to transform vector-valued equivariant functions into scalar invariant maps. Then, we apply a fine-grained approximation theorem from Timofte (2005) .We also provide specialized versions of our abstract theorem in Section 5, which can be easily used to determine the approximation capabilities of any deep learning architecture. Our theorem has also implications for universality results like Maron et al. (2019b); Keriven & Peyré (2019) . A class of GNN is said to be universal if its closure on a compact set K is the whole C I (K, F) (or C E (K, F n )). In particular, Thm. 4 implies that n-LGNN and n-FGNN are universal as n-WL distinguishes non-isomorphic graphs of size n. This recovers a result of Ravanbakhsh (2020) for LGNN. Moreover, we can leverage the extensive literature on the WL tests to give more subtle results. For instance, Cai et al. (1989, §8.2) show that, for planar graphs, O( √ n)-WL can distinguish non-isomoprhic instances. Therefore, O( √ n)-LGNN or O( √ n) -FGNN achieve universality in the particular, yet common, case of planar graphs. On a more practical side, Fürer (2010, Thm. 4.5) shows that the spectrum of a graph is less separating than 3-WL so that functions of the spectrum can actually be well approximated by 2-FGNN.

5. EXPRESSIVENESS OF GNNS

We now state the general theorems which are our main tools in proving our approximation guarantees for GNNs. Theirs proofs are deferred to Section D.9 which contains our generalization of the Stone-Weierstrass theorem with symmetries. We need to first introduce more general definitions: If G is a finite group acting on some topological space X, we say that G acts continuously on X if, for all g ∈ G, x → g • x is continuous. If G is a finite group acting on some compact set X and some topological space Y , we define the sets of equivariant and invariant continuous functions by, C E (X, Y ) = {f ∈ C(X, Y ) : ∀x ∈ X, ∀g ∈ G, f (g • x) = g • f (x)} C I (X, Y ) = {f ∈ C(X, Y ) : ∀x ∈ X, ∀g ∈ G, f (g • x) = f (x)} Note that these definitions extend Definition 1 to a general group. Theorem 5. Let X be a compact space, F = R p be some finite-dimensional vector space, G be a finite group acting (continuously) on X. Let F 0 ⊆ ∞ h=1 C I (X, R h ) be a non-empty set of invariant functions, stable by concatenation, and consider, F = {m • f : f ∈ F 0 ∩ C(X, R h ), m : R h → F MLP, h ≥ 1} ⊆ C(X, F) . Then the closure of F is, F = {f ∈ C I (X, F) : ρ (F 0 ) ⊆ ρ (f )} . We can apply Theorem 5 to the class of k-ary relational pooling GNN introduced in Murphy et al. (2019a) . As a result, we get that this class of invariant k-RP GNN can approximate any continuous function Murphy et al. (2019a) . We now state our general theorem for the equivariant case: Theorem 6. Let X be a compact space, F = R p and G = S n the permutation group, acting (continuously) on X and acting on F n by, for σ ∈ S n , x ∈ F n , ∀i ∈ {1, . . . , p}, (σ f with ρ(k -RPGNN) ⊆ ρ(f ) but to the best of our knowledge, ρ(k -RPGNN) is not known and only ρ(k -RPGNN) ⊂ ρ(2 -W L I ) is proved in • x) i = x σ -1 (i) , Let F 0 ⊆ ∞ h=1 C E X, (R h ) n be a non-empty set of equivariant functions, stable by concatenation, and consider, F = {x → (m(f (x) 1 ), . . . , m(f (x) n )) : f ∈ F 0 ∩ C X, (R h ) n , m : R h → F MLP, h ≥ 1} Assume, that, if f ∈ F 0 , then, x → n i=1 f (x) i , n i=1 f (x) i , . . . , n i=1 f (x) i ∈ F 0 . Then the closure of F is, F = {f ∈ C E (X, F n ) : ρ (F 0 ) ⊆ ρ (f )} . Applications of these theorems fo the case of Pointnet Qi et al. ( 2017) are provided in Section D.9

6. QUADRATIC ASSIGNMENT PROBLEM

To empirically evaluate our results, we study the Quadratic Assignment Problem (QAP), a classical problem in combinatorial optimization. For A, B n × n symmetric matrices, it consists in solving maximize trace(AXBX ), subject to X ∈ Π, where Π is the set of n × n permutation matrices. Many optimization problems can be formulated as QAP. An example is the network alignment problem, which consists in finding the best matching between two graphs, represented by their adjacency matrices A and B. Though QAP is known to be NP-hard, recent works such as Nowak et al. (2018) have investigated whether it can be solved efficiently w.r.t. a fixed input distribution. More precisely, Nowak et al. (2018) studied whether one can learn to solve this problem using a MGNN trained on a dataset of already solved instances. However, as shown below, both the baselines and their approach fail on regular graphs, a class of graph considered as particularly hard for isomorphism testing. To remedy this weakness, we consider 2-FGNN E . We then follow the siamese method of (Nowak et al., 2018) : given two graphs, our system produces an embedding in F n for each graph, where n is the number of nodes, which are then multiplied together to obtain a n × n similarity matrix on nodes. A permutation is finally computed by solving a Linear Assignment Problem (LAP) with this resulting n × n as cost matrix. We tested our architecture on two distribution: the Erdős-Rényi model and random regular graphs. The accuracy in matching the graphs is much improved compare to previous works. The experimental setup is described more precisely in Section A.1. (Feizi et al., 2016) GNN (Nowak et al., 2018) Figure 1 : Fraction of matched nodes for pairs of correlated graphs (with edge density 0.2) as a function of the noise, see Section A.1 for details.

7. CONCLUSION

We derived the expressive power of various practical GNN architectures: message passing GNN, linear GNN and folklore GNN; both for their invariant and equivariant counterparts. Our results unify and extend the recent works in this direction. In particular, we are able to recover all the universality results proved for GNNs so far. Similarly to existing results in the literature, we do not deal here with the sizes of the embeddings constructed at different layers, i.e. the sizes of the spaces F , and these sizes are supposed to grow to infinity with the number of nodes n in the graph. Obtaining bounds on the scaling of the sizes of the features to ensure that the results presented here are still valid is an interesting open question. We show that folklore GNNs have the best power of approximation among all GNNs studied here dealing with tensors of order 2. From a practical perspective, we demonstrate their improved performance on the QAP with a significant gap in performances compared to other approaches.

A Experimental results

A 

A.1 DETAILS ON THE EXPERIMENTAL SETUP

We consider a 2-FGNN E and train it to solve random planted problem instances of the QAP. Given a pair of graphs G 1 , G 2 with n nodes each, we consider the siamese 2-FGNN E encoder producing embeddings E 1 , E 2 ∈ R n×k . Those embeddings are used to predict a matching as follows: we first compute the outer product E 1 E T 2 , then we take a softmax along each row and use standard cross-entropy loss to predict the corresponding permutation index. We used 2-FGNN E with 2 layers, each MLP having depth 3 and hidden states of size 64. We trained for 25 epochs with batches of size 32, a learning rate of 1e-4 and Adam optimizer. The PyTorch code is available in the supplementary material. For each experiment, the dataset was made of 20000 graphs for the train set, 1000 for the validation set and 1000 for the test set. For the experiment with Erdős-Rényi random graphs, we consider G 1 to be a random Erdős-Rényi graph with edge density p e = 0.2 and n = 50 vertices. The graph G 2 is a small perturbation of G 1 according to the following error model considered in Feizi et al. (2016) : G 2 = G 1 (1 -Q) + (1 -G 1 ) Q , ( ) where Q and Q are Erdős-Rényi random graphs with edge density p 1 and p 2 = p 1 p e /(1 -p e ) respectively, so that G 2 has the same expected degree as G 1 . The noise level is the parameter p 1 . For regular graphs, we followed the same experimental setup but now G 1 is a random regular graph with degree d = 10. Regular graphs are interesting example as they tend to be considered harder to align due to their more symmetric structure. A.2 EXPERIMENTAL RESULTS ON GRAPHS OF VARYING SIZE 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0. We tested our models on dataset of graphs of varying size, as this setting is also encompassed by our theory. However, contrary to message-passing GNN, GNN based on tensors do not work well with batches of graphs of varying size. Previous implementations, such as the one of Maron et al. (2019a) , group the graphs in the dataset by size, enabling the GNN to only deal with batches of graphs on the same size. We implemented this functionnality as a class MaskedTensors. Thanks to the newest improvements of PyTorch (Paszke et al., 2019) , MaskedTensors act as a subclass of fundamental Tensor class. Thus they almost seamlessly integrate into standard PyTorch code. We refer the reader to the code for more details: https://github.com/mlelarge/graph_neural_net Results of our architecture and implementation with graphs of varying size are shown below on Figure 2 . The only difference with the setting described above is that the number of nodes is now random. The number of vertices of a graph is indeed chosen randomly according to binomial distribution of parameters n = 50 and p n = 0.9.

A.3 GENERALIZATION FOR REGULAR GRAPHS

We made the following experiment with the same general setting as in Section A.1 with regular graphs. We trained different models for all noise levels between 0 and 0.22 but in Figure 3 , we plot the accuracy of each model across all noise levels. We observe that a majority of the models actually generalize to settings with noise level on which they were not trained. Indeed, the model trained with noise level ≈ 0.1 is performing best among all models across all noise levels! B WEISFEILER-LEHMAN TESTS Here we describe more precisely this hierarchy of tests, which will be used extensively to characterize the discirminatory power of classes of GNN. See Douglas (2011) ; Grohe (2017) ; Fürer (2017) for graph-theoretic introductions to these algorithms.

B.1 WEISFEILER-LEHMAN TEST ON VERTICES

We now present the initial vertex coloring algorithm. Input. This algorithm takes as input a discrete graph structure G = (V, E) ∈ G n with V = [n], E ⊆ V 2 and h ∈ F n 0 features on the vertices. Initialization. Each vertex s ∈ V is given a color c 0 WL (G, h) s = h s corresponding to its features vector. Refining the coloring. The colors of the vertices are updated as follows, c t+1 WL (G, h) s = Lex c t WL (G, h) s , c t WL (G, h)(s) : s ∼ s , and the function Lex means that all occuring colors are lexicographically ordered. For each graph G ∈ G n and each vector of features h ∈ F n 0 , there exists a time T (G, h) from which the sequence of colorings (c t WL (G, h)) t≥0 is stationary. More exactly, the colorings are not refined anymore: for any t ≥ T (G, h), s, s ∈ V , c t WL (G, h) s = c t WL (G, h) s ⇐⇒ c T (G,h) WL (G, h) s = c T (G,h) WL (G, h) s . Denote the resulting coloring c T (G,h) WL (G, h) by simply c WL (G, h). c WL is now a mapping from G n × F n 0 → Z n for some space of colors Z.

Invariant tests

The proper Weisfeiler-Lehman test is invariant and is defined by, for t ≥ 0 and (G, h) ∈ G n × F n 0 a graph, WL t I (G) = c t WL (G) s : s ∈ V WL I (G) = {{c WL (G) s : s ∈ V }} Equivariant tests For the vertex coloring algorithm, c WL is already an equiavriant mapping so we define, for t ≥ 0 and (G, h) ∈ G n × F n 0 a graph, WL t E = c t WL WL E = c WL B.2 ISOMORPHISM TYPE The initialization of the higher-order variants of the Weisfeiler-Lehman test is slightly more intricate. For this we need to define the isomorphism type of a k-tuple w.r.t. a graph described by a tensor F n 2 0 . A k-tuple (i 1 , . . . , i k ) ∈ [n] k in a graph G ∈ F n 2 0 and a k-tuple (j 1 , . . . , j k ) ∈ [n] k in a graph H ∈ F n 2 0 are said to have the same isomoprhism type if the mapping i w → j w is a well-defined partial isormophism. Explicitly, this means that, • ∀w, w ∈ [k], i w = i w ⇐⇒ j w = j w . • ∀w, w ∈ [k], G iw,i w = H jw,j w . Denote by iso(G) i1,...,i k the isomorphism type of the k-tuple (i 1 , . . . , i k ) ∈ [n] k in a graph G ∈ F n 2 0 .

B.3 WEISFEILER-LEHMAN AND FOLKLORE WEISFEILER-LEHMAN TESTS OF ORDER k ≥ 2

We now present the folklore version of the Weisfeiler-Lehman test of order k (k-FWL), for k ≥ 2, along with k-WL for clarity. For both, we follow the presentation of Maron et al. (2019a) (except for the equivariant tests). Input. These algorithms take as input a graph G ∈ F n 2 0 which can be seen as a coloring on the pair of nodes. Initialization. Each k-tuple s ∈ V k is given a color c 0 k-WL (G) s = c 0 k-FWL (G) s corresponding to its isomorphism type. k-WL. The k-WL test relies on the following notion of neighborhood, defined by, for any w ∈ [k], and s = (i 1 , . . . , i k ) ∈ V k , N w (s) = {(i 1 , . . . , i w-1 , j, i w+1 , . . . , i k ) : j ∈ V } . (WL) Then, the colors of the k-tuples s ∈ V k are refined as follows, c t+1 WL (G) s = Lex c t k-WL (G) s , (C t 1 (s), . . . , C t k (s)) . where, for w ∈ [k], C t w (s) = c t k-WL (G) s : s ∈ N w (s) . For each graph G ∈ F n 0 , there exists a time T (G) from which the sequence of colorings (c t k-WL (G)) t≥0 is stationary. More exactly, the colorings are not refined anymore: for any t ≥ T (G), s, s ∈ V k , c t k-WL (G) s = c t k-WL (G) s ⇐⇒ c T (G) k-WL (G) s = c T (G) k-WL (G) s . Denote the resulting coloring c T (G) k-WL (G) by simply c k-WL (G). k-FWL. For k-FWL, the corresponding notion of neighborhood is defined by, for any j ∈ V , and s = (i 1 , . . . , i k ) ∈ V k , N F j (s) = {(j, i 2 , . . . , i k ), (i 1 , j, i 3 , . . . , i k ), . . . , (i 1 , i 2 , . . . , i k-1 , j)} (FWL) Then, the colors of the k-tuples s ∈ V k are refined as follows, c t+1 k-WL (G) s = Lex c t k-FWL (G) s , C t j (s) : j ∈ V , where, for j ∈ V , C t j (s) = c t k-FWL (G) s : s ∈ N F j (s) . Like k-WL, for each graph G ∈ F n 0 , there exists a time T (G) from which the sequence of colorings (c t k-WL (G)) t≥0 is stationary. Similarly, denote the resulting coloring c T (G) k-WL (G) by simply c k-WL (G). The colors c t k-FWL and c t k-FWL at iteration t define a mapping from F n 2 0 to space of colorings of k-tuples, Z n k for some space Z.

Invariant tests

The standard versions of the Weisfeiler-Lehman tests are invariant and can be defined by, for t ≥ 0 and G ∈ F n 2 0 a graph, k-WL t I (G) = c t k-WL (G) s : s ∈ V k k-WL I (G) = c k-WL (G) s : s ∈ V k k-FWL t I (G) = c t k-FWL (G) s : s ∈ V k k-FWL I (G) = c k-FWL (G) s : s ∈ V k . Equivariant tests We now introduce the equivariant version of these tests. Many extensions are possible, we chose this one for its simplicity. For t ≥ 0, G ∈ F n 2 0 a graph, i ∈ V , k-WL t E (G) i = c t k-WL (G) s : s ∈ V k , s 1 = i k-WL E (G) i = c k-WL (G) s : s ∈ V k , s 1 = i k-FWL t E (G) i = c t k-FWL (G) s : s ∈ V k , s 1 = i k-FWL E (G) i = c k-FWL (G) s : s ∈ V k , s 1 = i .

C SEPARATING POWER OF GNN

The goal of this section is to prove, Proposition 3. We have, for k ≥ 2, ρ (MGNN I ) = ρ (2-WL I ) ρ (MGNN E ) = ρ (2-WL E ) (4) ρ (k-LGNN I ) = ρ (k-WL I ) ρ (k-LGNN E ) ⊆ ρ (k-WL E ) (5) ρ (k-FGNN I ) = ρ ((k + 1)-WL I ) ρ (k-FGNN E ) = ρ ((k + 1)-WL E ) C.1 MUTLI-LINEAR PERCEPTRONS In the following we will use extensively multi-linear perceptrons (MLP) and their universality properties. Yet, for the sake of simplicity, we do not define precisely what we mean by MLP. Given two finite-dimensional feature spaces F 0 and F 1 , we only assume we are given a class of (continuous) MLP from F 0 to F 1 which is large enough to be dense in C(F 0 , F 1 ). See for instance Hornik et al. (1989) ; Cybenko (1989) for precise conditions for MLP to be universal.

C.2 AUGMENTED SEPARATING POWER

To factor the proof, we introduce another notion of separating power. Definition 7. For n, m ≥ 1 fixed, let X be a set, F be a some space and F be a set of functions from X to Y = F n×m . Then, the augmented separating power of F is, ρ augm n,m (F) = ρ ({(x, i, j) ∈ X × {1, . . . , n} × {1, . . . , m} → f (x) i,j : f ∈ F}) . Explicitly, for x, y ∈ X, i, j ∈ {1, . . . , n}, (x, i, j, y, k, l) ∈ ρ augm n,m (F) ⇐⇒ ∀f ∈ F, f (x) i,j = f (y) k,l . Note that when n = m = 1, the augmented separating power is exactly the same as the original separating power, so we identify ρ (.) with ρ augm 1,1 (.). We also identify ρ augm n,1 (.) with ρ augm 1,n (.), that we denote by ρ augm n (.). First, it is easy to see that this notion is more precise than the separating power. Lemma 8. If F and G are set of functions from X to F n×m , ρ augm n,m (F) ⊆ ρ augm n,m (G) =⇒ ρ (F) ⊆ ρ (G) , and, in particular, ρ augm n,m (F) = ρ augm n,m (G) =⇒ ρ (F) = ρ (G) , The interest in this notion is justified by the following lemma, which shows that this notion behaves well under composition with "reduction layers". Lemma 9. For n, m ≥ 1 fixed, let X be a compact topological space, Y = F n×m , F some finitedimensional space, F ⊆ C(X, Y ), and τ : X → Z n×m a function for some space Z. Define F ⊆ C(X, F n ) by, F =    x ∈ X -→   m j=1 h(f (x) 1,j ), m j=1 h(f (x) 2,j ), . . . , m i=1 h(f (x) n,j )   : h : F → F MLP, f ∈ F    . and τ : X → Z n by, for x ∈ X, ∀i ∈ {1, . . . , n}, τ (x) i = {{τ (x) i,j : 1 ≤ j ≤ m}} . Then, ρ augm n,m (F) ⊆ ρ augm n,m (τ ) =⇒ ρ augm n,m F ⊆ ρ augm n,m ( τ ) ρ augm n,m (F) ⊃ ρ augm n,m (τ ) =⇒ ρ augm n,m F ⊃ ρ augm n,m ( τ ) . Note that, in the statement, we implicitly see F as a set of functions from X to F n×1 to fit the definition of augmented separating power. Proof. We show the two inclusions independently. (⊆) We first show that, ρ augm n,m (F) ⊆ ρ augm n,m (τ ) =⇒ ρ augm n,m F ⊆ ρ augm n,m ( τ ) Take (x, i, y, k) ∈ ρ augm n,m F . This means that, for any h : F → F, f ∈ F, m j=1 h(f (x) i,j ) = m j=1 h(f (y) k,j ) . By Lem. 31 and the universality of MLP, there exists a permutation σ ∈ S m such that (f (x) i,σ(1) , . . . , f (x) i,σ(m) ) = (f (y) k,1 , . . . , f (y) k,m ). By definition of the augmented separating power, this means that, for any j ∈ {1, . . . , m}, (x, i, σ(j), y, k, j) ∈ ρ augm n,m (F). Hence, by assumption, for any j ∈ {1, . . . , m}, (x, i, σ(j), y, k, j) ∈ ρ augm n,m (τ ), i.e. τ (x) i,σ(j) = τ (y) k,j . But this exactly means that τ (x) i = τ (y) k so that (x, i, y, k) ∈ ρ augm n,m (τ ) as required. (⊃) We now show the other inclusion, ρ augm n,m (F) ⊃ ρ augm n,m (τ ) =⇒ ρ augm n,m F ⊃ ρ augm n,m ( τ ) . Take (x, i, y, k) ∈ ρ augm n,m ( τ ). By definition of τ , this means that there exists σ ∈ S m such that τ (x) i,σ(j) = τ (y) k,j so that (x, i, σ(j), y, k, j) ∈ ρ augm n,m (τ ) ⊆ ρ augm n,m (F). Hence, for any f ∈ F, (f (x) i,σ(1) , . . . , f (x) i,σ(m) ) = (f (y) k,1 , . . . , f (y) k,m ), and so, for any h : F → F, m j=1 h(f (x) i,j ) = m j=1 h(f (y) k,j ) . Therefore, (x, i, y, k) ∈ ρ augm n,m F , which concludes the proof.

C.3 INITIALIZATION LAYER

We use the same initialization layer as Maron et al. (2019a); Chen et al. (2020) and recall it below. The initial graph is a tensor of the form G ∈ F n 2 0 with F 0 = R e+2 ; the last channel of G :,:,e+1 encodes the adjacency matrix of the graph and the first e channels G :,:,1:e are zero outside the diagonal and G i,i,1:e ∈ R e is the color of vertex v i ∈ V . We then define I k : F n 2 0 → F n k 1 with F 1 = R k 2 ×(e+2) as follows: I k (G) i,r,s,w = G ir,is,w , w ∈ [e + 1], I k (G) i,r,s,e+2 = 1(i r = i s ), for i ∈ [n] k and r, s ∈ [k] . This linear equivariant layer has the same separating power as the isormorphism type, which is defined in Section B.2.

Lemma 10 ((Maron et al., 2019a, C.1)). For

k ≥ 2, p 0 ≥ 1, F 0 = R p0 , F 1 = R k 2 ×(p0+1) , there exists I k : F n 2 0 → F n k 1 such that, ρ augm n,n k-1 I k = ρ augm n,n k-1 (iso) . C.4 KNOWN RESULTS ABOUT THE SEPARATING POWER OF SOME GNN CLASSES First, we need to define some classes of GNN from which both the invariant and equivaraint GNN we considered are built. See Section 3 for details about the different layers. MGNN emb = {F T • . . . F 2 • F 1 : F t : F n t → F n t+1 message passing layer, t = 1, . . . , T, T ≥ 1} k-LGNN emb = {F T • . . . F 2 • F 1 • I k : F n t → F n t+1 linear equivariant layer, t = 1, . . . , T, T ≥ 1} k-FGNN emb = {F T • . . . F 2 • F 1 • I k : F n t → F n t+1 FGL, t = 1, . . . , T, T ≥ 1} . Then, the precise results from the literature can be rephrased as, Lemma 11 (Xu et al. (2018)  (MGNN emb ) = ρ augm n (c WL ) (8) ρ augm n,n k-1 (k-LGNN emb ) ⊆ ρ augm n,n k-1 (c k-WL ) (9) ρ augm n,n k-1 (k-FGNN emb ) ⊆ ρ augm n,n k-1 (c k-FWL )

C.5 BOUNDING THE SEPARATING POWER OF k-FGNN

We complete the results of the literature with a bound on the separating power of k-FGNN. Note that the particular case of k = 2 is already proven in Geerts (2020a). Lemma 12. For any k ≥ 2, ρ augm n,n k-1 (k-FGNN emb ) ⊃ ρ augm n,n k-1 (c k-FWL ) , so that ρ augm n,n k-1 (k-FGNN emb ) = ρ augm n,n k-1 (c k-FWL ) . Proof. Define, k-FGNN T emb = {F T • . . . F 2 • F 1 • I k : F n t → F n t+1 FGL, t = 1, . . . , T } , the set of functions defined by exactly T FGL layers. We show by induction that, for any T ≥ 0, ρ augm n,n k-1 k-FGNN T emb ⊃ ρ augm n,n k-1 c T k-FWL . For T = 0, this is immediate by the definition of I k in Section C.3. Assume now that this inclusion holds at T -1 ≥ 0. We show that it also holds at T . Take G, G ∈ F n k 0 and s, s ∈ [n] k such that, c T k-FWL (G) s = c T k-FWL (G ) s . We need to show that, for any f ∈ k-FGNN T emb , f (G) s = f (G ) s . But, by definition of the update rule k-FWL, the equality of the colors of s and s above implies that, c T -1 k-FWL (G) s = c T -1 k-FWL (G ) s , and that there exists σ ∈ S n such that, for any j ∈ [n], c T -1 k-FWL (G) s : s ∈ N F j (s) = c T -1 k-FWL (G ) s : s ∈ N F σ(j) (s ) . Let s = (i 1 , . . . , i k ) and s = (j 1 , . . . , j k ). Then this implies that, for any w ∈ [k], j ∈ [n], c T -1 k-FWL (G) i1,...,iw-1,j,iw+1,...,i k = c T -1 k-FWL (G ) j1,...,jw-1,σ(j),jw+1,...,j k . ( ) We now use the induction hypothesis, i.e. that, ρ augm n,n k-1 k-FGNN T -1 emb ⊃ ρ augm n,n k-1 c T -1 k-FWL . Take any f T -1 ∈ k-FGNN T -1 emb . By (11), f T -1 (G) s = f T -1 (G ) s . By ( 12), for any w ∈ [k], j ∈ [n], f T -1 (G) i1,...,iw-1,j,iw+1,...,i k = f T -1 (G ) j1,...,jw-1,σ(j),jw+1,...,j k . By the definition of FGL Section 3.3, for any F T : F n k T → F n k T +1 FGL, F T • f T -1 (G) s = F T • f T -1 (G ) s . Therefore, for any f ∈ k-FGNN T emb , f (G) s = f (G ) s , which concludes the proof. C.6 CONCLUSION Proposition 13. We have, for k ≥ 2, ρ (MGNN I ) = ρ (2-WL I ) ρ (MGNN E ) = ρ (2-WL E ) (13) ρ (2-LGNN I ) = ρ (2-WL I ) (14) ρ (k-LGNN I ) ⊆ ρ (k-WL I ) ρ (k-LGNN E ) ⊆ ρ (k-WL E ) (15) ρ (k-FGNN I ) = ρ (k-FWL I ) ρ (k-FGNN E ) = ρ (k-FWL E ) Proof. Most of the statements come from the literature or are direct consequences of the lemmas above. • Proof of (13). The invariant case is proven in Xu et al. (2018, Lem. 2, Thm. 3 ). The equivariant case comes from Lem. 11, Lem. 8, the fact that the layers m E • (Id +λS 1 ) does not change the separating power and recalling that simply WL E = c W L . • ( 14) is the exact result Chen et al. (2020, Thm. 6 ). • Proof of (15). The invariant case is exactly Maron et al. (2019a, Thm. 1). The equivariant case comes from Lem. 11, Lem. 9, Lem. 8 and the fact that the layer m E does not change the separating power. • Proof of ( 16). The direct inclusion of the invariant case corresponds to Maron et al. (2019a, Thm. 2). The other cases are a consequence of Lem. 11, Lem. 9, Lem. 8 and the fact that the layers m I and m E does not change the separating power.

D STONE-WEIERSTRASS THEOREM WITH SYMMETRIES

This section presents our extension of the Stone-Weierstrass theorem dealing with functions with symmetries. The scope of this section is not restricted to graphs or even tensors and we will deal with general spaces and general symmetries. To illustrate it, we will present applications for the PointNet architecture Qi et al. (2017) . Our approximation results for GNNs (Theorems Thm. 5 and Thm. 6) are then obtained from these theoretical results applied to tensors and the symmetric group in Section D.9

D.1 GENERAL NOTATIONS

As explained above, we are dealing in this section with a much larger scope than graphs and permutations. We first need to extend the notations introduced above. The notations introduced below will make this section self-contained. If X is some topological space, and F ⊆ X, denote by F its closure. If X is a topological space and Y = R p some finite-dimensional space, denote by C(X, Y ) the set of continuous functions from X to Y . Moreover, if X is compact, we endow C(X, Y ) with the topology of uniform convergence, which is defined by the norm, f → sup x∈X f (x) for some norm . on Y . If G is a finite group acting on some topological space X, we say that G acts continuously on X if, for all g ∈ G, x → g • x is continuous. If G is a finite group acting on some compact set X and some topological space Y , we define the sets of equivariant and invariant continuous functions by, C E (X, Y ) = {f ∈ C(X, Y ) : ∀x ∈ X, ∀g ∈ G, f (g • x) = g • f (x)} C I (X, Y ) = {f ∈ C(X, Y ) : ∀x ∈ X, ∀g ∈ G, f (g • x) = f (x)} Note that these definitions extend Definition 1 to a general group. If Y = R p , we denote the coordinate-wise multiplication, or Hadamard product of y, y ∈ Y simply by yy = (y 1 y 1 , . . . , y p y p ) ∈ Y . We say that a subset A ⊆ R p is a subalgebra of R p if it is both a linear space and stable by multiplication. This product in turn defines a product on C(X, Y ) with Y = R p by, for f, g ∈ C(X, Y ), f g : x → f (x)g(x). In addition, we also extend the scalar-vector product of Y = R p to functions: if g ∈ C(X, R) and f ∈ C(X, Y ), their product gf is the function gf : x → g(x)f (x). Given a set of scalar functions S ⊆ C(X, R) and a set of vector-valued functions F ⊆ C(X, Y ), the set of products of functions of these two sets will be denoted by, S • F = {gf : g ∈ S, f ∈ F} . Moreover, we denote by 1 the continuous function from some X to R p defined by x → (1, . . . , 1). In particular, if f is a function from X to R, f 1 denotes the function x → (f (x), . . . , f (x)) which goes from X to Y = R p . Finally, we say that F ⊆ C(X, Y ) with Y being some R p is a subalgebra if it is a linear space which is also stable by multiplication.

D.2 SEPARATING POWER

We recall the definition of separating power that we introduced above: Definition 14. Let F be a set of functions f defined on a set X, where each f takes its values in some Y f . The equivalence relation ρ (F) defined by F on X is: for any x, x ∈ X, (x, x ) ∈ ρ (F) ⇐⇒ ∀f ∈ F, f (x) = f (x ) . For a function f , we write ρ (f ) for ρ ({f }). Separating power is stable by closure: Lemma 15. Let X be a compact topological space, Y be some finite-dimensional and F ⊆ C(X, Y ). Then, ρ (F) = ρ F Proof. As F ⊆ F, F is more separating than F, i.e. ρ F ⊆ ρ (F). Conversely, take (x, y) / ∈ ρ F . By definition, there exists h ∈ F such that h(x) = h(y) so that if = h(x) -h(y) , > 0 (for some norm . on Y ). As h ∈ F, there is some f ∈ F such that sup X h -f ≤ 3 . Therefore, by the triangular inequality, = h(x) -h(y) ≤ f (x) -h(x) + f (x) -f (y) + f (y) -h(y) ≤ 2 3 + f (x) -f (y) . It follows that f (x) -f (y) ≥ 3 > 0 so that f (x) = f (y) and (x, y) / ∈ ρ (F).

D.3 APPROXIMATION THEOREMS FOR REAL-VALUED FUNCTIONS

We start by recalling Stone-Weierstrass theorem, see Rudin (1991, Thm. 5.7) . Theorem 16 (Stone-Weierstrass). Let X be a compact space, and F be a subalgebra of C(X, R) the space of real-valued continuous functions of X, which contains the constant function 1. If F separates points, i.e. ρ (F) = {(x, x) : x ∈ X}, then F is dense in C(X, R). We now prove an extension of this classical result due to Timofte (2005) allowing us to deal with much smaller F by dropping the requirement that F separates points. Corollary 17. Let X be a compact space, and F be a subalgebra of C(X, R) the space of real-valued continuous functions of X, which contains the constant function 1. Then, F = {f ∈ C(X, R) : ρ (F) ⊆ ρ (f )} . Note that if F separates points, we get back the classical result as every function satisfies {(x, x) : x ∈ X} ⊆ ρ (f ). Example 18. The invariant version of PointNet is able to learn functions of the form i f (x i ) for f ∈ C(R p , R). We can apply Corollary 17 to this setting. Consider the case where X is a compact subset of R p and F = {x → g ( n i=1 f (x i )) , f ∈ C(X, R h ), g ∈ C(R h , R)}. F is a subalgebra of C(X, R) (indeed of C I (X, R)) which contains the constant function 1. Then, it easy to see that ρ (F) = {(x, σ x), σ ∈ S n }, where σ x is defined by (σ x) σ(i) = x i for all i (see Lem. 31 for a formal statement). Now note that for a function f ∈ C(X, R), the condition {(x, σ x), σ ∈ S n } ⊆ ρ (f ) is equivalent to f ∈ C I (X, R). So that Corollary 17 implies that F = C I (X, R) which means that PointNet is universal for approximating invariant functions. This was already proved in Qi et al. (2017) . We now provide a proof of Corollary 17 for completeness. Proof. The first inclusion F ⊆ {f ∈ C(X, R) : ρ (F) ⊆ ρ (f )} follows from the same argument as the one given below Theorem 4 so we focus on the other one. For every x ∈ X, let x F denote its ρ (F)-class. The quotient set and the canonical surjection are: X F = X/ρ (F) = {x F , x ∈ X} and π F : X → X F , π F (x) = x F . A function g : X → R factorizes as g = ĝ • π F for some ĝ : X F → R if and only if ρ (F) ⊆ ρ (g). In this case ĝ is unique, since π F is a surjection. In particular. every f ∈ F factorizes uniquely as f = f • π F , f : X F → R, and F = { f , f ∈ F } clearly separates points on X F . We refer to Munkres (2000, §22) for the properties of the quotient topology. In particular, by the properties of the quotient topology on X F , F is a subalgebra of C(X F , R) and X F is compact. Hence, we can apply Theorem 16 to F and F is dense in C(X F , R). Now take f ∈ C(X, R) with ρ (F) ⊆ ρ (f ) and we show that f ∈ F. Again, ρ (F) ⊆ ρ (f ) implies that f = f • π F . Let > 0. By density of F, there is some ĥ ∈ F such that sup x F | ĥ(x F ))f (x F )| ≤ . But, by construction of F , there exits h ∈ F such that ĥ • π F = h. Thus, sup x∈X |h(x) -f (x)| = sup x∈X | ĥ(π F (x)) -f (π F (x))| = sup x F ∈X F | ĥ(x F )) -f (x F )| ≤ . As this holds for any > 0, we have proven that f ∈ F.

D.4 THE EQUIVARIANT APPROXIMATION THEOREM

We first need to extend Corollary 17 to vector-valued functions. For this, we need to have a vectorvalued version of the Stone-Weierstrass theorem and as shown by the example below additional assumptions have to be made. Example 19. We consider now the equivariant version of PointNet corresponding to the particular case where X is a compact subset of (R p ) n , Y = R n and F = {x → (f (x 1 ), . . . , f (x n )), f ∈ C(R, R)}. Then clearly F is a subalgebra of C E (X, Y ) containing the constant function 1 and ρ (F) = {(x, x), x ∈ X}. Hence if Corollary 17 would be true with vector-valued functions instead of real-valued functions, we would have that F is dense in C(X, Y ). But this can clearly not be true as F ⊆ C E (X, Y ) which is clearly not dense in C(X, Y ). We now present an extension of Corollary 17 also due to Timofte (2005) : Proposition 20. Let X be a compact space, Y = R p for some p ≥ 1. Let F ⊆ C(X, Y ). If there exists a nonempty subset S ⊆ C(X, R) such that: S • F ⊆ F and, ρ (S) ⊆ ρ (F) . ( ) Then we have F = f ∈ C(X, Y ), ρ (F) ⊆ ρ (f ) , f (x) ∈ F(x) , ( ) where F(x) = {f (x), f ∈ F}. Moreover in ( 18), we can replace ρ (F) by ρ (S). Note that in the particular case Y = R and if F is a subalgebra of C(X, R), then we can take S = F in ( 17) and if the constant function 1 is in F, then F(x) = R, so that we recover Corollary 17. Now consider the case where F is a subalgebra of C(X, R p ). We need to find a set S ∈ C(X, R) satisfying ( 17) i.e. with a better separating power than F but containing real-valued functions such that sf ∈ F for all s ∈ S and f ∈ F. In the sequel, we will consider the set F scal = {f ∈ C(X, R) : f 1 ∈ F}. We clearly have F scal • F ⊆ F since F is a subalgebra. Hence, in this setting, Prop. 20 can be rewritten as follows: Corollary 21. Let X be a compact space, Y = R p for some p, G be a finite group acting (continuously) on X and F ⊆ C I (X, Y ) a (non-empty) set of invariant functions associated to G. Consider the following assumptions, 1. F is a sub-algebra of C(X, Y ) and the constant function 1 is in F. 2. The set of functions F scal ⊆ C(X, R) defined by, F scal = {f ∈ C(X, R) : f 1 ∈ F} satisfy, ρ (F scal ) ⊆ ρ (F) . 3. For any x ∈ X, there exists f ∈ F such that f (x) has pairwise distinct coordinates, i.e., for any indices i, j ∈ {1, . . . , p} with i = j, f (x ) i = f (x) j . Then the closure of F (for the topology of uniform convergence) is, F = {f ∈ C I (X, Y ) : ρ (F) ⊆ ρ (f )} . Note that Assumptions 1 and 2 ensures that ( 17) is valid, while Assumption 3 ensures that F(x) = R p . Unfortunately, in the equivariant case, the condition ρ (F scal ) ⊆ ρ (F) is too strong and we now explain how we will relax it. For the sake of simplicity, we consider here the particular setting adapted to graphs: let n ≥ 1 be a fixed number (corresponding to the number of nodes), X be a compact set of graphs in R n 2 and Y = F n with F = R p for some p ≥ 1. We define the action of the symmetric group S n on X by (σ x) σ(i),σ(j) = x i.j and on Y by (σ y) σ(i) = y i ∈ R p . Hence the set of continuous equivariant functions C E (X, Y ) agrees with Definition 1. Now consider the case where F ⊆ C E (X, Y ) is a subalgebra of equivariant functions. Then, f ∈ F scal needs to be invariant in order for f 1 to be equivariant and hence in F. As a result, we see that F scal will not separate points of X in the same orbit, i.e. x and σ x. But these points will typically be separated by F, since for any f ∈ F, we have f (σ x) = σ f (x) which is not equal to f (x) unless f is invariant. We see that we need somehow to require a weaker separating power for F. More formally, two isomorphic graphs will have permuted outputs through an equivariant function, but should not be considered as separated. Let Orb(x) = {σ x, σ ∈ S n } and Orb(y) = {σ y, σ ∈ S n }. For any equivariant function f ∈ C E (X, Y ), for any z ∈ Orb(x), we have f (z) ∈ Orb(f (x)). Then let π : Y → Y /S n be the canonical projection π(y) = Orb(y). We define (x, x ) ∈ ρ (π • F) ⇔ ∀f ∈ F, Orb(f (x)) = Orb(f (x )) ⇔ ∀f ∈ F, ∃σ ∈ S n , f (σ x) = f (x ). In particular, we see that if x ∈ Orb(x) then (x, x ) ∈ ρ (π • F) for any F ∈ C E (X, Y ). Moreover, two graphs x and x are ρ (π • F)-distinct if there exists a function f ∈ F such that ∀σ, f (σ x) = f (x ), i.e. the function f discriminates Orb(x) from Orb(x ) in the sense that for any z ∈ Orb(x) and z ∈ Orb(x ), we have f (z) = f (z ). To obtain an equivalent of Proposition 20 with C E (X, Y ) replacing C(X, Y ), we are able to relax assumption (17) to ρ (F scal ) ⊆ ρ (π • F). Our main general result in this direction is the following theorem (proved in Section D.7) which might be of independent interest: Theorem 22. Let X be a compact space, Y = R p for some p, G be a finite group acting (continuously) on X and Y and F ⊆ C E (X, Y ) a (non-empty) set of equivariant functions. Denote by π : Y -→ Y /G the canonical projection on the quotient space Y /G. Consider the following assumptions, 1. F is a sub-algebra of C(X, Y ) and the constant function 1 is in F. 2. The set of functions F scal ⊆ C(X, R) defined by, F scal = {f ∈ C(X, R) : f 1 ∈ F} satisfy, ρ (F scal ) ⊆ ρ (π • F) . Then the closure of F (for the topology of uniform convergence) is, F = {f ∈ C E (X, Y ) : ρ (F) ⊆ ρ (f ) , ∀x ∈ X, f (x) ∈ F(x)} , where F(x) = {f (x), f ∈ F}. Moreover, if I(x) = {(i, j) ∈ [p] 2 : ∀y ∈ F(x), y i = y j }, then we have: F(x) = {y ∈ R p : ∀(i, j) ∈ I(x), y i = y j } . Example 23. We now demonstrate how Theorem 22 can be used to recover the universality results in Segol & Lipman (2020) . In this paper, the authors study equivariant neural network architectures working with unordered sets, corresponding in our case to X = Y = R n and the group being the symmetric group S n . They show that the PointNet architecture cannot approximate any (continuous) equivariant function and that adding a single so-called transmission layer is enough to make this architecture universal. Indeed, PointNet can only learn maps of the form x ∈ R n → (f (x 1 ) . . . f (x n )), which are not universal in the class of equivariant functions, as shown by Segol & Lipman (2020, Lem. 3 ). Now, their transmission layer is a map of the form x ∈ R n → (1 T x)1. Therefore, in PointNetST, adding such a layer precisely adds a large class of functions to F = {(f (x 1 , i g(x i )), . . . , f (x n , i g(x i ))), f ∈ C(R × R h , R), g ∈ C(R, R h ), h ≥ 1}. F is still an algebra and as shown in Example 19, we have ρ (F) = {(x, x), x ∈ X}. Moreover, we have F scal = C I (X, R) by Lem. 33 in particular, we get ρ (F scal ) = {(x, σ x), x ∈ X}, so that we obviously have ρ (F scal ) ⊆ ρ (π • F). In summary, Theorem 22 implies the universality of PointNetST in C E (X, Y ).

D.5 A PRELIMINARY VERSION OF THE EQUIVARIANT APPROXIMATION THEOREM

We start by proving a version of Theorem 22 with a slightly weaker condition: Proposition 24. Let X be a compact space, Y = R p for some p, G be a finite group acting (continuously) on X and Y and F ⊆ C E (X, Y ) a (non-empty) set of equivariant functions. Consider the following assumptions, 1. F is a subalgebra C E (X, Y ). 2. The set of real-valued functions F scal ⊆ C(X, R) defined by, F scal = {f ∈ C(X, R) : f 1 ∈ F} satisfies, ρ (F scal ) ⊆ {(x, x ) ∈ X × X : ∃g ∈ G, (g • x, x ) ∈ ρ (F)} . Then the closure of F (for the topology of uniform convergence) is, F = f ∈ C E (X, Y ) : ρ (F) ⊆ ρ (f ) , ∀x ∈ X, f (x) ∈ F(x) , where F(x) = {f (x), f ∈ F}. The proof of this theorem relies on two main ingredients. First, following the elegant idea of Maehara & Hoang (2019) , we augment the input space to transform the vector-valued equivariant functions into scalar maps. Second, we apply the fine-grained approximation result Cor. 17. Proof. As uniform convergence implies point-wise convergence, the first inclusion is immediate, F ⊆ f ∈ C E (X, Y ) : ρ (F) ⊆ ρ (f ) , ∀x ∈ X, f (x) ∈ F(x) . The rest of the proof is devoted to the other direction. For convenience, denote by Φ the family of linear forms associated to the canonical basis of R p , i.e., Φ = {y → y i : 1 ≤ i ≤ p} ⊆ C(Y, R) . Define our augmented input space as X = X × Φ. As Φ is finite and X is compact, X is still a compact space. We now transform F, a class of equivariant functions from X to Y , into F a class of maps from X to R. Define F = {(x, ϕ) → ϕ(f (x)) : f ∈ F} . We check that F is indeed a subset of C( X, R). Indeed, as Φ is finite, it is equipped with the discrete topology. Hence, each singleton {ϕ} for ϕ ∈ Φ is open in Φ and it suffices to check the continuity in the first variable with ϕ fixed. But, if f ∈ F, x → ϕ(f (x)) is continuous as a composition of continuous maps. We can now apply Cor. 17 to F ⊆ C( X, R). Therefore, the closure of F in C( X, R) is, F = v ∈ C( X, R) : ρ F ⊆ ρ (v) , ∀(x, ϕ) ∈ X, v(x, ϕ) ∈ F(x, ϕ) . We now show the equality of ( 19). Take h in the right-hand side of (19), i.e. h ∈ C E (X, Y ) such that ρ (F) ⊆ ρ (h) and h(x) ∈ F(x) for all x ∈ X. We show that h, defined by h : (x, ϕ) → ϕ(h(x)), belongs to F using the result above. • As h is continuous, by the same argument as above, (x, ϕ) → ϕ(h(x)) is continuous on X. • We check that ρ F ⊆ ρ h . Take (x, ϕ), (y, ψ) ∈ X such that, for all f ∈ F, ϕ(f (x)) = ψ(f (y)) , and we aim at showing that ϕ(h(x)) = ψ(h(y)). To gain more information from ( 20), we apply it to functions of the form f 1 with f ∈ F scal . By definition of Φ, this translates to f (x) = f (y), for any f ∈ F scal . Therefore (x, y) ∈ ρ (F scal ) and so there exists g ∈ G such that (g • x, y) ∈ ρ (F). Plugging this into (20) and using the equivariance of f , ∀f ∈ F, ϕ(f (x)) = ψ(g • f (x)) . As G acts continuously on Y , both ϕ and z → ψ(g • z) are continuous and, as a consequence of the equality above, coincide on F(x). But we assumed that h(x) ∈ F(x) and therefore the equality also holds for h, i.e. ϕ(h(x)) = ψ(g • h(x)). Finally, recalls that, by assumption, ρ (F) ⊆ ρ (h). Therefore (g • x, y) ∈ ρ (F) implies that h(g • x) = h(y) and, combined with the result above, ϕ(h(x)) = ψ(h(y)). • We verify that, for x ∈ X, ϕ ∈ Φ, ϕ(h(x)) belongs to F(x). Indeed, recall that h(x) ∈ F(x). Therefore, as ϕ is continuous, ϕ(h(x)) is in ϕ(F(x)) which is included in F(x). This shows that h : (x, ϕ) → ϕ(h(x)) is in F. Consequently, for any > 0, there exists f ∈ F such that, ∀x ∈ X, ∀ϕ ∈ Φ, |ϕ(h(x)) -ϕ(f (x))| ≤ . If Y = R p is endowed with the infinity norms on coordinates, by definition of Φ, this means that, ∀x ∈ X, h(x) -f (x) ≤ . Remark 25. In the particular case of G ⊆ S n being a group of permutations acting on R p by, for g ∈ G, x ∈ R p , ∀i ∈ {1, . . . , p}, (g • x) i = x g -1 (i) , the functions of F are indeed invariant, as shown by Maehara & Hoang (2019) . For this, a left action on Φ is defined by, for g ∈ G, ϕ ∈ Φ, ∀x ∈ R p , (g • ϕ)(x) = ϕ(g -1 • x) . In other words, the action of g on the linear form associated to the i th coordinate yields the linear form associated to the g(i) th coordinate. One can now check that the functions F are invariant.

D.6 CHARACTERIZING THE SUBALGEBRAS OF R p

Before moving to our general result, we need to study the structure of the subalgebras of R p . For this, we will use the following simple lemma. In the following lemma, R[X 1 , . . . , X p ] denotes the set of multivariate polynomials with p indeterminates (and real coefficients). Lemma 26. Let C ⊆ R p be a finite subset of R p . There exists P ∈ R[X 1 , . . . , X p ] such that P |C , the restriction of P to C, is an injective map. Proof. Let x 1 , . . . , x m ∈ R p be distinct vectors such that {x 1 , . . . , x m } = C. Similarly to Lagrange polynomials, define, P (X 1 , . . . , X p ) = m i=1 i j =i p l=1 (X l -x j l ) 2 x i -x j 2 2 , Published as a conference paper at ICLR 2021 which is a well-defined multivariate polynomial. Note that, seeing X = (X 1 , . . . , X p ) as a vector in R p , it can also be written as, P (X 1 , . . . , X p ) = m i=1 i j =i X -x j 2 2 x i -x j 2 2 . ( ) By construction, P (x i ) = i and therefore P is an injective map on C Lemma 27. For a subalgebra A of R p , we define: J = {j ∈ {1, . . . , p} : ∀x ∈ A, x j = 0} I = {(i, j) ∈ (J C ) 2 : ∀x ∈ A, x i = x j } . Then, we have : A = {x ∈ R p : ∀(i, j) ∈ I, x i = x j , ∀j ∈ J, x j = 0} . Proof. Before proving the general case, we focus on the situation where A will turn out to be the whole R p . Assume that the two following conditions holds, ∀i = j, ∃x ∈ A, x i = x j (23) ∀i, ∃x ∈ A, x i = 0 ( ) Our goal is to show that, under these additional assumptions, A = R p . We divide the proof in three parts, first we show that 1 ∈ A using ( 24), giving us that A is closed under polynomials, then that there is x ∈ A with pairwise distinct coordinates thanks to ( 23) and finally that this implies that A is the whole space. Note that if p = 1, (24) and the linear space property of A immediately give the result. • Here we prove that (24) implies that 1 ∈ A. -First, we construct by induction x ∈ A such that x i = 0 for any index i. More precisely, our induction hypothesis at step j ∈ {1, . . . , p} is, ∃x ∈ A, ∀1 ≤ i ≤ j, x i = 0 . By ( 24), this holds for j = 1. Now, assume that it holds at j -1 for some p ≥ j ≥ 2 and take x ∈ A such that x i = 0 for any 1 ≤ i ≤ j -1 and y ∈ A such that j j = 0 by (24). By definition of x, the set, {λ ∈ R : ∃1 ≤ i ≤ j -1, λx i + y i = 0} , is finite. Thus, there exists λ ∈ R such that λx i + y i = 0 for any 1 ≤ i ≤ j -1 and for i = j too as x j = 0 and y j = 0. As A is a subalgebra, λx + y ∈ A and this concludes the induction step. -Let x ∈ A be the vector constructed, i.e. such that x i = 0 for every index i. We prove that 1 ∈ A by constructing 1 from x. Indeed, using Lagrange interpolation, take P ∈ R[X] such that P (x i ) = 1 xi for every i. (Note that this noes not matter if some x i are equal, as the 1 xi would also be the same.) Finally, as A is a subalgebra, xP (x), which is to be understood coordinate-wise, is in A and so is 1 = xP (x). • We show that the previous point and (23) imply that there exists a vector in A with pairwise distinct coordinates, i.e. that there exists x ∈ A such that, for any i = j, x i = x j . Using (23), for any i < j, there exists x ij ∈ A such that x ij i = x ij j . We wish to combine the family (x ij ) i<j into a single vector. For this we use Lem. 26. Seeing each collection (x ij k ) i<j as vector of R p(p-1)/2 , we define C = {(x ij k ) i<j : 1 ≤ k ≤ p}, which is a finite subset (of cardinal p) of R p(p-1)/2 . By Lem. 26, there exists P ∈ R[X 1 , . . . , X p(p-1)/2 ] such that P is an injective map on C. As A is a subalgebra and 1 ∈ A, the vector, P (x ij ) i<j =     P (x ij 1 ) i<j . . . P (x ij p ) i<j     , is in A too. We now check that this vector has pairwise distinct coordinates. Let l < k, then x lk l = x lk k and therefore (x ij l ) i<j = (x ij k ) i<j . By construction of P , P (x ij l ) i<j = P (x ij k ) i<j , i.e. P (x ij ) i<j l = P (x ij ) i<j k . Thus, P (x ij ) i<j ∈ A has pairwise distinct coordinates as required. • Finally, we show that A = R p . This is a direct consequence of the point above and of Lagrange interpolation. Indeed, take any y ∈ R p and denote by x ∈ A the vector that we just constructed with pairwise distinct coordinates. Therefore, by Lagrange interpolation, there exists P ∈ R[X] such that P (x i ) = y i for every i ∈ {1, . . . , p}. As A is a sublagebra and 1 ∈ A, P (x) ∈ A and hence y ∈ A. Finally, we return to the general case. We introduce the set of indexes of I and J which appear in the result and use them to reduce the situation to the previous case. Define, J = {j ∈ {1, . . . , p} : ∀x ∈ A, x j = 0} I = {(i, j) ∈ (J C ) 2 : ∀x ∈ A, x i = x j } . and denote by A = {x ∈ R p : ∀(i, j) ∈ I, x i = x j , ∀j ∈ J, x j = 0}, which is also a subalgebra. By definition, it holds that A ⊆ A . By construction, I is an equivalence relation on J C and denote by J C /I its equivalence classes. Let p = |J C /I| and choose i 1 , . . . , i p representatives of the equivalence classes. Consider the map, ϕ : R p -→ R p x -→ (x i1 , . . . , x i p ) . ϕ is an algebra homomoprhism so that ϕ(A) is a subalgebra of R p . But, by construction of I and J, ϕ(A) satisfies ( 23) and ( 24). Whence, by our result in this particular case, ϕ(A) = R p . However, A ⊆ A implies that ϕ(A) ⊆ ϕ(A ) ⊆ R p . Therefore, R p = ϕ(A) = ϕ(A ). But, by construction of ϕ, ϕ is actually an injective map on A . Therefore, we deduce from ϕ(A) = ϕ(A ) that A = A , concluding the proof.

D.7 PROOF OF THE MAIN EQUIVARIANT APPROXIMATION THEOREM

We can now fully exploit the structure of subalgebra of F thanks to the results above, and in particular relax the second assumption of Prop. 24 to give our main theorem 22. We first prove the following lemma. Lemma 28. Under the assumptions of Thm. 22, for any H ⊆ G subgroup, for any x, y ∈ X, ∀f ∈ F, ∃g ∈ H, f (g • x) = f (y) ⇐⇒ ∃g ∈ H, ∀f ∈ F, f (g • x) = f (y) . In particular, (x, y) ∈ ρ (π • F) ⇐⇒ ∃g ∈ G, (g • x, y) ∈ ρ (F) . Proof. The reverse implication is immediate so we focus on the direct one and prove its contraposition, i.e., ∀g ∈ H, ∃f ∈ F, f (g • x) = f (y) =⇒ ∃f ∈ F, ∀g ∈ H, f (g • x) = f (y) . To prove this, we take advantage of F being a subalgebra and H being finite. Let H = {g 1 , . . . , g h } and define, A = {(f (g 1 • x), . . . , f (g h • x), f (y)) : f ∈ F} . As F is a subalgebra of C(X, R p ), A is a subalgebra of R p with p = p(h + 1). By Lem. 27, A = {z ∈ R p : ∀(i, j) ∈ I, z i = z j , ∀j ∈ J, z j = 0} . where I and J can be chosen to be,foot_0 J = {j ∈ {1, . . . , p} : ∀z ∈ A, z j = 0} I = {(i, j) ∈ {1, . . . , p}foot_1 : ∀z ∈ A, z i = z j } . As I is an equivalence relation, an element of A is uniquely defined by its coordinates on equivalence classes of I. Therefore, one can choose z ∈ A such that z i = z j ⇐⇒ (i, j) ∈ I. By definition of A, there exists f * ∈ F such that (f * (g 1 • x), . . . , f * (g h • x), f * (y)) = z. We now check that f * is indeed appropriate. Take l ∈ {1, . . . , h}, we want to show that f * (g l • x) = f * (y). By assumption, there exists f ∈ F such that f (g l • x) = f (y), i.e. there exists i ∈ {1, . . . , p} such that f (g l • x) i = f (y) i . Therefore, ((l -1)p + i, hp + i) cannot be in I so that z (l-1)p+i = z hp+1 , i.e. f * (g l • x) i = f * (y) i . We now prove our main abstract theorem. Theorem 22. Let X be a compact space, Y = R p for some p, G be a finite group acting (continuously) on X and Y and F ⊆ C E (X, Y ) a (non-empty) set of equivariant functions. Denote by π : Y -→ Y /G the canonical projection on the quotient space Y /G. Consider the following assumptions, 1. F is a sub-algebra of C(X, Y ) and the constant function 1 is in F. 2. The set of functions F scal ⊆ C(X, R) defined by, F scal = {f ∈ C(X, R) : f 1 ∈ F} satisfy, ρ (F scal ) ⊆ ρ (π • F) . Then the closure of F (for the topology of uniform convergence) is, F = {f ∈ C E (X, Y ) : ρ (F) ⊆ ρ (f ) , ∀x ∈ X, f (x) ∈ F(x)} , where F(x) = {f (x), f ∈ F}. Moreover, if I(x) = {(i, j) ∈ [p] 2 : ∀y ∈ F(x), y i = y j }, then we have: F(x) = {y ∈ R p : ∀(i, j) ∈ I(x), y i = y j } . Proof of Thm. 22. By Lem. 28, the second assumption of Prop. 24 is also satisfied. To get the conclusion of Thm. 22, note that F(x) is now a linear subspace of a finite-dimensional vector space and therefore it is closed. Thus, F(x) = F(x) which is a subalgebra. Applying Lem. 27 to F(x) and noting that, necessarily J = ∅ as 1 ∈ F(x) by assumption, gives the result of Thm. 22.

D.8 PRACTICAL REDUCTIONS

Though the results we proved above were formulated using classic hypotheses, such as requiring F to be a subalgebra, we can give much more compact versions for our setting. We also reduce the assumption that ρ (F scal ) ⊆ ρ (π • F) to a more practical one. We start with the invariant case. Corollary 29. Let X be a compact space, Y = F = R p be some finite-dimensional vector space, G be a finite group acting (continuously) on X and F ⊆ C I (X, Y ) a (non-empty) set of invariant functions. Assume that, for any h ∈ C(F 2 , F) and f, g ∈ F, x → h(f (x), g(x)) ∈ F . Then the closure of F is, F = {f ∈ C I (X, Y ) : ρ (F) ⊆ ρ (f )} . 5. If F is equivariant w.r.t. the action described in Cor. 30, so is E(F). Proof. Define, E(F) = {x → (m(f (x) 1 ), . . . , m(f (x) n )) : f ∈ F 0 ∩ C(X, R h ), m ∈ C(R h , F), h ≥ 1} . As MLP are continuous and by the universality of MLP on a compact set (see Section C.1), F ⊆ E(F) ⊆ F . This already implies 1. and that ρ F ⊆ ρ (E(F)) ⊆ ρ (F). Using Lem. 15 yields 2. We now show 3. Take h ∈ C(F 2 , F), f, g ∈ F 0 , f ∈ C(X, R h f ), g ∈ C(X, R hg ) and m ∈ C(R h f , F), l ∈ C(R hg , F). All we have to show is that, x → (h(m(f (x) 1 ), l(g(x) 1 )), . . . , h(m(f (x) n ), l(g(x) n ))) ∈ E(F) . But as F 0 is stable by concatenation, x → (f (x), g(x)) ∈ R h f +hg is still in F 0 . Moreover, y ∈ R h f +hg → h(m(y 1 , . . . , y h f ), l(y h f +1 , . . . , y h f +hg ) is also in C(R h f +hg , F ) which shows that the map above is indeed in E(F). The last two points are immediate consequences of the definition of E(F). Thm. 5 and Thm. 6 are now obtained by combining Lem. 32 with Cor. 29 and Cor. 30.

E PROOFS FOR EXPRESSIVENESS OF GNNS

Note that in the Theorem 6, the additional stability assumption is "almost" necessary to obtain the result. Indeed, if the result holds, i.e., F = {f ∈ C E (X, F n ) : ρ (F 0 ) ⊆ ρ (f )} , then one can show, that, if f ∈ F, then f : x → n i=1 f (x) i , n i=1 f (x) i . . . , n i=1 f (x) i ∈ F . Indeed, f ∈ F so that it has a weaker discriminating power than F, so that ρ (F) = ρ (F 0 ) ⊆ ρ (f ). But, by construction of f , ρ (f ) ⊆ ρ f so that ρ (F 0 ) ⊆ ρ f . As f is also in C E (X, F n ), f ∈ F. E.1 EXPRESSIVITY OF GNN LAYERS Lemma 33. Fix F 0 , F 1 (non-trivial) finite dimensionnal vector spaces. Consider the action of G = S n on F 0 × F n×k 0 defined by, ∀σ ∈ S n , ∀x 0 ∈ F 0 , ∀x ∈ F n×k , σ • (x 0 , x) = (x 0 , x σ -1 (1) , . . . , x σ -1 (n) ) Let K ⊆ F 0 × F n×k 0 be a compact set. Then, the set of functions from K ⊆ F 0 × F n×k 0 to F 1 of the form, (x 0 , x) -→ f 0   x 0 , n j=1 k w=1 f w (x j,w )   where f 0 : F 0 × R h → F 1 , f j : F 0 → R h , j = 1, . . . , k are multi-linear perceptrons and h ≥ 1, is dense in C I (K, F 1 ). Proof. Denote by F ⊆ C I (K, F 1 ) the set of such functions. To prove that F = C I (K, F 1 ), we first apply Thm. 5. We get, that, F = f ∈ C I (F 0 × F n×k 0 , F 1 ) : ρ (F) ⊆ ρ (f ) . We now characterize ρ (F). Actually, it is equal to ρ (inv) = {((x 0 , x), (y 0 , y)) ∈ F 0 × F n×k 0 2 : ∃σ ∈ S n , (x 0 , x) = σ • (y 0 , y)}. As the functions of F are invariant, ρ (inv) ⊆ ρ (F). We now show the reverse. Take (x 0 , x), (y 0 , y) ∈ F 0 × F n×k 0 such that there does not exist σ ∈ S n such that (x 0 , x) = σ • (y 0 , y). If x 0 = y 0 , there exists f 0 : F 0 → F 1 MLP such that f (x 0 ) = f (y 0 ) so that ((x 0 , x), (y 0 , y)) / ∈ ρ (F). Otherwise, there does not exists σ ∈ S n such that (x σ -1 (1) , . . . , x σ -1 (n) ) = (y 1 , . . . , y n ) by definition of the action of G = S n . Now, apply Lem. 31 with F ← F k 0 , the universality of MLP and use the decomposition given to get f j : F 0 → R h , j = 1, . . . , k such that n j=1 k w=1 f w (x j,w ) = n j=1 k w=1 f w (y j,w ). Choosing an appropriate MLP f 0 : R h → F 1 yields that ((x 0 , x), (y 0 , y)) / ∈ ρ (F). Hence, we have shown that, F = f ∈ C I (F 0 × F n×k 0 , F 1 ) : ρ (inv) ⊆ ρ (f ) = C I (F 0 × F n×k 0 , F 1 ) .

E.2 APPROXIMATION THEOREMS FOR GNNS

We now have all the tools to finally prove our main result. Theorem 34. Let K discr ⊆ G n × F n 0 , K ⊆ F n 2 0 be compact sets. For the invariant case, we have: MGNN I = {f ∈ C I (K discr , F) : ρ (2-WL I ) ⊆ ρ (f )} 2-LGNN I = {f ∈ C I (K, F) : ρ (2-WL I ) ⊆ ρ (f )} k-LGNN I = {f ∈ C I (K, F) : ρ (k-LGNN I ) ⊆ ρ (f )} ⊃ {f ∈ C I (K, F) : ρ (k-WL I ) ⊆ ρ (f )} k-FGNN I = {f ∈ C I (K, F) : ρ (k-FWL I ) ⊆ ρ (f )} For the equivariant case, we have: MGNN E = {f ∈ C E (K discr , F n ) : ρ (2-WL E ) ⊆ ρ (f )} k-LGNN E = {f ∈ C E (K, F n ) : ρ (k-LGNN E ) ⊆ ρ (f )} ⊃ {f ∈ C E (K, F n ) : ρ (k-WL E ) ⊆ ρ (f )} k-FGNN E = {f ∈ C E (K, F n ) : ρ (k-FWL E ) ⊆ ρ (f )} We decompose the proof with an additional lemma. Lemma 35. Let K discr ⊆ G n × F n 0 , K ⊆ F n 2 0 be compact sets. For the invariant case, and any k ≥ 2, we have, MGNN I = {f ∈ C I (K discr , F) : ρ (MGNN I ) ⊆ ρ (f )} k-LGNN I = {f ∈ C I (K, F) : ρ (k-LGNN I ) ⊆ ρ (f )} k-FGNN I = {f ∈ C I (K, F) : ρ (k-FGNN I ) ⊆ ρ (f )} For the equivariant case, and any k ≥ 2, we have: MGNN E = {f ∈ C E (K discr , F) : ρ (MGNN E ) ⊆ ρ (f )} k-LGNN E = {f ∈ C E (K, F) : ρ (k-LGNN E ) ⊆ ρ (f )} k-FGNN E = {f ∈ C E (K, F) : ρ (k-FGNN E ) ⊆ ρ (f )} Proof of Thm. 34. The theorem is now a direct consequence of Prop. 13 and Lem. 35. We now move to the proof of Lem. 35. Proof of Lem. 35. First, focus on the invariant case. Let F denote MGNN I , k-LGNN I or k-FGNN I and X be either K discr or K so that X is compact and F ⊆ C I (X, F). Applying Thm. 5 directly gives, F = {f ∈ C I (X, F) : ρ (F) ⊆ ρ (f )} , which is the desired result. We now move to the equivariant case. First, let us replace MGNN E by another class, which is slightly simpler to analyze. Define, MGNN E = {m E • ((1 -λ) Id +λS 1 ) • F T • . . . F 2 • F 1 : F t : F n t → F n t+1 message passing layer, t = 1, . . . , T, T ≥ 1, λ ∈ {0, 1}} . It holds that ρ (MGNN E ) = ρ (MGNN E ) and, MGNN E ⊆ MGNN E ⊆ {f ∈ C E (K discr , F n ) : ρ (MGNN E ) ⊆ ρ (f )} = {f ∈ C E (K discr , F n ) : ρ (MGNN E ) ⊆ ρ (f )} Therefore, if we show that, MGNN E = {f ∈ C E (K discr , F n ) : ρ (MGNN E ) ⊆ ρ (f )} , we will have the desired result. Let F denote MGNN E , k-LGNN E or k-FGNN E and X be either K discr or K so that X is compact and F ⊆ C E (X, F n ). We now wish to apply Thm. 6, but we need to verify the stability assumption first. Thus, we show that, for any f ∈ F, x → n i=1 f (x) i , n i=1 f (x) i . . . , n i=1 f (x) i ∈ F , • If F = MGNN E .Take f ∈ MGNN E . f is of the form, m E • ((1 -λ) Id +λS 1 ) • F T • . . . F 2 • F 1 , where F t : F n t → F n t+1 are message passing layers, F T +1 = F and λ ∈ {0, 1}. We need to show that there is a MGNN E which implements, x → n i=1 f (x) i , n i=1 f (x) i . . . , n i=1 f (x) i . If λ = 1, f is exactly the function of (26). Otherwise, if λ = 0, we build another MGNN E implementing this function. Denote by F T +1 : F n → F n the (simple) message passing layer defined by F T +1 (h) i = m E (h i ) for i ∈ [n] and any h ∈ F n . Then, the MGNN E (1 -λ ) Id +λ S 1 ) • F T +1 • F T • . . . F 2 • F 1 with λ = 1 exactly implements (26). • If F = k-LGNN E . A function f of this class is of the form, m E • S k 1 • F T • . . . F 2 • F 1 • I k , where F t : F n k t → F n k t+1 are linear graph layers and F T +1 = F. Our goal is to show that there is a GNN of k-LGNN E which implements, Now, the k-FGNN x → n i=1 f (x) i , n i=1 f (x) i . . . , n i=1 f (x) i , H 1 • H 2 • • • • • H k • F T +1 • H 2 • • • • • H k • F T +1 • F T • . . . F 2 • F 1 • I k exactly implements (28).

F EXTENSION TO GRAPHS OF VARYING SIZES F.1 EXTENSION TO DISCONNECTED INPUT SPACES

Here we show that our results can be extended to graphs of varying sizes similarly to Keriven & Peyré (2019) . There are two ways to do it. The first would be to directly adapt all the proofs but it would make them more cumbersome. Instead, we extend them with a simple argument presented below. As a side benefit, this general lemma makes it possible to extend almost all the approximation results for graph neural networks from the literature. The abstract setting is the following. Given a compact input space X, assume that there is some finite set A, and X α , α ∈ A a family of pairwise disjoints compact sets such that X = α∈A X α . Crucially, we will assume that the X α are in distinct connected components. Intuitively, they do not "touch" each other. Similarly, assume that the output space Y can be written as Y = α∈A Y α , with Y α , α ∈ A a family of (pairwise disjoints) real vector spaces. Before moving to the results, we need a last definition. Informally, we need that the functions f that we will consider do not change the number of nodes of their inputs. Formally, we say that f : X -→ Y is adapted if f (X α ) ⊆ Y α for each α ∈ A, and denote by C ad (X, Y ) the of continuous and adapted functions from X to Y . We can now state our lemma to adapt our results to graphs with varying node sizes. Lemma 36. Let X be a compact space, Y be a topological space and A a finite set. Assume that there exists X α , α ∈ A a family of pairwise disjoints compact sets such that X = α∈A X α and the X α 's are in distinct connected components. Also assume that there is Y α , α ∈ A a family of (pairwise disjoints) real vector spaces such that Y = α∈A Y α . Consider F ⊆ C ad (X, Y ) a (non-empty) set of adapted functions and, for each α ∈ A, define F |Xα = {f |Xα : f ∈ F } ⊆ C(X α , Y α ) the restriction of the functions of F to X α . Assume that the following holds, 1. Each F |Xα is a sub-algebra of C(X α , Y α ) which contains the constant function 1 Yα . 2. There is a set of functions F scal ⊆ C(X, R) such that F scal .F ⊆ F and it discriminates between the X α 's, i.e. ρ (F scal ) ⊆ α∈A X 2 α . Then the closure of F is, By definition, F scal ⊆ S. Moreover, as each F |Xα is a subalgebra, S is also a subalgebra of C(X, R). F = f ∈ C ad (X, As X is compact, we can apply Cor. 17 to S to get that, in C(X, R), S = {f ∈ C(X, R) : ρ (S) ⊆ ρ (f )} . In the first part of the proof, we check that, for each α, the function f α : X -→ [0, 1] defined by f α|X α = 1 ∀β = α, f α|X β = 0 . is continuous and satisfy ρ (S) ⊆ ρ (f α ). This will means that such functions belong to F scal . As the X β 's are in different connected components, it is enough to check its continuity on each X β . Indeed, each f α|X β is constant so continuous. The second fact comes from the second assumption. Indeed, take (x, y) ∈ X such that (x, y) ∈ ρ (S). As F scal ⊆ S, in particular (x, y) ∈ ρ (F scal ). But, by assumption, ρ (F scal ) ⊆ β∈A X 2 β . Thus, necessarily x and y belong to the same X β so that f α (x) = f α (y). Therefore, we conclude that f α ∈ S. We now prove the announced equality. By the definition of the distance, the first inclusion, F ⊆ f ∈ C ad (X, Y ) : f |Xα ∈ F α is immediate. We focus on the other way. Take h ∈ C ad (X, Y ) such that h |Xα ∈ F α for every α ∈ A and > 0. By definition of h, there exists, for each α ∈ A, g α ∈ F such that, sup Xα h -g α Yα ≤ , As the X α 's are compact and the g α 's are continuous, max α∈A sup Xα g α < +∞ and denote by M > 0 a bound on this quantity. We have shown above that each f α is in S so there exists, for each α ∈ A, l α ∈ S such that, sup X |f α -l α | ≤ M . By definition of S and, as F is a subalgebra, α∈A l α g α ∈ F and, for each β ∈ A, sup X β h - α∈A l α g α ∈ F ≤ sup X β h -g β + g β -l β g β + α =β l α g α ≤ + M × M + (|A| -1)M × M = (|A| + 1) , which concludes the proof.

F.2 APPROXIMATION THEOREM WITH VARYING GRAPH SIZE

We now state our theorem in the case of varying graph size like Keriven & Peyré (2019) . With Lem. 36, it is indeed straightforward to extend any approximation result initially proven for a class of graphs of fixed size. However, as the complete proof would require new notations again, we only give a sketch of proof. Fix N ≥ 1 and consider the space of graphs (described by tensors) of size less than N , F ≤N 2 0 = N n=1 F n 2 0 . Equip this space with the final topology or, equivalently, the graph edit distance. Then, the last thing to check to apply Lem. 36 is that the classes of GNN that we consider indeed discriminate between graphs of different sizes, which is immediate. For the equivariant case, we have: MGNN E = f ∈ C E (K discr , F ≤N ) : ρ (2-WL E ) ⊆ ρ (f ) k-LGNN E = f ∈ C E (K, F ≤N ) : ρ (k-LGNN E ) ⊆ ρ (f ) ⊃ f ∈ C E (K, F ≤N ) : ρ (k-WL E ) ⊆ ρ (f ) k-FGNN E = f ∈ C E (K, F ≤N ) : ρ (k-FWL E ) ⊆ ρ (f ) Proof. This corollary is a direct consequence of Thm. 34 and Lem. 36, using Lem. 32 to satisfy the sub-algebra assumption of Lem. 36.



We slightly change the definition of I compared to the statement of the lemma to add J , which does not change the result.



Figure 2: Fraction of matched nodes for pairs of correlated graphs (with edge density 0.2) as a function of the noise, see Section A.1 for details.

Figure 3: Each line corresponds to a model trained at a given noise level and shows its accuracy across all noise levels.

comes from Xu et al. (2018, §A, §B), (9) and (10) from Maron et al. (2019a, §C, §D).

By definition of linear graph layers, the map F T +1 :F n k → F n k defined by, for G ∈ F n k , ∀(i 1 , . . . , i k ) ∈ [n] k , F T +1 (G) i1,...,i k = m E (S k 1 (G) i ), is a linear graph layer as defined in Section 3. Now, consider, the linear graph layer,F T +2 : F n k → F n k defined by, for G ∈ F n k , ∀(i 1 , . . . , i k ) ∈ [n] k , F T +1 (G) i1,...,i k = n i=1 G i,i2,...,i k . Then, the k-LGNN F T +2 • F T +1 • F T • . . . F 2 • F 1 • I k exactly implements (27).

Y ) : ∀α ∈ A, f |Xα ∈ F |Xα , for the distance on C ad (X, Y ) defined by, for f, g ∈ C ad (X, Y ), d(f, g) = max α∈A sup Xα f -g Yα .Proof. Define the set of functions S ⊆ C(X, R) by, S = {f ∈ C(X, R) : f.F ⊆ F} .

Likewise, for message passing GNN, considerG ≤N ×F ≤N 0 = N n=1 G n ×F n0 , with a similar topology or the graph edit distance.Corollary 37. Let K discr ⊆ G ≤N × F ≤N 0 , K ⊆ F ≤N 2 0be compact sets. For the invariant case, we have:MGNN I = {f ∈ C I (K discr , F) : ρ (2-WL I ) ⊆ ρ (f )} 2-LGNN I = {f ∈ C I (K, F) : ρ (2-WL I ) ⊆ ρ (f )} k-LGNN I = {f ∈ C I (K, F) : ρ (k-LGNN I ) ⊆ ρ (f )} ⊃ {f ∈ C I (K, F) : ρ (k-WL I ) ⊆ ρ (f )} k-FGNN I = {f ∈ C I (K, F) : ρ (k-FWL I ) ⊆ ρ (f )}

.1 Details on the experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Experimental results on graphs of varying size . . . . . . . . . . . . . . . . . . . . A.3 Generalization for regular graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . Weisfeiler-Lehman test on vertices . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Isomorphism type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Weisfeiler-Lehman and Folklore Weisfeiler-Lehman tests of order k ≥ 2 . . . . . .

• If F = k-FGNN E . A function f of this class is of the form,m E • S k 1 • F T • . . . F 2 • F 1 • I k , where F t : F n k t → F n k t+1 are FGL (see Section 3) and F T +1 = F. We build a GNN of k-MGNN E which implements, For w ∈ [k], define the FGL H w : F n k → F n k by, ∀G ∈ F n k , ∀(i 1 , . . . , i k ) ∈ [n] k , H w (G) i1,...,i k = n j=1 G i1,...,iw-1,j,iw+1,...,i k . Then H 2 • • • • • H k : F n k → F n kcomputes the sum of the elements of the input tensor over the last k -1 dimensions like to S k 1 :F n k → F n and H 1 • H 2 • • • • • H k : F n k → F n k computes the fullsum of the elements of the input tensor like to S k : F n k → F. Finally, consider the FGL F T +1 : F n k → F n k associated to m E , i.e. such that, for any G ∈ F n k , ∀(i 1 , . . . , i k ) ∈ [n] k , F T +1 (G) i1,...,i k = m E (G i1,...,i k ) .

ACKNOWLEDGMENTS

This work was supported in part by the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). M.L. thanks Google for Google Cloud Platform research credits and NVIDIA for a NVIDIA GPU Grant.

annex

Proof. We wish to apply Thm. 22 but for this we need G to act on Y . Define a (trivial) action of G on Y by, ∀g ∈ G, ∀y ∈ Y, g • y = y .With this action on Y , C E (X, Y ) = C I (X, Y ). Moreover, Y /G = Y , π : Y → Y /G is the identity so that ρ (π • F) = ρ (F).Our assumption clearly ensure that F is indeed a subalgebra and contains the constant function .All that is left to show to apply Thm. 22 is that the set of functions F scal ⊆ C(X, R) defined by,Take (x, y) / ∈ ρ (F) and we show that (x, y) / ∈ ρ (F scal ). Indeed, by definition there exists f ∈ F, i ∈ {1, . . . , p} such that f (xTherefore, (x, y) / ∈ ρ (F scal ). Therefore, we can apply Thm. 22. We get that,with I(x) given by,To conclude the proof, we now show that I(x) = {(i, i) : i ∈ {1, . . . , p}}, which will imply thatIndeed, the constant function z → (1, 2, . . . , p) is in F by assumption. Therefore, I(x) is reduced to {(i, i) : i ∈ {1, . . . , p}}.In our previous version of our approximation result for node embedding, we did not allow features in the output as it would have made the statement and the proof a bit convoluted. With this new assumption, this is much easier. Corollary 30. Let X be a compact space, Y = F n , with F = R p and G = S n the permutation group, acting (continuously) on X and acting on F n by, for σ ∈ S n , x ∈ F n , ∀i ∈ {1, . . . , p}, (σConsider the following assumptions,Then the closure of F (for the topology of uniform convergence) is,For this we need a handy lemma, whose proof relies on a result about multi-symmetric polynomials from Maron et al. (2019a) .Lemma 31. Let F = R p be some finite-dimensional space. Take x 1 , . . . , x n , y 1 , . . . , y n ∈ F such that, for any σ ∈ S n , (x 1 , . . . , x n ) = (y σ(1) , . . . , y σ(n) ). Then, there exists h ∈ C(F, R) such that,Moreover, h can be written as hProof We now focus on the second one. As in the statement of Thm. 22, define F scal ⊆ C(X, R) by,We have to show that ρ (F scal ) ⊆ ρ (π • F). For this, take x, y / ∈ ρ (π • F). There exists f ∈ F such that for any σ ∈ S n , σ • f (x) = f (y). We have to find l ∈ F scal such that l(x) = l(y). In other words, from a function in F n which discriminates between x and y we have to build a function in R.First, we exhibit a function which discriminates between x and y. Apply Lem. 31 to the vectors f (x) and f (y): there existsTo fit the assumptions, we build h ∈ C(F, F) from h 0 by h : x ∈ F → (h 0 (x), . . . , h 0 (x)) ∈ F. Take g ∈ C(F, F) such that g(z) = (z 1 , . . . , z 1 ) for any z ∈ F and l ∈ C(X, R) defined by, for w ∈ X, l(w) = n i=1 h 0 (f (w i )). Then, l(x) = l(y). All we have to do is show that l ∈ F scal , i.e., l1 ∈ F. This is where the two assumptions we made come into play. Indeed, the first one implies that h • f ∈ F and the second givesFinally, the first assumption ensure that,But this last function is none other than l1, which shows that l ∈ F scal as required.We have successfully verified the hypothesis of Thm. 22. Therefore, the closure of F is,, y i,i = y j,j } , and I(x) = {(i, i , j, j ) ∈ ({1, . . . , n} × {1, . . . , p})2 : ∀y ∈ F(x), y i,i = y j,j } .To get the desired result, we need to get rid of the condition "f (x) ∈ F(x)" in the description of F. Fix x ∈ X.First, we show that F(x) = {y ∈ F n : ∀(i, j) ∈ J(x), y i = y j } , with J(x) = {(i, j) ∈ {1, . . . , n} 2 : y i = y j } , (note that the equalities here are not in R anymore but in F). The direct inclusion "⊆" is immediate by construction of J(x) so we focus on the reverse direction. For this, we show that the 4-tuples (i, i , j, j ) of I(x) necessarily satisfy i = j and (i, j) ∈ J(x).Published as a conference paper at ICLR 2021 First note that, by the first assumption, the vector y 0 ∈ F n such that y 0 i = (1, 2, . . . , p) for i ∈ {1, . . . , n} is in F(x). Indeed, take the constant function always equal to (1, 2, . . . , p) as h. Now, consider a 4-tuple (i, i , j, j ) of I(x) and we show that, actually, (i, j) ∈ J(x) and i = j . As y 0 in F(x), and y 0 i,i = i , y 0 j,j = j , (i, i , j, j ) ∈ I(x) implies that j = i . Consider, k ∈ {1, . . . , p}. We show that, for any y ∈ F(x), y i,k = y j,k . But such a y can be written as y = f (x) for some f ∈ F. Consider, the function h ∈ C(F, F) associated to the permutation (i k), defined by,By our first assumption, z → (h(f (z) 1 ), . . . , h(f (z) n )) ∈ F so that (h(y 1 ), . . . , h(y n )) ∈ F(x). In particular, as (i, i , j, i ) ∈ I(x), h(y i ) i = h(y j ) i , i.e. y i,k = y j,k . Therefore, (i, j) ∈ J(x).Finally, we can conclude that F(x) ⊃ {y ∈ F n : ∀(i, j) ∈ J(x), y i = y j }. Indeed, take y ∈ F n : ∀(i, j) ∈ J(x), y i = y j . We show that all the constraints of I(x) are satisfied. Indeed, take (i, i , j, j ) ∈ I(x). We have shown that (i, j) ∈ J(x) and i = j so that y i = y j and in particular y i,i = y j,j . Therefore, this finishes the proof of F(x) ⊃ {y ∈ F n : ∀(i, j) ∈ J(x), y i = y j }.

Thus,

F(x) = {y ∈ F n : ∀(i, j) ∈ J(x), y i = y j } , with J(x) = {(i, j) ∈ {1, . . . , n} 2 : y i = y j } .We have proven so far, that,Our goal is to show that, for any, where (i j) denotes the permutation which exchanges i and j. Moreover, as (i j) ∈ S n , by equivariance, this means that f ((i j) • x) = f (x) for every f ∈ F and therefore that ((i j) • x, x) ∈ ρ (F). By assumption, we infer that ((i j) • x, x) ∈ ρ (h) too, i.e. that h((i j) • x) = h(x) and so that h(x) i = h(x) j by equivariance, which concludes our proof.

D.9 REDUCTIONS FOR GNNS

We now present a lemma which explains how to instantiate the two corollaries above in the case of GNNs by replacing continuous functions with MLPs. Lemma 32. Fix X some compact space, n ≥ 1 and F a finite-dimensional feature space. Let F 0 ⊆ ∞ h=1 C(X, R h ) be stable by concatenation and consider, F = {x → (m(f (x) 1 ), . . . , m(f (x) n )) : f ∈ F 0 ∩C(X, R h ), m : R h → F MLP, h ≥ 1} ⊆ C(X, F) . 3. For any h ∈ C(F 2 , F), f, g ∈ E(F),x → (h(f (x) 1 , g(x) 1 ), . . . , h(f (x) n , g(x) n )) ∈ E(F) . 

