DIRECTIONAL GRAPH NETWORKS

Abstract

In order to overcome the expressive limitations of graph neural networks (GNNs), we propose the first method that exploits vector flows over graphs to develop globally consistent directional and asymmetric aggregation functions. We show that our directional graph networks (DGNs) generalize convolutional neural networks (CNNs) when applied on a grid. Whereas recent theoretical works focus on understanding local neighbourhoods, local structures and local isomorphism with no global information flow, our novel theoretical framework allows directional convolutional kernels in any graph. First, by defining a vector field in the graph, we develop a method of applying directional derivatives and smoothing by projecting node-specific messages into the field. Then we propose the use of the Laplacian eigenvectors as such vector field, and we show that the method generalizes CNNs on an n-dimensional grid, and is provably more discriminative than standard GNNs regarding the Weisfeiler-Lehman 1-WL test. Finally, we bring the power of CNN data augmentation to graphs by providing a means of doing reflection, rotation and distortion on the underlying directional field. We evaluate our method on different standard benchmarks and see a relative error reduction of 8% on the CIFAR10 graph dataset and 11% to 32% on the molecular ZINC dataset. An important outcome of this work is that it enables to translate any physical or biological problems with intrinsic directional axes into a graph network formalism with an embedded directional field.

1. INTRODUCTION

One of the most important distinctions between convolutional neural networks (CNNs) and graph neural networks (GNNs) is that CNNs allow for any convolutional kernel, while most GNN methods are limited to symmetric kernels (also called isotropic kernels in the literature) (Kipf & Welling, 2016; Xu et al., 2018a; Gilmer et al., 2017) . There are some implementation of asymmetric kernels using gated mechanisms (Bresson & Laurent, 2017; Veličković et al., 2017) , motif attention (Peng et al., 2019) , edge features (Gilmer et al., 2017) or by using the 3D structure of molecules for message passing (Klicpera et al., 2019) . However, to the best of our knowledge, there are currently no methods that allow asymmetric graph kernels that are dependent on the full graph structure or directional flows. They either depend on local structures or local features. This is in opposition to images which exhibit canonical directions: the horizontal and vertical axes. The absence of an analogous concept in graphs makes it difficult to define directional message passing and to produce an analogue of the directional frequency filters (or Gabor filters) widely present in image processing (Olah et al., 2020) . We propose a novel idea for GNNs: use vector fields in the graph to define directions for the propagation of information, with an overview of the paper presented in 1. Hence, the aggregation or message passing will be projected onto these directions so that the contribution of each neighbouring node n v will be weighted by its alignment with the vector fields at the receiving node n u . This enables our method to propagate information via directional derivatives or smoothing of the features. We also explore using the gradients of the low-frequency eigenvectors of the Laplacian of the graph φ k , since they exhibit interesting properties (Bronstein et al., 2017; Chung et al., 1997) . In particular, they can be used to define optimal partitions of the nodes in a graph, to give a natural ordering (Levy, 2006) , and to find the dominant directions of the graph diffusion process (Chung & Yau, 2000) . Further, we show that they generalize the horizontal and vertical directional flows in a grid (see 

Aggrega�on of neighbouring features MLP

The a-direc�onal adjacency matrix 𝑨 is given as an input. We then compute the Laplacian matrix 𝑳. Both 𝑨 and 𝑳 are of size 𝑁 × 𝑁, where 𝑁 is the number of nodes. The matrices are o�en sparse with 𝐸 being the number of edges. The eigenvectors 𝝓 of 𝑳 are computed and sorted such that 𝝓1 has the lowest nonzero eigenvalue and 𝝓𝑘 has the 𝑘-th lowest. This step is the most expensive computa�onally. There are methods to compute 𝑘-first eigenvectors with a complexity of 𝑂 𝑘𝐸

Pre-computed steps 𝑂 𝑘𝐸

A graph with the node features is given. 𝑿 0 is the feature matrix of the graph at the 0-th GNN layer. 𝑿 0 has 𝑁 rows (the number of nodes) and 𝑛0 columns (the number of input features). The aggrega�on matrices 𝑩 𝑎𝑣 ,𝑑𝑥 1,…,𝑘 are taken from the precomputed steps.

Graph neural network steps 𝑂 𝑘𝐸 + 𝑘𝑁

The gradient of 𝝓 is a func�on of the edges (a matrix) such that ∇𝝓𝑖𝑗 = 𝝓𝑖 -𝝓𝑗 if the nodes 𝑖, 𝑗 are connected, or ∇𝝓𝑖𝑗 = 0 otherwise. If the graph has a known direc�on, it can be encoded as field 𝑭 (an an�-symmetric matrix) instead of using ∇𝝓. Each row 𝑖, : of the field 𝑭 is normalized by it's 𝐿 1 norm to create the aggrega�on matrices. 𝑭𝑖,: = 𝑭𝑖,: 𝑭 𝑖.: 𝐿 1 + 𝜖 •𝑩𝑎𝑣 is the direc�onal smoothing matrix. 𝑩𝑎𝑣 = 𝑭 • 𝑩𝑑𝑥 is the direc�onal deriva�ve matrix. 𝑩𝑑𝑥 𝑖,: = 𝑭𝑖,: -diag 𝑭:,𝑗 𝑗 𝑖,: The aggrega�on matrices 𝑩 𝑎𝑣 ,𝑑𝑥 1,…,𝑘 are used to aggregate the features 𝑿 0 . For 𝑩𝑑𝑥 we take the absolute value due to the sign ambiguity of 𝝓. 𝑩 is similar to a weighted adjacency matrix (with possible nega�ve weights), the aggrega�on is simply the matrix product with the feature vector. Other non-direc�onal aggregators are used, such as the mean aggrega�on 𝑫 -1 𝑨 𝑿 0 . The resul�ng matrix 𝒀 0 is the column-concatena�on of all the aggrega�ons. The complexity is 𝑂 𝑘𝐸 , with𝐸 being the number of edges, or 𝑂 𝐸 if the aggrega�ons are parallelized. This is the only step with learned parameters. Based on the GCN method, each aggrega�on is followed by a mul� layer perceptron (MLP) on all the features. In our case, the MLP is applied on the columns of 𝒀 0 , thus we have a complexity of 𝑂 𝑘𝑁 . Let the number of features at the 𝑡-th layer in 𝑿 𝑡 be 𝑛𝑡 . Then: • 𝑿 0 has 𝑛0 columns • 𝒀 0 has 2𝑘 + 1 𝑛0 columns • 𝑿 1 has 𝑛1 columns figure 2 ), allowing them to guide the aggregation and mimic the asymmetric and directional kernels present in computer vision. In fact, we demonstrate mathematically that our work generalizes CNNs by reproducing all convolutional kernels of radius R in an n-dimensional grid, while also bringing the powerful data augmentation capabilities of reflection, rotation or distortion of the directions. We further show that our directional graph network (DGN) model theoretically and empirically allows for efficient message passing across distant communities, which reduces the well-known problem of over-smoothing, and aligns well with the need of independent aggregation rules (Corso et al., 2020) . Alternative methods reduce the impact of over-smoothing by using skip connections (Luan et al., 2019) , global pooling (Alon & Yahav, 2020) , or randomly dropping edges during training time (Rong et al., 2020) , but without solving the underlying problem. In fact, we also prove that DGN is more discriminative than standard GNNs in regards to the Weisfeiler-Lehman 1-WL test, showing that the reduction of over-smoothing is accompanied by an increase of expressiveness. Our method distinguishes itself from other spectral GNNs since the literature usually uses the low frequencies to estimate local Fourier transforms in the graph (Levie et al., 2018; Xu et al., 2019) . Instead, we do not try to approximate the Fourier transform, but only to define a directional flow at each node and guide the aggregation.

2.1. INTUITIVE OVERVIEW

One of the biggest limitations of current GNN methods compared to CNNs is the inability to do message passing in a specific direction such as the horizontal one in a grid graph. In fact, it is difficult to define directions or coordinates based solely on the shape of the graph. The lack of directions strongly limits the discriminative abilities of GNNs to understand local structures and simple feature transformations. Most GNNs are invariant to the permutation of the neighbours' features, so the nodes' received signal is not influenced by swapping the features of 2 neighbours. Therefore, several layers in a deep network will be employed to understand these simple changes instead of being used for higher level features, thus over-squashing the message sent between 2 distant nodes (Alon & Yahav, 2020) . In this work, one of the main contributions is the realisation that low-frequency eigenvectors of the Laplacian can overcome this limitation by providing a variety of intuitive directional flows. As a first example, taking a grid-shaped graph of size N × M with N 2 < M < N , we find that the eigenvector associated to the smallest non-zero eigenvalue increases in the direction of the width N and the second one increases in the direction of the height M . This property generalizes to n-dimensional grids and motivated the use of gradients of eigenvectors as preferred directions for general graphs. We validated this intuition by looking at the flow of the gradient of the eigenvectors for a variety of graphs, as shown in figure 2 . For example, in the Minnesota map, the first 3 non-constant eigenvectors produce logical directions, namely South/North, suburb/city, and West/East. Another important contribution also noted in figure 2 is the ability to define any kind of direction based on prior knowledge of the problem. Hence, instead of relying on eigenvectors to find directions in a map, we can simply use the cardinal directions or the rush-hour traffic flow. We ignore φ 0 since it is constant and has no direction.

2.2. VECTOR FIELDS IN A GRAPH

Based on a recent review from Bronstein et al. (2017) , this section presents the ideas of differential geometry applied to graphs, with the goal of finding proper definitions of scalar products, gradients and directional derivatives. Let G = (V, E) be a graph with V the set of vertices and E ⊂ V × V the set of edges. The graph is undirected meaning that (i, j) ∈ E iff (j, i) ∈ E. Define the vector spaces L 2 (V ) and L 2 (E) as the set of maps V → R and E → R with x, y ∈ L 2 (V ) and F , H ∈ L 2 (E) and scalar products x, y L 2 (V ) := i∈V x i y i , F , H L 2 (E) := (i,j)∈E F (i,j) H (i,j) Think of E as the "tangent space" to V and of L 2 (E) as the set of "vector fields" on the space V with each row F i,: representing a vector at the i-th node. Define the pointwise scalar product as the map L 2 (E) × L 2 (E) → L 2 (V ) taking 2 vector fields and returning their inner product at each point of V , at the node i is defined by equation 2. F , H i := j:(i,j)∈E F i,j H i,j In equation 3, we define the gradient ∇ as a mapping L 2 (V ) → L 2 (E) and the divergence div as a mapping L 2 (E) → L 2 (V ), thus leading to an analogue of the directional derivative in equation 4. (∇x ) (i,j) := x(j) -x(i) , ( F ) i := j:(i,j)∈E F (i,j) Definition 1. The directional derivative of the function x on the graph G in the direction of the vector field F where each vector is of unit-norm is D F x(i) := ∇x, F i = j:(i,j)∈E (x(j) -x(i)) Fi,j |F | will denote the absolute value of F and ||F i,: || L p the L p -norm of the i-th row of F . We also define the forward/backward directions as the positive/negative parts of the field F ± .

2.3. DIRECTIONAL SMOOTHING AND DERIVATIVES

Next, we show how the vector field F is used to guide the graph aggregation by projecting the incoming messages. Specifically, we define the weighted aggregation matrices B av and B dx that allow to compute the directional smoothing and directional derivative of the node features. The directional average matrix B av is the weighted aggregation matrix such that all weights are positives and all rows have an L 1 -norm equal to 1, as shown in equation 5 and theorem 2.1, with a proof in the appendix C.1. B av (F ) i,: = |F i,: | ||F i,: || L 1 + (5) The variable is an arbitrarily small positive number used to avoid floating-point errors. The L 1norm denominator is a local row-wise normalization. The aggregator works by assigning a large weight to the elements in the forward or backward direction of the field, while assigning a small weight to the other elements, with a total weight of 1. Theorem 2.1 (Directional smoothing). The operation y = B av x is the directional average of x, in the sense that y u is the mean of x v , weighted by the direction and amplitude of F . The directional derivative matrix B dx is defined in (6) and theorem 2.2, with the proof in appendix C.2. Again, the denominator is a local row-wise normalization but can be replaced by a global normalization. diag(a) is a square, diagonal matrix with diagonal entries given by a. The aggregator works by subtracting the projected forward message by the backward message (similar to a center derivative), with an additional diagonal term to balance both directions. B dx (F ) i,: = Fi,:diag j F:,j i,: Fi,: = F i,: ||F i,: || L 1 + (6) Theorem 2.2 (Directional derivative). Suppose F have rows of unit L 1 norm. The operation y = B dx ( F )x is the centered directional derivative of x in the direction of F , in the sense of equation 4, i.e. y = D F x = F -diag j F:,j x These aggregators are directional, interpretable and complementary, making them ideal choices for GNNs. We discuss the choice of aggregators in more details in appendix A, while also providing alternative aggregation matrices such as the center-balanced smoothing, the forward-copy, the phantom zero-padding, and the hardening of the aggregators using softmax/argmax on the field. We further provide a visual interpretation of the B av and B dx aggregators in figure 3 . Interestingly, we also note in appendix A.1 that B av and B dx yield respectively the mean and Laplacian aggregations when F is a vector field such that all entries are constant F ij = ±C. 

2.4. GRADIENT OF THE EIGENVECTORS AS INTERPRETABLE VECTOR FIELDS

In this section we give theoretical support for the choice of gradients of the eigenfunctions of the Laplacian as sensible vectors along which to do directional message passing since they are interpretable and allow to reduce the over-smoothing. As usual the combinatorial, degree-normalized and symmetric normalized Laplacian are defined as L = D -A , L norm = D -1 L , L sym = D -1 2 LD -1 2 (7) The problems of over-smoothing and over-squashing are critical issues in GNNs (Alon & Yahav, 2020; Hamilton, 2020) . In most GNN models, node representations become over-smoothed after several rounds of message passing (i.e., convolutions), as the representations tend to reach a meanfield equilibrium equivalent to the stationary distribution of a random walk (Hamilton, 2020) . Oversmoothing is also related to the problem of over-squashing, which reflects the inability for GNNs to propagate informative signals between distant nodes (Alon & Yahav, 2020) and is a major bottleneck to training deep GNN models (Xu et al., 2019) . Both problems are related to the fact that the influence of one node's input on the final representation of another node in a GNN is determined by the likelihood of the two nodes co-occurring on a truncated random walk (Xu et al., 2018b) . We show in theorem 2.3 (proved in appendix C.3) that by passing information in the direction of φ 1 , the eigenvector associated to the lowest non-trivial frequency of L norm , DGNs can efficiently share information between the farthest nodes of the graph, when using the K-walk distance to measure the difficulty of passing information. Thus, DGNs provide a natural way to address both the oversmoothing and over-squashing problems: they can efficiently propagate messages between distant nodes and in a direction that counteracts over-smoothing. Definition 2 (K-walk distance). The K-walk distance d K (v i , v j ) on a graph is the average number of times v i is hit in a K step random walk starting from v j . Theorem 2.3 (K-Gradient of the low-frequency eigenvectors). Let λ i and φ i be the eigenvalues and eigenvectors of the normalized Laplacian of a connected graph L norm and let a, b = arg max 1≤i,j≤n {d K (v i , v j )} be the nodes that have highest K-walk distance. Let m = arg min 1≤i≤n (φ 1 ) i and M = arg max 1≤i≤n (φ 1 ) i , then d K (v m , v M ) -d K (v a , v b ) has order O(1 -λ 2 ). As another point of view into the problem of oversmoothing, consider the hitting time Q(x, y) defined as the expected number of steps in a random walk starting from node x ending in node y with the probability transition P (x, y) = 1 dx . In appendix C.4 we give an informal argument supporting the following conjecture. Definition 3 (Gradient step). Suppose the two neighboring nodes x and z are such that φ(z) -φ(x) is maximal among the neighbors of x, then we will say z is obtained from x by taking a step in the direction of the gradient ∇φ. Conjecture 2.4 (Gradient steps reduce expected hitting time). Suppose that x, y are uniformly distributed random nodes such that φ i (x) < φ i (y). Let z be the node obtained from x by taking one step in the direction of ∇φ i , then the expected hitting time is decreased proportionally to λ -1 i and E x,y [Q(z, y)] ≤ E x,y [Q(x, y)] The next two corollaries follow from theorem 2.3 (and also conjecture 2.4 if it is true). Corollary 2.5 (Reduces over-squashing). Following the direction of ∇φ 1 is an efficient way of passing information between the farthest nodes of the graph (in terms of the K-walk distance). Corollary 2.6 (Reduces over-smoothing). Following the direction of ∇φ 1 allows the influence distribution between node representations to be decorrelated from random-walk hitting times (assuming the definition of influence introduced in Xu et al. ( 2018b)). Our method also aligns perfectly with a recent proof that multiple independent aggregators are needed to distinguish neighbourhoods of nodes with continuous features (Corso et al., 2020) . When using eigenvectors of the Laplacian φ i to define directions in a graph, we need to keep in mind that there is never a single eigenvector associated to an eigenvalue, but a whole eigenspace. For instance, a pair of eigenvalues can have a multiplicity of 2 meaning that they can be generated by different pairs of orthogonal eigenvectors. For an eigenvalue of multiplicity 1, there are always two unit norm eigenvectors of opposite sign, which poses a problem during the directional aggregation. We can make a choice of sign and later take the absolute value (i.e. B av in equation 5). An alternative is to take a sample of orthonormal basis of the eigenspace and use each choice to augment the training (see section 2.8). Although multiplicities higher than one do happen for low-frequencies (square grids have a multiplicity 2 for λ 1 ) this is not common in "real-world graphs"; we found no λ 1 multiplicity greater than 1 on the ZINC and PATTERN datasets (see appendix B.4). Further, although all φ are orthogonal, their gradients, used to define directions, are not always locally orthogonal (e.g. there are many horizontal flows in the grid). This last concern is left to be addressed in future work.

2.5. GENERALIZATION OF THE CONVOLUTION ON A GRID

In this section we show that our method generalizes CNNs by allowing to define any radius-R convolutional kernels in grid-shaped graphs. The radius-R kernel at node u is a convolutional kernel that takes the weighted sum of all nodes v at a distance d(u, v) ≤ R. Consider the lattice graph Γ of size N 1 × N 2 × ... × N n where each vertices are connected to their direct non-diagonal neighbour. We know from Lemma C.1 that, for each dimension, there is an eigenvector that is only a function of this specific dimension. For example, the lowest frequency eigenvector φ 1 always flows in the direction of the longest length. Hence, the Laplacian eigenvectors of the grid can play a role analogous to the axes in Euclidean space, as shown in figure 2 . With this knowledge, we show in theorem 2.7 (proven in C.7), that we can generalize all convolutional kernels in an n-dimensional grid. This is a strong result since it demonstrates that our DGN framework generalizes CNNs when applied on a grid, thus closing the gap between GNNs and the highly successful CNNs on image tasks. Theorem 2.7 (Generalization radius-R convolutional kernel in a lattice). For an n-dimensional lattice, any convolutional kernel of radius R can be realized by a linear combination of directional aggregation matrices and their compositions. As an example, figure 4 shows how a linear combination of the first and m-th aggregators B(∇φ 1,m ) realize a kernel on an N × M grid, where m = N/M and N > M . CNN equivalent on image 𝐼 𝑁×𝑀 , with 𝑁 > 𝑀 ; 𝑁%𝑀 ≠ 0 Graph aggregation 1 1 𝒚 = 2𝑩 𝑎𝑣 1 𝒙 1 -1 𝒚 = 2𝑩 𝑑𝑥 1 𝒙 1 1 𝒚 = 2𝑩 𝑎𝑣 𝑚 𝒙 1 -1 𝒚 = 2𝑩 𝑑𝑥 𝑚 𝒙 𝑤 1 𝑤 2 + 𝑤 3 𝑤 4 + 𝑤 5 𝑤 4 -𝑤 5 𝑤 2 -𝑤 3 Figure 4 : Realization of a radius-1 convolution using the proposed aggregators. I x is the input feature map, * the convolutional operator, I y the convolution result, and B i = B(∇φ i ).

2.6. EXTENDING THE RADIUS OF THE AGGREGATION KERNEL

Having aggregation kernels for neighbours of distance 2 or 3 is important to improve the expressiveness of GNNs, their ability to understand patterns, and to reduce the number of layers required. However, the lack of directions in GNNs strongly limits the radius of the kernels since, given a graph of regular degree d, a mean/sum aggregation at a radius-R will result in a heavy over-squashing of d R messages. Using the directional fields, we can enumerate different paths, thus assigning a different weight for different R-distant neighbours. This method, proposed in appendix A.7, avoids the over-squashing, but empirical results are left for future work.

2.7. COMPARISON WITH WEISFEILER-LEHMAN (WL) TEST

We also compare the expressiveness of the Directional Graph Networks with the classical WL graph isomorphism test which is often used to classify the expressivity of graph neural networks (Xu et al., 2018a) . In theorem 2.8 (proven in appendix C.8) we prove that DGNs are capable of distinguishing pairs of graphs that the 1-WL test (and so ordinary GNNs) cannot differentiate. Theorem 2.8 (Comparison with 1-WL test). DGNs using the mean aggregator, any directional aggregator of the first eigenvector and injective degree-scalers are strictly more powerful than the 1-WL test.

2.8. DATA AUGMENTATION

Another important result is that the directions in the graph allow to replicate some of the most common data augmentation techniques used in computer vision, namely reflection, rotation and distortion. The main difference is that, instead of modifying the image (such as a 5 • rotation), the proposed transformation is applied on the vector field defining the aggregation kernel (thus rotating the kernel by -5 • without changing the image). This offers the advantage of avoiding to pre-process the data since the augmentation is done directly on the kernel at each iteration of the training. The simplest augmentation is the vector field flipping, which is done changing the sign of the field F , as stated in definition 4. This changes the sign of B dx , but leaves B av unchanged. Definition 4 (Reflection of the vector field). For a vector field F , the reflected field is -F . Let F 1 , F 2 be vector fields in a graph, with F1 and F2 being the field normalized such that each row has a unitary L 2 -norm. Define the angle vector α by ( F1 ) i,: , ( F2 ) i,: = cos(α i ). The vector field F ⊥ 2 is the normalized component of F2 perpendicular to F1 . The equation below defines F ⊥ 2 . The next equation defines the angle ( F ⊥ 2 ) i,: = ( F2 -F1 , F2 F1 ) i,: ||( F2 -F1 , F2 F1 ) i,: || Notice that we then have the decomposition ( F2 ) i,: = cos(α i )( F1 ) i,: + sin(α i )( F ⊥ 2 ) i,: . Definition 5 (Rotation of the vector fields). For F1 and F2 non-colinear vector fields with each vector of unitary length, their rotation by the angle θ in the plane formed by { F1 , F2 } is F θ 1 = F1 diag(cos θ) + F ⊥ 2 diag(sin θ) , F θ 2 = F1 diag(cos(θ + α)) + F ⊥ 2 diag(sin(θ + α)) (8) Finally, the following augmentation has a similar effect to a wave distortion applied on images. Definition 6 (Random distortion of the vector field). For vector field F and anti-symmetric random noise matrix R, its randomly distorted field is F = F + R • A.

3. IMPLEMENTATION

We implemented the models using the DGL and PyTorch libraries and we provide the code at the address https://anonymous.4open.science/r/a752e2b1-22e3-40ce-851c-a564073e1fca/. We test our method on standard benchmarks from Dwivedi et al. ( 2020) and Hu et al. (2020) , namely ZINC, CIFAR10, PATTERN and MolHIV with more details on the datasets and how we enforce a fair comparison in appendix B.1. For the empirical experiments we inserted our proposed aggregation method in two different type of message passing architecture used in the literature: a simple one similar to the one present in GCN (equation 9a) (Kipf & Welling, 2016) and a more complex and general one typical of MPNN (9b) (Gilmer et al., 2017) with or without edge features e ji . Hence, the time complexity O(Em) is identical to the PNA (Corso et al., 2020) , where E is the number of edges and m the number of aggregators, with an additional O(Ek) to pre-compute the k-first eigenvectors, as explained in the appendix B.2. X (t+1) i = U (j,i)∈E X (t) j (9a) X (t+1) i = U X (t) i , (j,i)∈E M X (t) i , X (t) j , e ji optional ( ) where is an operator which concatenates the results of multiple aggregators, X is the node features, M is a linear transformation and U a multiple layer perceptron. We tested the directional aggregators across the datasets using the gradient of the first k eigenvectors ∇φ 1,...,k as the underlying vector fields. Here, k is a hyperparameter, usually 1 or 2, but could be bigger for high-dimensional graphs. To deal with the arbitrary sign of the eigenvectors, we take the absolute value of the result of equation 6, making it invariant to a reflection of the field. In case of a disconnected graph, φ i is the i-th eigenvector of each connected component. Despite the numerous aggregators proposed in appendix A, only B dx and B av are tested empirically.

4. RESULTS AND DISCUSSION

Directional aggregation Using the benchmarks introduced in section 3, we present in figure 5 a fair comparison of various aggregation strategies using the same parameter budget and hyperparameters. We see a consistent boost in the performance for simple, complex and complex with edges models using directional aggregators compared to the mean-aggregator baseline. 2020). The low-frequency Laplacian eigenvectors are used to define the directions, except for CIFAR10 that uses the coordinates of the image. For brevity, we denote dx i and av i as the directional derivative B i dx and smoothing B i av aggregators of the i-th direction. We also denote pos i as the i-th eigenvector used as positional encoding for the mean aggregator. In particular, we see a significant improvement in ZINC and MolHIV using the directional aggregators. We believe this is due to the capacity to move efficiently messages across opposite parts of the molecule and to better understand the role of atom pairs. Further, the thesis that DGNs can bridge the gap between CNNs and GNNs is supported by the clear improvements on CIFAR10 over the baselines. This contrasts with the positional encoding which showed no clear improvement. With our theoretical analysis in mind, we expected to perform well on PATTERN since the flow of the first eigenvectors are meaningful directions in a stochastic block model and passing messages using those directions allows the network to efficiently detect the two communities. The results match our expectations, outperforming all the previous models. Comparison to the literature In order to compare our model with the literature, we fine-tuned it on the various datasets and we report its performance in figure 6 . We observe that DGN provides significant improvement across all benchmarks, highlighting the importance of anisotropic kernels. In the work by Dwivedi et al. (2020) , they proposed the use of positional encoding of the eigenvectors in node features, but these bring significant improvement when many eigenvectors and high network depths are used. Our results outperform them with fewer parameters, less depth, and only 1-2 eigenvectors, further motivating their use as directional flows instead of positional encoding.

Data augmentation

To evaluate the effectiveness of the proposed augmentation, we trained the models on a reduced version of the CIFAR10 dataset. The results in figure 7 show clearly a higher expressive power of the dx aggregator, enabling it to fit well the training data. For a small dataset, this comes at the cost of overfitting and a reduced test-set performance, but we observe that randomly rotating or distorting the kernels counteracts the overfitting and improves the generalization. As expected, the performance decreases when the rotation or distortion is too high since the augmented graph changes too much. In computer vision images similar to CIFAR10 are usually rotated by less than 30 • (Shorten & Khoshgoftaar; O'Gara & McGuinness, 2019) . Further, due to the constant number of parameters across models, less parameters are attributed to the mean aggregation in Figure 7 : Accuracy of the various models using data augmentation with a complex architecture of ∼ 100k parameters and trained on 10% of the CIFAR10 training set (4.5k images). An angle of x corresponds to a rotation of the kernel by a random angle sampled uniformly in (-x • , x • ) using definition 5 with F 1,2 being the gradient of the horizontal/vertical coordinates. A noise of 100x% corresponds to a distortion of each eigenvector with a random noise uniformly sampled in (-x • m, x • m) where m is the average absolute value of the eigenvector's components. The mean baseline model is not affected by the augmentation since it does not use the underlining vector field.

5. CONCLUSION

The proposed DGN method allows to solve many problems of GNNs, including the lack of anisotropy, the low expressiveness, the over-smoothing and over-squashing. For the first time in graph networks, we generalize the directional properties of CNNs and their data augmentation capabilities. Based on an intuitive idea and backed by a set of strong theoretical and empirical results, we believe this work will give rise to a new family of directional GNNs. Future work can focus on the implementation of radius-R kernels and improving the choice of multiple orthogonal directions. Broader Impact This work will extend the usability of graph networks to all problems with physically defined directions, thus making GNN a new laboratory for physics, material science and biology. In fact, the anisotropy present in a wide variety of systems could be expressed as vector fields (spinor, tensor) compatible with the DGN framework, without the need of eigenvectors. One example is magnetic anisotropicity in metals, alloys and also in molecules such as benzene ring, alkene, carbonyl, alkyne that are easier or harder to magnetise depending on the directions or which way the object is rotated. Other examples are the response of materials to high electromagnetic fields (e.g. to study material responses at terahertz frequency); all kind of field propagation in crystals lattices (vibrations, heat, shear and frictional force, young modulus, light refraction, birefringence); multi-body or liquid motion; traffic modelling; and design of novel materials and constrained structures. This also enables GNNs to be used for virtual prototyping systems since the added directional constraints could improve the analysis of a product's functionality, manufacturing and behavior. A simple alternative to the directional smoothing and directional derivative operator is to simply take the forward/backward values according to the underlying positive/negative parts of the field F , since it can effectively replicate them. However, there are many advantage of using B av,dx . First, one can decide to use either of them and still have an interpretable aggregation with half the parameters. Then, we also notice that B av,dx regularize the parameter by forcing the network to take both forward and backward neighbours into account at each time, and avoids one of the neighbours becoming too important. Lastly, they are robust to a change of sign of the eigenvectors since B av is sign invariant and B dx will only change the sign of the results, which is not the case for forward/backward aggregations.

A.1 RETRIEVING THE MEAN AND LAPLACIAN AGGREGATIONS

It is interesting to note that we can recover simple aggregators from the aggregation matrices B av (F ) and B dx (F ). Let F be a vector field such that all edges are equally weighted F ij = ±C for all edges (i, j). Then, the aggregator B av is equivalent to a mean aggregation: B av (F )x = D -1 Ax Under the condition F ij = C, the differential aggregator is equivalent to a Laplacian operator L normalized using the degree D B dx (CA)x = D -1 (A -D)x = -D -1 Lx A.2 GLOBAL FIELD NORMALIZATION The proposed aggregators are defined with a row-wise normalized field Fi,: = F i,: ||F i,: || L P meaning that all the vectors are of unit-norm and the aggregation/message passing is done only according to the direction of the vectors, not their amplitude. However, it is also possible to do a global normalization of the field F by taking a matrix-norm instead of a vector-norm. Doing so will modulate the aggregation by the amplitude of the field at each node. One needs to be careful since a global normalization might be very sensitive to the number of nodes in the graph.

A.3 CENTER-BALANCED AGGREGATORS

A problem arises in the aggregators B dx and B av proposed in equations 5 and 6 when there is an imbalance between the positive and negative terms of F ± . In that case, one of the directions overtakes the other in terms of associated weights. An alternative is also to normalize the forward and backward directions separately, to avoid having either the backward or forward direction dominating the message. B av-center (F ) i,: = F + i,: + F - i,: ||F + i,j + F - i,j || L1 , F ± i,: = |F ± i,: | ||F ± i,: || L 1 + (10) The same idea can be applied to the derivative aggregator equation 11 where the positive and negative parts of the field F ± are normalized separately to allow to project both the forward and backward messages into a vector field of unit-norm. F + is the out-going field at each node and is used for the forward direction, while F -is the in-going field used for the backward direction. By averaging the forward and backward derivatives, the proposed matrix B dx-center represents the centered derivative matrix. B dx-center (F ) i,: = F i,: -diag   j F :,j   i,: , F i,: = 1 2      F + i,: ||F + i,: || L 1 + forward field + F - i,: ||F - i,: || L 1 + backward field      A.4 HARDENING THE AGGREGATORS The aggregation matrices that we proposed, mainly B dx and B av depend on a smooth vector field F . At any given node, the aggregation will take a weighted sum of the neighbours in relation to the direction of F . Hence, if the field F v at a node v is diagonal in the sense that it gives a non-zero weight to many neighbours, then the aggregator will compute a weighted average of the neighbours. Although there are clearly good reasons to have this weighted-average behaviour, it is not necessarily desired in every problem. For example, if we want to move a single node across the graph, this behaviour will smooth the node at every step. Instead, we propose below to soften and harden the aggregations by forcing the field into making a decision on the direction it takes. Soft hardening the aggregation is possible by using a softmax with a temperature T on each row to obtain the field F softhard . (F softhard ) i,: = sign(F i,: )softmax(T |F i,: |) (12) Hardening the aggregation is possible by using an infinite temperature, which changes the softmax functions into argmax. In this specific case, the node with the highest component of the field will be copied, while all other nodes will be ignored. (F hard ) i,: = sign(F i,: )argmax(|F i,: |) (13) An alternative to the aggregators above is to take the softmin/argmin of the negative part and the softmax/argmax of the positive part.

A.5 FORWARD AND BACKWARD COPY

The aggregation matrices B av and B dx have the nice property that if the field is flipped (change of sign), the aggregation gives the same result, except for the sign of B dx . However, there are cases where we want to propagate information in the forward direction of the field, without smoothing it with the backward direction. In this case, we can define the strictly forward and strictly backward fields below, and use them directly with the aggregation matrices. F forward = F + , F backward = F - Further, we can use the hardened fields in order to define a forward copy and backward copy, which will simply copy the node in the direction of the highest field component. F forward copy = F + hard , F backward copy = F - hard A.6 PHANTOM ZERO-PADDING Some recent work in computer vision has shown the importance of zero-padding to improve CNNs by allowing the network to understand it's position relative to the border (Islam et al., 2020) . In contrast, using boundary conditions or reflection padding makes the network completely blind to positional information. In this section, we show that we can mimic the zero-padding in the direction of the field F for both aggregation matrices B av and B dx . Starting with the B av matrix, in the case of a missing neighbour in the forward/backward direction, the matrix will compensate by adding more weights to the other direction, due to the denominator which performs a normalization. Instead, we would need the matrix to consider both directions separately so that a missing direction would result in zero padding. Hence, we define B av,0pad below, where either the F + or F -will be 0 on a boundary with strictly in-going/out-going field. (B av,0pad ) i,: = 1 2 |F + i,: | ||F + i,: || L 1 + + |F - i,: | ||F - i,: || L 1 + (16) Following the same argument, we define B dx,0pad below, where either the forward or backward term is ignored. The diagonal term is also removed at the boundary so that the result is a center derivative equal to the subtraction of the forward term with the 0-term on the back (or vice-versa), instead of a forward derivative. B dx-0pad (F ) i,: =        F + i,: if j F - i,j = 0 F - i,: if j F + i,j = 0 1 2 F + i,: + F - i,: -diag j F + :,j + F - :,j i,: , otherwise F + i,: = F + i,: ||F + i,: || L 1 + F - i,: = F - i,: ||F - i,: || L 1 +

A.7 EXTENDING THE RADIUS OF THE AGGREGATION KERNEL

We aim at providing a general radius-R kernel B R that assigns different weights to different subsets of nodes n u at a distance R from the center node n v . First, we decompose the matrix B(F ) into positive and negative parts B ± (F ) representing the forward and backward steps aggregation in the field F . B(F ) = B + (F ) -B -(F ) Thus, defining B ± f b (F ) i,: = F ± i,: ||Fi,:|| L p , we can find different aggregation matrices by using different combinations of walks of radius R. First demonstrated for a grid in theorem 2.7, we generalize it in equation 19 for any graph G. Definition 7 (General radius R n-directional kernel). Let S n be the group of permutations over n elements with a set of directional fields F i . B R := V ={v1,v2,...,vn}∈N n ||V || L 1 ≤R, -R≤vi≤R Any choice of walk V with at most R steps using all combinations of v1, v2, ..., vn σ∈Sn optional permutations a V N j=1 (B sgn(v σ(j) ) f b (F σ(j) )) |v σ(j) | Aggregator following the steps V , permuted by Sn (19) In this equation, n is the number of directional fields and R is the desired radius. V represents all the choices of walk {v 1 , v 2 , ..., v n } in the direction of the fields {F 1 , F 2 , ..., F n }. For example, V = {3, 1, 0, -2} has a radius R = 6, with 3 steps forward of F 1 , 1 step forward of F 2 , and 2 steps backward of F 4 . The sign of each B ± f b is dependant to the sign of v σ(j) , and the power |v σ(j) | is the number of aggregation steps in the directional field F σ(j) . The full equation is thus the combination of all possible choices of paths across the set of fields F i , with all possible permutations. Note that we are restricting the sum to v i having only a possible sign; although matrices don't commute, we avoid choosing different signs since it will likely self-intersect a lower radius walk. The permutations σ are required since, for example, the path up → left is different (in a general graph) than the path left → up. This matrix B R has a total of R r=0 (2n) r = (2n) R+1 -1 2n-1 parameters, with a high redundancy since some permutations might be very similar, e.g. for a grid graph we have that up → left is identical to left → up. Hence, we can replace the permutation S n by a reverse ordering, meaning that N j B j = B N ...B 2 B 1 . Doing so does not perfectly the radius-R kernel for all graphs, but it generalizes it on a grid and significantly reduces the number of parameters to R r=0 min(n,r) l=1 2 r n l r-1 l-1 .

B APPENDIX -IMPLEMENTATION DETAILS B.1 BENCHMARKS AND DATASETS

We use a variety of benchmarks proposed by Dwivedi et al. ( 2020) and Hu et al. (2020) to test the empirical performance of our proposed methods. In particular, to have a wide variety of graphs and tasks we chose: 1. ZINC, a graph regression dataset from molecular chemistry. The task is to predict a score that is a subtraction of computed properties logP -SA, with logP being the computed octanol-water partition coefficient, and SA being the synthetic accessibility score (Jin et al., 2018) . 2. CIFAR10, a graph classification dataset from computer vision (Krizhevsky, 2009) . The task is to classify the images into 10 different classes, with a total of 5000 training image per class and 1000 test image per class. Each image has 32 × 32 pixels, but the pixels have been clustered into a graph of ∼ 100 super-pixels. Each super-pixel becomes a node in an almost grid-shaped graph, with 8 edges per node. The clustering uses the code from Knyazev et al. (2019) , and results in a different number of super-pixels per graph. 3. PATTERN, a node classification synthetic benchmark generated with Stochastic Block Models, which are widely used to model communities in social networks. The task is to classify the nodes into 2 communities and it tests the fundamental ability of recognizing specific predetermined subgraphs. 4. MolHIV, a graph classification benchmark from molecular chemistry. The task is to predict whether a molecule inhibits HIV virus replication or not. The molecules in the training, validation and test sets are divided using a scaffold splitting procedure that splits the molecules based on their two-dimensional structural frameworks. Our goal is to provide a fair comparison to demonstrate the capacity of our proposed aggregators. Therefore, we compare the various methods on both types of architectures using the same hyperparameters tuned in previous works (Corso et al., 2020) for similar networks. The models vary exclusively in the aggregation method and the width of the architectures to keep a set parameter budget. In CIFAR10 it is impossible to numerically compute a deterministic vector field with eigenvectors due to the multiplicity of λ 1 being greater than 1. This is caused by the symmetry of the square image, and is extremely rare in real-world graphs. Therefore, we used as underlying vector field the gradient of the coordinates of the image. Note that these directions are provided in the nodes' features in the dataset and available to all models, that they are co-linear to the eigenvectors of the grid as per lemma C.1, and that they mimic the inductive bias in CNNs.

B.2 IMPLEMENTATION AND COMPUTATIONAL COMPLEXITY

Unlike several more expressive graph networks (Kondor et al., 2018; Maron et al., 2018) , our method does not require a computational complexity superlinear with the size of the graph. The calculation of the first k eigenvectors during pretraining, done using Lanczos method (Lanczos, 1950) and the sparse module of Scipy, has a time complexity of O(Ek) where E is the number of edges. During training the complexity is equivalent to a m-aggregator GNN O(Em) (Corso et al., 2020) for the aggregation and O(N m) for the MLP. To all the architectures we added residual connections (He et al., 2016) , batch normalization (Ioffe & Szegedy, 2015) and graph size normalization (Dwivedi et al., 2020) . For all the datasets with non-regular graphs, we combine the various aggregators with logarithmic degree-scalers as in Corso et al. (2020) . An important thing to note is that, for dynamic graphs, the eigenvectors need to be re-computed dynamically with the changing edges. Fortunately, there are random walk based algorithms that can estimate φ 1 quickly, especially for small changes to the graph (Doshi & Eun, 2000) . In the current empirical results, we do not work with dynamic graphs.

B.3 RUNNING TIME

The precomputation of the first four eigenvectors for all the graphs in the datasets takes 38s for ZINC, 96s for PATTERN and 120s for MolHIV on CPU. Table 1 shows the average running time on GPU for all the various model from figure 5. On average, the epoch running time is 16% slower for the DGN compared to the mean aggregation, but a faster convergence for DGN means that the total training time is on average 8% faster for DGN. The possibility to define equivariant directions using the low-frequency Laplacian eigenvectors is subject to the uniqueness of those vectors. When the dimension of the eigenspaces associated with the lowest eigenvalues is 1, the eigenvectors are defined up to a constant factor. In section 2.4, we propose the use of unit vector normalization and an absolute value to eliminate the scale and sign ambiguity. When the dimension of those eigenspaces is greater than 1, it is not possible to define equivariant directions using the eigenvectors. Fortunately, it is very rare for the Laplacian matrix to have repeated eigenvalues in real-world datasets. We validate this claim by looking at ZINC and PATTERN datasets where we found no graphs with repeated Fiedler vector and only one graph out of 26k with multiplicity of the second eigenvector greater than 1. When facing a graph that presents repeated Laplacian eigenvalues, we propose to randomly shuffle, during training time, different eigenvectors randomly sampled in the eigenspace. This technique will act as a data augmentation of the graph during training time allowing the network to train with multiple directions at the same time. C APPENDIX -MATHEMATICAL PROOFS C.1 PROOF FOR THEOREM 2.1 (DIRECTIONAL SMOOTHING) The operation y = B av x is the directional average of x, in the sense that y u is the mean of x v , weighted by the direction and amplitude of F . Proof. This should be a simple proof, that if we want a weighted average of our neighbours, we simply need to multiply the weights by each neighbour, and divide by the sum of the weights. Of course, the weights should be positive. C.2 PROOF FOR THEOREM 2.2 (DIRECTIONAL DERIVATIVE) Suppose F have rows of unit L 1 norm. The operation y = B dx ( F )x is the centered directional derivative of x in the direction of F , in the sense of equation 4, i.e. y = D F x = F -diag j F:,j x Proof. Since F rows have unit L 1 norm, F = F . The i-th coordinate of the vector F -diag j F :,j x is   F x -diag   j F   x   i = j F i,j x(j) -   j F i,j   x(i) = j:(i,j)∈E (x(j) -x(i))F i,j = D F x(i) C.3 PROOF FOR THEOREM 2.3 (K-GRADIENT OF THE LOW-FREQUENCY EIGENVECTORS) Let λ i and φ i be the eigenvalues and eigenvectors of the normalized Laplacian of a connected graph L norm and let a, b = arg max 1≤i,j≤n {d K (v i , v j )} be the nodes that have highest K-walk distance. Let m = arg min 1≤i≤n (φ 1 ) i and M = arg max 1≤i≤n (φ 1 ) i , then d K (v m , v M ) -d K (v a , v b ) has order O(1 -λ 2 ). Proof. For this theorem, we use the indices i = 0, ..., (N -1), sorted such that λ i ≤ λ i+1 . Hence, λ 0 = 0 and λ 1 is the first non-trivial eigenvalue. First we need the following proposition: and since φ 0 is constant the previous equation leads to n-1 i=0 φ 1 (i) = 0 ⇐⇒ φ 1 (0) = - n-1 i=1 φ 1 (i) If such p, q didn't exist then we would get that ∀i, j φ 1 (i) • φ 1 (j) ≥ 0, hence multiplying both sides of the previous equation by φ 1 (0) we get φ 1 (0) 2 = - n-1 i=1 φ 1 (i) • φ 1 (0) ⇒ φ 1 (0) 2 ≤ 0 Which is a contradiction since by assumption φ 1 (0) > 0; hence exist p, q such that φ 1 (p) < 0, φ 1 (q) > 0. Since φ 1 attains both positive and negative values, the quantity φ 1 (i)φ 1 (j) is minimised when it has negative sign and highest absolute value, hence when i, j are associated with the negative and positive values with the highest absolute value: the lowest and the highest value of φ 1 . Hence, d K (v M , v m ) -d K (v a , v b ) = O(1 -λ 2 ) C.4 INFORMAL ARGUMENT IN SUPPORT OF CONJECTURE 2.4 (GRADIENT STEPS REDUCE EXPECTED HITTING TIME) Suppose that x, y are uniformly distributed random nodes such that φ i (x) < φ i (y). Let z be the node obtained from x by taking one step in the direction of ∇φ i , then the expected hitting time is decreased proportionally to λ -1 i and E x,y [Q(z, y)] ≤ E x,y [Q(x, y)] As a reminder, the definition of a gradient step is given in the definition 3, copied below. Suppose the two neighboring nodes x and z are such that φ(z) -φ(x) is maximal among the neighbors of x, then we will say z is obtained from x by taking a step in the direction of the gradient ∇φ. In (Chung & S.T.Yau, 2000) , it is shown the hitting time Q(x, y) is given by the equation Q(x, y) = vol G(y, y) d y - G(x, y) d x With λ k and φ k being the k-th eigenvalues and eigenvectors of the symmetric normalized Laplacian L sym , vol the sum of the degrees of all nodes, d x the degree of node x and G Green's function for the graph G(x, y) = d 1 2 x d -1 2 y k>0 1 λ k φ k (x)φ k (y) Since the sign of the eigenvector is not deterministic, the choice φ i (x) < φ i (y) is used to simplify the argument without having to consider the change in sign. Supposing λ 1 λ 2 , the first term of the sum of G has much more weight than the following terms. With z obtained from x by taking a step in the direction of the gradient of φ 1 we have φ 1 (z) -φ 1 (x) > 0 We want to show that the following inequality holds E x,y (Q(z, y)) < E x,y (Q(x, y)) this is equivalent to the following inequality E x,y [G(z, y)] > E x,y [G(x, y)] By the hypothesis λ 1 λ 2 , we can approximate G(x, y) ∼ d 1 2 x d -1 2 y 1 λ1 φ 1 (x)φ 1 (y) so the last inequality is equivalent to E x,y d 1 2 z d -1 2 y 1 λ 1 φ 1 (z)φ 1 (y) > E x,y d 1 2 x d -1 2 y 1 λ 1 φ 1 (x)φ 1 (y) Removing all equal terms from both sides, the inequality is equivalent to E x,y d 1 2 z φ 1 (z) > E x,y d 1 2 x φ 1 (x) But showing this last inequality is not easy. We know that φ 1 (z) > φ 1 (x) and from the choice of z being a step in the direction of ∇φ 1 , we know it is less likely to be on the border of the graph so we believe E(d z ) ≥ E(d x ). Thus we also believe that the conjecture should hold in general. We believe this should be true even without the assumption on λ 1 and λ 2 and for more eigenvectors than φ 1 .

C.5 PROOF FOR LEMMA C.1 (COSINE EIGENVECTORS)

Consider the lattice graph Γ of size N 1 × N 2 × ... × N n , that has vertices i=1,...,n {1, ..., N i } and the vertices (x i ) i=1,...,n and (y i ) i=1,...,n are connected by an edge iff |x i -y i | = 1 for one index i and 0 for all other indices. Note that there are no diagonal edges in the lattice. The eigenvector of the Laplacian of the grid L(Γ) are given by φ j . Lemma C.1 (Cosine eigenvectors). The Laplacian of Γ has an eigenvalue 2 -2 cos π Ni with the associated eigenvector φ j that depends only the variable in the i-th dimension and is constant in all others, with φ j = 1 N1 ⊗ 1 N2 ⊗ ... ⊗ x 1,Ni ⊗ ... ⊗ 1 Nn , and x 1,Ni (j) = cos πj n -π Proof. First, recall the well known result that the path graph on N vertices P N has eigenvalues λ k = 2 -2 cos πk n with associated eigenvector x k with i-th coordinate x k (i) = cos πki n + πk The Cartesian product of two graphs G = (V G , E G ) and H = (V H , E H ) is defined as G × H = (V G×H , E G×H ) with V G×H = V G × V H and ((u 1 , u 2 ), ((v 1 , v 2 )) ∈ E G×H iff either u 1 = v 1 and (u 2 , v 2 ) ∈ E H or (u 1 , v 1 ) ∈ V G and u 2 = v 2 . It is shown in (Fiedler, 1973) that if (µ i ) i=1,...,m and (λ j ) j=1,...,n are the eigenvalues of G and H respectively, then the eigenvalues of the Cartesian product graph G × H are µ i + λ j for all possible eigenvalues µ i and λ j . Also, the eigenvectors associated to the eigenvalue µ i + λ j are u i ⊗ v j with u i an eigenvector of the Laplacian of G associated to the eigenvalue µ i and v j an eigenvector of the Laplacian of H associated to the eigenvalue λ j . Finally, noticing that a lattice of shape N 1 × N 2 × ... × N n is really the Cartesian product of path graphs of length N 1 up to N n , we conclude that there are eigenvalues 2 -2 cos π Ni . Denoting by 1 Nj the vector in R Nj with only ones as coordinates, then the eigenvector associated to the eigenvalue 2 -2 cos π Ni is 1 N1 ⊗ 1 N2 ⊗ ... ⊗ x 1,Ni ⊗ ... ⊗ 1 Nn where x 1,Ni is the eigenvector of the Laplacian of P Ni associated to its first non-zero eigenvalue. 2 -2 cos π Ni .

RADIUS 1 CONVOLUTION IN A GRID

this section we any radius 1 kernel can be as a linear combination of the B (∇φ ) and B av (∇φ i ) matrices for the choice of eigenvectors φ i First we this can be for 1-d kernels. Theorem C.2. On a path any 1D convolution kernel size k is a linear combination of aggregators B av , B and the identity I. Proof. Recall the previous that the first non zero eigenvalue of the path graph P N associated 1 (i) = cos( N -π 2N ). Since this is monotone decreasing function in i, the i-th row of ∇φ 1 will be (0, ..., 0, s i-1 , 0, -s i+1 , 0, ..., 0) with s i-1 and s i+1 > 0. We are trying to solve (aB av + bB dx + cId) i,: = (0, ..., 0, x, y, z, 0, ..., 0) with x, y, z, in positions i -1, i and i + 1. This simplifies to solving a 1 s L 1 |s| + b 1 s L 2 s + c(0, 1, 0) = (x, y, z) with s = (s i-1 , 0, -s i+1 ), which always has a solution because s i-1 , s i+1 > 0. Theorem C.3 (Generalization radius-1 convolutional kernel in a grid). Let Γ be the n-dimensional lattice as above and let φ j be the eigenvectors of the Laplacian of the lattice as in theorem C.1. Then any radius 1 kernel k on Γ is a linear combination of the aggregators B av (φ i ), B dx (φ i ) and I. Proof. This is a direct consequence of C. For an n-dimensional lattice, any convolutional kernel of radius R can be realized by a linear combination of directional aggregation matrices and their compositions. Proof. For clarity, we first do the 2 dimensional case for a radius 2, then extended to the general case. Let k be the radius 2 kernel on a grid represented by the matrix a 5×5 =      0 0 a -2,0 0 0 0 a -1,-1 a -1,0 a -1,1 0 a 0,-2 a 0,-1 a 0,0 a 0,1 a 0,2 0 a 1,-1 a 1,0 a 1,1 0 0 0 a 2,0 0 0      since we supposed the N 1 × N 2 grid was such that N 1 > N 2 , by theorem C.1, we have that φ 1 is depending only in the first variable x 1 and is monotone in x 1 . Recall from C.1 that φ 1 (i) = cos πi N 1 + π 2N 1 The vector N1 π ∇ arccos(φ 1 ) will be denoted by F 1 in the rest. Notice all entries of F 1 are 0 or ±1. Denote by F 2 the gradient vector N2 π ∇ arccos(φ k ) where φ k is the eigenvector given by theorem C.1 that is depending only in the second variable x 2 and is monotone in The table shows the node feature updates done at every layer. MPNN with mean/sum aggregators and the 1-WL test only use the updates in the first row and therefore cannot distinguish between the nodes in the two graphs. DGNs also use directional aggregators that, with the vector field given by the first eigenvector of the Laplacian matrix, provides different updates to the nodes in the two graphs. Then, to show that the DGNs are strictly more powerful than the 1-WL test it suffices to provide an example of a pair of graphs that DGNs can differentiate and 1-WL cannot. Such a pair of graphs is illustrated in figure 8 . The 1-WL test (as any MPNN with, for example, sum aggregator) will always have the same features for all the nodes labelled with a and for all the nodes labelled with b and, therefore, will classify the graphs as isomorphic. DGNs, via the directional smoothing or directional derivative aggregators based on the first eigenvector of the Laplacian matrix, will update the features of the a nodes differently in the two graphs (figure 8 presents also the aggregation functions) and will, therefore, be capable of distinguishing them.



Figure 1: Overview of the steps required to aggregate messages in the direction of the eigenvectors.

Figure 2: Possible directional flows in different types of graphs. The node coloring is a potential map and the edges represent the gradient of the potential with the arrows in the direction of the flow. The first 3 columns present the arcosine of the normalized eigenvectors (acos φ) as node coloring, and their gradients represented as edge intensity. The last column presents examples of inductive bias introduced in the choice of direction. (a) The eigenvectors 1 and 2 are the horizontal and vertical flows of the grid. (b) The eigenvectors 1 and 2 are the flow in the longest and secondlongest directions. (c) The eigenvectors 1, 2 and 3 flow respectively in the South-North, suburbs to the city center and West-East directions. We ignore φ 0 since it is constant and has no direction.

Figure 3: Illustration of how the directional aggregation works at a node n v , with the arrows representing the direction and intensity of the field F .

2 obtained by adding n 1-dimensional kernels, with each kernel being in a different axis of the grid as per Lemma C.1. See figure 4 for a visual example in 2D. C.7 PROOF FOR THEOREM 2.7 (GENERALIZATION RADIUS-R CONVOLUTIONAL KERNEL IN A LATTICE)

Figure 8: Illustration of an example pair of graphs which the 1-WL test cannot distinguish but DGNs can. The table shows the node feature updates done at every layer. MPNN with mean/sum aggregators and the 1-WL test only use the updates in the first row and therefore cannot distinguish between the nodes in the two graphs. DGNs also use directional aggregators that, with the vector field given by the first eigenvector of the Laplacian matrix, provides different updates to the nodes in the two graphs.

Direc�onal vector field between the node 𝑣 and 𝑢

No edge features Edge features No edge features No edge features Edge features No edge featuresCorso et al., 2020). All the models use ∼ 100k parameters, except those with * who use 300k to 1.9M . In ZINC the DGN aggregators are {mean, dx 1 , max, min}, in PATTERN {mean, dx 1 , av 1 }, in CIFAR10 {mean, dx 1 , dx 2 , max}, in MolHIV {mean, dx 1 , av 1 , max, min}.the directional models, thus it cannot fit well the data when the rotation/distortion is too strong since the directions are less informative. We expect large models to perform better at high angles.

Average running time for the non-fine tuned models from figure 5. Each entry represents average time per epoch / average total training time. Each of these models has a parameter budget ∼ 100k and was run on a Tesla T4 (15GB GPU). The avg increase row is the average of the relative running time of all rows compared to the mean row, with a negative value meaning a faster running time.

AUTHOR CONTRIBUTIONS Anonymous

Proposition 1 (K-walk distance matrix). The K-walk distance matrix P associated with a graph is the matrix such that (P ) i,j = d K (v i , v j ) can be written as K p=1 W p , where W = D -1 A is the random walk matrix.Let's define W = D -1 A the random walk matrix of the graph.First, we are going to show that W is jointly diagonalizable with L norm and we are going to relate its eigenvectors φ i and its eigenvalues λ i with the ones of W . Indeed, L sym is a symmetric real matrix which is semi-positive definite diagonalizable by the spectral theorem. Since the matrix L norm is similar toL sym and the matrix of similarity is D 1 2 , a positive definite matrix, L norm is diagonalizable and semi-positive definite.Bythe random walk matrix is jointly diagonalizable with the random walk Laplacian. Also their eigenvalues and eigenvectors are related to each other by φ i = φ n-1-i and λ i = 1 -λ n-1-i Moreover, the constant eigenvector associated with eigenvalue 0 of the Random walk Laplacian, is the eigenvector associated with the highest eigenvalue of the Random walk matrix and by the formula obtained, λ n-1 = 1 -λ 0 = 1 Now, we are going to approximate the K-walk distance matrix P using the 2 eigenvectors of the Random walk matrix associated with the highest eigenvalues.By Proposition 1 we have that P = K p=1 W p , which can be written asby eigen-decomposition.Since λ n-1-i = 1 -λ i and λ 2 λ 1 , we have that λ n-2 λ n-3 , hence we can approximatep is a positive constant. Now we are going to show that the farthest nodes with respect to the K-walk distance are the ones associated with the highest and lowest value of φ 1 . Indeed if we want to choose i, j to be at the farthest distance we need to minimisewhich is minimum when φ 1 (i)φ 1 (j) is minimum.We are going to show that exist p, q such that φ 1 (p) < 0, φ 1 (q) > 0. Since the eigenvector is nonzero, without loss of generality assume φ 1 (0) = 0. Since φ 0 and φ 1 are eigenvectors associated with different eigenvalues of a real symmetric matrix, they are orthogonal:a matrix B, B ± the parts of B, matrices with positive such that B B + --. Let B r1 a matrix representing the radius 1 kernel with weightsThe B r1 be obtained by theorem C.3. the radius kernel k is defined by the possible of 2 positive/negative steps, plus initial radius-1Any combination of 2 stepsall possible single-steps with sgn the sign function sgn(i) = + if i ≥ 0 andif i < 0. The matrix B r2 then realises the kernel a 5×5 .We can further extend the above construction to N dimension grids and radius R kernels kAny choice of walk V with at most R-stepsAggregator following the steps defined in V with F j = Nj π ∇ arccos φ j ,φ j the eigenvector with lowest eigenvalue only dependent on the j-th variable and given in theorem C.1 and is the matrix multiplication. V represents all the choices of walk {v 1 , v 2 , ..., v n } in the direction of the fields {F 1 , F 2 , ..., F n }. For example, V = {3, 1, 0, -2} has a radius R = 6, with 3 steps forward of F 1 , 1 step forward of F 2 , and 2 steps backward of F 4 .C.8 PROOF FOR THEOREM 2.8 (COMPARISON WITH 1-WL TEST) DGNs using the mean aggregator, any directional aggregator of the first eigenvector and injective degree-scalers are strictly more powerful than the 1-WL test.Proof. We will show that (1) DGNs are at least as powerful as the 1-WL test and (2) there is a pair of graphs which are not distinguishable by the 1-WL test which DGNs can discriminate.Since the DGNs include the mean aggregator combined with at least an injective degree-scaler, Corso et al. (2020) show that the resulting architecture is at least as powerful as the 1-WL test.

