DIRECTIONAL GRAPH NETWORKS

Abstract

In order to overcome the expressive limitations of graph neural networks (GNNs), we propose the first method that exploits vector flows over graphs to develop globally consistent directional and asymmetric aggregation functions. We show that our directional graph networks (DGNs) generalize convolutional neural networks (CNNs) when applied on a grid. Whereas recent theoretical works focus on understanding local neighbourhoods, local structures and local isomorphism with no global information flow, our novel theoretical framework allows directional convolutional kernels in any graph. First, by defining a vector field in the graph, we develop a method of applying directional derivatives and smoothing by projecting node-specific messages into the field. Then we propose the use of the Laplacian eigenvectors as such vector field, and we show that the method generalizes CNNs on an n-dimensional grid, and is provably more discriminative than standard GNNs regarding the Weisfeiler-Lehman 1-WL test. Finally, we bring the power of CNN data augmentation to graphs by providing a means of doing reflection, rotation and distortion on the underlying directional field. We evaluate our method on different standard benchmarks and see a relative error reduction of 8% on the CIFAR10 graph dataset and 11% to 32% on the molecular ZINC dataset. An important outcome of this work is that it enables to translate any physical or biological problems with intrinsic directional axes into a graph network formalism with an embedded directional field.

1. INTRODUCTION

One of the most important distinctions between convolutional neural networks (CNNs) and graph neural networks (GNNs) is that CNNs allow for any convolutional kernel, while most GNN methods are limited to symmetric kernels (also called isotropic kernels in the literature) (Kipf & Welling, 2016; Xu et al., 2018a; Gilmer et al., 2017) . There are some implementation of asymmetric kernels using gated mechanisms (Bresson & Laurent, 2017; Veličković et al., 2017) , motif attention (Peng et al., 2019) , edge features (Gilmer et al., 2017) or by using the 3D structure of molecules for message passing (Klicpera et al., 2019) . However, to the best of our knowledge, there are currently no methods that allow asymmetric graph kernels that are dependent on the full graph structure or directional flows. They either depend on local structures or local features. This is in opposition to images which exhibit canonical directions: the horizontal and vertical axes. The absence of an analogous concept in graphs makes it difficult to define directional message passing and to produce an analogue of the directional frequency filters (or Gabor filters) widely present in image processing (Olah et al., 2020) . We propose a novel idea for GNNs: use vector fields in the graph to define directions for the propagation of information, with an overview of the paper presented in 1. Hence, the aggregation or message passing will be projected onto these directions so that the contribution of each neighbouring node n v will be weighted by its alignment with the vector fields at the receiving node n u . This enables our method to propagate information via directional derivatives or smoothing of the features. We also explore using the gradients of the low-frequency eigenvectors of the Laplacian of the graph φ k , since they exhibit interesting properties (Bronstein et al., 2017; Chung et al., 1997) . In particular, they can be used to define optimal partitions of the nodes in a graph, to give a natural ordering (Levy, 2006) , and to find the dominant directions of the graph diffusion process (Chung & Yau, 2000) . Further, we show that they generalize the horizontal and vertical directional flows in a grid (see 𝝓 1 𝝓 𝑘 𝑭 1 = ∇𝝓 1 = 𝑭 𝑘 = ∇𝝓 𝑘 = 𝑩 𝑑𝑥 𝑘 𝑩 𝑎𝑣 𝑘 𝑩 𝑎𝑣 1 1 = MLP 𝒀 0 Graph 𝑨 Input graph Compute first 𝑘 non- trivial eigenvectors Compute the gradient Create the aggrega�on matrices 𝑩

Aggrega�on of neighbouring features MLP

The a-direc�onal adjacency matrix 𝑨 is given as an input. We then compute the Laplacian matrix 𝑳. Both 𝑨 and 𝑳 are of size 𝑁 × 𝑁, where 𝑁 is the number of nodes. The matrices are o�en sparse with 𝐸 being the number of edges. The eigenvectors 𝝓 of 𝑳 are computed and sorted such that 𝝓1 has the lowest nonzero eigenvalue and 𝝓𝑘 has the 𝑘-th lowest. This step is the most expensive computa�onally. There are methods to compute 𝑘-first eigenvectors with a complexity of 𝑂 𝑘𝐸

Pre-computed steps 𝑂 𝑘𝐸

A graph with the node features is given. 𝑿 0 is the feature matrix of the graph at the 0-th GNN layer. 𝑿 0 has 𝑁 rows (the number of nodes) and 𝑛0 columns (the number of input features). The aggrega�on matrices 𝑩 𝑎𝑣 ,𝑑𝑥 1,…,𝑘 are taken from the precomputed steps.

Graph neural network steps 𝑂 𝑘𝐸 + 𝑘𝑁

The gradient of 𝝓 is a func�on of the edges (a matrix) such that ∇𝝓𝑖𝑗 = 𝝓𝑖 -𝝓𝑗 if the nodes 𝑖, 𝑗 are connected, or ∇𝝓𝑖𝑗 = 0 otherwise. If the graph has a known direc�on, it can be encoded as field 𝑭 (an an�-symmetric matrix) instead of using ∇𝝓. Each row 𝑖, : of the field 𝑭 is normalized by it's 𝐿 1 norm to create the aggrega�on matrices. 𝑭𝑖,: = 𝑭𝑖,: 𝑭 𝑖.: 𝐿 1 + 𝜖 •𝑩𝑎𝑣 is the direc�onal smoothing matrix. 𝑩𝑎𝑣 = 𝑭 • 𝑩𝑑𝑥 is the direc�onal deriva�ve matrix. 𝑩𝑑𝑥 𝑖,: = 𝑭𝑖,: -diag 𝑭:,𝑗 𝑗 𝑖,: The aggrega�on matrices 𝑩 𝑎𝑣 ,𝑑𝑥 1,…,𝑘 are used to aggregate the features 𝑿 0 . For 𝑩𝑑𝑥 we take the absolute value due to the sign ambiguity of 𝝓. 𝑩 is similar to a weighted adjacency matrix (with possible nega�ve weights), the aggrega�on is simply the matrix product with the feature vector. Other non-direc�onal aggregators are used, such as the mean aggrega�on 𝑫 -1 𝑨 𝑿 0 . The resul�ng matrix 𝒀 0 is the column-concatena�on of all the aggrega�ons. The complexity is 𝑂 𝑘𝐸 , with𝐸 being the number of edges, or 𝑂 𝐸 if the aggrega�ons are parallelized.

This is the only step with learned parameters.

Based on the GCN method, each aggrega�on is followed by a mul� layer perceptron (MLP) on all the features. In our case, the MLP is applied on the columns of 𝒀 0 , thus we have a complexity of 𝑂 𝑘𝑁 . Let the number of features at the 𝑡-th layer in 𝑿 𝑡 be 𝑛𝑡 . Then: • 𝑿 0 has 𝑛0 columns • 𝒀 0 has 2𝑘 + 1 𝑛0 columns • 𝑿 1 has 𝑛1 columns 𝑩 𝑑𝑥 1 𝑿𝑏,0 0 𝑿𝑏,1 0 𝑿𝑏,2 0 𝑿𝑏,3 0 𝑿𝑏,4 0 𝑿𝑏,5 0 𝑿𝑎,0 0 𝑿𝑎,1 0 𝑿𝑎,2 0 𝑿𝑎,3 0 𝑿𝑎,4 0 𝑿𝑎,5 0 𝑎 𝑏 𝑡 → 𝑡 + 1 𝑿 𝑡 → 𝑿 𝑡+1 𝑿 0 → 𝑿 1 𝒀 0 → 𝒀 1 Next GNN layer figure 2 ), allowing them to guide the aggregation and mimic the asymmetric and directional kernels present in computer vision. In fact, we demonstrate mathematically that our work generalizes CNNs by reproducing all convolutional kernels of radius R in an n-dimensional grid, while also bringing the powerful data augmentation capabilities of reflection, rotation or distortion of the directions. We further show that our directional graph network (DGN) model theoretically and empirically allows for efficient message passing across distant communities, which reduces the well-known problem of over-smoothing, and aligns well with the need of independent aggregation rules (Corso et al., 2020) . Alternative methods reduce the impact of over-smoothing by using skip connections (Luan et al., 2019) , global pooling (Alon & Yahav, 2020), or randomly dropping edges during training time (Rong et al., 2020) , but without solving the underlying problem. In fact, we also prove that DGN is more discriminative than standard GNNs in regards to the Weisfeiler-Lehman 1-WL test, showing that the reduction of over-smoothing is accompanied by an increase of expressiveness. Our method distinguishes itself from other spectral GNNs since the literature usually uses the low frequencies to estimate local Fourier transforms in the graph (Levie et al., 2018; Xu et al., 2019) . Instead, we do not try to approximate the Fourier transform, but only to define a directional flow at each node and guide the aggregation.

2.1. INTUITIVE OVERVIEW

One of the biggest limitations of current GNN methods compared to CNNs is the inability to do message passing in a specific direction such as the horizontal one in a grid graph. In fact, it is difficult to define directions or coordinates based solely on the shape of the graph. The lack of directions strongly limits the discriminative abilities of GNNs to understand local structures and simple feature transformations. Most GNNs are invariant to the permutation of the neighbours' features, so the nodes' received signal is not influenced by swapping the features of 2 neighbours. Therefore, several layers in a deep network will be employed to understand these simple changes instead of being used for higher level features, thus over-squashing the message sent between 2 distant nodes (Alon & Yahav, 2020) . In this work, one of the main contributions is the realisation that low-frequency eigenvectors of the Laplacian can overcome this limitation by providing a variety of intuitive directional flows. As a first example, taking a grid-shaped graph of size N × M with N 2 < M < N , we find that the eigenvector



Figure 1: Overview of the steps required to aggregate messages in the direction of the eigenvectors.

