

Abstract

Graph convolutional networks (GCNs) enable end-to-end learning on graph structured data. However, many works begin by assuming a given graph structure. As the ideal graph structure is often unknown, this limits applicability. To address this, we present a novel end-to-end differentiable graph-generator which builds the graph topology on the fly. Our module can be readily integrated into existing pipelines involving graph convolution operations, replacing the predetermined or existing adjacency matrix with one that is learned, and optimised, as part of the general objective. As such it is applicable to any GCN. We show that integrating our module into both node classification and trajectory prediction pipelines improves accuracy across a range of datasets and backbones.

1. I N T R O D U C T I O N

The success of Graph Neural Networks (GNNs) (Duvenaud et al., 2015; Bronstein et al., 2017; Monti et al., 2017) , has led to a surge in the use of graph-based representation learning. GNNs provide an efficient framework to learn from graph-structured data, making them widely applicable in any domain where data can be represented as a relation or interaction system. They have been successfully applied in a wide range of tasks including particle physics (Choma et al., 2018) , protein science (Gainza et al., 2020) and many others (Monti et al., 2019) , (Stokes et al., 2020) . In a GNN, each node iteratively updates its state by interacting with its neighbors, typically through message passing. However, a fundamental limitation of such architectures is the assumption that the underlying graph is provided. While node or edge features may be updated during message passing, the graph topology remains fixed, and its choice may be suboptimal for various reasons. For instance, when classifying nodes on a citation network, an edge connecting nodes of different classes can diminish classification accuracy. These edges can degrade performance by causing irrelevant information to be propagated across the graph. When no graph is explicitly provided, one common practice is to generate a k-nearest neighbor (k-NN) graph. In such cases, k is a hyperparameter and tuned to find the model with the best performance. For many applications, fixing k is overly restrictive as the optimal choice of k may vary for each node in the graph. While there has been an emergence of approaches which learn the graph structure for use in downstream GNNs (Zheng et al., 2020; Kazi et al., 2020; Kipf et al., 2018) , all of them treat the node degree k as a fixed hyperparameter. We propose a general differentiable graph-generator (DGG) module for learning graph topology with or without an initial edge structure. This module can be placed within any graph convolutional network, and jointly optimized with the rest of the network's parameters, learning topologies which favor the downstream task without hyperparameter selection or indeed any additional training signal. The primary contributions of this paper are as follows: 1. We propose a novel, differentiable graph-generator (DGG) module which jointly optimizes both the neighbourhood size, and the edges that should belong to each neighbourhood. Note that existing approaches (Zheng et al., 2020; Kipf et al., 2018; Kazi et al., 2020) do not allow for learnable neighbourhood sizes. 2. Our DGG module is directly integrable into any pipeline involving graph convolutions, where either the given adjacency matrix is noisy, or is not explicitly provided and must be determined heuristically. In both cases, our DGG generates the adjacency matrix as part of the GNN training and can be trained end-to-end to optimize performance on the downstream task. Should a good graph structure be known, the generated adjacency matrix can be learned to remain close to it while optimizing performance. 3. To demonstrate the power of the approach, we integrate our DGG within a range of SOTA pipelines -without modification -across different datasets in trajectory prediction and node classification and demonstrate improvements in model accuracy.

2. R E L AT E D W O R K

Graph Representation Learning: GNNs (Bronstein et al., 2017) provide a powerful class of neural architectures for modelling data which can be represented as a set of nodes and relations (edges). Most use message-passing to build node representations by aggregating neighborhood information. A common formulation is the Graph Convolution Network (GCNs) which generalizes the convolution operation to graphs (Kipf & Welling, 2017; Defferrard et al., 2016; Wu et al., 2018; Hamilton et al., 2017) . More recently, the Graph Attention Network (GAT) (Veličković et al., 2018) utilizes a self-attention mechanism to aggregate neighborhood information. However, these works assumed that the underlying graph structure is predetermined, with the graph convolutions learning features that describe preexisting nodes and edges. In contrast, we simultaneously learn the graph structure while using our generated adjacency matrix in downstream graph convolutions. The generated graph topology of our module is jointly optimized alongside other network parameters with feedback signals from the downstream task. Graph Structure Learning: In many applications, the optimal graph is unknown, and a graph is constructed before training a GNN. One question to ask is: "Why isn't a fully-connected graph suitable?" Constructing adjacency matrices weighted by distance or even an attention mechanism (Veličković et al., 2018) over a fully-connected graph incorporates many task-irrelevant edges, even if their weights are small. While an attention mechanism can zero these out -i.e., discover a subgraph within the complete graph -discovering this subgraph is challenging given the combinatorial complexity of graphs. A common remedy is to sparsify a complete graph by selecting the k-nearest neighbors (k-NN). Although this can prevent the propagation of irrelevant information between nodes, the topology of the constructed graph may have no relation to the downstream task. Not only can irrelevant edges still exist, but pairs of relevant nodes may remain unconnected and can lead GCNs to learn representations with poor generalization (Zheng et al., 2020) . This limitation has led to works which learn a graph's structure within a deep learning framework. Some methods (Shi et al., 2019; Liu et al., 2020 ) take a fixed adjacency matrix as input and then learn a residual mask over it. Since these methods directly optimize the residual adjacency by treating each element as a learnable parameter, the learned adjacency matrix is not linked to the representation space and only works in tasks where the training nodes are the same as that at test time. To overcome this, recent approaches (Zheng et al., 2020; Kipf et al., 2018; Luo et al., 2021; Kazi et al., 2020) generate a graph structure by sampling from discrete distributions. As discrete sampling is not directly optimizable using gradient descent, these methods use the Gumbel-Softmax reparameterization trick (Jang et al., 2016) to generate differentiable graph samples. The Gumbel-Softmax approximates an argmax over the edges for each node, and sampling in these approaches is typically performed k times to obtain the top-k edges. Here, k is a specified hyperparameter that controls the node degree for the entire graph/dataset. Unlike these works, we generate edge samples by selecting the top-k in a differentiable manner, where we learn a distribution over the edges and over the node degree k. This allows the neighborhood and its size to be individually selected for each node. Additionally, a known 'ideal' graph structure can be used as intermediate supervision to further constrain the latent space.

3. M E T H O D

In this section, we provide details of our differentiable graph generation (DGG) module. We begin with notation and the statistical learning framework guiding its design, before describing the module, and how it is combined with graph convolutional backbone architectures. Notation We represent a graph of N nodes as G = (V, E): where V is the set of nodes or vertices, and E the edge set. A graph's structure can be described by its adjacency matrix A, with a ij = 1 if an edge connects nodes i and j and a ij = 0 otherwise. This binary adjacency matrix A is directed, and potentially asymmetrical. Problem definition. We reformulate the baseline prediction task based on a fixed graph with an adaptive variant where the graph is learned. Typically, such baseline tasks make learned predictions Y given a set of input features X and a graph structure A of node degree k: Y = Q ϕ (X, A),

