SLAPS: SELF-SUPERVISION IMPROVES STRUCTURE LEARNING FOR GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) work well when the graph structure is provided. However, this structure may not always be available in real-world applications. One solution to this problem is to infer the latent structure and then apply a GNN to the inferred graph. Unfortunately, the space of possible graph structures grows super-exponentially with the number of nodes and so the available node labels may be insufficient for learning both the structure and the GNN parameters. In this work, we propose the Simultaneous Learning of Adjacency and GNN Parameters with Self-supervision, or SLAPS, a method that provides more supervision for inferring a graph structure. This approach consists of training a denoising autoencoder GNN in parallel with the task-specific GNN. The autoencoder is trained to reconstruct the initial node features given noisy node features as well as a structure provided by a learnable graph generator. We explore the design space of SLAPS by comparing different graph generation and symmetrization approaches. A comprehensive experimental study demonstrates that SLAPS scales to large graphs with hundreds of thousands of nodes and outperforms several models that have been proposed to learn a task-specific graph structure on established benchmarks.

1. INTRODUCTION

Graph representation learning has grown rapidly and found applications in domains where data points define a graph (Chami et al., 2020; Kazemi et al., 2020) . Graph neural networks (GNNs) (Scarselli et al., 2008) have been a key component to the success of the research in this area. Following the success of graph convolutional networks (GCNs) (Kipf & Welling, 2017) on semi-supervised node classification, several other GNN variants have been proposed for different prediction tasks on graphs (Hamilton et al., 2017; Veličković et al., 2018; Gilmer et al., 2017; Battaglia et al., 2018) and the power of these models has been studied theoretically (Xu et al., 2019; Sato, 2020) . GNNs take as input a set of node features and an adjacency matrix corresponding to the graph structure, and, for each node, output an embedding that captures not only the initial features of the node but also the features and embeddings of its neighbors. The performance of GNNs highly depends on the quality of the input graph structure and deteriorates substantially when the graph structure is noisy (see Zügner et al., 2018; Dai et al., 2018; Fox & Rajamanickam, 2019) . The need for both node features and a clean graph structure impedes the applicability of GNNs to domains where one has access to a set of nodes and their features but not to their underlying graph structure, or only has access to a noisy structure. Examples of such domains include brain signal classification (Jang et al., 2019 ), computer-aided diagnosis (Cosmo et al., 2020) , analysis of computer programs (Johnson et al., 2020) , and particle reconstruction (Qasim et al., 2019) . In this paper, we address this limitation by developing a model that learns both the GNN parameters as well as an adjacency matrix simultaneously. Since the number of possible graph structures grows super-exponentially with the number of nodes (Stanley, 1973) and obtaining node labels is typically costly, the number of available labels may not be enough for learning both the GNN parameters and an adjacency matrix-especially for semi-supervised node classification. Our main contribution is to supplement the classification task with a self-supervised task that helps learn a high-quality adjacency matrix. Our self-supervision approach masks some input features (or adds noise to them) and trains a separate GNN aiming at updating the adjacency matrix in such a way that it can recover the masked (or noisy) features. Introducing this self-supervision adds the inductive bias that a graph structure suitable for predicting the node features is also suitable for predicting the node labels. We experiment with several classification datasets. For datasets with a graph structure, we only feed the node features to our model. The model operates on the node features and an adjacency that is learned simultaneously from data. We compare our model with different classes of methods: some which do not use the graph structure for predicting labels, some which use a fixed k-Nearest Neighbors (kNN) graph built based on a chosen similarity metric, and some which initialize the graph with kNN but then revise it throughout the training. We show that our model consistently outperforms these methods. We also show that the self-supervised task is key to the high performance of our model. As an additional contribution, we provide an implementation for simultaneous structure and parameter learning that scales to graphs with hundreds of thousands of nodes.

2. RELATED WORK

Existing methods that relate to this work can be grouped into the following categories. Similarity Graph: One approach for inferring a graph structure is to select a similarity metric and set the edge weight between two nodes to be their similarity (Roweis & Saul, 2000; Tenenbaum et al., 2000) . To obtain a sparse structure, one may create a kNN similarity graph, only connect pairs of nodes whose similarity surpasses some predefined threshold, or do sampling. As an example, Gidaris & Komodakis (2019) create a (fixed) kNN graph using the cosine similarity of the node features. Wang et al. (2019b) extend this idea by creating a fresh graph in each layer of the GNN based on the node embedding similarities in that layer as opposed to fixing a graph solely based on the initial features. Instead of choosing a single similarity metric, Halcrow et al. ( 2020) fuse several (potentially weak) measures of similarity. The quality of the predictions of these methods depends heavily on the choice of the similarity metric(s) and the value of k for the kNN graph, or the threshold on similarity. Furthermore, designing an appropriate similarity metric may not be straightforward in some applications. Fully-connected Graph: Another approach is to assume a fully-connected graph and employ GNN variants such as graph attention networks (Veličković et al., 2018; Zhang et al., 2018) or the transformer (Vaswani et al., 2017) which infer the graph structure via an attention mechanism, or infer the graph structure using additional information. This approach has been used in computer vision (e.g., Suhail & Sigal, 2019), natural language processing (e.g., Zhu et al., 2019) , and few-shot learning (e.g., Garcia & Bruna, 2017) , where there are not many nodes. The complexity of this approach, however, grows rapidly making it applicable only to small-sized graphs with a few thousand nodes and not scalable to the datasets we use in our experiments. Learnable Graph: Instead of computing a similarity graph on the initial features, one may use a graph generator with learnable parameters. Li et al. (2018b) create a fully-connected graph based on a biliear similarity function with learnable parameters. A common approach is to learn to project the nodes to a latent space where node similarities correspond to edge weights. Wu et al. (2018) project the nodes to a latent space by learning weights for each of the input features. 2020) propose an iterative approach that iterates over projecting the nodes to a latent space and constructing an adjacency matrix from the latent representations multiple times. In our experiments, we compare with several approaches from this category. Leveraging Domain Knowledge: In applications where specific domain knowledge is available, one may leverage this to guide the model toward learning specific structures. For example, Johnson et al. ( 2020) leverage abstract syntax trees and regular languages in learning graph structures of Python programs that aid reasoning for downstream tasks. Jin et al. (2020b) train GNNs that are robust to adversarial attack by learning a cleaned version of the input poisoned adjacency matrix



Cosmo et al. (2020) and Qasim et al. (2019) use a multi-layer perceptron for projection. Yu et al. (2020) use a GNN that projects the nodes into a latent space using the initial node features as well as an initial graph structure, aiming at providing a revised graph structure to the task-specific GNN. Franceschi et al. (2019) propose a model named LDS with a bi-level optimization setup for simultaneously learning the GNN parameters and a full adjacency matrix. Yang et al. (2019) update the input adjacency matrix based on the inductive bias that nodes belonging to the same class should be connected to each other and nodes belonging to different classes should be disconnected. Chen et al. (

