SCALABLE AND PRIVACY-ENHANCED GRAPH GENERA-TIVE MODEL FOR GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this dilemma, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that can learn and reproduce the distribution of real-world graphs in a privacy-enhanced way. Our proposed model ( 1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale real-world graphs, (3) guarantees privacy for end-users. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to evaluate GNN models.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Kipf & Welling, 2016a; Chami et al., 2022) are machine learning models that learn the dependences in graphs via message passing between nodes. Various GNN models have been widely applied on a variety of industrial domains such as misinformation detection (Benamira et al., 2019) , financial fraud detection (Wang et al., 2019a) , traffic prediction (Zhao et al., 2019) , and social recommendation (Ying et al., 2018) . However, datasets from these industrial tasks are overwhelmingly proprietary and privacy-restricted and thus almost always unavailable for researchers to study or evaluate new GNN architectures. This state-of-affairs means that in many cases, GNN models cannot be trained or evaluated on graphs that are appropriate for the actual tasks that they need to execute. This scarcity of real-world benchmark graphs also leaves GNN researchers with only a handful of public datasets, which could potentially cause new GNN architectures to optimize performance only on these public datasets rather than generalizing (Palowitch et al., 2022) . In this paper, we introduce a novel graph generative model to overcome the unavailability of critical real-world graph datasets. While there is already a vast body of work on graph generation (You et al., 2018; Liao et al., 2019; Simonovsky & Komodakis, 2018; Grover et al., 2019) , including differentially-private generation (Qin et al., 2017; Proserpio et al., 2012) , we found that no one study has addressed all aspects of the modern GNN problem setting, such as handling large-scale graphs and node attributes/labels. We thus propose a novel, modern graph generation problem definition: Problem Definition 1. Let A, X , and Y denote adjacency, node attribute, and node label matrices; given an original graph G = (A, X , Y), generate a synthetic graph G ′ = (A ′ , X ′ , Y ′ ) satisfying: • Benchmark effectiveness: performance rankings among m GNN models on G ′ should be similar to the rankings among the same m GNN models on G. • Scalability: computation complexity of graph generation should be linearly proportional to the size of the original graph O(|G|) (e.g., number of nodes or edges). • Privacy guarantee: any syntactic privacy notions are given to end users (e.g., k-anonymity). To address this problem statement, we introduce the Computation Graph Transformer (CGT) as the core of a graph generation approach with two novel components. First, CGT operates on minibatches rather than the whole graph, avoiding scalability issues encountered with nearly all existing graph generative models. Note that each minibatch is in fact a GNN computation graph (Hamilton et al., 2017) having its own adjacency and feature submatrices, and the set of all minibatches comprises a graph minibatch distribution that can be learned by an appropriate generative model. Second, instead of attempting to learn the joint distribution of adjacency matrices and feature matrices, we derive a novel duplicate encoding scheme that transforms a (A, X) adjacency and feature matrix pair into a single, dense feature matrix that is isomorphic to the original pair. In this way we are able to reduce the task of learning graph distributions to learning feature vector sequence distributions, which we approach with a novel Transformer architecture (Vaswani et al., 2017) . This reduction is the key innovation allowing CGT to be an effective generator of realistic datasets for GNN research. In addition, after the reduction process, our model can be easily extended to provide k-anonymity or differential privacy guarantees on node attributes and edge distributions. To show the effectiveness of CGT, we design three experiments that examine its scalability, its benchmark effectiveness as a substitute generator of source graphs, and its privacy-performance trade-off. Specifically, to examine this benchmark aspect, we perturb various aspects of the GNN models and datasets, and check that these perturbations bring the same empirical effect on GNN performance on both the original and generated graphs. In total, our contributions are: 1) we propose a novel graph generation problem featuring three requirements in state-of-the-art graph learning settings; 2) we reframe the problem of learning a distribution of a whole graph into learning the distribution of minibatches that are consumed by GNN models; 3) we propose the Computation Graph Transformer, an architecture that casts the problem of computation graph generation as conditional sequence modeling; and finally 4) we show that the test performance of 9 GNN models in 14 different task scenarios is consistent across 7 real-world graphs and their corresponding synthetic graphs.

2. RELATED WORK

Traditional graph generative models extract common patterns among real-world graphs (e.g. nodes/edge/triangle counts, degree distribution, graph diameter, clustering coefficient) (Chakrabarti & Faloutsos, 2006) and generate synthetic graphs following a few heuristic rules (Erdős et al., 1960; Leskovec et al., 2010; Leskovec & Faloutsos, 2007; Albert & Barabási, 2002) . However, they cannot generate unseen patterns on synthetic graphs (You et al., 2018) . More importantly, most of them generate only graph structures, sometimes with low-dimensional boolean node attributes (Eswaran et al., 2018) . General-purpose deep graph generative models exploit GAN (Goodfellow et al., 2014) , VAE (Kingma & Welling, 2013), and RNN (Zaremba et al., 2014) to learn graph distributions (Guo & Zhao, 2020) . Most of them focus on learning graph structures (You et al., 2018; Liao et al., 2019; Simonovsky & Komodakis, 2018; Grover et al., 2019) , thus their evaluation metrics are graph statistics such as orbit counts, degree coefficients, and clustering coefficients which do not consider quality of generated node attributes and labels. Molecule graph generative models are actively studied for generating promising candidate molecules using VAE (Jin et al., 2018) , GAN (De Cao & Kipf, 2018) , RNN (Popova et al., 2019) , and recently invertible flow models (Shi et al., 2020; Luo et al., 2021) . However, most of their architectures are specialized to small-scaled molecule graphs (e.g., 38 nodes per graph in the ZINC datasets) with low-dimensional attribute space (e.g., 9 boolean node attributes indicating atom types) and distinct molecule-related information (e.g., SMILES representation or chemical structures such as bonds and rings) (Suhail et al., 2021) .

3. FROM GRAPH GENERATION TO SEQUENCE GENERATION

To develop a scalable and privacy-enhanced benchmark graph generative model for GNNs, we first look into how GNNs process a given graph G. With n nodes and d-dimensional node attribute vectors, G is commonly given as a triad of adjacency matrix A ∈ R n×n , node attribute matrix X ∈ R n×d , and node label matrix Y ∈ R n . In this section, we illustrate how to convert the whole-graph generation problem into a discrete-valued sequence generation problem.

