SCALABLE AND PRIVACY-ENHANCED GRAPH GENERA-TIVE MODEL FOR GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this dilemma, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that can learn and reproduce the distribution of real-world graphs in a privacy-enhanced way. Our proposed model ( 1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale real-world graphs, (3) guarantees privacy for end-users. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to evaluate GNN models.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Kipf & Welling, 2016a; Chami et al., 2022) are machine learning models that learn the dependences in graphs via message passing between nodes. Various GNN models have been widely applied on a variety of industrial domains such as misinformation detection (Benamira et al., 2019) , financial fraud detection (Wang et al., 2019a ), traffic prediction (Zhao et al., 2019) , and social recommendation (Ying et al., 2018) . However, datasets from these industrial tasks are overwhelmingly proprietary and privacy-restricted and thus almost always unavailable for researchers to study or evaluate new GNN architectures. This state-of-affairs means that in many cases, GNN models cannot be trained or evaluated on graphs that are appropriate for the actual tasks that they need to execute. This scarcity of real-world benchmark graphs also leaves GNN researchers with only a handful of public datasets, which could potentially cause new GNN architectures to optimize performance only on these public datasets rather than generalizing (Palowitch et al., 2022) . In this paper, we introduce a novel graph generative model to overcome the unavailability of critical real-world graph datasets. While there is already a vast body of work on graph generation (You et al., 2018; Liao et al., 2019; Simonovsky & Komodakis, 2018; Grover et al., 2019) , including differentially-private generation (Qin et al., 2017; Proserpio et al., 2012) , we found that no one study has addressed all aspects of the modern GNN problem setting, such as handling large-scale graphs and node attributes/labels. We thus propose a novel, modern graph generation problem definition: Problem Definition 1. Let A, X , and Y denote adjacency, node attribute, and node label matrices; given an original graph G = (A, X , Y), generate a synthetic graph G ′ = (A ′ , X ′ , Y ′ ) satisfying: • Benchmark effectiveness: performance rankings among m GNN models on G ′ should be similar to the rankings among the same m GNN models on G. • Scalability: computation complexity of graph generation should be linearly proportional to the size of the original graph O(|G|) (e.g., number of nodes or edges). • Privacy guarantee: any syntactic privacy notions are given to end users (e.g., k-anonymity).

