SCALABLE AND PRIVACY-ENHANCED GRAPH GENERA-TIVE MODEL FOR GRAPH NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this dilemma, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that can learn and reproduce the distribution of real-world graphs in a privacy-enhanced way. Our proposed model ( 1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale real-world graphs, (3) guarantees privacy for end-users. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to evaluate GNN models.

1. INTRODUCTION

Graph Neural Networks (GNNs) (Kipf & Welling, 2016a; Chami et al., 2022) are machine learning models that learn the dependences in graphs via message passing between nodes. Various GNN models have been widely applied on a variety of industrial domains such as misinformation detection (Benamira et al., 2019) , financial fraud detection (Wang et al., 2019a) , traffic prediction (Zhao et al., 2019) , and social recommendation (Ying et al., 2018) . However, datasets from these industrial tasks are overwhelmingly proprietary and privacy-restricted and thus almost always unavailable for researchers to study or evaluate new GNN architectures. This state-of-affairs means that in many cases, GNN models cannot be trained or evaluated on graphs that are appropriate for the actual tasks that they need to execute. This scarcity of real-world benchmark graphs also leaves GNN researchers with only a handful of public datasets, which could potentially cause new GNN architectures to optimize performance only on these public datasets rather than generalizing (Palowitch et al., 2022) . In this paper, we introduce a novel graph generative model to overcome the unavailability of critical real-world graph datasets. While there is already a vast body of work on graph generation (You et al., 2018; Liao et al., 2019; Simonovsky & Komodakis, 2018; Grover et al., 2019) , including differentially-private generation (Qin et al., 2017; Proserpio et al., 2012) , we found that no one study has addressed all aspects of the modern GNN problem setting, such as handling large-scale graphs and node attributes/labels. We thus propose a novel, modern graph generation problem definition: Problem Definition 1. Let A, X , and Y denote adjacency, node attribute, and node label matrices; given an original graph G = (A, X , Y), generate a synthetic graph G ′ = (A ′ , X ′ , Y ′ ) satisfying: • Benchmark effectiveness: performance rankings among m GNN models on G ′ should be similar to the rankings among the same m GNN models on G. • Scalability: computation complexity of graph generation should be linearly proportional to the size of the original graph O(|G|) (e.g., number of nodes or edges). • Privacy guarantee: any syntactic privacy notions are given to end users (e.g., k-anonymity). To address this problem statement, we introduce the Computation Graph Transformer (CGT) as the core of a graph generation approach with two novel components. First, CGT operates on minibatches rather than the whole graph, avoiding scalability issues encountered with nearly all existing graph generative models. Note that each minibatch is in fact a GNN computation graph (Hamilton et al., 2017) having its own adjacency and feature submatrices, and the set of all minibatches comprises a graph minibatch distribution that can be learned by an appropriate generative model. Second, instead of attempting to learn the joint distribution of adjacency matrices and feature matrices, we derive a novel duplicate encoding scheme that transforms a (A, X) adjacency and feature matrix pair into a single, dense feature matrix that is isomorphic to the original pair. In this way we are able to reduce the task of learning graph distributions to learning feature vector sequence distributions, which we approach with a novel Transformer architecture (Vaswani et al., 2017) . This reduction is the key innovation allowing CGT to be an effective generator of realistic datasets for GNN research. In addition, after the reduction process, our model can be easily extended to provide k-anonymity or differential privacy guarantees on node attributes and edge distributions. To show the effectiveness of CGT, we design three experiments that examine its scalability, its benchmark effectiveness as a substitute generator of source graphs, and its privacy-performance trade-off. Specifically, to examine this benchmark aspect, we perturb various aspects of the GNN models and datasets, and check that these perturbations bring the same empirical effect on GNN performance on both the original and generated graphs. In total, our contributions are: 1) we propose a novel graph generation problem featuring three requirements in state-of-the-art graph learning settings; 2) we reframe the problem of learning a distribution of a whole graph into learning the distribution of minibatches that are consumed by GNN models; 3) we propose the Computation Graph Transformer, an architecture that casts the problem of computation graph generation as conditional sequence modeling; and finally 4) we show that the test performance of 9 GNN models in 14 different task scenarios is consistent across 7 real-world graphs and their corresponding synthetic graphs.

2. RELATED WORK

Traditional graph generative models extract common patterns among real-world graphs (e.g. nodes/edge/triangle counts, degree distribution, graph diameter, clustering coefficient) (Chakrabarti & Faloutsos, 2006) and generate synthetic graphs following a few heuristic rules (Erdős et al., 1960; Leskovec et al., 2010; Leskovec & Faloutsos, 2007; Albert & Barabási, 2002) . However, they cannot generate unseen patterns on synthetic graphs (You et al., 2018) . More importantly, most of them generate only graph structures, sometimes with low-dimensional boolean node attributes (Eswaran et al., 2018) . General-purpose deep graph generative models exploit GAN (Goodfellow et al., 2014) , VAE (Kingma & Welling, 2013) , and RNN (Zaremba et al., 2014) to learn graph distributions (Guo & Zhao, 2020) . Most of them focus on learning graph structures (You et al., 2018; Liao et al., 2019; Simonovsky & Komodakis, 2018; Grover et al., 2019) , thus their evaluation metrics are graph statistics such as orbit counts, degree coefficients, and clustering coefficients which do not consider quality of generated node attributes and labels. Molecule graph generative models are actively studied for generating promising candidate molecules using VAE (Jin et al., 2018) , GAN (De Cao & Kipf, 2018) , RNN (Popova et al., 2019) , and recently invertible flow models (Shi et al., 2020; Luo et al., 2021) . However, most of their architectures are specialized to small-scaled molecule graphs (e.g., 38 nodes per graph in the ZINC datasets) with low-dimensional attribute space (e.g., 9 boolean node attributes indicating atom types) and distinct molecule-related information (e.g., SMILES representation or chemical structures such as bonds and rings) (Suhail et al., 2021) .

3. FROM GRAPH GENERATION TO SEQUENCE GENERATION

To develop a scalable and privacy-enhanced benchmark graph generative model for GNNs, we first look into how GNNs process a given graph G. With n nodes and d-dimensional node attribute vectors, G is commonly given as a triad of adjacency matrix A ∈ R n×n , node attribute matrix X ∈ R n×d , and node label matrix Y ∈ R n . In this section, we illustrate how to convert the whole-graph generation problem into a discrete-valued sequence generation problem. 

3.1. COMPUTATION GRAPHS IN MINIBATCH-BASED GNN TRAINING

To compute embeddings of node v, L-layered GNNs extract the node's L-hop egonet G v , namely the computation graph. Specifically, as with the global graph, G v is composed of a sub-adjacency matrix  A v ∈ R nv×nv , a sub-feature matrix X v ∈ R nv×d v = (A v , X v , Y v ) : v ∈ G} sampled from an original graph, we generate a set of computation graphs {G ′ v = (A ′ v , X ′ v , Y ′ v )}. This reframing shares intuition with mini-batch stochastic gradient descent that the distribution of randomly chosen subsets approximates the distribution of the original set (Bottou, 2010) .

3.2. ENCODING SCHEME FOR COMPUTATION GRAPHS

In this work, we sample a fixed-size set of neighbors to generate computation graphs instead of using the full neighborhood, as proposed by GraphSage (Hamilton et al., 2017) , a technique also widely adopted in popular GNN libraries (Fey & Lenssen, 2019; Ferludin et al., 2022; Wang et al., 2019b) to fix the minibatch computational footprint. To train a L-layered GNN model with a user-specified neighbor sampling number s, a computation graph is generated for each node in a top-down manner (l : L → 1): A target node v is located at the L-th layer; the target node samples s neighbors, and the sampled s nodes are located at the (L -1)-th layer; each node samples s neighbors, and the sampled s 2 nodes are located at the (L -2)-th layer; repeat until the 1-st layer. When the neighborhood is smaller than s, we sample all existing neighbors of the node. Generating a computation graph is similar to generating a balanced s-nary tree structure. For example, a balanced binary tree-shaped computation graph is generated for node A in Figure 1 (b) with neighbor sampling number s = 2. However, in practice, computation graphs are almost always unbalanced s-nary trees due to one of two cases: (1) lack of neighbors, and (2) neighbor sharing. In Figure 1 (b), B's computation graph is an unbalanced tree because node C has no neighbors (case 1). In D's computation graph, nodes D and G share node H as neighbors, creating a cycle in the computation graph (case 2). These two cases result in variably-shaped of adjacency and node attribute matrices of computation graphs shown as blue and yellow boxes in Figure 1 (b).

3.3. DUPLICATE ENCODING SCHEME FOR COMPUTATION GRAPHS

We introduce a duplicate encoding scheme for computation graphs that is conceptually simple but brings a significant consequence: it fixes the adjacency matrix for all computation graphs, allowing us to model it as a constant. To circumvent case 1 from the previous paragraph, the duplicate encoding scheme defines a null node with zero attribute vector (node '-' in Figure 1(c )) and samples it as a padding neighbor for any node with less than s neighbors. To circumvent case 2, the duplicate encoding scheme copies shared neighbors and provides each copy to parent nodes (node H in node D's computation graph is copied in Figure 1(c) ). Each node attribute vector is also copied and added to the feature matrix. As shown in Figure 1 (c), the duplicate encoding scheme ensures that all computation graphs have an identical adjacency matrix (presenting a balanced s-nary tree) and an identical shape of feature matrices. Note that in order to fix the adjacency matrix, we need to fix the order of nodes in adjacency and attribute matrices (e.g., breadth-first ordering in Figure 1 (c)). Because our duplicate encoding scheme fixes the adjacency structure over all computation graphs, our problem reduces to learning the distribution of (duplicate-encoded) feature matrices of computation graphs, formalized as: given a set of feature matrix-label pairs {( Xv , Y v ) : v ∈ G} of duplicateencoded computation graphs, we generate a set of feature matrix-label pairs {( X ′ v , Y ′ v )}.

3.4. QUANTIZATION

To learn distributions of feature matrices of computation graphs, we first quantize feature vectors into discrete bins; specifically, we cluster feature vectors in the original graph using k-means and map each feature vector to its (discrete) cluster id. Quantization is motivated by 1) privacy benefits and 2) ease of modeling. By mapping different feature vectors (which are clustered together) into the same cluster id, we can guarantee k-anonymity among them (more details in Section 4.2). Ultimately, quantization further reduces our problem to learning distributions over sequences of discrete values, namely the sequences of cluster ids of feature vectors in each computation graph. Such a problem is naturally addressed by Transformers, state-of-the-art sequence generative models (Vaswani et al., 2017) . In Section 4, we introduce the Computational Graph Transformer (CGT), a novel architecture which (at inference time) generates a new sequence of cluster ids, which are then de-quantized as the mean feature vector of the cluster.

3.5. END-TO-END FRAMEWORK FOR A BENCHMARK GRAPH GENERATION PROBLEM

Figure 2 summarizes the entire process of mapping a graph generation problem into a discrete sequence generation problem. In the training phase, we 1) sample a set of computation graphs from the input graph, 2) encode each computation graph using the duplicate encoding scheme to fix adjacency matrices, 3) quantize feature vectors to cluster ids they belong to, and finally 4) hand over a set of (sequence of cluster ids, node label) pairs to our new Transformer architecture to learn their distribution. In the generation phase, we follow the same process in the opposite direction: 1) the trained Transformer outputs a set of (sequence of cluster ids, node label) pairs, 2) we de-quantize cluster ids back into the feature vector space by replacing them with the mean feature vector of the cluster, 3) we regenerate a computation graph from each sequence of feature vectors with the adjacency matrix fixed by the duplicate encoding scheme, and finally 4) we feed the set of generated computation graphs into the GNN model we want to train or evaluate. t denote the token, query, and context embedding of t-th token at the l-th layer; p l(t) and ys 1 denote the position embeddings of t-th token and label embedding of the whole sequence, respectively. (c) The cost-efficient version of CGT divides the input sequence into shorter ones composed only of direct ancestor nodes. Position embeddings: In the original architecture, each token receives a position embedding to let the Transformer recognize the token's position in the sequence. In our model, however, sequences are flattened computation graphs (e.g., input computation graph in Figure 3 (a) is flattened into input sequence in Figure 3(b) ). To encode the original computation graph structure, we provide different position embeddings to different layers in the computation graph, while nodes at the same layer share the same position embedding. When l(t) denotes the layer number where t-th token (node) is located at the original computation graph, position embedding p l(t) indexed by the layer number is assigned to t-th token. In Figure 3 (b), node C, D, F and H located at the 1-st layer in the computation graph have the same position embedding p 1 . Attention Masks: In the original architecture, query and context embeddings, q (l) t and h (l) t , attend to all context embeddings h (l-1) 1:t-1 before t. In the computation graph, each node is sampled based on its parent node (which is sampled based on its own parent nodes) and is not directly affected by its sibling nodes. To encode this relationship more effectively, we mask all nodes except direct ancestor nodes in the computation graph, i.e., the root node and any nodes between the root node and the leaf node. In Figure 3 (b), node C's context/query embeddings attend only to direct ancestors, nodes A and B. Note that the number of unmasked tokens are fixed to L in our architecture because there are always L -1 direct ancestors in L-layered computation graphs. Based on this observation, we provide cost-efficient version of CGT that has shorter sequence length and preserves XLNet's auto-regressive masking as shown in Figure 3(c) . Label conditioning: Distributions of neighboring nodes are not only affected by each node's feature information but also by its label. It is well-known that GNNs improve over MLP performance by adding convolutional operations that augment each node's features with neighboring node features. This improvement is commonly attributed to nodes whose feature vectors are noisy (outliers among nodes with the same label) but that are connected with "good" neighbors (whose features are wellaligned with the label). In this case, without label information, we cannot learn whether a node has feature-wise homogeneous neighbors or feature-wise heterogeneous neighbors but with the same labels. In our Transformer model, query embeddings q (0) t are initialized with label embeddings y s1 that encode the label of the root node s 1 .

4.2. THEORETICAL ANALYSIS

First, our framework can be easily extended to provide k-anonymity for node attributes and edge distributions by using k-means clustering with the minimum cluster size k (Bradley et al., 2000) during the quantization phase. Note that we define edge distributions as neighboring node distributions of each node. The full proofs for the following claims can be found in Appendix A.1. Claim 1 (k-anonymity for node attributes and edge distributions). In the generated computation graphs, each node attribute and edge distribution appear at least k times, respectively. Next, we can provide differential privacy (DP) for node attributes and edge distributions by exploiting DP k-means clustering (Chang et al., 2021) during the quantization phase and DP stochastic gradient descent (DP-SGD) (Song et al., 2013) to train the Transformer. Unfortunately, however, DP-SGD for transformer is not well developed yet practically. So we can not guarantee the rigid DP for edge distributions in practice (experimental results in Section 5.4 and more analysis in Appendix A.1). Thus, here, we claim DP only for node attributes. Claim 2 ((ϵ, δ)-Differential Privacy for node attributes). With probability at least 1-δ, our generative model A gives ϵ-differential privacy for any graph G, any neighboring graph G -v without any node v ∈ G, and any new computation graph G cg generated from our model as follows: e -ϵ ≤ P r[A(G) = G cg ] P r[A(G -v ) = G cg ] ≤ e ϵ Finally, we show that CGT satisfies the scalability requirement in Problem Definition 1: Claim 3 (Scalability). When we aim to generate L-layered computation graphs with neighbor sampling number s on a graph with n nodes, computational complexity of CGT training is O(s 2L n), and the cost-efficient version is O(L 2 s L n).

5. EXPERIMENTS

We aim to show that (1) CGT scales to learn distributions of large-scale real-world graphs; (2) CGT generates synthetic graphs on which GNNs perform similarly to the original graphs; (3) synthetic graphs generated by CGT are sufficiently private; and (4) CGT preserves distributions of graph statistics defined on the original set of computation graphs.

5.1. EXPERIMENTAL SETTING

We evaluate on seven public datasets -three citation networks (Cora, Citeseer, and Pubmed) (Sen et al., 2008) , two co-purchase graphs (Amazon Computer and Amazon Photo) (Shchur et al., 2018) , and two co-authorship graph (MS CS and MS Physic) (Shchur et al., 2018) . To measure GNN performance similarity, we run popular GNN architectures on a pair of original and synthetic graphs, then measure Pearson and Spearman correlations (Myers et al., 2013) between the resultant performance metrics on each type of graph.

5.2. SCALABILITY

To the best of our knowledge, no other generic graph generative model was designed to output a triad of the adjacency matrix, node attribute matrix, and node labels. We extend two VAE-based graph generative models, GVAE (Kipf & Welling, 2016b) and Graphite (Grover et al., 2019) to generate node attributes and labels in addition to adjacency matrices from their latent variables. We also choose three molecule graph generative models, GraphAF (Shi et al., 2020) , GraphDF (Luo et al., 2021) , and GraphEBM (Suhail et al., 2021) , that do not rely on any molecule-specific traits (e.g., SMILES representation). GraphAF, GraphDF, and Graphite meet out-of-memory errors on even the smallest dataset, Cora (Table 5in Appendix A.2). This is not surprising, given they were originally designed for small-size molecule graphs. The remaining baselines (GVAE and GraphEBM), however, fail to learn any meaningful node attribute/label distributions from the original graphs. For instance, the predicted distribution sometimes collapses to generating the the same node feature/labels across all nodes, which is obviously not the most effective benchmark (100% accuracy for all GNN models). We show their results and our analysis in the Appendix A.2. Only our method can successfully generate benchmark graphs across all datasets with meaningful node attribute/label distributions (Tables 1 and 2 ).

5.3. BENCHMARK EFFECTIVENESS

To examine the benchmark effectiveness of our generative model, we design 4 different scenarios where the performance of different GNN architectures varies widely. In each scenario, we provide 3 variations to graphs and run 4 GNN models on each variation. For each scenario-variation, we report Pearson and Spearman correlations of the GNN performance metrics on the original graph against those on the generated graph. Due to the space limitation, we present detailed GNN accuracies only on a few datasets/scenarios in Table 1 . Results on other datasets can be found in Appendix A.3. Note that Table 2 presents the correlation coefficients across all datasets and scenarios. Descriptions of each GNN model can be found in the Appendix A.7.1. SCENARIO 1: noisy edges on aggregation strategies. We choose four different GNN models with different aggregation strategies: GCN (Kipf & Welling, 2016a) with mean aggregator, GIN (Xu et al., 2018) with sum aggregator, SGC (Wu et al., 2019) with linear aggregator, and GAT (Veličković et al., 2017) with attention aggregator. Then we modify the graph by adding different numbers of noisy edges (randomly connected with any node in the graph) to each node and check how the GNN performance changes. In Table 1 , first three columns show the result in the Citeseer dataset. When more noisy edges are added, the accuracy across all GNN models drops in the original graphs. These trends can be nearly exactly captured in GNN performance on the generated graphs (both Pearson and Spearman correlation rates are up to 0.964). This shows that the synthetic graphs generated by our method successfully capture the noisy edge distributions introduced in the original graphs. SCENARIO 2: noisy edges on neighbor sampling. We choose four different GNN models with different neighbor sampling strategies: GraphSage (abbreviated as GSage in Table 1 ) (Hamilton et al., 2017) with random sampling, FastGCN (Chen et al., 2018) with heuristic layer-wise sampling, AS-GCN (Huang et al., 2018) with trainable layer-wise sampling, and PASS (Yoon et al., 2021) with trainable node-wise sampling. We then add noisy edges as described in the ENV 1 and check how the different sampling policies deal with noisy neighbors. In Table 1 , FastGCN shows highest accuracies across different number of nosiy edges, followed by PASS, GraphSage, and AS-GCN on the original graphs on the Amazon Photo dataset; and this trend is well-preserved on the generated graphs, showing 0.958 Pearson correlation. (Page et al., 1999)  π ppr = (I -(1 -α) Ã) -1 with decaying coefficient α, then uses them to compose a biased training set. We choose the same GNN models, GCN (Kipf & Welling, 2016a), SGC (Wu et al., 2019) , GAT (Veličković et al., 2017) , and PPNP (Klicpera et al., 2018) , as the original paper (Zhu et al., 2021) chose for their baselines. We vary α and check how each GNN models deal with the biased training set. In the last three columns in Table 1 , the performance of GNN models drops as α increases, and the generated graphs successfully capture these trends. Link prediction. As nodes are the minimum unit in graphs that compose edges or subgraphs, we can generate subgraphs for edges by merging computation graphs of their component nodes. Here we show link prediction results on original graphs are also preserved successfully on our generated graphs. We run GCN, SGC, GIN, and GAT on graphs, followed by Dot product or MLP to predict link probabilities. Table 3 shows Pearson and Spearman correlations across 8 different combinations of link prediction models (4 GNN models × 2 predictors) on each dataset and across the whole datasets. The lower table shows the detailed link prediction accuracies on the Citeseer dataset. Our model generates graphs that substitute original graphs successfully, preserving the ranking of GNN link prediction performance with 0.754 Spearman correlation across the datasets. and the GNN performance gaps between original and generated graphs increase (lower Pearson and Spearman coefficients). Detailed GNN accuracies could be found in Table 12 in Appendix A.4. To provide DP for edge distributions, we use DP stochastic gradient descent (Song et al., 2013) to train the transformer, varying the privacy cost ϵ while setting δ = 0.1. In Table 4 , even with astronomically low privacy cost (ϵ = 10 6 ), the performance of our generative model degrades significantly. When we set ϵ = 10 9 (which is impractical), we can finally see a reasonable performance. This shows the limited performance of DP SGD on the transformer architecture.

5.5. GRAPH STATISTICS

Given a source graph, our method generates a set of computation graphs without any node ids. In other words, attackers cannot merge the generated computation graphs to restore the original graph and re-identify node information. 4 (a), green lines (Pubmed) are lower than purple and red lines (Amazon Computer and Amazon Photo) at the beginning and become higher in both plots. In Figure 4 (b), purple lines (Amazon Computer) are slightly higher than red lines (Amazon Photo) until x = 300, then become lower in both plots. In the same figure, blue, yellow, and green lines (Cora, Citeseer, and Pubmed) decrease sharply compared to purple and red lines (Amazon Computer and Amazon Photo) in both plots. This shows our generative model preserves graph structures encoded in feature matrices successfully.

6. CONCLUSION

We propose a new graph generation problem to enable generating benchmark graphs for GNNs that follow distributions of (possibly proprietary) source graphs with three requirements: 1) benchmark effectiveness, 2) privacy guarantee, and 3) scalability. With a novel graph encoding scheme, we reframe a large-scale graph generation problem into a medium-length sequence generation problem and apply the strong generation power of the Transformer architecture to the graph domain. Limitation of the study: This paper shows that clustering-based solutions can achieve k-anonymity privacy guarantees. We stress, however, that implementing a real-world system with strong privacy guarantees will need to consider many other aspects beyond the scope of this paper. We leave as future work the study of whether we can combine stronger privacy guarantees with those of k-anonymity to enhance privacy protection.

A APPENDIX

A.1 PROOF OF PRIVACY CLAIMS Claim 1 (k-Anonymity for node attributes and edge distributions). In the generated computation graphs, each node attribute and edge distribution appear at least k times, respectively. hypergraph is mapped back to a (n × n) graph by letting k nodes in each cluster follow their cluster's node attributes/edge distributions as follows: k nodes in the same cluster will have the same feature vector that is the average feature vector of original nodes belonging to the cluster. When s denotes the number of sampled neighbor nodes, each node samples s clusters (with replacement) following its cluster's edge distributions among m clusters. When a node samples cluster i, it will be connected to one of nodes in the cluster i randomly. At the end, each node will have s neighbor nodes randomly sampled from s clusters the node samples with the cluster's edge distribution, respectively. Likewise, all k nodes belonging to the same cluster will sample neighbors following the same edge distributions. Thus each node attribute and edge distribution appear at least k times in a generated graph. ■ Claim 2 ((ϵ, δ)-Differential Privacy for node attributes). With probability at least 1-δ, our generative model A gives ϵ-differential privacy for any graph G, any neighboring graph G -v without any node v ∈ G, and any new computation graph G cg generated from our model as follows: e -ϵ ≤ P r[A(G) = G cg ] P r[A(G -v ) = G cg ] ≤ e ϵ Proof. G -v denotes neighboring graphs to the original one G, but without a specific node v. During the quantization phase, we use (ϵ, δ)-differential private k-means clustering algorithm on node features (Chang et al., 2021) . Then clustering results are differentially private with regard to each node features. In the generated graphs, each node feature is decided by the clustering results (i.e., the average feature vector of nodes belonging to the same cluster). Then, by looking at the generated node features, one cannot tell whether any individual node feature was included in the original dataset or not. ■ Remark 1 ((ϵ, δ)-Differential Privacy for edge distributions). In our model, individual nodes' edge distributions are learned and generated by the transformer. When we use (ϵ, δ)-differential private stochastic gradient descent (DP-SGD) (Song et al., 2013) to train the transformer, the transformer becomes differentially private in the sense that by looking at the output (generated edge distributions), one cannot tell whether any individual node's edge distribution (input to the transformer) was included in the original dataset or not. If we have DP-SGD that can train transformers successfully with reasonably small ϵ and δ, we can guarantee (ϵ, δ)-differential privacy for edge distribution of any graph generated by our generative model. However, as we show in Section 5.4, current DP-SGD is not stable yet for transformer training, leading to very coarse or impractical privacy guarantees. Claim 3 (Scalability). When we aim to generate L-layered computation graphs with neighbor sampling number s on a graph with n nodes, computational complexity of CGT training is O(s 2L n), and that of the cost-efficient version is O(L 2 s L n). Proof. During k-means, we randomly sample n k node features to compute the cluster centers. Then we map each feature vector to the closest cluster center. By sampling n k nodes, we limit the k-mean computation cost to O(n 2 k ). The sequence flattened from each computation graph is O(1 + s + • • • + s L ) and the number of sequences (computation graphs) is O(n). Then the training time of the transformer is proportional to O(s 2L n). In total, the complexity is O(s 2L n + n 2 k ). As s 2L n >> n 2 k , the final computation complexity becomes O(s 2L n). In the cost-efficient version, the length of sequences (composed only of direct ancestor nodes) is reduced to L. However, the In Table 5 , GraphAF (Shi et al., 2020) , GraphDF (Luo et al., 2021) , and Graphite (Grover et al., 2019) meet out-of-memory errors on both the Cora and Citeseer datasets. This is because they were originally designed for small-size molecule graphs. The remaining baselines, GVAE (Kipf & Welling, 2016b) and GraphEBM (Suhail et al., 2021) successfully generate graphs, however, fail to learn any meaningful node attribute/label distributions from the original graphs. Especially, GraphEBM generates graphs whose distribution collapses to generating the the same node feature/labels across all nodes, showing 100% accuracy for all GNN models, which is obviously not the most effective benchmark. Note that none of existing graph generative models is designed for GNN benchmarking -simultaneous generation of adjacency, node feature, and node label matrices, rather they all focus only on the generation of adjacency matrices. This result shows the tricky aspects of graph generation and relations among graph structure, node attributes and labels, and a large room for improvement in the graph generation field.

A.3 DETAILED GNN PERFORMANCE IN THE BENCHMARK EFFECTIVE EXPERIMENT IN SECTION 5.3

Tables 6, 7, 8, and 9 show GNN performance on node classification tasks across the original/quantized/generated graphs. Quantized graphs are graphs after the quantization process: each feature vector is replaced by the mean feature vector of a cluster it belongs to, and adjacency matrices are a constant encoded by the duplicate encoding scheme. Quantized graphs are input to CGT, and generated graphs are output from CGT as presented in Figure 2 . Likewise, Table 10 shows GNN performance on link prediction tasks across the original/quantized/generated graphs. As presented across all five Tables, our proposed generative model CGT successfully generates synthetic substitutes of large-scale real-world graphs that shows similar task performance as on the original graphs. Table 11 shows benchmark effective across all 9 GNN models we have used in the experiments on the original graphs without any variations.

A.4 DETAILED GNN PERFORMANCE IN THE PRIVACY EXPERIMENT IN SECTION 5.4

Table 12 shows detailed privacy-GNN performance trade-off on the Cora dataset. In K-anonymity, higher k (i.e., more nodes in the same clusters, thus stronger privacy) hinders the generative model's ability to learn the exact distributions of the original graphs, and the GNN performance gaps between original and generated graphs increase, showing lower Pearson and Spearman coefficients. In DP kmeans, smaller ϵ (i.e., higher privacy cost) results in lower GNN performance, however, surprisingly, showing higher Pearson and Spearman coefficients. This is because DP kmeans could remove noises in graphs (while hiding outliers for privacy) and capture representative distributions on the original graph more easily. This results show our Claims 1 and 2 on privacy are holding on real-world experiments. As we discussed in Section 4.2, DP SGD fails to train the transformer, showing low GNN performance even with astronomically low privacy cost (ϵ = 10 6 ). Figure 5 shows distributions of graph statistics on computation graphs sampled from the original/quantized/generated graphs. Quantized graphs are graphs after the quantization process: each feature vector is replaced by the mean feature vector of a cluster it belongs to, and adjacency matrices are a constant encoded by the duplicate encoding scheme. Quantized graphs are input to CGT, and generated graphs are output from CGT as presented in Figure 2 . While converting from original graphs to quantized graphs, CGT trades off some of the graph statistics information for k-anonymity privacy benefits. In Figure 5 , we can see distributions of graphs statistics have changed slightly from original graphs to quantized graphs. Then CGT learns distributions of graph statistics on the quantized graphs and generates synthetic graphs. The variations given by CGT are presented as differences in distributions between quantized and generated graphs in Figure 5 .

A.6 ABLATION STUDY

To show the importance of each component in our Computation Graph Transformer, we run three ablation studies on our model. Table 13 shows CGT without label conditioning (conditioning on the label of the root node of the computation graph), positional embedding trick (giving the same positional embedding to nodes at the same layers on the computation graph), and masked attention trick (attended only on direct ancestor nodes on the computation graph), respectively. When we remove the positional embedding trick, we provide the different positional embeddings to all nodes in a computation graph, following the original transformer architecture. When we remove attention masks from our model, the transformer attends all other nodes in the computation graphs to compute the context embeddings. As shown in Table 13 , removing any component negatively impacts model performance. This shows not only the importance of label conditioning and our designed positional embeddings and attention masks, but also tricky aspects of graph generation and relations among graph structure, node attributes and labels.

A.7 GRAPH NEURAL NETWORKS

We briefly review graph neural networks (GNNs) then describe how neighbor sampling operations can be applied on GNNs. Notations. Let G = (V, E) denote a graph with n nodes v i ∈ V and edges (v i , v j ) ∈ E. Denote an adjacency matrix A = (a(v i , v j )) ∈ R n×n and a feature matrix X ∈ R n×d where x i denotes the d-dimensional feature vector of node v i . GCN (Kipf & Welling, 2016a) . GCN models stack layers of first-order spectral filters followed by a nonlinear activation functions to learn node embeddings. When h (l) i denotes the hidden embeddings of node v i in the l-th layer, the simple and general form of GCNs is as follows: h (l+1) i = α( 1 n(i) n j=1 a(vi, vj)h (l) j W (l) ), l = 0, . . . , L -1 (1) where a(v i , v j ) is set to 1 when there is an edge from v i to v j , otherwise 0. n(i) = n j=1 a(v i , v j ) is the degree of node v i ; α(•) is a nonlinear function; W (l) ∈ R d (l) ×d (l+1) is the learnable transformation matrix in the l-th layer with d (l) denoting the hidden dimension at the l-th layer. h (0) i is set with the input node attribute x i GraphSage (Hamilton et al., 2017) . GCNs require the full expansion of neighborhoods across layers, leading to high computation and memory costs. To circumvent this issue, GraphSage adds sampling operations to GCNs to regulate the size of neighborhood. We first recast Equation 1 as Figure 5 : CGT preserves distributions of graph statistics in generated graphs for each dataset: While converting from original graphs to quantized graphs, CGT loses some of graph statistics information for kanonymity privacy benefit. The variations given by CGT are presented as differences in distributions between quantized and generated graphs  (l+1) i = α W (l) (E j∼p(j|i) [h (l) j ]), l = 0, . . . , L -1 where we combine the transformation matrix W (l) into the activation function α W (l) (•) for conci- sion; p(j|i) = a(vi,vj ) n(i) defines the probability of sampling v j given v i . Then we approximate the expectation by Monte-Carlo sampling as follows: where s is the number of sampled neighbors for each node. Now, we can regulate the size of neighborhood using s, in other words, the computational footprint for each minibatch. We choose four different GNN models with different aggregation strategies to examine the effect of noisy edges on the aggregation strategies: GCN (Kipf & Welling, 2016a) with mean aggregator, GIN (Xu et al., 2018) with sum aggregator, SGC (Wu et al., 2019) with linear aggregator, and GAT (Veličković et al., 2017) with attention aggregator. We choose four different GNN models with different neighbor sampling strategies to examine the effect of noisy edges and number of sampled neighbor numbers on GNN performance: GraphSage (Hamilton et al., 2017) with random sampling, FastGCN (Chen et al., 2018) with heuristic layer-wise sampling, AS-GCN (Huang et al., 2018) with trainable layer-wise sampling, and PASS (Yoon et al., 2021) with trainable node-wise sampling. Finally, we choose four different GNN models to check their robustness to distribution shifts in training/test time, as the authors of the original paper (Zhu et al., 2021) chose for their baselines: GCN (Kipf & Welling, 2016a), SGC (Wu et al., 2019) , GAT (Veličković et al., 2017) , and PPNP (Klicpera et al., 2018) . We implement GCN, SGC, GIN, and GAT from scratch for the SCENARIO 1: noisy edges on aggregation strategies. For SCENARIOS 2 and 3: noisy edges and different sampling numbers on neighbor sampling, we use open source implementations of each GNN model, ASGCNfoot_0 , FastGCNfoot_1 , and PASSfoot_2 , uploaded by the original authors. Finally, for SCENARIO 4: distribution shift, we use GCN, SGC, GAT, and PPNP implemented by (Zhu et al., 2021) θ (s 1:t-1 ) ⊤ e(s t )) s ′ ̸ =st exp(q (L) θ (s 1:t-1 ) ⊤ e(s ′ )) where node embedding e(s t ) maps discrete input id s t to a randomly initialized trainable vector, and query embedding q (L) θ (s 1:t-1 ) encodes information until (t -1)-th token in the sequence. Query embedding q (l) t is computed with context embeddings h (l-1) 1:t-1 of previous t -1 tokens and query embedding q (l-1) t from the previous layer. Context embedding h (l) t is computed from h (l-1) 1:t , context embeddings of previous t -1 tokens and t-th token from the previous layer. Note that, while the query embeddings have access only to the previous context embeddings h t is initially encoded by node embeddings e(s t ) and position embedding p l(t) that encodes the location of each token in the sequence. The query embedding is initialized with a trainable vector and label embeddings y s1 as shown in Figure 3 . This two streams (query and context) of self-attention layers are stacked M time and predict the next tokens auto-regressively. A.9 DIFFERENTIALLY PRIVATE K-MEANS AND SGD ALGORITHMS Given a set of data points, k-means clustering identifies k points, called cluster centers, by minimize the sum of distances of the data points from their closest cluster center. However, releasing the set of cluster centers could potentially leak information about particular users. For instance, if a particular data point is significantly far from the rest of the points, so the k-means clustering algorithm returns this single point as a cluster center. Then sensitive information about this single point could be revealed. To address this, DP k-means clustering algorithm (Chang et al., 2021) is designed within the framework of differential privacy. To generate the private core-set, DP k-means partitions the points into buckets of similar points then replaces each bucket by a single weighted point, while adding noise to both the counts and averages of points within a bucket. Training a model is done through access to its parameter gradients, i.e., the gradients of the loss with respect to each parameter of the model. If this access preserves differential privacy of the training data, so does the resulting model, per the post-processing property of differential privacy. To achieve this goal, DP stochastic gradient descent (DP-SGD) (Song et al., 2013) modifies the minibatch stochastic optimization process to make it differentially private. We use the open source implementation of DP k-means provided by Google's differential privacy librariesfoot_4 . We extend implementations of DP SGD provided by a public differential library Opacusfoot_5 .

A.10 EXPERIMENTAL SETTINGS

All experiments were conducted on the same p3.2xlarge Amazon EC2 instance. We run each experiment three times and report the mean and standard deviation. We evaluate on seven public datasets -three citation networks (Cora, Citeseer, and Pubmed) (Sen et al., 2008) , two co-purchase graphs (Amazon Computer and Amazon Photo) (Shchur et al., 2018) , and two co-authorship graph (MS CS and MS Physic) (Shchur et al., 2018) . We use all nodes when training CGT. For GNN training, we split 50%/10%/40% of each dataset into the training/validation/test sets, respectively. We report their statistics in Table 14 . For the molecule graph generative models, GraphAF, GraphDF, and GraphEBM, we extend implementations in a public domain adaptation library DIG (Liu et al., 2021) . We extend implementations of GVAEfoot_6 , Graphitefoot_7 from codes uploaded by the original authors. For our Computation Graph Transformer model, we use 3-layered transformers for Cora, Citeseer, Pubmed, and Amazon Computer, 4-layered transformers for Amazon Photo and MS CS, and 5layered transformers for MS Physic, considering each graph size. For all experiments to examine the benchmark effectiveness of our model in Section 5.3, we sample s = 5 neighbors per node. For graph statistics shown in Section 5.5, we sample s = 20 neighbors per node. Our code is publicly availablefoot_8 (anonymized).



https://github.com/huangwb/AS-GCN https://github.com/matenure/FastGCN https://github.com/linkedin/PASS-GNN https://github.com/GentleZhu/Shift-Robust-GNNs https://github.com/google/differential-privacy/tree/main/python/dp_ accounting https://github.com/pytorch/opacus https://github.com/tkipf/gae https://github.com/ermongroup/graphite https://www.dropbox.com/sh/e2ukf2djimjs4ud/AACgn0oZ0oWl0N2jILK_JEy3a? dl=0



Figure 1: 2-layered computation graphs with s = 2 neighbor samples: (a) input graph; (b) original encoding scheme results in differently-shaped adjacency (blue) and attribute (yellow) matrices per computation graph; (c) duplicate encoding scheme outputs the same adjacency matrix and identically-shaped attribute matrices.

, and node v's label Y v ∈ R, where each of n v rows correspond to nodes sampled into the computation graph. Minibatch-based GNN training samples one computation graph per each node in minibatch and runs GNN models on those computation graphs which are much smaller than the whole graph. Based on this observation, our problem reduces to: given a set of computation graphs {G

Figure 2: Overview of our benchmark graph generation framework: (1) We sample a set of computation graphs of variable shapes from the original graph, then (2) duplicate-encode them to fix adjacency matrices to a constant. (3) Duplicate-encoded feature matrices are quantized into cluster id sequences and fed into our Computation Graph Transformer. (4) Generated cluster id sequences are de-quantized back into duplicateencoded feature matrices and fed into GNN models with the constant adjacency matrix.

Figure 3: Computation Graph Transformer (CGT): (a,b) Given a sequence flattened from the input computation graph, CGT generates context in the forward direction. e(st), q (l) t , and h (l)

Figure 4: CGT preserves distributions of graph statistics in generated graphs for each dataset: Duplicate encoding infuses graph structure into feature matrices of computation graphs. In each computation graph, # zero vectors is inversely proportional to edge density, while # redundant vectors is proportional to # cycles.

In the quantization phase, we use the k-means clustering algorithm (Bradley et al., 2000) with a minimum cluster size k. Then each node id is replaced with the id of the cluster it belongs to, reducing the original (n × n) graph into a (m × m) hypergraph where m = n/k is the number of clusters. Then Computation Graph Transformer learns edge distributions among m hyper nodes (i.e., clusters) and generates a new (m × m) hypergraph. In the hypergraph, there are at most m different node attributes and m different edge distributions. During the de-quantization phase, a (m × m)

.1 GNN MODELS USED IN THE BENCHMARK EFFECTIVENESS EXPERIMENT IN SECTION 5.3

using DGL library 4 . A.8 ARCHITECTURE OF COMPUTATION GRAPH TRANSFORMER Given a sequence s = [s 1 , • • • , s T ], the M -layered transformer maximizes the likelihood under the forward auto-regressive factorization as follow: max θ logp θ (s) = T t=1 logp θ (s t |s <t ) =

-1 , the context embeddings attend to all tokens h

GNN performance on original and generated graphs in three different scenarios with three variations. # NE denotes number of noise edges and α denotes the PPR coefficient used for distribution shift.

Benchmark effectiveness on node classification. Pearson and Spearman scores measure the correlation in ranking of GNN models on original and generated graphs; a score of 1 denotes perfect correlation.

Benchmark effectiveness and GNN performance on link prediction.

Privacy-Performance trade-off in graph generation on the Cora dataset

GNN performance on graphs generated by baseline generative models. Except our method, no existing graph generative models can generate a set of adjacency matrix, node feature matrix, and node label matrix that reproduce reasonable GNN performance. increases to s L n because each nodes has one computation graph composed of s L shortened sequences. Then the final computation complexity become O(L 2 s L n).

GNN performance on SCENARIO 1: noisy edges on aggregation strategies.

GNN performance on SCENARIO 2: noisy edges on neighbor sampling.

GNN performance on SCENARIO 3: different sampling numbers on neighbor sampling.

GNN performance on SCENARIO 4: distribution shift.

GNN performance on link prediction.

Benchmark effectiveness across 9 GNN models without any graph variations

Privacy-Performance trade-off in graph generation on the Cora dataset

Ablation study. Label, Position, and Attention denote ablation of label conditioning, positional embeddings, and masked attention proposed in Section 4.1, respectively. Pearson scores measure the correlation in ranking of GNN models on original and generated graphs; a score of 1 denotes perfect correlation.

Dataset statistics.

