GENERATED GRAPH DETECTION

Abstract

Graph generative models become increasingly effective for data distribution approximation and data augmentation. Although still in sandboxes, they have aroused public concerns about their malicious misuses or misinformation broadcasts, just as what Deepfake visual and auditory media has been delivering to society. It is never too early to regulate the prevalence of generated graphs. As a preventive response, we pioneer to formulate the generated graph detection problem to distinguish generated graphs from real ones. We propose the first framework to systematically investigate a set of sophisticated models and their performance in four classification scenarios. Each scenario switches between seen and unseen datasets/generators during testing to get closer to real world settings and progressively challenge the classifiers. Extensive experiments evidence that all the models are qualified for generated graph detection, with specific models having advantages in specific scenarios. Resulting from the validated generality and oblivion of the classifiers to unseen datasets/generators, we draw a safe conclusion that our solution can sustain for a decent while to curb generated graph misuses.

1. INTRODUCTION

However, a coin has two sides, there is a concern that the synthetic graphs can be misused. For example, molecular graphs are used to design new drugs Simonovsky & Komodakis (2018); You et al. (2018a) . The generated graphs can be misused in this process and it is important for the pharmaceutical factory to vet the authenticity of the molecular graphs. Also, synthetic graphs make deep graph learning models more vulnerable against well-designed attacks. Existing graph-level backdoor attacks Xi et al. (2021) and membership inference attacks Wu et al. (2021a) require the attackers to train their local models using the same or similar distribution data as those for the target models. Adversarial graph generation enables attackers to generate graphs that are close to the real graphs. It facilitates the attackers to build better attack models locally hence keeping those attacks more stealthy (since the attackers can minimize the interaction with the target models). This advantage also applies to the latest graph attacks such as the property inference attack Zhang et al. As a result, it is essential to regulate the prevalence of generated graphs. In this paper, we propose to proactively target the generated graph detection problem, i.e., to study whether generated graphs can be differentiated from real graphs with machine learning classifiers. 2016), which learns close/distant representations for the data from the same/different classes. We adopt metric learning to learn close/distant representations for graphs from the same/different sources. We systematically conduct experiments under different settings for all the classification models to demonstrate the effectiveness of our framework. Moreover, we conduct the dataset-oblivious study which mixes various datasets in order to evaluate the influence along the dataset dimension. The evidenced dataset-oblivious property makes them independent of a specific dataset and practical in real-world situations.

2. PRELIMINARIES

Notations. We define an undirected and unweighted homogeneous graph as G = (V, E, A), where 2017). In recent years, they become the start-of-the-art technique serving as essential building blocks in graph generators and graph classification algorithms. A GNN normally takes the graph structure as the input for message passing, during which the neighborhood information of each node u is aggregated to get a more comprehensive representation h u . The detailed information of GNN is described in Appendix A.1. V = {v 1 , v 2 , ..., v n } represents the set of nodes, E ⊆ {(v, u) | v, u ∈ V} Graph Generators. Graph generators aim to produce graph-structured data of observed graphs regardless of the domains, which is fundamental in graph generative models. The study of graph generators dated back at least to the work by Erdös-Rényi Erdös & Rényi (1959) in the 1960s. These traditional graph generators focus on various random models Erdös & Rényi (1959); Albert & Barabási (2002) , which typically use simple stochastic generation methods such as a random or preferential attachment mechanism. However, the traditional models require prior knowledge to obtain/tune the parameters and tie them specifically to certain properties (e.g., probability of connecting to other nodes, etc.), hence their limited capacity of handling the complex dependencies of properties. Recently, graph generators using GNN as the foundation have attracted huge attention Liao et al. ( 2019 



Graph generative models aim to learn the distributions of real graphs and generate synthetic ones Xie et al. (2022); Liu et al. (2021); Wu et al. (2021b). Generated graphs have found applications in numerous domains, such as social networks Qiu et al. (2018), e-commerce Li et al. (2020), chemoinformatics Kearnes et al. (2016), etc. In particular, with the development of deep learning, graph generative models have witnessed significant advancement in the past 5 years Stoyanovich et al. (2020); Liao et al. (2019); Kipf & Welling (2016); You et al. (2018a).

(2022)  and GNN model stealing attackShen et al. (2022).

To detect generated graphs, we train graph neural network (GNN)-based classifiers and show their effectiveness in encoding and classifying graphsZhang et al. (2020);Kipf & Welling (2017); Hamilton  et al. (2017). Figure2illustrates the general pipeline of the generated graph detection. To evaluate their accuracy and generalizability, we test graphs from varying datasets and/or varying generators that are progressively extended towards the unseen during training. The seen concept in dataset or generator means that the graphs used in the training and testing stage are from the same dataset or generated by the same generator, respectively. That is to say, they share the same or similar distribution. And the unseen concept represents the opposite. To sophisticate our solution space, we study three representative classification models. The first model is a direct application of GNN-based end-to-end classifiers Kipf & Welling (2017); Hamilton et al. (2017); Chen et al. (2018); Xu et al. (2019b). The second model shares the spirit of contrastive learning for images Chen et al. (2020); Wu et al. (2018); Hénaff (2020) and graphs Zhu et al. (2021a); You et al. (2020); Hassani & Ahmadi (2020); Zhu et al. (2021b), which, as one of the cutting-edge self-supervised representation learning models, learns similar representations for the same data under different augmentations. We adapt graph contrastive learning to learn similar representations of graphs from the same source under different augmentations. The third model is based on deep metric learning Xing et al. (2002); Schroff et al. (2015); Song et al. (

is the set of edges and A ∈ {0, 1} n×n denotes G's adjacency matrix. We denote the embedding of a node u ∈ V as h u and the embedding of the whole graph G as h G .GraphNeural Networks. Graph neural networks (GNNs) have shown great effectiveness in fusing information from both the graph topology and node features Zhang et al. (2020); Hamilton et al. (2017); Kipf & Welling (

); You et al. (2018b); Grover et al. (2019); Simonovsky & Komodakis (2018). The GNNbased graph generators can be further grouped into two categories: autoencoder-based generators and autoregressive-based generators. Autoencoder-based generator Kipf & Welling (2016); Grover et al. (2019); Mehta et al. (2019); Simonovsky & Komodakis (2018) is a type of neural network which is used to learn the representations of unlabeled data and reconstruct the input graphs based on the representations. Autoregressive-based generator Liao et al. (2019); You et al. (2018b) uses sophisticated models to better capture the properties of observed graphs. By generating graphs sequentially, the models can leverage the complex dependencies between generated graphs. In this paper, we selectively focus on eight graph generators that span the space of commonly used architectures, including ER Erdös & Rényi (1959), BA Albert & Barabási (2002), GRAN Liao et al. (2019), VGAE Kipf & Welling (2016), Graphite Grover et al. (2019), GraphRNN You et al. (2018b), SBMGNN Mehta et al. (2019), and GraphVAE Simonovsky & Komodakis (2018) (see more detailed information about graph generators in Appendix A.2).

