SYNTHETIC DATA GENERATION OF MANY-TO-MANY DATASETS VIA RANDOM GRAPH GENERATION

Abstract

Synthetic data generation (SDG) has become a popular approach to release private datasets. In SDG, a generative model is fitted on the private real data, and samples drawn from the model are released as the protected synthetic data. While real-world datasets usually consist of multiple tables with potential many-to-many relationships (i.e. many-to-many datasets), recent research in SDG mostly focuses on modeling tables independently or only considers generating datasets with special cases of many-to-many relationships such as one-to-many. In this paper, we first study challenges of building faithful generative models for many-to-many datasets, identifying limitations of existing methods. We then present a novel factorization for many-to-many generative models, which leads to a scalable generation framework by combining recent results from random graph theory and representation learning. Finally, we extend the framework to establish the notion of (ϵ, δ)-differential privacy. Through a real-world dataset, we demonstrate that our method can generate synthetic datasets while preserving information within and across tables better than its closest competitor.

1. INTRODUCTION

Private data release has gained much attention in recent years due to new regulations in privacy such as General Data Protection Regulation (GDPR). To obey such regulations and to protect privacy, a popular approach from the machine learning community called synthetic data generation (SDG) is developed in various domains (Nowok et al., 2016; Montanez et al., 2018; Xu & Veeramachaneni, 2018; Xu et al., 2019; Lin et al., 2020; Tucker et al., 2020; Xu et al., 2021; Ziller et al., 2021) . In SDG, synthetic data that is statistically similar but not identical to real data is released as a replacement of the real data to protect. On top of it, special attention has also been paid to make sure these methods work well on tabular data as most real-world datasets are stored as tables in databases (Montanez et al., 2018; Xu & Veeramachaneni, 2018; Xu et al., 2019; Nazabal et al., 2020; Ma et al., 2020) . At the core of most SDG methods are generative models-probabilistic models that one can drawn samples from. Usually a generative model is firstly trained to capture the distribution of the private real data, and then samples are drawn from the model so they can be released as protected synthetic data. However, this procedure without extra care does not have guarantees privacy. To tackle this, research in this direction has also been focusing on differentially private generative models (Zhang et al., 2017; Xie et al., 2018; Jordon et al., 2018) -generative models that satisfy a notion of privacy called differential privacy (Dwork et al., 2014) , which controls the amount of information individual datum can reveal. Appendix A provides an illustration of how generative models are used for SDG. While real-world datasets usually consist of multiple tables with potential many-to-many relationships, SDG for such type of data is not well-studied. Most of recent work focuses on modeling tables independently (Xu & Veeramachaneni, 2018; Xu et al., 2019; Nazabal et al., 2020; Ma et al., 2020) or considers generating many-to-many relationships but only for special cases such as one-to-many (Getoor et al., 2007; Montanez et al., 2018) . As a side effect, privacy for multi-table SDG with many- deg(n 1 u ) = 3 deg(n 2 u ) = 1 deg(n 3 u ) = 2 deg(n 4 u ) = 1 deg(n 1 v ) = 3 deg(n 2 v ) = 3 deg(n 3 v ) = 1 (a) Bipartite graph B with annotations for node degrees         0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0            1 1 1 1 0 0 1 1 0 0 1 0    (b) Adjacency matrix G (left) and biadjacency matrix B (right) deg(n v ) 1 2 3 deg(n u ) 1 0 0 2 2 0 0 2 3 1 0 2 (c) BJDD matrix P J with supports du = deg(nu), dv = deg(nv) Figure 1 : An illustration of bipartite graphs and their related representations and statistics. to-many relationships is also under-studied. In fact, relationships in real data can also reveal private information (Sala et al., 2011; Proserpio et al., 2012) . For example, in a customer-merchant dataset, the number of links to a merchant could reveal its identity because of its uncommon popularity. This paper studies the challenge of building faithfulfoot_1 generative models (Webb et al., 2018) to synthesize data together with their many-to-many relationships, and proposes a novel factorization for many-to-many generative models, which leads to a scalable approach by combining results from random graph theory and representation learning. In short, our paper has the following contributions: 1. We study possible factorization of faithful generative models for many-to-many data and identify limitations of those taken by existing approaches. 2. We propose a novel factorization for modeling distributions over many-to-many data and we use it to develop a synthetic data generation framework using methods from random graph generation, node representation learning and set representation learning. 3. We extend the proposed framework to establish the notion of (ϵ, δ)-differential privacy. 4. We evaluate two model instances from our framework, BayesM2M and NeuralM2M, on the MOVIELENS dataset, demonstrating its superior performance over its closest competitor, SDV, especially on capturing information in many-to-many relationships.

2. BACKGROUND AND NOTATIONS

Bipartite and multipartite graphs A bipartite graph B := (U, V, L) is a tuple of two disjoint node sets U, V (called upper and lower nodes respectively) as well as a set of edges L between them. Each node is represented by a tuple of its index and attribute, e.g. Modeling datasets with many-to-many relationships 2 Datasets with many-to-many relationships, such as relational databases, can be viewed as multipartite graphs. Denote the generating distribution of a given multipartite graph M data as p data (M). The goal of this paper is to build a generative model p θ (M) with learnable parameter θ and to develop a learning algorithm A 1 that takes M data as inputs and outputs the optimal model p θ * with parameter θ * , s.t. p θ * (M) is close to p data (M). After learning, a sample M is drawn from the model and used as synthetic data; we denote this sampling process/algorithm as A 2 .  n k u := (i k u , x k u ) ∈ U for k = 1, . . . ,



* Now at Amazon; work done prior to joining Amazon † Also a Ph.D candidate at the University College London The term "faithful" as in faithful generative models refers to the fact that the model does not introduce any conditional independence that is not true in general(Webb et al., 2018). It is true for how our proposed model factorizes individual tables as well as their relationships.2 Note here we are intended to avoid using terms such as primary/foreign keys and parents/children from the relational database literature. This is because in order to model the joint distribution for a given dataset, parentchild directions can be rearranged for modeling convenience in favor of some particular factorization of the joint.



|U|.Each edge is represented by a tuple of node indices, i.e. l k := (i k u , i k v ) ∈ L for k = 1, . . . , |L|. Such edge sets can be represented as adjacency matrices G or biadjacency matrices B; see figure1bfor an illustration. A generalised notion of bipartite graphs called multipartite graphs, denoted as M := ({T k1 } N k1=1 , {L k2 } M k2=1 ), are graphs with N disjoint node sets and M ≤ N 2 edge sets.

In this work, we are interested in establishing the standard (ϵ, δ)-differential privacy, a.k.a. (ϵ, δ)-DP or approximate DP. Denote A = A 2 • A 1 . We say a model p θ with A is (ϵ, δ)-DP according to the following definition(Dwork et al., 2014):Definition 1 ((ϵ, δ)-DP for p θ with A). A generative model p θ with the synthesis algorithm A is (ϵ, δ)-DP if for all S ⊆ Range(A) and for all B, B ′ that differ on a single element: P(A(B) ∈ S) ≤ exp(ϵ)P(A(B ′ ) ∈ S) + δ.

