GOGGLE: GENERATIVE MODELLING FOR TABULAR DATA BY LEARNING RELATIONAL STRUCTURE

Abstract

Deep generative models learn highly complex and non-linear representations to generate realistic synthetic data. While they have achieved notable success in computer vision and natural language processing, similar advances have been less demonstrable in the tabular domain. This is partially because generative modelling of tabular data entails a particular set of challenges, including heterogeneous relationships, limited number of samples, and difficulties in incorporating prior knowledge. Additionally, unlike their counterparts in image and sequence domain, deep generative models for tabular data almost exclusively employ fully-connected layers, which encode weak inductive biases about relationships between inputs. Real-world data generating processes can often be represented using relational structures, which encode sparse, heterogeneous relationships between variables. In this work, we learn and exploit relational structure underlying tabular data (where typical dimensionality d < 100) to better model variable dependence, and as a natural means to introduce regularization on relationships and include prior knowledge. Specifically, we introduce GOGGLE, an end-to-end message passing scheme that jointly learns the relational structure and corresponding functional relationships as the basis of generating synthetic samples. Using real-world datasets, we provide empirical evidence that the proposed method is effective in generating realistic synthetic data and exploiting domain knowledge for downstream tasks.

1. INTRODUCTION

Learning generative models for synthetic data is an important area in machine learning with many applications. For example, synthetic data can be used to simulate settings where real data is scarce or unavailable [7, 10] , support better supervised learning by increasing quality in datasets [8] , improving robustness and predictive performance [69, 60] , and promoting fairness [72] . Additionally, synthetic data is increasingly being used to overcome usage restrictions while preserving privacy [32, 78, 54] . Deep generative models have achieved notable success in approximating complicated, highdimensional distributions as encountered in computer vision, natural language processing, and more [46] . A key contributor to this success is that learning architectures can easily exploit relational inductive bias that enhance the learning of joint distributions. Informally, relational inductive biases encode assumptions about the relational structure, which describes variables, and their relationships [3] . For example, image variables (pixels) have high covariance within a local region and relational rules that are invariant across regions-properties which are exploited by kernels in convolutional neural networks (CNN) to better model image distributions [42] . Similarly, sequence variables are highly dependent on sequentiality and relational rules are invariant across time-steps-recurrence relations leveraged by recurrent neural networks (RNNs) to capture distributions over time [27] . homogeneous data formats with known relational structure, tabular data commonly contain more heterogeneous relationships (e.g. variables are only correlated with a small subset of other variables), where the exact relational structure is obscured by domain-specific knowledge [62] . Without an obvious relational structure, deep generative models almost exclusively employ multilayer perceptrons (MLP) to learn representations on tabular data. This is less than ideal as MLPs encode virtually no relational information-indeed, all variables can interact to determine any other variable's value. However, this all-to-all relational structure is often unnecessary, the data generating process (DGP) of tabular data is better described using sparse relational structures [39, 4] . Variable dependencies can be more accurately captured by considering them as edges and learning representations over the resulting relational structure. We hypothesize that generative models exploiting the relational structure can more adequately address certain challenges that arise during modelling distributions for tabular data, including ▶ heterogeneous relationships between variables, ▶ smaller datasets, which are more prone to overfitting, and ▶ the lack of mechanism to incorporate prior knowledge that can improve modelling performance. Contributions. We introduce Generative MOdellinG with Graph LEarning (GOGGLE), an end-to-end framework that learns an approximate relational structure as the foundation of generative modelling. More specifically, we devise a general message passing scheme that models tabular data by jointly learning (1) the relational structure and (2) the corresponding functional relationships (dependencies) in the learned structure. To the best of our knowledge, this is the first work to jointly learn the relational structure and the parameters of a generative model to model tabular data. Additionally, we propose regularization on variable dependencies to reduce model overfitting on smaller tabular datasets and propose a simple mechanism to include prior knowledge into the generative process. We demonstrate the advantages of our approach in a series of experiments on multiple real-world datasets. We employ both qualitative and quantitative approaches to demonstrate that GOGGLE achieves consistent improvements over state-of-the-art benchmarks in generating synthetic data and exploiting prior knowledge for better downstream performance.

2. CHALLENGES IN TABULAR DATA GENERATION

While deep generative models have seen notable success in image and sequence domains, tabular data is ubiquitous in many salient applications, including medicine, finance, and economics. Generative modelling for tabular data presents a distinct set of challenges, which are largely open research questions. Here, we highlight them in turn: 1. Complicated relational structure. Tabular data commonly contain heterogeneous relational structures, including sparse dependencies (variables only dependent on a small subset of other variables), and heterogeneous functional relationships (dependencies) between variables [62] . Unlike in images and sequences, where the relational structures (locality and sequentiality respectively) are better understood (and arguably generalizable), variable dependencies in tabular datasets are domain specific and rarely known. 2. Overfitting and memorization. Modern deep generative models are over-parameterized, thus requiring large datasets to learn the underlying distribution without overfitting [1] . This is especially demanding on tabular datasets that are smaller and where collection is difficult/expensive. Real-world DGPs are sparse in structure, and we address overfitting concerns by enforcing sparsity in variable dependencies, thus achieving a regularization effect by restricting the hypothesis space. Additionally, as variables can only be generated using their neighborhoods, the model is incentivized to find informative neighbors. 3. Domain knowledge. In many fields, such as medicine or social sciences, we have rich domain knowledge on dependence between variables, sparsity, or importance of specific variables (i.e. degree of connectivity [74]). Incorporating prior knowledge is especially useful in practical settings where we may not have large datasets but can obtain expert knowledge to aid in model learning. As far as we know, this is a capability that is currently lacking in tabular deep generative models. Our generative process takes into account the relational structure, allowing a diverse range of (partial) domain knowledge to be incorporated. A distinction. We emphasize that the goal of our work is not probabilistic structure discovery, which aims to discover the unique probabilistic graph from observed data [15, 85] . As there is

