GOGGLE: GENERATIVE MODELLING FOR TABULAR DATA BY LEARNING RELATIONAL STRUCTURE

Abstract

Deep generative models learn highly complex and non-linear representations to generate realistic synthetic data. While they have achieved notable success in computer vision and natural language processing, similar advances have been less demonstrable in the tabular domain. This is partially because generative modelling of tabular data entails a particular set of challenges, including heterogeneous relationships, limited number of samples, and difficulties in incorporating prior knowledge. Additionally, unlike their counterparts in image and sequence domain, deep generative models for tabular data almost exclusively employ fully-connected layers, which encode weak inductive biases about relationships between inputs. Real-world data generating processes can often be represented using relational structures, which encode sparse, heterogeneous relationships between variables. In this work, we learn and exploit relational structure underlying tabular data (where typical dimensionality d < 100) to better model variable dependence, and as a natural means to introduce regularization on relationships and include prior knowledge. Specifically, we introduce GOGGLE, an end-to-end message passing scheme that jointly learns the relational structure and corresponding functional relationships as the basis of generating synthetic samples. Using real-world datasets, we provide empirical evidence that the proposed method is effective in generating realistic synthetic data and exploiting domain knowledge for downstream tasks.

1. INTRODUCTION

Learning generative models for synthetic data is an important area in machine learning with many applications. For example, synthetic data can be used to simulate settings where real data is scarce or unavailable [7, 10] , support better supervised learning by increasing quality in datasets [8], improving robustness and predictive performance [69, 60] , and promoting fairness [72] . Additionally, synthetic data is increasingly being used to overcome usage restrictions while preserving privacy [32, 78, 54] . Deep generative models have achieved notable success in approximating complicated, highdimensional distributions as encountered in computer vision, natural language processing, and more [46] . A key contributor to this success is that learning architectures can easily exploit relational inductive bias that enhance the learning of joint distributions. Informally, relational inductive biases encode assumptions about the relational structure, which describes variables, and their relationships [3] . For example, image variables (pixels) have high covariance within a local region and relational rules that are invariant across regions-properties which are exploited by kernels in convolutional neural networks (CNN) to better model image distributions [42] . Similarly, sequence variables are highly dependent on sequentiality and relational rules are invariant across time-steps-recurrence relations leveraged by recurrent neural networks (RNNs) to capture distributions over time [27] . In this work, we hope to exploit similar relational inductive biases to better model real-world tabular datasets (where typical dimensionality d < 100). However, while images and sequences are 1

