STASY: SCORE-BASED TABULAR DATA SYNTHESIS

Abstract

Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.

1. INTRODUCTION

Tabular data synthesis is of non-trivial importance in real-world applications for various reasons: protecting the privacy of original tabular data by releasing fake tabular data (Park et al., 2018; Lee et al., 2021) , augmenting the original tabular data with fake data for better training machine learning models (Chawla et al., 2002; Han et al., 2005; He et al., 2008; Kim et al., 2022) , and so on. In order to alleviate the training difficulty of Naïve-STaSy, we design i) a self-paced learning method, and ii) a fine-tuning approach. Our proposed self-paced learning technique trains our model from easy to hard records based on their loss values by modifying the objective function. The technique makes the model learn records selectively and eventually. During this process, the model can be better trained. In addition, our proposed fine-tuning method, which modestly adjusts the model parameters, can further improve the sampling quality and diversity. In Table 1 , we summarize our experimental results, where we compare our STaSy with other existing tabular data synthesis methods in terms of the sampling quality, diversity, and time. As shown, our basic model even without our proposed selfpaced learning and fine-tuning, denoted Naïve-STaSy, significantly outperforms all baselines except for runtime. In summary, our contributions are as follows: i) We design a score-based generative model for tabular data synthesis. ii) We alleviate the training difficulty of the denoising score matching loss by designing a self-paced learning strategy and further enhance the sampling quality and diversity using a proposed fine-tuning method. STaSy, thus, clearly balances among the generative learning trilemma: sampling quality, diversity, and time. iii) Our proposed method outperforms other deep learning methods in all cases by large margins, which we consider a significant advance in the field of tabular data synthesis. iv) We evaluate various methods in terms of the generative learning trilemma in a rigorous manner.

2.1. SCORE-BASED GENERATIVE MODELS

Score-based generative models (SGMs) use a diffusion process defined by the following Itô stochastic differential equation (SDE): dx = f (x, t)dt + g(t)dw, where f (x, t) = f (t)x, f and g are drift and diffusion coefficients of x(t), and w is the standard Wiener process. Depending on the types of f and g, SGMs can be divided into variance exploding (VE), variance preserving (VP), and sub-variance preserving (sub-VP) models (Song et al., 2021) . The definitions of f and g are in Appendix A. The reverse of the diffusion process is a denoising process as follows: dx = f (x, t) -g 2 (t)∇ x log p t (x) dt + g(t)dw, where this reverse SDE is a process of generating samples. 



Figure 1: Distributions of denoising score matching loss in Shoppers

Summary of experimental results. We report the average sampling quality, diversity, and time.

STaSy additionally uses our proposed self-paced learning and fine-tuning methods. Figure1shows the uneven and long-tailed loss distribution of Naïve-STaSy at the end of its training process. The figure implies the training of Naïve-STaSy by the denoising score matching failed to learn the score values of some records. This may let the model be (partially) underfitted to training data. In contrast, STaSy with our two proposed training methods yields many loss values around the left corner (i.e., close to 0).

The score function ∇ x log p t (x) is approximated by a time-dependent score-based model S θ (x, t), called score network.In general, following the diffusion process in Equation 1, we can derive x(t) at time t ∈ [0, T ], where x(0) and x(T ) means a real and noisy sample, respectively. The transition probability p(x(t)|x(0)) at time t is easily approximated by this process, and it always follows a Gaussian distribution. It allows us to collect the gradient of the log transition probability, ∇ x(t) log p(x(t)|x(0)), during the diffusion process. Therefore, we can train a score network S

