STASY: SCORE-BASED TABULAR DATA SYNTHESIS

Abstract

Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity. Table 1: Summary of experimental results. We report the average sampling quality, diversity, and time.



Tabular data synthesis is of non-trivial importance in real-world applications for various reasons: protecting the privacy of original tabular data by releasing fake tabular data (Park et al., 2018; Lee et al., 2021) , augmenting the original tabular data with fake data for better training machine learning models (Chawla et al., 2002; Han et al., 2005; He et al., 2008; Kim et al., 2022) , and so on. However, it is well-known that tabular data frequently has such peculiar characteristics that deep generative models are not able to synthesize all possible details of the original tabular data (Park et al., 2018; Xu et al., 2019) -given a set of columns in tabular data, columns typically follow unpredictable (multi-modal) distributions and therefore, it is hard to model their joint probability. A couple of recent methods, however, showed remarkable successes (with some failure cases) in synthesizing fake tabular data, such as CTGAN (Xu et al., 2019) , TVAE (Xu et al., 2019) , IT-GAN (Lee et al., 2021) , and OCT-GAN (Kim et al., 2021) . In addition, a recent generative model paradigm, called score-based generative modeling (SGMs), successfully resolves the two problems of the generative learning trilemma (Xiao et al., 2021) , i.e., score-based generative models provide high sampling quality and diversity, although their training/sampling time is relatively longer than other deep generative models. In this paper, we adopt a score-based generative modeling paradigm and design a Score-based Tabular data Synthesis (STaSy) method. Our model designs significantly outperform all existing baselines in terms of the sampling quality and diversity (cf. Naïve-STaSy and STaSy in Table 1 ) -Naïve-STaSy is a naive conversion of SGMs toward tabular data, and STaSy additionally uses our proposed self-paced learning and fine-tuning methods. Figure 1 shows the uneven and long-tailed loss distribution of Naïve-STaSy at the end of its training process. The figure implies the training of Naïve-STaSy by the denoising score matching failed to learn the score values of some records. This may let the model be (partially) underfitted to training data. In contrast, STaSy with our two proposed training methods yields many loss values around the left corner (i.e., close to 0) .

Loss Frequency

Na ïve -STaSy STaSy In order to alleviate the training difficulty of Naïve-STaSy, we design i) a self-paced learning method, and ii) a fine-tuning approach. Our proposed self-paced learning technique trains our model from easy to hard records based on their loss values by modifying the objective function. The technique makes the model learn records selectively and eventually. During this process, the model can be better trained. In addition, our proposed fine-tuning method, which modestly adjusts the model parameters, can further improve the sampling quality and diversity. In Table 1 , we summarize our experimental results, where we compare our STaSy with other existing tabular data synthesis methods in terms of the sampling quality, diversity, and time. As shown, our basic model even without our proposed selfpaced learning and fine-tuning, denoted Naïve-STaSy, significantly outperforms all baselines except for runtime. In summary, our contributions are as follows: i) We design a score-based generative model for tabular data synthesis. ii) We alleviate the training difficulty of the denoising score matching loss by designing a self-paced learning strategy and further enhance the sampling quality and diversity using a proposed fine-tuning method. STaSy, thus, clearly balances among the generative learning trilemma: sampling quality, diversity, and time. iii) Our proposed method outperforms other deep learning methods in all cases by large margins, which we consider a significant advance in the field of tabular data synthesis. iv) We evaluate various methods in terms of the generative learning trilemma in a rigorous manner.

2.1. SCORE-BASED GENERATIVE MODELS

Score-based generative models (SGMs) use a diffusion process defined by the following Itô stochastic differential equation (SDE): dx = f (x, t)dt + g(t)dw, where f (x, t) = f (t)x, f and g are drift and diffusion coefficients of x(t), and w is the standard Wiener process. Depending on the types of f and g, SGMs can be divided into variance exploding (VE), variance preserving (VP), and sub-variance preserving (sub-VP) models (Song et al., 2021) . The definitions of f and g are in Appendix A. The reverse of the diffusion process is a denoising process as follows: dx = f (x, t) -g 2 (t)∇ x log p t (x) dt + g(t)dw, where this reverse SDE is a process of generating samples. The score function ∇ x log p t (x) is approximated by a time-dependent score-based model S θ (x, t), called score network. In general, following the diffusion process in Equation 1, we can derive x(t) at time t ∈ [0, T ], where x(0) and x(T ) means a real and noisy sample, respectively. The transition probability p(x(t)|x( 0)) at time t is easily approximated by this process, and it always follows a Gaussian distribution. It allows us to collect the gradient of the log transition probability, ∇ x(t) log p(x(t)|x( 0)), during the diffusion process. Therefore, we can train a score network S θ (x, t) as follows: arg min θ E t E x(t) E x(0) λ(t)∥S θ (x(t), t) -∇ x(t) log p(x(t)|x(0))∥ 2 2 , where λ(t) is to control the trade-off between the sampling quality and likelihood. This is called denoising score matching, and θ * solving Equation 3 can accurately solve the reverse SDE in Equation 2 (Vincent, 2011) . After the training process, we can synthesize fake data records with i) the predictor-corrector framework or ii) the probability flow method, a deterministic method based on the ordinary differential equation (ODE) whose marginal distribution is equal to that of Equation 1 (Song et al., 2021) . In particular, the latter enables fast sampling and exact log-probability computation.

2.2. TABULAR DATA SYNTHESIS

Many distinct methods exist for tabular data synthesis, which creates realistic synthetic tables depending on the data types. For example, a recursive table modeling utilizing a Gaussian copula is used to synthesize continuous variables (Patki et al., 2016) . Discrete variables can be generated by Bayesian networks (Zhang et al., 2017; Aviñó et al., 2018) and decision trees (Reiter, 2005) . Several data synthesis methods based on GANs have been presented to generate tabular data in recent years. RGAN (Esteban et al., 2017) creates continuous time-series healthcare records, whereas MedGAN (Choi et al., 2017) and corrGAN (Patel et al., 2018) generate discrete records. EhrGAN (Che et al., 2017) utilizes semi-supervised learning to generate plausible labeled records to supplement limited training data. PATE-GAN (Jordon et al., 2019) generates synthetic data without jeopardizing the privacy of real data. TableGAN (Park et al., 2018) employs convolutional neural networks to enhance tabular data synthesis and maximize label column prediction accuracy. CTGAN and TVAE (Xu et al., 2019) adopt column-type-specific preprocessing steps to deal with multi-modality in the original dataset distribution. OCT-GAN (Kim et al., 2021 ) is a generative model design based on neural ODEs. SOS (Kim et al., 2022) proposed a style-transfer-based oversampling method for imbalanced tabular data using SGMs, whose main strategy is converting a major sample to a minor sample. Since its task is not compatible to our task to generate from the scratch, direct comparisons are not possible. However, we convert our method to an oversampling method following their design guidance and compare with SOS in Appendix B.

2.3. SELF-PACED LEARNING

Self-paced learning (SPL) is a training strategy related to curriculum learning to select training records in a meaningful order, inspired by the learning process of humans (Kumar et al., 2010b; Jiang et al., 2014) . It refers to training a model only with a subset of data that has low training losses and gradually expanding to the entire training data. We denote the training set as D = {x i } N i=1 , where x i is the i-th record. The model M with parameters θ has a loss l i = L(M (x i , θ)), where L is the loss function . A vector v = [v i ] N i=1 , v i ∈ {0, 1} indicates whether x i is easy or not for all i. SPL aims to learn the model parameter θ and the selection importance v by minimizing: min θ,v E(θ, v) = N i=1 v i L(M (x i , θ)) - 1 K N i=1 v i , where K is a parameter to control the learning pace. In general, the second term in Equation 4, called a self-paced regularizer, can be customized for a downstream task. The alternative convex search (ACS) (Bazaraa et al., 1993) is typically used to solve Equation 4 (Kumar et al., 2010a; Tang et al., 2012) . By alternately optimizing variables while fixing others, we can optimize Equation 4, i.e., update v after fixing θ, and vice versa. With fixed θ, the global optimum v * = [v * i ] N i=1 is defined as follows: v * i = 1, l i < 1 K , 0, l i ≥ 1 K , When updating v with fixed θ, a record x i with l i < 1 K is regarded as an easy record and will be chosen for training. Only easy records are used to train the model. Otherwise, x i is regarded as a hard record and will be unselected. To involve more records in the training process, K is gradually decreased.

3. PROPOSED METHOD

STaSy is an SGM-based method for tabular data synthesis. STaSy uses SPL to ensure its training stability. The suggested fine-tuning method takes advantage of a favorable property of SGMs, which is that we can measure the log-probabilities of records.

3.1. SCORE NETWORK ARCHITECTURE & MISCELLANEOUS DESIGNS

It is known that each column in tabular data typically has complicated distributions, whereas pixel values in image datasets typically follow Gaussian distributions (Xu et al., 2019) . Moreover, tabular synthesis models should learn the joint probability of multiple columns to generate a record, which is one main reason why tabular data synthesis is difficult. However, one good design point is that the dimensionality of tabular data is typically far less than that of image data, e.g., 784 pixels even in MNIST, one of the simplest image datasets, vs. 30 columns in Credit. We found through our preliminary experiments that the SDE in Equations 1 and 2 can well model the joint probability iff its score network, which approximates ∇ x log p t (x), is well trained. We carefully design our score network for tabular data synthesis considering these points. Our proposed score network architecture is in Appendix C. The network consists of residual blocks of FC layers. Since SGMs were theoretically designed from the idea of perturbing data with an infinite number of noise scales, SGMs typically require large-scale computation, e.g., T = 1, 000 for images in (Song et al., 2021) , as an approximation to the infinite number. With a large number of steps, the denoising process requires a long time to complete, which is one part of the generative learning trilemma. However, we found that T = 50 steps in Equation 1 are enough to train a network to approximate the gradient of the log-likelihood, which means that our STaSy naturally has less sampling time than SGMs for images with T = 1, 000 steps. Pre/post-processing of tabular data To handle mixed types of data, which is a challenge in tabular data generation, we pre/post-process columns. We use the min-max scaler to pre-process numerical columns, and its reverse scaler is used for post-processing after generation. We also apply one-hot encoding to pre-process categorical columns, and use the softmax function, followed by the rounding function, when generating. How to generate After sampling a noisy vector z ∼ N (µ, σ 2 I), the reverse SDE can convert z into a fake record. The prior distribution z ∼ N (µ, σ 2 I) varies depending on the type of SDEs: N (0, σ 2 max I) for VE, and N (0, I) for VP and sub-VP. σ max is a hyperparameter. In particular, we adopt the probability flow method to solve the reverse SDE, which will be shortly described in Equation 10.

3.2. SELF-PACED LEARNING APPROACH

In order to alleviate the training difficulty, we apply a curriculum learning technique for STaSy, more specifically, self-paced learning. Instead of letting v i ∈ {0, 1}, we use a "soft" record sampling method, i.e., v i ∈ [0, 1]. If l i , which is the denoising score matching objective on i-th record, is less than a threshold, we set v i to 1 to ensure that the record is fully involved in training. At the end of the training, v i must be set to 1 for all i to train the model with the entire data. The denoising score matching loss for i-th training record x i is defined as follows: l i = E t E xi(t) λ(t)∥S θ (x i (t), t) -∇ xi(t) log p(x i (t)|x i (0))∥ 2 2 . Then, we have the following STaSy objective: min θ,v N i=1 v i l i + r(v; α, β), where 0 ≤ v i ≤ 1 for all i, r( 9Update θ after fixing v with Equation 7Update α and β with the control method in Appendix D / * Fine-tune the trained model using log-probability * / τ i ← log p(x i ) F ← {x i | log p(x i ), where x i ∈ D, is smaller than the average (or median) log-probability.} for each fine-tune epoch do for each x i ∈ F do Update θ with Equation 6F ← {x i |log p(x i ) < τ i } return θ Theorem 1. Let the self-paced regularizer r(v; α, β) be defined as follows: r(v; α, β) = - Q(α) -Q(β) 2 N i=1 v 2 i -Q(β) N i=1 v i , where the closed-form optimal solution for v * = [v * 1 , v * 2 , . . . , v * N ], given fixed θ, is defined as follows -its proof is in Appendix E: v * i =        1, if l i ≤ Q(α), 0, if l i ≥ Q(β), l i -Q(β) Q(α) -Q(β) , otherwise. Specifically, records with l i ≤ Q(α) are considered easy records and will be selected for training, whereas records with l i ≥ Q(β) are considered complicated (or potentially noisy) and will not be selected. If not both cases, records will be partially selected during training, i.e., v i ∈ [0, 1]. α and β are gradually increased to 1 from the initial values α 0 and β 0 , proportionally to training progress to ensure that all data records are involved in training. As α and β increase, the difficult records are gradually involved in training, and the model also becomes more robust to those difficult cases. We set α 0 and β 0 in such a way that more than 80% of the training records are included in the learning process from the beginning.

3.3. FINE-TUNING APPROACH

For solving the reverse SDE process, score-based generative models rely on various numerical approaches. One of the techniques is the probability flow method in Equation 10 ( Song et al., 2021) , which uses a deterministic process whose marginal probability is the same as the SDE. With the approximated score function S θ (•), the probability flow method uses the following neural ordinary differential equation (NODE) based model (Chen et al., 2018) : dx = f (x, t) - 1 2 g(t) 2 ∇ x log p t (x) dt. In our experiments, the probability flow shows better quality than other methods to solve the original reverse SDE, and our default solver is the probability flow (see Section 4.3) . In addition, NODEs facilitate computing the log-probability defined in Equation 10 through the instantaneous change of variables theorem. Consequently, we can calculate the exact log-probability efficiently with the unbiased Hutchinson's estimator (Hutchinson, 1989; Grathwohl et al., 2018) . Thus, we propose to fine-tune based on the exact log-probability. After learning the model parameter θ as described in Section 3.2, we set the sample-wise threshold τ i to log p(x i ) (cf. Line 6 of Algorithm 1). We then prepare the fine-tuning candidate set F (cf. Line 7 of Algorithm 1). After fine-tuning for the samples in F, we update the candidate set (cf. Line 11 of Algorithm 1). Our goal is for achieving a better log-probability than the initial one τ i before the fine-tuning process.

3.4. TRAINING ALGORITHM

Algorithm 1 shows the overall training process for our STaSy. Firstly, we initialize the parameters of the score network θ. Utilizing ACS, we then train STaSy with the SPL training strategy. At this step, we iteratively optimize θ and v. We can obtain an optimal θ by optimizing Equation 7with fixed v, and the global optimum v is calculated by Equation 9. We also update α and β in proportion to training progress. After finishing the main SPL training step, we can generate fake records from the model. To further improve the trained score network, we retrain our model for every record x i whose log-probabilities are less than its threshold τ i . At Line 10 of Algorithm 1, one can use the log-probability instead of Equation 6as a fine-tuning objective. However, we found that Equation 6is more effective (see Appendix F).

4. EXPERIMENTS

We analyze methods in terms of the generative learning trilemma. We list only aggregated results over datasets in the main paper -details are in Appendix G. We repeat experiments 5 times.

4.1. EXPERIMENTAL ENVIRONMENTS

A brief description of our experimental environments is as follows: i) We use 15 real-world tabular datasets for classification and regression and 7 baseline methods. To be specific about our models, Naïve-STaSy is a naive conversion of SGMs without our proposed training strategies, and STaSy is trained with self-paced learning and the fine-tuning method. ii) In general, we follow the "train on synthetic, test on real (TSTR)" framework (Esteban et al., 2017; Jordon et al., 2019) , which is a widely used evaluation method for tabular data (Xu et al., 2019; Kim et al., 2021; Lee et al., 2021) , to evaluate the quality of sampling -in other words, we train various models, including DecisionTree, AdaBoost, Logistic/Linear Regression, MLP classifier/regressor, RandomForest, and XGBoost, with fake data, validate with original training data, and test them with real test data. For Identity, we train with real training data, choose the best performing model using the cross-validation, and test with real test data, whose score can be a criterion to evaluate the sampling quality of various generative methods. iii) We use various metrics to evaluate in various aspects. For the sampling quality, we mainly use average F1 for classification, and also report AUROC and Weighted-F1. We use R 2 and RMSE for regression. For the sampling diversity, we use coverage (Naeem et al., 2020) , which was proposed to measure the diversity of generated records. Full results are in Appendix G. Detailed environments and hyperparameter settings are in Appendix H and I, respectively.

4.2.1. SAMPLING QUALITY

Table 2 summarizes the key results on the sampling quality. We use task-oriented metrics, such as F1, R 2 , and so on, under the TSTR evaluation framework. MedGAN and VEEGAN, two early GAN-based methods, show relatively lower test scores than other GAN-based methods, i.e., CTGAN, TableGAN, and OCT-GAN. In general, CTGAN and OCT-GAN show reliable quality among the baselines. However, our two score-based models always mark the best and the second best quality. Our methods, Naïve-STaSy and STaSy, significantly outperform all the baselines by large margins. In particular, our methods perform well in small datasets, e.g., Crowdsource, Obesity, and Robot, while other methods show poor quality, as shown in As flow-based generative models and SGMs with the probability flow method can calculate the exact logprobability of records, we present the log-probability as another metric for the sampling quality. Table 3 shows the median of the log-probabilities of testing records, averaged over all datasets. Since the log-probability is not bounded, we take a median of them to handle the case of outliers. Our methods, even without the fine-tuning, show a much better log-probability than RNODE, which optimizes the log-probability as its objective. Moreover, the median log-probability of testing records even improves after the proposed fine-tuning method. Putting it all together, our proposed score-based generative models, i.e., Naïve-STaSy and STaSy, show reasonable performance in all cases regarding the machine learning efficacy and log-probability. Furthermore, as shown in Tables 2 and 3 , the sampling quality always improves with the proposed training strategies, i.e., the self-paced learning and the fine-tuning, which justifies their efficacy. For the quantitative evaluation of the sampling diversity between existing methods and our proposed method, we use the coverage score (Naeem et al., 2020) , which is bounded between 0 and 1. Coverage is the ratio of real records that have at least one fake record in its manifold. A manifold is a sphere around the sample with radius r, where r is the distance between the sample and the k-th nearest neighborhood. shows relatively inferior coverage performance than others. In specific, in Robot, STaSy shows a coverage of 0.94, while other three top-performing baselines, CTGAN, TableGAN, and OCT-GAN, show coverage scores less than 0.26 in Table 20 of Appendix G.2. Figure 2 also presents the diversity of each fake data by each method qualitatively, which reflects the results of coverage. In general, STaSy shows stable performance across the sampling quality and the sampling diversity, outperforming others by large margins.

4.2.2. SAMPLING DIVERSITY

Real STaSy In Figure 3 (Left and Middle), the fake data by STaSy shows an almost identical distribution to that of real data. In contrast, OCT-GAN, which was proposed to address the multi-modality issue of tabular data, fails to do it. This means STaSy is able to capture every mode in the columns, while OCT-GAN is not. In Figure 3 (Right), CTGAN generates some out-of-distribution records, highlighted in red. We summarize runtime in Table 5 . In order to compare the runtime of all methods, we measure the wall-clock time taken to sample N records, where N is training size, 5 times, and average them. In general, simple GAN-based methods, especially TableGAN and TVAE, show faster runtime. On the other hand, SGMs, OCT-GAN, and RNODE take a relatively long time for sampling. Our proposed methods, Naïve-STaSy and STaSy, take a long sampling time compared to simple GAN-based methods but are faster than OCT-GAN and RNODE, which means a well-balanced trade-off between the sampling quality, diversity, and time. 6 , showing the effectiveness of SPL and fine-tuning. In particular, SPL improves the sampling diversity as in Figure 4 . Naïve-STaSy suffers from mild mode collapses, as highlighted in red.

4.3. ABLATION & SENSITIVITY STUDIES

Table 7 shows sensitivity analyses w.r.t. some important hyperparameters. In general, all settings show reasonable results, outperforming the baselines. We recommend 0.2 and 0.25 for α 0 and 0.9 and 0.95 for β 0 . We can adopt a variety of methods to solve the reverse SDE process in Equation 2. Our method can generate fake records with the predictor-corrector framework (Pred. Corr.) or the probability flow (PF) method (Song et al., 2021) . The former uses the ancestral sampling (AS), reverse diffusion (RD), or Euler-Maruyama (EM) method for solving the reverse SDE, and for the correction process, the Langevin corrector. In Table 8 , the probability flow method in Equation 10 mostly leads to successful results, and other datasets also show similar results. 

5. CONCLUSIONS AND DISCUSSIONS

Synthesizing tabular data is an important yet non-trivial task, as it requires modeling a joint probability of multi-modal columns. To this end, we presented our detailed designs and experimental results with thorough analyses. Our proposed method, STaSy, is a score-based model equipped with our proposed self-paced learning and fine-tuning methods. In our experiments with 15 benchmark datasets and 7 baselines, STaSy outperforms other deep learning methods in terms of the sampling quality and diversity (and with an acceptable sampling time). Based on these considerations, we believe that STaSy shows significant advancements in tabular data synthesis. We expect much follow-up work in utilizing SGMs for tabular data synthesis. Limitations. Although our model shows the best balance for the deep generative task trilemma, we think that there exists room to improve runtime further -existing simple GAN-based methods are faster than our method for sampling fake records. In addition, SGMs are known to be sometimes unstable for high-dimensional data, e.g., high-resolution images, but in general, stable for lowdimensional data, e.g., tabular data. Therefore, we think that SGMs have much potential for tabular data synthesis in the future.

6. ETHICS STATEMENT

Indeed, people do not always use artificial intelligence technology for righteous purposes. One can use our method to achieve his/her wrongful goals, e.g., selling high-quality fake data generated by our method, and retrieving private original data records from synthetic data. However, we believe that our research has much more beneficial points. One can use our method to generate fake data and share (after hiding the original data) to prevent potential privacy leakages. We, of course, need more studies to achieve the privacy protection goal based on our model. However, a research trend exists where researchers try to use a deep generative model to protect privacy (Park et al., 2018; Lee et al., 2021) .

7. REPRODUCIBILITY STATEMENT

To reproduce the experimental results, we have made the following efforts: 1) Source codes used in the experiments are available in the supplementary material. By following the README guidance, the main results are easily reproducible. 2) All the experiments are repeated five times, and their mean and standard deviation values are reported in Appendix. 3) We provide extensive experimental details in Appendix H.

A VE, VP, AND SUB-VP SDES

We introduce the definitions of f and g as follows: f (t) =    0, if VE, -1 2 γ(t)x, if VP, -1 2 γ(t)x, if sub-VP, g(t) =        d[σ 2 (t)] dt , if VE, γ(t), if VP, γ(t)(1 -e -2 t 0 γ(s) ds ), if sub-VP, where σ(t) and γ(t) are noise functions w.r.t. time t. σ(t) = σ min σmax σmin t for t ∈ [0, 1], where σ min and σ max are hyperparameters, and we use σ min = {0.01, 0.1} and σ max = {5.0, 10.0}. γ(t) = γ min + t (γ max -γ min ) for t ∈ [0, 1], where γ min and γ max are hyperparameters, and we use γ min = {0.01, 0.1} and γ max = {5.0, 10.0}. In this section, we discuss the difference between SOS and STaSy. They are both based on SGMs, but they are optimized towards different goals by using different objective functions and training strategies. SOS has many design points specialized to augment minor classes only (rather than synthesizing entire tabular data) -for instance, SOS adopts a style-transfer-based idea to convert a major class sample to a minor one via their own SGM model without any consideration on the training difficulty of the denoising score matching. However, our STaSy more focuses on synthesizing entire tabular data by proposing special self-paced training and fine-tuning methods.

B COMPARISON BETWEEN SOS AND STASY FOR THE OVERSAMPLING TASK

We use five imbalanced datasets to oversample -many datasets in our main experiments are not imbalanced. We conduct the oversampling experiments with STaSy to compare with SOS (Kim et al., 2022) . In this experiment, STaSy is converted to an oversampling method following the design guidance of SOS, i.e., each minority class has its own score network and is separately trained. We train, for fair comparison, STaSy w/o fine-tuning for each minor class and generate minority samples to be the same size of the majority class. We compare the two models in terms of the sampling quality using Weighted-F1 which is specialized in evaluating imbalanced data. We note that Identity means that we do not use any oversampling methods, which is, therefore, a minimum quality requirement upon oversampling. As shown in Table 9 , STaSy w/o fine-tuning outperforms SOS. The result shows that our proposed training strategy, i.e., the self-paced learning, improves the model training regardless of tasks.

C NETWORK ARCHITECTURE

We propose the following score network S θ (x(t), t): h 0 = x(t), h i = ω(H i (h i-1 , t) ⊕ h i-1 ), 1 ≤ i ≤ d N S θ (x(t), t) = FC(h d N ), where x(t) is a record (or a row) at time t in tabular data, h i is the i-th hidden vector, and ω is an activation function. d N is the number of hidden layers. For various layer types of H i (h i-1 , t), we provide the following options: H i (h i-1 , t) =    FC i (h i-1 ) ⊙ ψ(FC t i (t)), if Squash, FC i (t ⊕ h i-1 ), if Concat, FC i (h i-1 ) ⊙ ψ(FC gate i (t) + FC bias i (t)), if Concatsquash, where we can choose one of the three possible layer types as a hyperparameter, ⊙ means the elementwise multiplication, ⊕ means the concatenation operator, ψ is the Sigmoid function, and FC is a fully connected layer. We modify the architecture of (Song et al., 2021) by using the layer types, being inspired by (Grathwohl et al., 2018) .

D THRESHOLD CONTROLLING MECHANISM

Our threshold controlling mechanism is designed to meet the following 3 requirements: i) it can control when the entire dataset is used for training, starting from a subset, ii) it should gradually increase the size of used training records while logarithmically decreasing the number of hard records (that are not involved in training), and iii) it should be a monotonically increasing/decreasing function to guarantee that the training difficulty gets more challenging as the training process goes on. The threshold controlling variables α and β, where 0 ≤ α ≤ β ≤ 1, are gradually increased to 1 to involve the entire data records for training. We increase them proportionally to training steps, where α = α 0 + log 1 + c e-1 S (1 -α 0 ) and β = β 0 + log 1 + c e-1 S (1 -β 0 ) . e is the base of the natural logarithm, α 0 and β 0 are initial values of α and β, c is the current training step, and S determines when to utilize the entire data records. We use 10,000 for S. We set β 0 at least 0.8 to ensure 80% of the data records are involved in training at the start of the training. We set α = α 0 and β = β 0 at the beginning of the training, since c is 0. We note that a training sample x i whose quantile of loss value is greater than β is regarded as a hard sample and v i is set to be 0. If we set β 0 to be 0.8, the top 20% of difficult samples will not be used at the first training step. In this way, one can control the proportion of training samples involved in the training from the beginning.

E PROOF OF THEOREM 1

As defined in Section 3.2, the STaSy objective is as follows: min θ,v N i=1 v i l i - Q(α) -Q(β) 2 N i=1 v 2 i -Q(β) N i=1 v i , where l i is the score matching loss for i-th training record as in Equation 6. We can rewrite the optimal solution for each training record v i with respect to fixed θ in the vertex form. Let L(v i ) be the objective with fixed θ, which is a quadratic function with respect to v i . Then, L(v i ) = v i l i - Q(α) -Q(β) 2 v 2 i -Q(β)v i = - Q(α) -Q(β) 2 v 2 i + (l i -Q(β))v i = - Q(α) -Q(β) 2 v 2 i - 2(l i -Q(β)) Q(α) -Q(β) v i = - Q(α) -Q(β) 2 v 2 i - 2(l i -Q(β)) Q(α) -Q(β) v i + l i -Q(β) Q(α) -Q(β) 2 - l i -Q(β) Q(α) -Q(β) 2 = - Q(α) -Q(β) 2 v i - l i -Q(β) Q(α) -Q(β) 2 + Q(α) -Q(β) l i -Q(β) Q(α) -Q(β) 2 . (14) Because Q(α)-Q(β) 2 is less than or equal to 0 and Q(α)-Q(β) 2 li-Q(β) Q(α)-Q(β) 2 is a constant, the solution v i which minimizes Equation 14 is v i = li-Q(β) Q(α)-Q(β) . Considering v i ∈ [0, 1], we can get the optimal v i as follows: v * i =        1, if l i ≤ Q(α), 0, if l i ≥ Q(β), l i -Q(β) Q(α) -Q(β) , otherwise. (15)

F THE HUTCHINSON'S ESTIMATION AS A FINE-TUNING OBJECTIVE

We use the denoising score matching loss in Line 10 of Algorithm 1. In this section, we describe the results of an additional experiment in which the Hutchinson's log-probability estimation is used for the fine-tuning objective. Table 10 summarizes the F1 score and the median of log-probabilities when we use the denoising score matching loss and the Hutchinson's estimation as the tine-tuning objective. In Contraceptive, there does not exist a clear winner between the two fine-tuning objectives, and similar results are also shown in other datasets. However, in some datasets, e.g., Shoppers and Crowdsource, the former shows better F1 scores and better medians of the log-probabilities than the latter by large margins. In addition, when we update θ using the Hutchinson's estimation, in Default, the sampling quality is lower than before fine-tuning. Considering these results, we use the denoising score matching loss, which shows the generalizability, as our default fine-tuning objective. We use the coverage score as a metric for the sampling diversity. Full results for all datasets are in Tables 19, 20 , and 21. We measure the coverage score 5 times with different fake records and report their mean and standard deviation. Coverage is bounded between 0 and 1, and higher coverage means more diverse samples. This k-NN-based measurement is expected to achieve 100% performance when the real and fake records are identical, but in practice, this is not always the case. For the dataset whose coverage score does not show 1 for two same data records, we choose the hyperparameter k to achieve at least greater than 0.95. In our experiments, k for Phishing is 7, and for others, k is 5. As shown in Tables 19, 20 , and 21, in 12 out of 15 datasets, our methods outperform others by large margins. , 23 , and 24 show runtime evaluation results of each method. We measure the wall-clock time taken to sample fake records 5 times, and report their mean and standard deviation. In almost all datasets, Naïve-STaSy and STaSy show faster runtime than OCT-GAN and RNODE. TableGAN and TVAE take a short sampling time, but considering their inferior sampling quality and diversity, only our proposed model resolves the problems of the generative learning trilemma. 

H.1 BASELINES

We utilize a set of baselines that includes various generative models. • Identity is a case where we do not synthesize but use original data. • MedGANfoot_0 (Choi et al., 2017) is a GAN that incorporates non-adversarial losses to generate discrete medical records. • VEEGAN 1 (Srivastava et al., 2017) is a GAN for tabular data that avoids mode collapse by adding a reconstructor network. • CTGAN 1 (Xu et al., 2019) and TVAE 1 (Xu et al., 2019) are a conditional GAN and a VAE for tabular data with mixed types of variables. • TableGAN 1 (Park et al., 2018) is a GAN for tabular data using convolutional neural networks. • OCT-GANfoot_1 (Kim et al., 2021 ) is a GAN that has a generator and discriminator based on neural ordinary differential equations. • RNODEfoot_2 (Finlay et al., 2020) is an advanced flow-based model with two regularization terms added to the training objective of FFJORD (Grathwohl et al., 2018) .

H.2 DATASETS

In this section, we describe 15 real-world tabular datasets for our experiments. We select the datasets for experiments with two metrics: 1) how many times a dataset has been cited/used in previous papers, and 2) how many times a dataset has been viewed/downloaded in famous repositories, such as UCI Machine Learning Repository and Kaggle. Among them, we choose the datasets that can be used for classification and regression tasks, with more than 5 columns and 1,000 rows. • Credit is a binary classification dataset collected from European cardholders for credit card fraud detection. • Default (Lichman, 2013) • Shuttle (shu) is a multi-class classification dataset for extracting conditions in which automatic landing is preferred over manual control of the spacecraft. • Beijing (Liang et al., 2015) is a regression dataset about PM2.5 air quality in the city of Beijing. • News (Fernandes, 2015) is a regression dataset about online news articles to predict the number of shares in social networks. The statistical information of datasets used in our experiments is in Table 25 . #train, #test, #continuous, #categorical, and #class mean the number of training data, testing data, continuous columns, categorical columns, and class, respectively. The raw data of 15 datasets are available online: • Credit: https://www.kaggle.com/mlg-ulb/creditcardfraud (DbCL 1.0) • Default: https://archive.ics.uci.edu/ml/datasets/default+of+ credit+card+clients (CC BY 4.0) • HTRU: https://archive.ics.uci.edu/ml/datasets/HTRU2 (CC BY 4.0) • Magic: https://archive.ics.uci.edu/ml/datasets/magic+gamma+ telescope (CC BY 4.0) 

J ADDITIONAL VISUALIZATIONS

We show several visualizations that are missing in the main paper. In each subsection, we show column-wise histograms and t-SNE visualizations on HTRU, Robot, and News, respectively.

J.1 ADDITIONAL VISUALIZATIONS IN HTRU

As shown in Figure 5 , the fake data distributions of TVAE, TableGAN, and RNODE are dissimilar to the real data distributions, and these baselines fail to sample high-quality fake records. In Figure 6 , CTGAN, TableGAN, OCT-GAN, and RNODE suffer from mode collapses as highlighted in red. In STaSy, however, the mode collapse problem is clearly alleviated, which means that STaSy is effective in enhancing the diversity. 



https://github.com/sdv-dev/SDGym (MIT License) https://github.com/bigdyl-yonsei/OCTGAN https://github.com/cfinlay/ffjord-rnode (MIT License)



Figure 1: Distributions of denoising score matching loss in Shoppers

Figure 2: t-SNE visualizations of fake and the original records in Robot.

Figure 3: (Left and Middle) Histograms of values in Roundness and Compactness columns of Bean, respectively. (Right) t-SNE (van der Maaten & Hinton, 2008) visualizations of the fake and original records in Obesity. More visualizations are in Appendix J.

Figure 4: t-SNE visualizations of the fake and original records in Beijing

is a binary classification dataset describing the information on credit card clients in Taiwan regarding default payments.• HTRU(Lyon, 2017) is a binary classification dataset that describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey. • Magic(Bock, 2007) is a binary classification dataset that simulates the registration of high-energy gamma particles in the atmospheric telescope. • Phishing(Mohammad, 2015) is a binary classification dataset used to distinguish between phishing and legitimate web pages. • Shoppers (C Okan Sakar, 2019) is a binary classification dataset about online shoppers' intention. • Spambase (Hopkins, 1999) is a binary classification dataset that indicates whether an email is spam or non-spam. • Bean (Koklu & Ozkan, 2020) is a multi-class classification dataset that includes types of beans with their characteristics. • Contraceptive (Lim, 1997) is a multi-class classification dataset about Indonesia contraceptive prevalence. • Crowdsource (Johnson & Iizuka, 2016) is a multi-class classification dataset used to classify satellite images into different land cover classes. • Obesity (Palechor & de la Hoz Manotas, 2019) is a multi-class classification dataset describing obesity levels based on eating habits and physical condition. • Robot (Freire, 2010) is a multi-class classification dataset collected as the robot moves around the room, following the wall using ultrasound sensors.

Figure 5: Histograms of values in the excess kurtosis column of HTRU

Classification/regression with real data. We report the average F1 (resp. macro F1), AUROC, and Weighted-F1 for binary (resp. multi-class) classification, and R 2 and RMSE for regression. The best (resp. the second best) results are highlighted in bold face (resp. with underline).

The median of the log-probabilities of testing records, averaged over all datasets

Sampling diversity in terms of coverage averaged over all datasets



Runtime evaluation results, averaged over all datasets

Ablation study. We report F1 (resp. R 2 ) for classification (resp. regression).

Sensitivity analyses. We report F1 (resp. R 2 ) for classification (resp. regression).



Comparison between SOS and STaSy in terms of Weighted-F1

We report the F1 score and the median of the log-probabilities of testing records according to the fine-tuning objective. ) for the classfication (resp. regression) TSTR evaluation, and also report AUROC and Weighted-F1 (resp. RMSE) results. Full results for all datasets are in Tables11, 12, and 13 for binary classification, Tables14, 15, and 16 for multi-class classification, and Table17for regression. We train and test various base classifiers/regressors and report their mean and standard deviation. Moreover, we use the log-probability as another metric for the sampling quality. Full results are in Table18. The best results are highlighted in bold face and the second best results with underline. As shown, Naïve-STaSy and STaSy show the best and the second best performances in almost all cases.

Classification with real data. We report macro F1 for multi-class classification.

Classification with real data. We report AUROC for multi-class classification.

Classification with real data. We report Weighted-F1, which is inversely weighted to its class size, for multi-class classification.

Regression with real data. We report R 2 and RMSE for regression.

The median of the log-probabilities of testing records are reported.

Sampling diversity in terms of coverage for binary classification datasets

Sampling diversity in terms of coverage for multi-class classification datasets

Sampling diversity in terms of coverage for regression datasets

Wall-clock runtime for binary classification datasets

Wall-clock runtime for multi-class classification datasets

Datasets used for our experiments

Phishing: https://archive.ics.uci.edu/ml/datasets/phishing+ websites (CC BY 4.0) • Shoppers: https://archive.ics.uci.edu/ml/datasets/Online+ Shoppers+Purchasing+Intention+Dataset (CC BY 4.0) • Spambase: https://archive.ics.uci.edu/ml/datasets/spambase (CC BY 4.0) Crowdsource: https://archive.ics.uci.edu/ml/datasets/ Crowdsourced+Mapping# (CC BY 4.0) • Obesity: https://archive.ics.uci.edu/ml/datasets/Estimation+ of+obesity+levels+based+on+eating+habits+and+physical+ condition+ (CC BY 4.0)

Hyperparameters of the base classifiers/regressors

The best hyperparameters used in Table2Datasets Hyperparameters for SPL of STaSy Hyperparameters for fine-tuning SDE Type Layer Type Activation Learn. Rate α 0 β 0 Hutchinson Type Learn. Rate

ACKNOWLEDGMENTS

Jayoung Kim and Chaejeong Lee equally contributed. Noseong Park is the corresponding author. This work was supported by the Institute of Information & communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT) (90% from No. 2021-0-00231, Development of Approximate DBMS Query Technology to Facilitate Fast Query Processing for Exploratory Data Analysis and 10% from No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University)).

annex

Published as a conference paper at ICLR 2023 

