STUNT: FEW-SHOT TABULAR LEARNING WITH SELF-GENERATED TASKS FROM UNLABELED TABLES

Abstract

Learning with few labeled tabular samples is often an essential requirement for industrial machine learning applications as varieties of tabular data suffer from high annotation costs or have difficulties in collecting new samples for novel tasks. Despite the utter importance, such a problem is quite under-explored in the field of tabular learning, and existing few-shot learning schemes from other domains are not straightforward to apply, mainly due to the heterogeneous characteristics of tabular data. In this paper, we propose a simple yet effective framework for few-shot semi-supervised tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT). Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks. Moreover, we introduce an unsupervised validation scheme for hyperparameter search (and early stopping) by generating a pseudo-validation set using STUNT from unlabeled data. Our experimental results demonstrate that our simple framework brings significant performance gain under various tabular few-shot learning benchmarks, compared to prior semi-and self-supervised baselines.

1. INTRODUCTION

Learning with few labeled samples is often an essential ingredient of machine learning applications for practical deployment. However, while various few-shot learning schemes have been actively developed over several domains, including images (Chen et al., 2019) and languages (Min et al., 2022) , such research has been under-explored in the tabular domain despite its practical importance in industries (Guo et al., 2017; Zhang et al., 2020; Ulmer et al., 2020) . In particular, few-shot tabular learning is a crucial application as varieties of tabular datasets (i) suffer from high labeling costs, e.g., the credit risk in financial datasets (Clements et al., 2020) , and (ii) even show difficulties in collecting new samples for novel tasks, e.g., a patient with a rare or new disease (Peplow, 2016) such as an early infected patient of COVID-19 (Zhou et al., 2020) . To tackle such limited label issues, a common consensus across various domains is to utilize unlabeled datasets for learning a generalizable and transferable representation, e.g., images (Chen et al., 2020a) and languages (Radford et al., 2019) . Especially, prior works have shown that representations learned with self-supervised learning are notably effective when fine-tuned or jointly learned with few labeled samples (Tian et al., 2020; Perez et al., 2021; Lee et al., 2021b; Lee & Shin, 2022) . However, contrary to the conventional belief, we find this may not hold for tabular domains. For instance, recent state-of-the-art self-supervised tabular learning methods (Yoon et al., 2020; Ucar et al., 2021) do not bring meaningful performance gains over even a simple k-nearest neighbor (kNN) classifier for few-shot tabular learning in our experiments (see Table 1 for more details). We hypothesize that this is because the gap between trained self-supervised tasks and the applied few-shot task is large due to the heterogeneous characteristics of tabular data. Instead, we ask whether one can utilize the power of meta-learning to reduce the gap via fast adaption to unseen few-shot tasks; meta-learning is indeed one of the most effective few-shot learning : we generate the task label by running a k-means clustering over the randomly selected column features of the table, then perturb the selected columns to prevent from generating a trivial task. strategies across domains (Finn et al., 2017; Gu et al., 2018; Xie et al., 2018) . We draw inspiration from the recent success in unsupervised meta-learning literature, which meta-learns over the self-generated tasks from unlabeled data to train an effective few-shot learner (Khodadadeh et al., 2019; Lee et al., 2021a) . It turns out that such an approach is quite a promising direction for fewshot tabular learning: a recent unsupervised meta-learning scheme (Hsu et al., 2018) outperforms the self-supervised tabular learning methods in few-shot tabular classification in our experiments (see Table 1 ). In this paper, we suggest to further exploit the benefits of unsupervised meta-learning into few-shot tabular learning by generating more diverse and effective tasks compared to the prior works using the distinct characteristic of the tabular dataset's column feature. Contribution. We propose a simple yet effective framework for few-shot semi-supervised tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT); see the overview in Figure 1 . Our key idea is to generate a diverse set of tasks from the unlabeled tabular data by treating the table's column feature as a useful target, e.g., the 'blood sugar' value can be used as a substituted label for 'diabetes'. Specifically, we generate pseudo-labels of the given unlabeled input by running a k-means clustering on randomly chosen subsets of columns. Moreover, to prevent generating a trivial task (as the task label can be directly inferred by the input columns), we randomly replace the chosen column features with a random value sampled from the columns' respective empirical marginal distributions. We then apply a meta-learning scheme, i.e., Prototypical Network (Snell et al., 2017) , to learn generalizable knowledge with the self-generated tasks. We also find that the major difficulty of the proposed meta-learning with unlabeled tabular datasets is the absence of a labeled validation set; the training is quite sensitive to hyperparameter selection or even suffers from overfitting. To this end, we propose an unsupervised validation scheme by utilizing STUNT to the unlabeled set. We find that the proposed technique is highly effective for hyperparameter searching (and early stopping), where the accuracy of the pseudo-validation set and the test set show a high correlation. We verify the effectiveness of STUNT through extensive evaluations on various datasets in the OpenML-CC18 benchmark (Vanschoren et al., 2014; Bischl et al., 2021) . Overall, our experimental results demonstrate that STUNT consistently and significantly outperforms the prior methods, including unsupervised meta-learning (Hsu et al., 2018) , semi-and self-supervised learning schemes (Tarvainen & Valpola, 2017; Yoon et al., 2020; Ucar et al., 2021) under few-shot semisupervised learning scenarios. In particular, our method improves the average test accuracy from 59.89%→63.88% for 1-shot and from 72.19%→74.77% for 5-shot, compared to the best baseline. Furthermore, we show that STUNT is effective for multi-task learning scenarios where it can adapt to new tasks without retraining the network.

2. RELATED WORK

Learning with few labeled samples. To learn an effective representation with few labeled samples, prior works suggest leveraging the unlabeled samples. Such works can be roughly categorized as (i) semi-supervised (Kim et al., 2020; Assran et al., 2021) and (ii) self-supervised (Chen et al., 2020a; b) 



Figure1: An overview of the proposed Self-generated Tasks from UNlabeled Tables (STUNT): we generate the task label by running a k-means clustering over the randomly selected column features of the table, then perturb the selected columns to prevent from generating a trivial task.

availability

https://github.com/jaehyun513

