PRETRAIN THEN FINETUNE: GETTING THE MOST FROM YOUR LABELED DATASET IN TABULAR DEEP LEARNING

Abstract

Similarly to computer vision and NLP, deep learning models for tabular data can benefit from pretraining, which is not the case for traditional ML models, such as gradient boosted decision trees (GBDT). Although the pretraining techniques for tabular data are actively studied, the existing works mostly focus on the unsupervised pretraining, implying the access to a large amount of unlabeled data in addition to the labeled target dataset. By contrast, pretraining in the fully supervised setting, when the available data is fully labeled and directly represents the downstream tabular task, receives significantly less attention. Moreover, the existing works on pretraining typically consider only the simplest MLP architectures and do not cover the recently proposed tabular models. In this work, we aim to identify the best practices for pretraining in tabular DL that can be universally applied to different datasets and architectures in the fully supervised setting. Among our findings, we show that using the object target labels during the pretraining stage is beneficial for the downstream performance and advocate several target-aware pretraining objectives. Overall, our experiments demonstrate that properly performed pretraining significantly increases the performance of tabular DL models on fully supervised problems.

1. INTRODUCTION

Tabular problems are ubiquitous in industrial ML applications, which include data described by a set of heterogeneous features, such as learning-to-rank, click-through rate prediction, credit scoring, and many others. Despite the current dominance of deep learning models in the ML literature, for tabular problems, the "old-school" decision tree ensembles (e.g., GBDT) are often the top choice for practitioners. Only recently, several works have proposed the deep models that challenge the supremacy of GBDT in the tabular domain (Arik & Pfister, 2020; Gorishniy et al., 2021; 2022) and suggest that the question "tabular DL or GBDT" is yet to be answered. An important advantage of deep models over GBDT is that they can potentially achieve higher performance via pretraining their parameters with a properly designed objective. These pretrained parameters, then, serve as a better than random initialization for subsequent finetuning for downstream tasks. For computer vision and NLP domains, pretraining is a de facto standard and is shown to be necessary for the state-of-the-art performance (He et al., 2021; Devlin et al., 2019) . For tabular problems such a consensus is yet to be achieved as well as the best practices of tabular pretraining are to be established. While a large number of prior works address the pretraining of tabular DL models (Yoon et al., 2020; Bahri et al., 2022; Ucar et al., 2021; Darabi et al., 2021) , it is challenging to make reliable conclusions about pretraining efficacy in tabular DL from the literature since experimental setups vary significantly. Moreover, most evaluation protocols assume the unlabeled data is abundant and only use a small subset of labels from each dataset during finetuning for evaluation -demonstrating pretraining efficacy, but somewhat limiting the performance of supervised baselines on these datasets. Such protocols of "unsupervised pretraining" are common in vision or NLP domains, where huge "extra" data is available on the Internet. In contrast, in our work, we focus on the setup with fully labeled tabular datasets to understand if pretraining helps tabular DL in a fully supervised setting and compare pretraining methods to 1

