PRETRAIN THEN FINETUNE: GETTING THE MOST FROM YOUR LABELED DATASET IN TABULAR DEEP LEARNING

Abstract

Similarly to computer vision and NLP, deep learning models for tabular data can benefit from pretraining, which is not the case for traditional ML models, such as gradient boosted decision trees (GBDT). Although the pretraining techniques for tabular data are actively studied, the existing works mostly focus on the unsupervised pretraining, implying the access to a large amount of unlabeled data in addition to the labeled target dataset. By contrast, pretraining in the fully supervised setting, when the available data is fully labeled and directly represents the downstream tabular task, receives significantly less attention. Moreover, the existing works on pretraining typically consider only the simplest MLP architectures and do not cover the recently proposed tabular models. In this work, we aim to identify the best practices for pretraining in tabular DL that can be universally applied to different datasets and architectures in the fully supervised setting. Among our findings, we show that using the object target labels during the pretraining stage is beneficial for the downstream performance and advocate several target-aware pretraining objectives. Overall, our experiments demonstrate that properly performed pretraining significantly increases the performance of tabular DL models on fully supervised problems.

1. INTRODUCTION

Tabular problems are ubiquitous in industrial ML applications, which include data described by a set of heterogeneous features, such as learning-to-rank, click-through rate prediction, credit scoring, and many others. Despite the current dominance of deep learning models in the ML literature, for tabular problems, the "old-school" decision tree ensembles (e.g., GBDT) are often the top choice for practitioners. Only recently, several works have proposed the deep models that challenge the supremacy of GBDT in the tabular domain (Arik & Pfister, 2020; Gorishniy et al., 2021; 2022) and suggest that the question "tabular DL or GBDT" is yet to be answered. An important advantage of deep models over GBDT is that they can potentially achieve higher performance via pretraining their parameters with a properly designed objective. These pretrained parameters, then, serve as a better than random initialization for subsequent finetuning for downstream tasks. For computer vision and NLP domains, pretraining is a de facto standard and is shown to be necessary for the state-of-the-art performance (He et al., 2021; Devlin et al., 2019) . For tabular problems such a consensus is yet to be achieved as well as the best practices of tabular pretraining are to be established. While a large number of prior works address the pretraining of tabular DL models (Yoon et al., 2020; Bahri et al., 2022; Ucar et al., 2021; Darabi et al., 2021) , it is challenging to make reliable conclusions about pretraining efficacy in tabular DL from the literature since experimental setups vary significantly. Moreover, most evaluation protocols assume the unlabeled data is abundant and only use a small subset of labels from each dataset during finetuning for evaluation -demonstrating pretraining efficacy, but somewhat limiting the performance of supervised baselines on these datasets. Such protocols of "unsupervised pretraining" are common in vision or NLP domains, where huge "extra" data is available on the Internet. In contrast, in our work, we focus on the setup with fully labeled tabular datasets to understand if pretraining helps tabular DL in a fully supervised setting and compare pretraining methods to the strong supervised baselines. We argue that this setup is reasonable in practice, since for many tabular problems there is no publicly available relevant data to pretrain on. In this setup, we perform a systematic experimental evaluation of several pretraining objectives, identify the superior ones, and describe the practical details of how to perform tabular pretraining optimally. Our main findings, which are important for practitioners, are summarized below: • Pretraining provides substantial gains over well-tuned supervised baselines in the fully supervised setup. • Simple self-prediction based pretraining objectives are comparable to the objective based on contrastive learning. To the best of our knowledge, this behavior was not reported before in tabular DL. • The object labels can be exploited for more effective pretraining. In particular, we describe several "target-aware" objectives and demonstrate that they often outperform their "unsupervised" counterparts. • The pretraining provides the most noticeable improvements for the vanilla MLP architecture. In particular, their performance after pretraining becomes comparable to the state-ofthe-art models trained from scratch, which is important for practitioners, who are interested in simple and efficient solutions. • The ensembling of pretrained models is beneficial. It indicates that the pretraining stage does not significantly decrease the diversity of the models, despite the fact that all the models are initialized by the same set of parameters. Overall, our work provides a set of recipes for practitioners interested in tabular pretraining, which results in higher performance for most of the tasks. The code of our experiments is available online.foot_0 

2. RELATED WORK

Here we briefly review the lines of research that are relevant to our study. Status quo in tabular DL. A plethora of recent works have proposed a large number of deep models for tabular data (Klambauer et al., 2017; Popov et al., 2020; Arik & Pfister, 2020; Song et al., 2019; Wang et al., 2017; Badirli et al., 2020; Hazimeh et al., 2020; Huang et al., 2020; Gorishniy et al., 2021; Kossen et al., 2021) . Several systematic studies, however, reveal that these models typically do not consistently outperform the decision tree ensembles, such as GBDT (Gradient Boosting Decision Tree) (Chen & Guestrin, 2016; Prokhorenkova et al., 2018; Ke et al., 2017) , which are typically the top-choice in various ML competitions (Gorishniy et al., 2021; Shwartz-Ziv & Armon, 2021) . Additionally, several works have shown that the existing sophisticated architectures are not consistently superior to properly tuned simple models, such as MLP and ResNet (Gorishniy et al., 2021; Kadra et al., 2021) . Finally, the recent work (Gorishniy et al., 2022) has highlighted that the appropriate embeddings of numerical features in the high-dimensional space are universally beneficial for different architectures. In our work, we experiment with pretraining of both traditional MLP-like models and advanced embedding-based models proposed in Gorishniy et al. (2022) . Pretraining in deep learning. For domains with structured data, like natural images or texts, pretraining is currently an established stage in the typical pipelines, which leads to higher general performance and better model robustness (He et al., 2021; Devlin et al., 2019) . Pretraining with the auto-encoding objective was also previously studied as a regularization strategy helping in the optimization process without large scale pretraining datasets (Erhan et al., 2010; El-Nouby et al., 2021; Krishna et al., 2022) . During the last years, several families of successful pretraining methods have been developed. An impactful line of research on pretraining is based on the paradigm of contrastive learning, which effectively enforces the invariance of the learned representations to the human-specified augmentations (Chen et al., 2020; He et al., 2020) . Another line of methods exploits the idea of self-prediction, i.e., these methods require the model to predict certain parts of the input given the remaining parts (He et al., 2021; Devlin et al., 2019) . In the vision community, the self-prediction based methods are shown to be superior to the methods that use contrastive learning



Code: https://anonymous.4open.science/r/pretrains-DD02

