MET: MASKED ENCODING FOR TABULAR DATA

Abstract

We propose Masked Encoding for Tabular Data (MET) for learning selfsupervised representations from tabular data. Tabular self-supervised learning (tabular-SSL) -unlike structured domains like images, audio, text -is more challenging since each tabular dataset can have a completely different structure among its features (or coordinates), which is hard to identify a priori. MET attempts to circumvent this problem by assuming the following hypothesis: the observed tabular data features come from a latent graphical model and the downstream tasks are significantly easier to solve in the latent space. Based on this hypothesis, MET uses random masking based encoders to learn a positional embedding for each coordinate, which would in turn capture the latent structure between coordinates. Through experiments on a toy dataset from a linear graphical model, we show that MET is indeed able to capture the latent graphical model. Practically, through extensive experiments on multiple benchmarks for tabular data, we demonstrate that MET significantly outperforms all the baselines. For example, on Criteoa large-scale click prediction dataset -MET achieves as much as 5% improvement over the current state-of-the-art (SOTA) while purely supervised learning based approaches have been able to advance SOTA by at most 2% in the last six years. Furthermore, averaged over nine datasets, MET is around 3.9% more accurate than the next best method of Gradient-boosted decision trees -considered as SOTA for the tabular setting.

1. INTRODUCTION

Self-supervised pre-training (SSL) followed by supervised fine-tuning has emerged as the state of the art approach for multiple domains such as natural language processing (NLP) (Devlin et al., 2019 ), computer vision (Chen et al., 2020b) and speech/audio processing (Baevski et al., 2020) . However, despite presence of an extensive amount of raw and unlabeled data in a variety of critical tabular-heavy domains like finance, marketing, etc., it has been challenging to extend SSL based pre-training approaches to tabular data. Broadly speaking, there are two dominant approaches to SSL: (i) reconstruction of masked inputs, and (ii) invariance to certain augmentations/transformations, also known as contrastive learning. Most of the existing tabular-SSL methods (Verma et al., 2020; Ucar et al., 2021 ) have adopted the second approach of contrastive learning. The underlying structure and semantics of specific domains such as images remain somewhat static, irrespective of the dataset. So, one can design generalizable domain specific augmentations like cropping, rotating, resizing etc. However, tabular data does not have such fixed input vocabulary space (such as pixels in images) and semantic structure, and thus lacks generalizable augmentations across different datasets. Consequently, there are only a limited number of augmentations that have been proposed for the tabular setting such as mix-up, adding random (gaussian) noise and selecting subsets of features (Verma et al., 2020; Ucar et al., 2021) . In this paper, we hypothesize the following: for any tabular dataset, (i) there is a latent (i.e., unknown/unobserved) graphical model that captures the relations between different coordinates/features, and (ii) classification is easier in the latent space. For example, in the CovType dataset -where the task is to predict the type of forest (e.g., deciduous, alpine etc.) given features such as elevation, soil type -extensive research in mountain and forest science has established that there are very specific relations among different features (Catry et al., 2009; Badía et al., 2016) , and leveraging and learning these relations could yield significant improvements in classification accuracy of machine learning models.

