HAVE MISSING DATA? MAKE IT MISS MORE! IMPUT-ING TABULAR DATA WITH MASKED AUTOENCODING

Abstract

We present REMASKER, a novel method for imputing missing values in tabular data by extending the masked autoencoding framework. In contrast to prior work, REMASKER is both simple -besides the missing values (i.e., naturally masked), we randomly "re-mask" another set of values, optimize the autoencoder by reconstructing this re-masked set, and apply the trained model to predict the missing values; and effective -with extensive evaluation on benchmark datasets, we show that REMASKER performs on par with or outperforms state-of-the-art methods in terms of both imputation fidelity and utility under various missingness settings, while its performance advantage often increases with the ratio of missing data. We further explore theoretical justification for its effectiveness, showing that REMASKER tends to learn missingness-invariant representations of tabular data. Our findings indicate that masked modeling represents a promising direction for further research on tabular data imputation.

1. INTRODUCTION

Missing values are ubiquitous in real-world tabular data due to various reasons during data collection, processing, storage, or transmission. It is often desirable to know the most likely values of missing data before performing downstream tasks (e.g., classification or synthesis). To this end, intensive research has been dedicated to developing imputation methods ("imputers") that estimate missing values based on observed data (Yoon et al., 2019; Jarrett et al., 2022; Kyono et al., 2021; Stekhoven & Buhlmann, 2012; Mattei & Frellsen, 2018 ). Yet, imputing missing values in tabular data with high fidelity and utility remains an open problem, due to challenges including the intricate correlation across different features, the variety of missingness scenarios, and the scarce amount of available data with respect to the number of missing values. The state-of-the-art imputers can be categorized as either discriminative or generative. The discriminative imputers, such as MissForest (Stekhoven & Buhlmann, 2012) , MICE (van Buuren & Groothuis-Oudshoorn, 2011), and MIRACLE (Kyono et al., 2021) , impute missing values by modeling their conditional distributions on the basis of other values. In practice, these methods are often hindered by the requirement of specifying the proper functional forms of conditional distributions and adding the set of appropriate regularizers. The generative imputers, such as GAIN (Yoon et al., 2019) , MIWAE (Mattei & Frellsen, 2018) , GAMIN (Yoon & Sull, 2020) , and HI-VAE (Nazabal et al., 2020) , estimate the joint distributions of all the features by leveraging the capacity of deep generative models and impute missing values by querying the trained models. Empirically, GAN-based methods often require a large amount of training data and suffer the difficulties of adversarial training (Goodfellow et al., 2014) , while VAE-based methods often face the limitations of training through variational bounds (Zhao et al., 2022) . Further, some of these methods either require complete data during training or operate on the assumptions of specific missingness patterns. In this paper, we present REMASKER, a novel method that extends the masked autoencoding (MAE) framework (Devlin et al., 2018; He et al., 2022) to imputing missing values of tabular data. The idea of REMASKER is simple: Besides the missing values in the given dataset (i.e., naturally masked), we randomly select and "re-mask" another set of values, optimize the autoencoder with the objective of reconstructing this re-masked set, and then apply the trained autoencoder to predict the missing values. Compared with the prior work, REMASKER enjoys the following desiderata: (i) it is instantiated with Transformer (Vaswani et al., 2017) as its backbone, of which the self-attention mechanism is able to capture the intricate inter-feature correlation (Huang et al., 2020) ; (ii) without specific assumptions about the underlying missingness mechanisms, it is applicable to various scenarios even if complete data is unavailable; and (iii) as the re-masking approach naturally accounts for missing values and encourages learning high-level representations beyond low-level statistics, REMASKER works effectively even under a high ratio of missing data (e.g., 0.7). With extensive evaluation on 12 benchmark datasets under various missingness scenarios, we show that REMASKER performs on par with or outperforms 13 popular methods in terms of both imputation fidelity and utility, while its performance advantage often increases with the ratio of missing data. We further explore the theoretical explanation for its effectiveness. We find that REMASKER encourages learning missingness-invariant representations of tabular data, which are insensitive to missing values. Our findings indicate that, besides its success in the language and vision domains, masked modeling also represents a promising direction for future research on tabular data imputation.

2. RELATED WORK

Here, we survey relevant literature in three categories. Tabular data imputation. The existing imputation methods can be roughly categorized as either discriminative or generative. The discriminative methods (Stekhoven & Buhlmann, 2012; van Buuren & Groothuis-Oudshoorn, 2011; Kyono et al., 2021) often specify a univariable model for each feature conditional on all others and perform cyclic regression over each target variable until convergence. Recent work has also explored adaptively selecting and configuring multiple discriminative imputers (Jarrett et al., 2022) . The generative methods either implicitly train imputers as generators within the GAN framework (Yoon et al., 2019; Yoon & Sull, 2020) or explicitly train deep latent-variable models to approximate the joint distributions of all features (Mattei & Frellsen, 2018; Nazabal et al., 2020) . There are also imputers based on representative-value (e.g., mean, median, or frequent values) substitution (Hawthorne & Elliott, 2005) , EM optimization (García-Laencina et al., 2010 ), matrix completion (Hastie et al., 2015) , or optimal transport (Muzellec et al., 2020) . Transformer. Transformer has emerged as a dominating design (Vaswani et al., 2017) in the language domain, in which multi-head self-attention and MLP layers are stacked to capture both short-and long-term correlations between words. Recent work has explored the use of Transformer in the vision domain by treating each image as a grid of visual words (Dosovitskiy et al., 2020) . For instance, it has been integrated into image generation models (Jiang et al., 2021; Zhang et al., 2021; Hudson & Zitnick, 2021) , achieving performance comparable to CNN-based models. Masked autoencoding. Autoencoding is a classical method for learning representation in a selfsupervised manner (Vincent et al., 2008; Pathak et al., 2016) : an encoder maps an input to its representation and a decoder reconstructs the original input. Meanwhile, masked modeling is originally proposed as a pre-training method in the language domain: by holding out a proportion of a word sequence, it trains the model to predict the masked words (Devlin et al., 2018; Radford & Narasimhan, 2018) . Recent work has combined autoencoding and masked modeling in vision tasks (Dosovitskiy et al., 2020; Bao et al., 2022) . Particularly, the seminal MAE (He et al., 2022) represents the state of the art in self-supervised pre-training on the ImageNet-1K benchmark. The work is also related to that models missing data by adapting existing model architectures (Przewięźlikowski et al., 2021) . To our best knowledge, this represents the first work to explore the masked autoencoding method with Transformer in the task of tabular data imputation.

3. REMASKER

Next, we present REMASKER, an extremely simple yet effective method for imputing missing values of tabular data. We begin by formalizing the imputation problem.

3.1. PROBLEM FORMALIZATION

Incomplete data. To model tabular data with d features, we consider a d-dimensional random variable x ≜ (x 1 , . . . , x d ) ∈ X 1 × . . . × X d , where X i is either continuous or categorical for i ∈ {1, . . . , d}.

availability

https://anonymous.4open.science/

