DEALING WITH MISSING DATA USING ATTENTION AND LATENT SPACE REGULARIZATION

Abstract

Most practical data science problems encounter missing data. A wide variety of solutions exist, each with strengths and weaknesses that depend upon the missingness-generating process. Here we develop a theoretical framework for training and inference using only observed variables enabling modeling of incomplete datasets without imputation. Using an information and measure-theoretic argument we construct models with latent space representations that regularize against the potential bias introduced by missing data. The theoretical properties of this approach are demonstrated empirically using a synthetic dataset. The performance of this approach is tested on 11 benchmarking datasets with missingness and 18 datasets corrupted across three missingness patterns with comparison against a state-of-the-art model and industry-standard imputation. We show that our proposed method overcomes the weaknesses of imputation methods and outperforms the current state-of-the-art with statistical significance.

1. INTRODUCTION

Missing data is a common problem encountered by the data scientist. The consequences of missing data are often not straightforward and depend on the missingness generating process (Little and Rubin 2002) . Choosing the best strategy to deal with missingness is critical when designing statistical or machine learning models that rely on incomplete datasets. A frequent complication is that the missingness generating process is often unknown, leading to assumptions about the missingness and potentially the introduction of bias into model training and inference (Davey and Dai 2020). The two most commonly employed strategies for dealing with missing data are to either drop data points where missing data exists or impute the values (Bertsimas et al. 2021) . Both strategies can potentially introduce bias into a model if applied incorrectly (Little and Rubin 2002). The 'impute and regress' strategy has been recently critiqued in the setting of machine learning predictors with the finding that an imputation method leads to a consistent prediction model if that model can almost surely undo the imputation, which questions the need for sophisticated imputation strategies (Bertsimas et al. 2018 (Bertsimas et al. , 2021)) . The best case scenario for an imputation method is to replace missing unobserved variables with values that do not corrupt the model's predictive distribution from the true predictive distribution given only the observed variables. The worst case would be introducing a significant bias during training that increases the divergence of the model's predictive distribution from the true distribution. Indeed, Jeong et al. ( 2022) proved with an information-theoretic argument that there is no universally fair imputation method for different downstream learning tasks. While practitioners continue to develop novel machine learning and statistical methods for imputation, relatively less effort has been made to create a framework to reason about models that can fit incomplete data with the notable exception of decision tree models (Gavankar and Sawarkar 2015; Jeong et al. 2022) . In this paper, we consider the question of designing a model that trains and performs inference on only the observed variables, without imputation or data deletion.

1.1. RELATED WORK

Most recent work has focused on developing novel imputation techniques, such as auto-encoders, in order to build better imputations based on the structure of the data (Abiri et al. 2019) . A consideration of learning and inference without imputation has been considered recently by Bertsimas et al. (2021) 

