DEALING WITH MISSING DATA USING ATTENTION AND LATENT SPACE REGULARIZATION

Abstract

Most practical data science problems encounter missing data. A wide variety of solutions exist, each with strengths and weaknesses that depend upon the missingness-generating process. Here we develop a theoretical framework for training and inference using only observed variables enabling modeling of incomplete datasets without imputation. Using an information and measure-theoretic argument we construct models with latent space representations that regularize against the potential bias introduced by missing data. The theoretical properties of this approach are demonstrated empirically using a synthetic dataset. The performance of this approach is tested on 11 benchmarking datasets with missingness and 18 datasets corrupted across three missingness patterns with comparison against a state-of-the-art model and industry-standard imputation. We show that our proposed method overcomes the weaknesses of imputation methods and outperforms the current state-of-the-art with statistical significance.

1. INTRODUCTION

Missing data is a common problem encountered by the data scientist. The consequences of missing data are often not straightforward and depend on the missingness generating process (Little and Rubin 2002) . Choosing the best strategy to deal with missingness is critical when designing statistical or machine learning models that rely on incomplete datasets. A frequent complication is that the missingness generating process is often unknown, leading to assumptions about the missingness and potentially the introduction of bias into model training and inference (Davey and Dai 2020). The two most commonly employed strategies for dealing with missing data are to either drop data points where missing data exists or impute the values (Bertsimas et al. 2021) . Both strategies can potentially introduce bias into a model if applied incorrectly (Little and Rubin 2002) . The 'impute and regress' strategy has been recently critiqued in the setting of machine learning predictors with the finding that an imputation method leads to a consistent prediction model if that model can almost surely undo the imputation, which questions the need for sophisticated imputation strategies (Bertsimas et al. 2018 (Bertsimas et al. , 2021)) . The best case scenario for an imputation method is to replace missing unobserved variables with values that do not corrupt the model's predictive distribution from the true predictive distribution given only the observed variables. The worst case would be introducing a significant bias during training that increases the divergence of the model's predictive distribution from the true distribution. Indeed, Jeong et al. ( 2022) proved with an information-theoretic argument that there is no universally fair imputation method for different downstream learning tasks. While practitioners continue to develop novel machine learning and statistical methods for imputation, relatively less effort has been made to create a framework to reason about models that can fit incomplete data with the notable exception of decision tree models (Gavankar and Sawarkar 2015; Jeong et al. 2022) . In this paper, we consider the question of designing a model that trains and performs inference on only the observed variables, without imputation or data deletion.

1.1. RELATED WORK

Most recent work has focused on developing novel imputation techniques, such as auto-encoders, in order to build better imputations based on the structure of the data (Abiri et al. 2019) . A consideration of learning and inference without imputation has been considered recently by Bertsimas et al. (2021) who provided a theoretical approach to the deficiencies of imputation in predictive modeling, drawing an important distinction between statistical inference and prediction. They build on this approach to develop an adaptive hierarchical linear regression model that is capable of performing prediction in the presence of any missingness pattern. Similarly, Jeong et al. ( 2022) provide a method of inference without imputation using a decision tree approach, with a fairness-regularized loss function. Our work differs substantially from these approaches, whereby we show that training and inference using only the observed data is feasible using latent space representations and an entropy-based objective which ensures regularization against the potential bias of missingness.

1.2. SUMMARY OF CONTRIBUTIONS

1. Introduce a novel method of dealing with missingness by establishing a simple framework for reasoning about a model that fits and infers from the observed variables only. 2. Interpret the latent space representations in this framework with a measure-and information theoretic argument. 3. Empirically validate the theoretical properties of the latent space on a synthetic dataset with a latent space attention model. 4. Demonstrate the effectiveness of this model on benchmark datasets with corrupted data and real world datasets with missingness.

2. INTERPRETING MISSINGNESS IN A MEASURABLE SPACE

We first describe a sample space Ω that is defined by an unknown data generating process and unknown missingness generating process. We then define three random variables: X d : Ω → R Y : Ω → R M d : Ω → {0, 1} Where d ∈ {1, ..., D} is the total number of possibly observed variables. We can further specify, X = {X 1 , ..., X D }, which is the random variable X : Ω → R D , and M = {M 1 , ..., M D }, which is the random variable M : Ω → {0, 1} D . A realization of X is a vector representation of possibly observed variables and Y is the outcome variable of interest. M is a missingness vector where a value of 1 indicates a missing value and 0 an observed value. If we consider the power set U = P(X), then we can define a measurable space (X, U). The impact of missingness results in a smaller σ-algebra such that U = {i : i ∈ U ∧ j ∈ M ∧ 1 / ∈ j}, where M = P(M ). Finally, we can define a probability space (X, U, µ) where a measure maps each combination of possible variables in X, defined in U , to a probability whereby µ : u → [0, 1], where u ∈ U . An implication of our definition of the measure, µ, is that where no missingness exists, each subset is equally likely, and the distribution is uniform. In the presence of a missingness generating process, some subsets are more likely than others and the distribution diverges from uniform. How the probability distribution diverges from uniform can be considered with reference to the definitions of missingness originally described by Little and Rubin (2002)  . If X obs = {x : x ∈ X ∧ m ∈ M ∧ m = 0} and X mis = {x : x ∈ X ∧ m ∈ M ∧ m = 1}



and we consider M to be defined by some unknown parameter β, with a conditional distribution f (M |X, β), then we can define three types of missingness:Definition 1. If f (M |X, β) = f (M |β) then M ⊥ X|βand data is considered to be missing completely at random (MCAR) Definition 2. If f (M |X, β) = f (M |X obs , β) then M ⊥ X mis |X obs , β and data is considered to be missing at random (MAR) Definition 3. If f (M |X, β) ̸ = f (M |β) and f (M |X, β) ̸ = f (M |X obs , β), then data is considered to be missing not at random (MNAR)

