IMPROVING THE IMPUTATION OF MISSING DATA WITH MARKOV BLANKET DISCOVERY

Abstract

The process of imputation of missing data typically relies on generative and regression models. These approaches often operate on the unrealistic assumption that all of the data features are directly related with one another, and use all of the available features to impute missing values. In this paper, we propose a novel Markov Blanket discovery approach to determine the optimal feature set for a given variable by considering both observed variables and missingness of partially observed variables to account for systematic missingness. We then incorporate this method to the learning process of the state-of-the-art MissForest imputation algorithm, such that it informs MissForest which features to consider to impute missing values, depending on the variable the missing value belongs to. Experiments across different case studies and multiple imputation algorithms show that the proposed solution improves imputation accuracy, both under random and systematic missingness.

1. INTRODUCTION

Dealing with missing data values represents a common practice across different scientific domains, especially in clinical (Little et al., 2012; Austin et al., 2021) , genomics (Petrazzini et al., 2021) and ecological studies (Alsaber et al., 2021; Zhang & Thorburn, 2022) . It represents a problem that can be difficult to address accurately, and this is because missingness can be caused by various known and unknown factors, including machine fault, privacy restriction, data corruption, inconsistencies in the way data are recorded, as well as purely due to human error. Rubin (1976) categorised the problem of missing data into three classes known as Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). We say data is MCAR when the missingness is purely random, i.e., the missing mechanism is independent of both the observed and unobserved values. On the other hand, data is MAR when the missingness is dependent on the observed values but independent of the unobserved values given the observed values; implying that MAR data can be effectively imputed by relying on observed data alone. Lastly, data is said to be MNAR when it is neither MCAR nor MAR and hence, missingness is dependent on both the observed and unobserved values. While it is tempting to simply remove data rows that contain empty data cells, a process often referred to as list-wise deletion or complete case analysis, past studies have shown that such an approach is ineffective since it tends to lead to poorly trained models (Wilkinson, 1999; Baraldi & Enders, 2010) . On this basis, the problem of missingness is typically handled by imputation approaches which estimate the missing values, often using regression or generative models, and return a complete data set. The imputation algorithms are often classified as either statistical or machine learning methods (Lin & Tsai, 2020) . Statistical imputation methods include Mean/Mode, which is one of the simplest methods where the imputation is derived by the mean or mode of the observed values found in the same data column. A more advanced statistical method is the Expectation-Maximization (EM) algorithm (Honaker et al., 2011) . EM computes the expectation of sufficient statistics given the observed data at the E-step (Expectation), and then maximizes likelihood at the M-step (Maximization). It iterates over these two steps until convergence, at which point the converged parameters are used along with the observed data to impute missing values. Another statistical al-

