NOT-MIWAE: DEEP GENERATIVE MODELLING WITH MISSING NOT AT RANDOM DATA

Abstract

When a missing process depends on the missing values themselves, it needs to be explicitly modelled and taken into account while doing likelihood-based inference. We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. Specifically, a deep neural network enables us to flexibly model the conditional distribution of the missingness pattern given the data. This allows for incorporating prior information about the type of missingness (e.g. self-censoring) into the model. Our inference technique, based on importance-weighted variational inference, involves maximising a lower bound of the joint likelihood. Stochastic gradients of the bound are obtained by using the reparameterisation trick both in latent space and data space. We show on various kinds of data sets and missingness patterns that explicitly modelling the missing process can be invaluable.

1. INTRODUCTION

Missing data often constitute systemic issues in real-world data analysis, and can be an integral part of some fields, e.g. recommender systems. This requires the analyst to take action by either using methods and models that are applicable to incomplete data or by performing imputations of the missing data before applying models requiring complete data. The expected model performance (often measured in terms of imputation error or innocuity of missingness on the inference results) depends on the assumptions made about the missing mechanism and how well those assumptions match the true missing mechanism. In a seminal paper, Rubin (1976) introduced a formal probabilistic framework to assess missing mechanism assumptions and their consequences. The most commonly used assumption, either implicitly or explicitly, is that a part of the data is missing at random (MAR). Essentially, the MAR assumption means that the missing pattern does not depend on the missing values. This makes it possible to ignore the missing data mechanism in likelihood-based inference by marginalizing over the missing data. The often implicit assumption made in nonprobabilistic models and ad-hoc methods is that the data are missing completely at random (MCAR). MCAR is a stronger assumption than MAR, and informally it means that both observed and missing data do not depend on the missing pattern. More details on these assumptions can be found in the monograph of Little & Rubin (2002) ; of particular interest are also the recent revisits of Seaman et al. (2013) and Doretti et al. (2018) . In this paper, our goal is to posit statistical models that leverage deep learning in order to break away from these assumptions. Specifically, we propose a general recipe for dealing with cases where there is prior information about the distribution of the missing pattern given the data (e.g. self-censoring). The MAR and MCAR assumptions are violated when the missing data mechanism is dependent on the missing data themselves. This setting is called missing not at random (MNAR). Here the missing mechanism cannot be ignored, doing so will lead to biased parameter estimates. This setting generally requires a joint model for data and missing mechanism. Deep latent variable models (DLVMs, Kingma & Welling, 2013; Rezende et al., 2014) have recently been used for inference and imputation in missing data problems (Nazabal et al., 2020; Ma et al., 2018; 2019; Ivanov et al., 2019; Mattei & Frellsen, 2019) . This led to impressive empirical results in the MAR and MCAR case, in particular for high-dimensional data.

1.1. CONTRIBUTIONS

We introduce the not-missing-at-random importance-weighted autoencoder (not-MIWAE) which allows for the application of DLVMs to missing data problems where the missing mechanism is MNAR. This is inspired by the missing data importance-weighted autoencoder (MIWAE, Mattei & Frellsen, 2019), a framework to train DLVMs in MAR scenarios, based itself on the importanceweighted autoencoder (IWAE) of Burda et al. (2016) . The general graphical model for the not-MIWAE is shown in figure 1a . The first part of the model is simply a latent variable model: there is a stochastic mapping parameterized by θ from a latent variable z ∼ p(z) to the data x ∼ p θ (x|z), and the data may be partially observed. The second part of the model, which we call the missing model, is a stochastic mapping from the data to the missing mask s ∼ p φ (s|x). Explicit specification of the missing model p φ (s|x) makes it possible to address MNAR issues. The model can be trained efficiently by maximising a lower bound of the joint likelihood (of the observed features and missing pattern) obtained via importance weighted variational inference (Burda et al., 2016) . A key difference with the MIWAE is that we use the reparameterization trick in the data space, as well as in the code space, in order to get stochastic gradients of the lower bound. Missing processes affect data analysis in a wide range of domains and often the MAR assumption does not hold. We apply our method to censoring in datasets from the UCI database, clipping in images and the issue of selection bias in recommender systems.

2. BACKGROUND

Assume that the complete data are stored within a data matrix X = (x 1 , . . . , x n ) ∈ X n that contain n i.i.d. copies of the random variable x ∈ X , where X = X 1 × • • • × X p is a p-dimensional feature space. For simplicity, x ij refers to the j'th feature of x i , and x i refers to the i'th sample in the data matrix. Throughout the text, we will make statements about the random variable x, and only consider samples x i when necessary. In a missing data context, each sample can be split into an observed part and a missing part, x i = (x o i , x m i ). The pattern of missingness is individual to each copy of x and described by a corresponding mask random variable s ∈ {0, 1} p . This leads to a mask matrix S = (s 1 , . . . , s n ) ∈ {0, 1} n×p verifying s ij = 1 if x ij is observed and s ij = 0 if x ij is missing. We wish to construct a parametric model p θ,φ (x, s) for the joint distribution of a single data point x and its mask s, which can be factored as p θ,φ (x, s) = p θ (x)p φ (s|x). (1) Here p φ (s|x) = p φ (s|x o , x m ) is the conditional distribution of the mask, which may depend on both the observed and missing data, through its own parameters φ. The three assumptions from the framework of Little & Rubin (2002) (see also Ghahramani & Jordan, 1995) pertain to the specific form of this conditional distribution: • MCAR: p φ (s|x) = p φ (s), • MAR: p φ (s|x) = p φ (s|x o ), • MNAR: p φ (s|x) may depend on both x o and x m .



Figure 1: (a) Graphical model of the not-MIWAE. (b) Gaussian data with MNAR values. Dots are fully observed, partially observed data are displayed as black crosses. A contour of the true distribution is shown together with directions found by PPCA and not-MIWAE with a PPCA decoder.

