NOT-MIWAE: DEEP GENERATIVE MODELLING WITH MISSING NOT AT RANDOM DATA

Abstract

When a missing process depends on the missing values themselves, it needs to be explicitly modelled and taken into account while doing likelihood-based inference. We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. Specifically, a deep neural network enables us to flexibly model the conditional distribution of the missingness pattern given the data. This allows for incorporating prior information about the type of missingness (e.g. self-censoring) into the model. Our inference technique, based on importance-weighted variational inference, involves maximising a lower bound of the joint likelihood. Stochastic gradients of the bound are obtained by using the reparameterisation trick both in latent space and data space. We show on various kinds of data sets and missingness patterns that explicitly modelling the missing process can be invaluable.

1. INTRODUCTION

Missing data often constitute systemic issues in real-world data analysis, and can be an integral part of some fields, e.g. recommender systems. This requires the analyst to take action by either using methods and models that are applicable to incomplete data or by performing imputations of the missing data before applying models requiring complete data. The expected model performance (often measured in terms of imputation error or innocuity of missingness on the inference results) depends on the assumptions made about the missing mechanism and how well those assumptions match the true missing mechanism. In a seminal paper, Rubin (1976) introduced a formal probabilistic framework to assess missing mechanism assumptions and their consequences. The most commonly used assumption, either implicitly or explicitly, is that a part of the data is missing at random (MAR). Essentially, the MAR assumption means that the missing pattern does not depend on the missing values. This makes it possible to ignore the missing data mechanism in likelihood-based inference by marginalizing over the missing data. The often implicit assumption made in nonprobabilistic models and ad-hoc methods is that the data are missing completely at random (MCAR). MCAR is a stronger assumption than MAR, and informally it means that both observed and missing data do not depend on the missing pattern. More details on these assumptions can be found in the monograph of Little & Rubin (2002) ; of particular interest are also the recent revisits of Seaman et al. (2013) and Doretti et al. (2018) . In this paper, our goal is to posit statistical models that leverage deep learning in order to break away from these assumptions. Specifically, we propose a general recipe for dealing with cases where there is prior information about the distribution of the missing pattern given the data (e.g. self-censoring). The MAR and MCAR assumptions are violated when the missing data mechanism is dependent on the missing data themselves. This setting is called missing not at random (MNAR). Here the missing mechanism cannot be ignored, doing so will lead to biased parameter estimates. This setting generally requires a joint model for data and missing mechanism. Deep latent variable models (DLVMs, Kingma & Welling, 2013; Rezende et al., 2014) have recently been used for inference and imputation in missing data problems (Nazabal et al., 2020; Ma et al., 2018; 2019; Ivanov et al., 2019; Mattei & Frellsen, 2019) . This led to impressive empirical results in the MAR and MCAR case, in particular for high-dimensional data.

1.1. CONTRIBUTIONS

We introduce the not-missing-at-random importance-weighted autoencoder (not-MIWAE) which allows for the application of DLVMs to missing data problems where the missing mechanism is MNAR. This is inspired by the missing data importance-weighted autoencoder (MIWAE, Mattei & Frellsen, 2019) , a framework to train DLVMs in MAR scenarios, based itself on the importanceweighted autoencoder (IWAE) of Burda et al. (2016) . The general graphical model for the not-MIWAE is shown in figure 1a . The first part of the model is simply a latent variable model: there is a stochastic mapping parameterized by θ from a latent variable z ∼ p(z) to the data x ∼ p θ (x|z), and the data may be partially observed. The second part of the model, which we call the missing model, is a stochastic mapping from the data to the missing mask s ∼ p φ (s|x). Explicit specification of the missing model p φ (s|x) makes it possible to address MNAR issues. The model can be trained efficiently by maximising a lower bound of the joint likelihood (of the observed features and missing pattern) obtained via importance weighted variational inference (Burda et al., 2016) . A key difference with the MIWAE is that we use the reparameterization trick in the data space, as well as in the code space, in order to get stochastic gradients of the lower bound. Missing processes affect data analysis in a wide range of domains and often the MAR assumption does not hold. We apply our method to censoring in datasets from the UCI database, clipping in images and the issue of selection bias in recommender systems.

2. BACKGROUND

Assume that the complete data are stored within a data matrix X = (x 1 , . . . , x n ) ∈ X n that contain n i.i.d. copies of the random variable x ∈ X , where X = X 1 × • • • × X p is a p-dimensional feature space. For simplicity, x ij refers to the j'th feature of x i , and x i refers to the i'th sample in the data matrix. Throughout the text, we will make statements about the random variable x, and only consider samples x i when necessary. In a missing data context, each sample can be split into an observed part and a missing part, x i = (x o i , x m i ). The pattern of missingness is individual to each copy of x and described by a corresponding mask random variable s ∈ {0, 1} p . This leads to a mask matrix S = (s 1 , . . . , s n ) ∈ {0, 1} n×p verifying s ij = 1 if x ij is observed and s ij = 0 if x ij is missing. We wish to construct a parametric model p θ,φ (x, s) for the joint distribution of a single data point x and its mask s, which can be factored as p θ,φ (x, s) = p θ (x)p φ (s|x). (1) Here p φ (s|x) = p φ (s|x o , x m ) is the conditional distribution of the mask, which may depend on both the observed and missing data, through its own parameters φ. The three assumptions from the framework of Little & Rubin (2002) (see also Ghahramani & Jordan, 1995) pertain to the specific form of this conditional distribution: • MCAR: p φ (s|x) = p φ (s), • MAR: p φ (s|x) = p φ (s|x o ), • MNAR: p φ (s|x) may depend on both x o and x m . To maximize the likelihood of the parameters (θ, φ), based only on observed quantities, the missing data is integrated out from the joint distribution p θ,φ (x o , s) = p θ (x o , x m )p φ (s|x o , x m ) dx m . In both the MCAR and MAR cases, inference for θ using the full likelihood becomes proportional to p θ,φ (x o , s) ∝ p θ (x o ), and the missing mechanism can be ignored while focusing only on p θ (x o ). In the MNAR case, the missing mechanism can depend on both observed and missing data, offering no factorization of the likelihood in equation ( 2). The parameters of the data generating process and the parameters of the missing data mechanism are tied together by the missing data.

2.1. PPCA EXAMPLE

A linear DLVM with isotropic noise variance can be used to recover a model similar to probabilistic principal component analysis (PPCA, Roweis, 1998; Tipping & Bishop, 1999) . In figure 1b , a dataset affected by an MNAR missing process is shown together with two fitted PPCA models, regular PPCA and the not-MIWAE formulated as a PPCA-like model. Data is generated from a multivariate normal distribution and an MNAR missing process is imposed by setting the horizontal coordinate to missing when it is larger than its mean, i.e. it becomes missing because of the value it would have had, had it been observed. Regular PPCA for missing data assumes that the missing mechanism is MAR so that the missing process is ignorable. This introduces a bias, both in the estimated mean and in the estimated principal signal direction of the data. The not-MIWAE PPCA assumes the missing mechanism is MNAR so the data generating process and missing data mechanism are modelled jointly as described in equation (2).

2.2. PREVIOUS WORK

In (Rubin, 1976) the appropriateness of ignoring the missing process when doing likelihood based or Bayesian inference was introduced and formalized. The introduction of the EM algorithm (Dempster et al., 1977) made it feasible to obtain maximum likelihood estimates in many missing data settings, see e.g. Ghahramani & Jordan (1994; 1995) ; Little & Rubin (2002) . Sampling methods such as Markov chain Monte Carlo have made it possible to sample a target posterior in Bayesian models, including the missing data, so that parameter marginal distributions and missing data marginal distributions are available directly (Gelman et al., 2013) . This is also the starting point of the multiple imputations framework of Rubin (1977; 1996) . Here the samples of the missing data are used to provide several realisations of complete datasets where complete-data methods can be applied to get combined mean and variability estimates. The framework of Little & Rubin (2002) is instructive in how to handle MNAR problems and a recent review of MNAR methods can be found in (Tang & Ju, 2018) . Low rank models were used for estimation and imputation in MNAR settings by Sportisse et al. (2020a) . Two approaches were taken to fitting models, 1) maximising the joint distribution of data and missing mask using an EM algorithm, and 2) implicitly modelling the joint distribution by concatenating the data matrix and the missing mask and working with this new matrix. This implies a latent representation both giving rise to the data and the mask. An overview of estimation methods for PCA and PPCA with missing data was given by Ilin & Raiko (2010) , while PPCA in the presence of an MNAR missing mechanism has been addressed by Sportisse et al. (2020b) . There has been some focus on MNAR issues in the form of selection bias within the recommender system community (Marlin et al., 2007; Marlin & Zemel, 2009; Steck, 2013; Hernández-Lobato et al., 2014; Schnabel et al., 2016; Wang et al., 2019) where methods applied range from joint modelling of data and missing model using multinomial mixtures and matrix factorization to debiasing existing methods using propensity based techniques from causality. Deep latent variable models are intuitively appealing in a missing context: the generative part of the model can be used to sample the missing part of an observation. This was already utilized by Rezende et al. (2014) to do imputation and denoising by sampling from a Markov chain whose stationary distribution is approximately the conditional distribution of the missing data given the observed. This procedure has been enhanced by Mattei & Frellsen (2018a) using Metropolis-within-Gibbs. In both cases the experiments were assuming MAR and a fitted model, based on complete data, was already available. Approaches to fitting DLVMs in the presence of missing have recently been suggested, such as the HI-VAE by Nazabal et al. (2020) using an extension of the variational autoencoder (VAE) lower bound, the p-VAE by Ma et al. (2018; 2019) using the VAE lower bound and a permutation invariant encoder, the MIWAE by Mattei & Frellsen (2019) , extending the IWAE lower bound (Burda et al., 2016) , and GAIN (Yoon et al., 2018) using GANs for missing data imputation. All approaches are assuming that the missing process is MAR or MCAR. In (Gong et al., 2020) , the data and missing mask are modelled together, as both being generated by a mapping from the same latent space, thereby tying the data model and missing process together. This gives more flexibility in terms of missing process assumptions, akin to the matrix factorization approach by Sportisse et al. (2020a) . In concurrent work, Collier et al. (2020) 

3. INFERENCE IN DLVMS AFFECTED BY MNAR

In an MNAR setting, the parameters for the data generating process and the missing data mechanism need to be optimized jointly using all observed quantities. The relevant quantity to maximize is therefore the log-(joint) likelihood (θ, φ) = n i=1 log p θ,φ (x o i , s i ), where we can rewrite the general contribution of data points log p θ,φ (x o , s) as log p φ (s|x o , x m )p θ (x o |z)p θ (x m |z)p(z) dz dx m , using the assumption that the observation model is fully factorized p θ (x|z) = j p θ (x j |z), which implies p θ (x|z) = p(x o |z)p θ (x m |z). The integrals over missing and latent variables make direct maximum likelihood intractable. However, the approach of Burda et al. (2016) , using an inference network and importance sampling to derive a more tractable lower bound of (θ, φ), can be used here as well. The key idea is to posit a conditional distribution q γ (z|x o ) called the variational distribution that will play the role of a learnable proposal in an importance sampling scheme. As in VAEs (Kingma & Welling, 2013; Rezende et al., 2014) and IWAEs (Burda et al., 2016) , the distribution q γ (z|x o ) comes from a simple family (e.g. the Gaussian or Student's t family) and its parameters are given by the output of a neural network (called inference network or encoder) that takes x o as input. The issue is that a neural net cannot readily deal with variable length inputs (which is the case of x o ). This was tackled by several works: Nazabal et al. (2020) and Mattei & Frellsen (2019) advocated simply zero-imputing x o to get inputs with constant length, and Ma et al. (2018; 2019) used a permutation-invariant network able to deal with inputs with variable length. Introducing the variational distribution, the contribution of a single observation is equal to log p θ,φ (x o , s) = log p φ (s|x o , x m )p θ (x o |z)p(z) q γ (z|x o ) q γ (z|x o )p θ (x m |z) dx m dz (5) = log E z∼qγ (z|x o ),x m ∼p θ (x m |z) p φ (s|x o , x m )p θ (x o |z)p(z) q γ (z|x o ) . ( ) The main idea of importance weighed variational inference and of the IWAE is to replace the expectation inside the logarithm by a Monte Carlo estimate of it (Burda et al., 2016) . This leads to the objective function L K (θ, φ, γ) = n i=1 E   log 1 K K k=1 w ki   , where, for all k ≤ K, i ≤ n, w ki = p φ (s i |x o i , x m ki )p θ (x o i |z ki )p(z ki ) q γ (z ki |x o i ) , and (z 1i , x m 1i ), . . . , (z Ki , x m Ki ) are K i.i.d. samples from q γ (z|x o i )p θ (x m |z), over which the expectation in equation ( 7) is taken. The unbiasedness of the Monte Carlo estimates ensures (via Jensen's inequality) that the objective is indeed a lower-bound of the likelihood. Actually, under the moment conditions of (Domke & Sheldon, 2018, Theorem 3 ), which we detail in Appendix D, it is possible to show that the sequence (L K (θ, φ, γ)) K≥1 converges monotonically (Burda et al., 2016, Theorem 1) to the likelihood: L 1 (θ, φ, γ) ≤ . . . ≤ L K (θ, φ, γ) ----→ K→∞ (θ, φ). ( ) Properties of the not-MIWAE objective The bound L K (θ, φ, γ) has essentially the same properties as the (M)IWAE bounds, see Mattei & Frellsen, 2019, Section 2.4 for more details. The key difference is that we are integrating over both the latent space and part of the data space. This means that, to obtain unbiased estimates of gradients of the bound, we will need to backpropagate through samples from q γ (z|x o i )p θ (x m |z). A simple way to do this is to use the reparameterization trick both for q γ (z|x o i ) and p θ (x m |z). This is the approach that we chose in our experiments. The main limitation is that p θ (x|z) has to belong to a reparameterizable family, like Gaussians or Student's t distributions (see Figurnov et al., 2018 for a list of available distributions). If the distribution is not readily reparametrisable (e.g. if the data are discrete), several other options are available, see e.g. the review of Mohamed et al. (2020) , and, in the discrete case, the continuous relaxations of Jang et al. (2017) and Maddison et al. (2017) . Imputation When the model has been trained, it can be used to impute missing values. If our performance metric is a loss function L(x m , xm ), optimal imputations xm minimise E x m [L(x m , xm )|x o , s]. When L is the squared error, the optimal imputation is the conditional mean that can be estimated via self-normalised importance sampling (Mattei & Frellsen, 2019) , see appendix B for more details.

3.1. USING PRIOR INFORMATION VIA THE MISSING DATA MODEL

The missing data mechanism can both be known/decided upon in advance (so that the full relationship p φ (s|x) is fixed and no parameters need to be learned) or the type of missing mechanism can be known (but the parameters need to be learnt) or it can be unknown both in terms of parameters and model. The more we know about the nature of the missing mechanism, the more information we can put into designing the missing model. This in turn helps inform the data model how its parameters should be modified so as to accommodate the missing model. This is in line with the findings of Molenberghs et al. (2008) , who showed that, for MNAR modelling to work, one has to leverage prior knowledge about the missing process. A crucial issue is under what model assumptions the full data distribution can be recovered from incomplete sample. Indeed, some general missing models may lead to inconsistent statistical estimation (see e.g. Mohan & Pearl, 2021; Nabi et al., 2020) . The missing model is essentially solving a classification problem; based on the observed data and the output from the data model filling in the missing data, it needs to improve its "accuracy" in predicting the mask. A Bernoulli distribution is used for the probability of the mask given both observed and missing data p φ (s|x o , x m ) = p φ (s|x) = Bern(s|π φ (x)) = p j=1 π φ,j (x) sj (1 -π φ,j (x)) 1-sj . ( ) Here π j is the estimated probability of being observed for that particular observation for feature j. The mapping π φ,j (x) from the data to the probability of being observed for the j'th feature can be as general or specific as needed. A simple example could be that of self-masking or self-censoring, where the probability of the j'th feature being observed is only dependent on the feature value, x j . Here the mapping can be a sigmoid on a linear mapping of the feature value, π φ,j (x) = σ(ax j + b). The missing model can also be based on a group theoretic approach, see appendix C.

4. EXPERIMENTS

In this section we apply the not-MIWAE to problems with values MNAR: censoring in multivariate datasets, clipping in images and selection bias in recommender systems. Implementation details and a link to source code can be found in appendix A. 

4.1. EVALUATION METRICS

Model performance can be assessed using different metrics. A first metric would be to look at how well the marginal distribution of the data has been inferred. This can be assessed, if we happen to have a fully observed test-set available. Indeed, we can look at the test log-likelihood of this fully observed test-set as a measure of how close p θ (x) and the true distribution of x are. In the case of a DLVM, performance can be estimated using importance sampling with the variational distribution as proposal (Rezende et al., 2014) . Since the encoder is tuned to observations with missing data, it should be retrained (while keeping the decoder fixed) as suggested by Mattei & Frellsen (2018b) . Another metric of interest is the imputation error. In experimental settings where the missing mechanism is under our control, we have access to the actual values of the missing data and the imputation error can be found directly as an error measure between these and the reconstructions from the model. In real-world datasets affected by MNAR processes, we cannot use the usual approach of doing a train-test split of the observed data. As the test-set is biased by the same missing mechanism as the training-set it is not representative of the full population. Here we need a MAR data sample to evaluate model performance (Marlin et al., 2007) .

4.2. SINGLE IMPUTATION IN UCI DATA SETS AFFECTED BY MNAR

We compare different imputation techniques on datasets from the UCI database (Dua & Graff, 2017) , where in an MCAR setting the MIWAE has shown state of the art performance (Mattei & Frellsen, 2019 ). An MNAR missing process is introduced by self-masking in half of the features: when the feature value is higher than the feature mean it is set to missing. The MIWAE and not-MIWAE, as well as their linear PPCA-like versions, are fitted to the data with missing values. For the not-MIWAE three different approaches to the missing model are used: 1) agnostic where the data model output is mapped to logits for the missing process via a single dense linear layer, 2) self-masking where logistic regression is used for each feature and 3) self-masking known where the sign of the weights in the logistic regression is known. We compare to the low-rank approximation of the concatenation of data and mask by often led astray by an incorrectly learned missing model. This speaks to the trade-off between data model flexibility and missing model flexibility. The not-MIWAE PPCA has huge inductive bias in the data model and so we can employ a more flexible missing model and still get good results. For the not-MIWAE having both a flexible data model and a flexible missing model can be detrimental to performance. One way to asses the learnt missing processes is the mask classification accuracy on fully observed data. These are reported in table A1 and show that the accuracy increases as more information is put into the missing model.

4.3. CLIPPING IN SVHN IMAGES

We emulate the clipping phenomenon in images on the street view house numbers dataset (SVHN, Netzer et al., 2011) . Here we introduce a self-masking missing mechanism that is identical for all pixels. The missing data is Bernoulli sampled with probability Pr(s ij = 1|x ij ) = 1 1 + e -logits , logits = W (x ij -b), where W = -50 and b = 0.75. This mimmicks a clipping process where 0.75 is the clipping point (the data is converted to gray scale in the [0, 1] range). For this experiment we use the true missing process as the missing model in the not-MIWAE. Table 2 shows model performance in terms of imputation RMSE and test-set log likelihood as estimated with 10k importance samples. The not-MIWAE outperforms the MIWAE both in terms of test-set log likelihood and imputation RMSE. This is further illustrated in the imputations shown in figure 3 . Since the MIWAE is only fitting the observed data, the range of pixel values in the imputations is limited compared to the true range. The not-MIWAE is forced to push some of the data-distribution towards higher pixel values, in order to get a higher likelihood in the logistic regression in the missing model. In figures 2a-2c, histograms over the imputation values are shown together with the true pixel values of the missing data. Here we see that the not-MIWAE puts a considerable amount of probability mass above the clipping value.

4.4. SELECTION BIAS IN THE YAHOO! R3 DATASET

The Yahoo! R3 dataset (webscope.sandbox.yahoo.com) contains ratings on a scale from 1-5 of songs in the database of the Yahoo! LaunchCast internet radio service and was first presented in (Marlin et al., 2007) . It consists of two datasets with the same 1,000 songs selected randomly from the LaunchCast database. The first dataset is considered an MNAR training set and contains selfselected ratings from 15,400 users. In the second dataset, considered an MCAR test-set, 5,400 of these users were asked to rate exactly 10 randomly selected songs. This gives a unique opportunity to train a model on a real-world MNAR-affected dataset while being able to get an unbiased estimate of the imputation error, due to the availability of MCAR ratings. The plausibility that the set of selfselected ratings was subject to an MNAR missing process was explored and substantiated by Marlin et al. (2007) . The marginal distributions of samples from the self-selected dataset and the randomly selected dataset can be seen in figures 4a and 4b. We train the MIWAE and the not-MIWAE on the MNAR ratings and evaluate the imputation error on the MCAR ratings. Both a gaussian and a categorical observation model is explored. In order to get reparameterized samples in the data space for the categorical observation model, we use the Gumbel-Softmax trick (Jang et al., 2017) with a temperature of 0.5. The missing model is a logistic regression for each item/feature, with a shared weight across features and individual biases. A description of competitors can be found in appendix A.3 and follows the setup in (Wang et al., 2019) . The results are grouped in table 3, from top to bottom, according to models not including the missing process (MAR approaches), models using propensity scoring techniques to debias training losses, and finally models learning a data model and a missing model jointly, without the use of propensity estimates. The not-MIWAE shows state of the art performance, also compared to models based on propensity scores. The propensity based techniques need access to a small sample of MCAR data, i.e. a part of the test-set, to estimate the propensities using Naive Bayes, though they can be estimated using logistic regression if covariates are available (Schnabel et al., 2016) or using a nuclear-norm-constrained matrix factorization of the missing mask itself (Ma & Chen, 2019) . We stress that the not-MIWAE does not need access to similar unbiased data in order to learn the missing model. However, the missing model in the not-MIWAE can take available information into account, e.g. we could fit a continuous mapping to the propensities and use this as the missing model, if propensities were available. Histograms over imputations for the missing data in the MCAR test-set can be seen for the MIWAE and not-MIWAE in figures 4c and 4d. The marginal distribution of the not-MIWAE imputations are seen to match that of the MCAR test-set better than the marginal distribution of the MIWAE imputations.

5. CONCLUSION

The proposed not-MIWAE is versatile both in terms of defining missing mechanisms and in terms of application area. There is a trade-off between data model complexity and missing model complexity. In a parsimonious data model a very general missing process can be used while in flexible data model the missing model needs to be more informative. Specifically, any knowledge about the missing process should be incorporated in the missing model to improve model performance. Doing so using recent advances in equivariant/invariant neural networks is an interesting avenue for future research (see appendix C). Recent developments on the subject of recoverability/identifiability of MNAR models (Sadinle & Reiter, 2018; Mohan & Pearl, 2021; Nabi et al., 2020; Sportisse et al., 2020b) could also be leveraged to design provably idenfiable not-MIWAE models. Several extensions of the graphical models used here could be explored. For example, one could break off the conditional independence assumptions, in particular the one of the mask given the data. This could, for example, be done by using an additional latent variable pointing directly to the mask. Combined with a discriminative classifier, the not-MIWAE model could also be used in supervised learning with input values missing not at random following the techniques by Ipsen et al. (2020) . Banknote Concrete Red White Yeast Breast not-MIWAE -PPCA agnostic 0.80 ± 0.03 0.75 ± 0.05 0.88 ± 0.01 0.83 ± 0.00 0.78 ± 0.02 0.96 ± 0.00 self-masking 0.92 ± 0.05 0.95 ± 0.00 0.96 ± 0.00 0.97 ± 0.00 0.99 ± 0.00 0.98 ± 0.00 self-masking known 0.98 ± 0.00 0.95 ± 0.00 0.96 ± 0.00 0.97 ± 0.00 1.00 ± 0.00 0.97 ± 0.00 not-MIWAE agnostic 0.92 ± 0.01 0.54 ± 0.04 0.91 ± 0.00 0.88 ± 0.00 0.80 ± 0.00 0.93 ± 0.00 self-masking 0.99 ± 0.00 0.93 ± 0.02 0.95 ± 0.01 0.90 ± 0.02 0.71 ± 0.02 0.98 ± 0.00 self-masking known 0.99 ± 0.00 0.97 ± 0.00 0.97 ± 0.00 0.95 ± 0.00 0.78 ± 0.00 0.98 ± 0.00 Table A1 : Mask prediction accuracies on UCI datasets using fully observed data.

A IMPLEMENTATION DETAILS

In all experiments we used TensorFlow probability (Dillon et al., 2017) and the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. Gaussian distributions were used both as the variational distribution in latent space and the observation model in data space. No regularization was used. Similar settings were used for the MIWAE and the not-MIWAE, except for the missing model which is exclusive to the not-MIWAE. Source code is available at: https://github.com/nbip/notMIWAE  A.1 UCI The encoder and decoder consist of two hidden layers with 128 units and tanh activation functions. In the PPCA-like models, the decoder is a linear mapping from latent space to data space, with a learnt variance shared across features. The size of the latent space is set to p -1, K = 20 importance samples were used during training and a batch size of 16 was used for 100k iterations. Data are standardized before missing is introduced. The imputation RMSE is estimated using 10k importance samples and the mean and standard errors are found over 5 runs. Since the imputation error in a real-world setting cannot be monitored during training, neither on a train or validation set, early stopping cannot be done based on this. Both the MIWAE and not-MIWAE are trained for a fixed number of iterations. In the low-rank joint model of Sportisse et al. (2020a) , model selection needs to be done for the penalization parameter λfoot_0 . In order to do this we add 5% missing values (MCAR) to the concatenated matrix of data and mask and use the imputation error on this added missing data to select the optimal lambda. The model is then trained on the original data using the optimal λ to get the imputation error. For evaluating the learnt missing model, we report mask classification accuracies when feeding fully observed data as input to the missing model, see table A1 . As the missing model contains more prior information, the classification accuracy becomes better and better.

A.2 SVHN

For the encoder and decoder a convolutional structure was used (see tables A2 and A3) together with ReLU activations and a latent space of dimension 20. K = 5 importance samples were used during training and a batch size of 64 was used for 1M iterations. The variance in the observation model was lower bounded at ∼ 0.02.

A.3 YAHOO!

The MIWAE and the not-MIWAE were trained on the MNAR ratings and the imputation error was evaluated on the MCAR ratings (when encoding the MNAR ratings). We used the permutation invariant encoder by Ma et al. (2018) with an embedding size of 20 and a code size of 50, along with a linear mapping to a latent space of size 30. In the Gaussian observation model, the decoder is a linear mapping and there is a sigmoid activation of the mean in data space, scaled to match the scale of the ratings. The categorical observation model also has a linear mapping to its logits. In both latent space and data space, we learn shared variance parameters in each dimension. The missing model is a logistic regression for each feature, with a shared weight across features and individual biases for each feature. We use K = 20 importance samples during training, ReLU activations, a batch size of 100 and train for 10k iterations. We follow the setup of Wang et al. (2019) and compare to the following approaches: CPT-v: Marlin et al. (2007) show that a multinomial mixture model with a Conditional Probability Tables missing model give better performance than the multinomial mixture model without missing model. The approach is further expanded by Marlin & Zemel (2009) , where a logistic model, Logitvd, is also tried as the missing model. 2016) with an error-imputation approach by Steck (2013) to obtain a doubly robust estimator. This is used both with matrix factorization and in neural factorization machines (He & Chua, 2017) . As for Schnabel et al. (2016) , 5% of the MCAR test-set is used to learn the propensities. Results are from the paper. In addition to these debiasing approaches, we compare to the following methods, which do not take the missing process into account: MF (Koren et al., 2009) , PMF (Mnih & Salakhutdinov, 2008) , AutoRec (Sedhain et al., 2015) and Gaussian VAE (Liang et al., 2018) . The presented results for these methods are from (Wang et al., 2019) .

B IMPUTATION

Once the model has been trained, it is possible to use it to impute the missing values. If our performance metric is a loss function L(x m , y m ), optimal imputations xm minimise E x m [L(x m , xm )|x o , s]. Many loss functions can be minimized using moments of the conditional distribution of the missing values, given the observed. Similarly to Mattei & Frellsen (2019, equations 10-11) , these moments can be estimated via self-normalised importance sampling. For any function of the missing data h(x m ), E[h(x m )|x o , s] = h(x m )p(x m |x o , s) dx m . Using Bayes's theorem, we get E[h(x m )|x o , s] = h(x m ) p(s|x o , x m )p(x m , x o ) p(s, x o ) dx m , and now we can introduce the latent variable: E[h(x m )|x o i , s] = h(x m ) p(s|x o , x m )p(x m |z)p(x o |z)p(z) p(s, x o ) dz dx m . ( ) Using self-normalised importance sampling on this last integral with proposal q γ (z|x o )p θ (x m |z) leads to the estimate xm = E[h(x m )|x o , s] ≈ K k=1 α k h(x m k ), with α k = w k w 1 + . . . + w K , ( ) where the weights w 1 , . . . , w K are incidentally identical to the ones used for training: ∀k ≤ K, w k = p φ (s|x o , x m k )p θ (x o |z k )p(z k ) q γ (z k |x o ) , and (z 1 , x m 1 ), . . . , (z K , x m K ) are K i.i.d. samples from q γ (z|x o )p θ (x m |z). If the quantity E[h(x m )|z] is easy to compute, then a Rao-Blackwellized version of equation ( 15) should be preferred xm = E[h(x m )|x o , s] ≈ K k=1 α k E[h(x m )|z k ]. Squared loss When L corresponds to the squared error, the optimal imputation will be the conditional mean that can be estimated using the method above (in that case, h is the identity function): xm = E[x m |x o , s] ≈ K k=1 α k E[x m |x o , s], with α k = w k w 1 + . . . + w K . ( ) Absolute loss When L is the absolute error loss, the optimal imputation is the conditional median, that can be estimated using the same technique and at little additional cost compared to the mean. Indeed, we can estimate the cumulative distribution function of each missing feature j ∈ {1, . . . , p}: F j (x j ) = E[1 x m j ≤xj |x o , s] ≈ K k=1 α k F xj |x o ,s (x j ), where F xj |x o ,s is the cumulative distribution function of x j |x o , s, which will often be available in closed-form (e.g. in the case of a Gaussian, Bernoulli or Student's t observation model). We can then use this estimate to approximately solve F j (x j ) = 0.5. More generally, if L is a multilinear loss, optimal imputations will be quantiles (see e.g. Robert, 2007, section 2.5.2) that can be estimated using equation ( 19). The consistency of similar quantile estimates was studied by Glynn (1996) . Multiple imputation. It is also possible to perform multiple imputation with the same computations. One can obtain approximate samples from p(x m |x o ) using sampling importance resampling with the same set of weights. This allows us to do both single and multiple imputation with the same computations.

C MISSING MODEL, GROUP THEORETIC APPROACH

A more complex form of prior information that can be used to choose the form of π φ (x) is grouptheoretic. For example, we may know a priori that p φ (s|x) is invariant to a certain group action g • x on the data space: ∀g, p φ (s|x) = p φ (s|g • x). This would for example be the case, if the data sets were made of images whose class is invariant to translations (which is the case of most image data sets, like MNIST or SVHN), and with a missing model only dependent on the class. Similarly, one may know that the missing process is equivariant: ∀g, p φ (g • s|x) = p φ (s|g -1 • x). Again, such a setting can appear when there is strong geometric structure in the data (e.g. with images or proteins). Invariance or equivariance can be built in the architecture of π φ (x) by leveraging the quite large body of work on invariant/equivariant convolutional neural networks, see e. 

D THEORETICAL PROPERTIES OF THE NOT-MIWAE BOUND

The properties of the not-MIWAE bound are directly inherited from the ones of the usual IWAE bound. Indeed, as we will see, the not-MIWAE bound is a particular instance of IWAE bound with an extended latent space composed of both the code and the missing values. More specifically, recall the definition of the not-MIWAE bound L K (θ, φ, γ) = n i=1 E   log 1 K K k=1 w ki   , with w ki = p θ (x o i |z ki )p φ (s i |x o i , x m ki )p(z ki ) q γ (z ki |x o i ) . Each ith term of the sum can be seen as an IWAE bound with extended latent variable (z ki , x m ki ), whose prior is p θ (x m ki |z ki )p(z ki ). The related importance sampling proposal of the ith term is p θ (x m ki |z ki )q γ (z ki |x o i ), and the observation model is p φ (s i |x o i , x m ki )p θ (x o i |z ki ). Since all n terms of the sum are IWAE bounds, Theorem 1 from Burda et al. (2016) directly gives the monotonicity property: L 1 (θ, φ, γ) ≤ . . . ≤ L K (θ, φ, γ). Regarding convergence of the bound to the true likelihood, we can use Theorem 3 of Domke & Sheldon (2018) for each term of the sum to get the following result. Theorem. Assuming that, for all i ∈ {1, ..., n}, • there exists α i > 0 such that E |w 1i -p θ,φ (x o i , s i )| 2+αi < ∞, • lim sup K-→∞ E K/(w 1i + ... + w Ki ) < ∞, the not-MIWAE bound converges to the true likelihood at rate 1/K: (θ, φ) -L K (θ, φ, γ) ∼ K→∞ 1 K n i=1 Var[w 1i ] 2p θ,φ (x o i , s i ) 2 . E VARYING MISSING RATE (UCI) The UCI experiments use a self-masking missing process in half the features: when the feature value is higher than the feature mean it is set to missing. In order to investigate varying missing rates we change the cutoff point from the mean to the mean plus an offset. The offsets used are {0, 0.25, 0.5, 0.75, 1.0}, so that the largest cutoff point will be the mean plus one standard deviation. Increasing the cutoff point further results in mainly imputing outliers. Results for PPCA and not-MIWAE PPCA using the agnostic missing model are seen in figure 5 and using the self-masking model with known sign of the weights are seen in figure 6 . Figure 7 shows the results for MIWAE and not-MIWAE using self-masking with known sign of the weights. Figure 5 : PPCA agnostic: Imputation RMSE at varying missing rates on UCI datasets. The variation in missing rate is obtained by changing the cutoff point using an offset, so that an offset = 0 corresponds to using the mean as the cutoff point while an offset = 1 corresponds to using the mean plus one standard deviation as the cutoff point. Results are averages over 2 runs. The variation in missing rate is obtained by changing the cutoff point using an offset, so that an offset = 0 corresponds to using the mean as the cutoff point while an offset = 1 corresponds to using the mean plus one standard deviation as the cutoff point. Results are averages over 2 runs. Figure 7 : Self-masking known: Imputation RMSE at varying missing rates on UCI datasets. The variation in missing rate is obtained by changing the cutoff point using an offset, so that an offset = 0 corresponds to using the mean as the cutoff point while an offset = 1 corresponds to using the mean plus one standard deviation as the cutoff point. Results are averages over 2 runs.



We used original code from the authors found here: https://github.com/AudeSportisse/ stat



Figure 1: (a) Graphical model of the not-MIWAE. (b) Gaussian data with MNAR values. Dots are fully observed, partially observed data are displayed as black crosses. A contour of the true distribution is shown together with directions found by PPCA and not-MIWAE with a PPCA decoder.

Figure 2: SVHN: Histograms over imputed values for (a) the MIWAE and (b) the not-MIWAE, and (c) the pixel values of the missing data.

Figure 3: Rows from top: original images, images with missing, not-MIWAE imputations, MIWAE imputations

Figure 4: Histograms over rating values for the Yahoo! R3 dataset from (a) the MNAR training set and (b) the MCAR test set. (c) and (d) show histograms over imputations of missing values in the test set, when encoding the corresponding training set. The not-MIWAE imputations (d) are much more faithful to the shape of the test set (b) than the MIWAE imputations (c).

The result for the CPT-v model and the Logit-vd model are taken from the supplementary material of Hernández-Lobato et al. (2014). MF-MNAR: Hernández-Lobato et al. (2014) extended probabilistic matrix factorization to include a missing data model for data missing not at random in a collaborative filtering setting. Results are from the supplementary material of the paper. MF-IPS: Schnabel et al. (2016) applied propensity-based methods from causal inference to matrix factorization, specifically inverse-propensity-scoring, IPS. The propensities used to debias the matrix factorization are the probabilities of a rating being observed for each (user, item) pair. The propensities used for training are found using 5% of the MCAR test-set. Results are from the paper. MF-DR-JL and NFM-DR-JL: Wang et al. (2019) combines the propensity-scoring approach from Schnabel et al. (

g. Bietti & Mairal (2017); Cohen et al. (2019); Zaheer et al. (2017); Wiqvist et al. (2019); Bloem-Reddy & Teh (2020), and references therein.

Figure6: PPCA self-masking known: Imputation RMSE at varying missing rates on UCI datasets. The variation in missing rate is obtained by changing the cutoff point using an offset, so that an offset = 0 corresponds to using the mean as the cutoff point while an offset = 1 corresponds to using the mean plus one standard deviation as the cutoff point. Results are averages over 2 runs.

Imputation RMSE on UCI datasets affecfed by MNAR.

Sportisse  et al. (2020a)  that is implicitly modelling the data and mask jointly. Furthermore we compare to mean imputation, missForest(Stekhoven & Bühlmann, 2012) and MICE (Buuren & Groothuis-Oudshoorn, 2010) using Bayesian Ridge regression. Similar settings are used for the MIWAE and not-MIWAE, see appendix A. Results over 5 runs are seen in table 1. Results for varying missing rates are in appendix E.The low-rank joint model is almost always better thanPPCA, missForest, MICE and mean, i.e. all

3: Rows from top: original images, images with missing, not-MIWAE imputations, MIWAE imputations

Imputation MSEs for the Yahoo! MCAR test-set. Models are trained on the MNAR training set.



ACKNOWLEDGMENTS

The Danish Innovation Foundation supported this work through Danish Center for Big Data Analytics driven Innovation (DABAI). JF acknowledge funding from the Independent Research Fund Denmark (grant number 9131-00082B) and the Novo Nordisk Foundation (grant numbers NNF20OC0062606 and NNF20OC0065611).

