TOMOGRAPHIC AUTO-ENCODER: UNSUPERVISED BAYESIAN RECOVERY OF CORRUPTED DATA

Abstract

We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible reconstructed data and hence characterising the underlying uncertainty. In this setting, direct application of classical variational methods often gives rise to collapsed densities that do not adequately explore the solution space. Instead, we derive our novel reduced entropy condition approximate inference method that results in rich posteriors. We test our model in a data recovery task under the common setting of missing values and noise, demonstrating superior performance to existing variational methods for imputation and de-noising with different real data sets. We further show higher classification accuracy after imputation, proving the advantage of propagating uncertainty to downstream tasks with our model.

1. INTRODUCTION

Data sets are rarely clean and ready to use when first collected. More often than not, they need to undergo some form of pre-processing before analysis, involving expert human supervision and manual adjustments (Zhou et al., 2017; Chu et al., 2016) . Filling missing entries, correcting noisy samples, filtering collection artefacts and other similar tasks are some of the most costly and time consuming stages in the data modeling process and pose an enormous obstacle to machine learning at scale (Munson, 2012) . Traditional data cleaning methods rely on some degree of supervision in the form of a clean dataset or some knowledge collected from domain experts. However, the exponential increase of the data collection and storage rates in recent years, makes any supervised algorithm impractical in the context of modern applications that consume millions or billions of datapoints. In this paper, we introduce a novel variational framework to perform automated data cleaning and recovery without any example of clean data or prior signal assumptions. The Tomographic auto-encoder (TAE), is named in analogy with standard tomography. Tomographic techniques for signal recovery aim at reconstructing a target signal, such as a 3D image, by algorithmically combining different incomplete measurements, such as 2D images from different view points, subsets of image pixels or other projections (Geyer et al., 2015) . The TAE extends this concept to the reconstruction of data manifolds; our target signal is a clean data set, where corrupted data is interpreted as incomplete measurements. Our aim is to combine these to reconstruct the clean data. More specifically, we are interested in performing Bayesian recovery, where we do not simply transform degraded samples into clean ones, but recover probabilistic functions, with which we can generate diverse clean signals and capture uncertainty. Uncertainty is considerably important when cleaning data. If we are over-confident about specific solutions, errors are easily ignored and passed on to downstream tasks. For instance, in the example of figure 1(a) , some corrupted observations are consistent with multiple digits. If we were to impute a single possibility for each sample, the true underlying solution may be ignored early on in the modeling pipeline and the digit will be consistently mis-classified. If we are instead able to recover accurate probability densities, we can remain adequately uncertain in any subsequent processing task. Several variational auto-encoder (VAE) models have been proposed for applications that can be considered special cases of this problem (Im et al., 2017; Nazabal et al., 2018; Ainsworth et al., 2018) and, in principle, they are capable of performing Bayesian reconstruction. However, we show that surrogating variational inference (VI) in a latent space with VAEs results in collapsed distributions that do not explore the different possibilities of clean samples, but only return single estimates. The TAE performs approximate VI in the space of recovered data instead, through our reduced entropy condition method. The resulting posteriors adequately explore the manifold of possible clean samples for each corrupted observation and, therefore, adequately capture the uncertainty of the task. In our experiments we focus on data recovery from noisy samples and missing entries. This is one of the most common data corruption settings being encountered in a wide range of domains with different types of data (White et al., 2011; Kwak & Kim, 2017) . By testing our approach in this prevalent scenario, we can closely compare with recently proposed VAE approaches (Nazabal et al., 2018; Dalca et al., 2019; Mattei & Frellsen, 2019) . We show how the existing VAE models exhibit the posterior collapse problem while the TAE produces rich posteriors that capture the underlying uncertainty. We further test TAEs on classification subsequent to imputation, demonstrating superior performance to existing methods in these downstream tasks. Finally, we use a TAE to perform automated missing values imputation on raw depth maps from the NYU rooms data set.

2. METHOD

In order to frame the problem and understand the issues with standard variational methods in this context, we view the task from a signal reconstruction prospective. The final scope of a Bayesian data recovery method is to build and train a parametric probability density function (PDF) q(x|y), which takes as inputs corrupted samples y and generates different possible corresponding clean data x ∼ q(x|y) through sampling. There are two aspects we need to design: i) the structure of this conditional PDF and ii) the way it will be trained to perform the recovery task. Regarding the former, as natural data often lies on highly non-linear manifolds, we need the conditional PDF to capture complicated modalities, e.g. the distribution of plausible images consistent with one of the corrupted observations in figure 1(a). A suitable recovery PDF q(x|y) needs to be able to capture such complexity. A natural choice to achieve high capacity and tractability is to



Figure 1: (a) Example of Bayesian recovery from corrupted data with a Tomographic Auto-Encoder (TAE) on corrupted MNIST. The TAE recovers posterior probability densities q(x|y i ) for each corrupted sample y i . We can draw from these to explore different possible clean solutions. (b) Two dimensional Bayesian recovery experiment. (i) Observed set of corrupted data Y , with the point we are inferring from y i highlighted. (ii) Ground truth hidden clean data with the target point x i highlighted, along with the posterior q(x|y i ) reconstructed by a VAE. (iii) Posterior q(x|y i ) recovered with our TAE. While the VAE posterior collapses to a single point, the TAE reconstructs a rich posterior that adjusts to the data manifold.

