TOMOGRAPHIC AUTO-ENCODER: UNSUPERVISED BAYESIAN RECOVERY OF CORRUPTED DATA

Abstract

We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible reconstructed data and hence characterising the underlying uncertainty. In this setting, direct application of classical variational methods often gives rise to collapsed densities that do not adequately explore the solution space. Instead, we derive our novel reduced entropy condition approximate inference method that results in rich posteriors. We test our model in a data recovery task under the common setting of missing values and noise, demonstrating superior performance to existing variational methods for imputation and de-noising with different real data sets. We further show higher classification accuracy after imputation, proving the advantage of propagating uncertainty to downstream tasks with our model.

1. INTRODUCTION

Data sets are rarely clean and ready to use when first collected. More often than not, they need to undergo some form of pre-processing before analysis, involving expert human supervision and manual adjustments (Zhou et al., 2017; Chu et al., 2016) . Filling missing entries, correcting noisy samples, filtering collection artefacts and other similar tasks are some of the most costly and time consuming stages in the data modeling process and pose an enormous obstacle to machine learning at scale (Munson, 2012) . Traditional data cleaning methods rely on some degree of supervision in the form of a clean dataset or some knowledge collected from domain experts. However, the exponential increase of the data collection and storage rates in recent years, makes any supervised algorithm impractical in the context of modern applications that consume millions or billions of datapoints. In this paper, we introduce a novel variational framework to perform automated data cleaning and recovery without any example of clean data or prior signal assumptions. The Tomographic auto-encoder (TAE), is named in analogy with standard tomography. Tomographic techniques for signal recovery aim at reconstructing a target signal, such as a 3D image, by algorithmically combining different incomplete measurements, such as 2D images from different view points, subsets of image pixels or other projections (Geyer et al., 2015) . The TAE extends this concept to the reconstruction of data manifolds; our target signal is a clean data set, where corrupted data is interpreted as incomplete measurements. Our aim is to combine these to reconstruct the clean data. More specifically, we are interested in performing Bayesian recovery, where we do not simply transform degraded samples into clean ones, but recover probabilistic functions, with which we can generate diverse clean signals and capture uncertainty. Uncertainty is considerably important when cleaning data. If we are over-confident about specific solutions, errors are easily ignored and passed

