PARAMETRIC DENSITY ESTIMATION WITH UNCERTAINTY USING DEEP ENSEMBLES Anonymous

Abstract

In parametric density estimation, the parameters of a known probability density 1 are typically recovered from measurements by maximizing the log-likelihood. Prior knowledge of measurement uncertainties is not included in this method, po-3 tentially producing degraded or even biased parameter estimates. We propose 4 an efficient two-step, general-purpose approach for parametric density estimation 5 using deep ensembles. Feature predictions and their uncertainties are returned 6 by a deep ensemble and then combined in an importance weighted maximum 7 likelihood estimation to recover parameters representing a known density along 8 with their respective errors. To compare the bias-variance tradeoff of different 9 approaches, we define an appropriate figure of merit. We illustrate a number of 10 use cases for our method in the physical sciences and demonstrate state-of-the-art 11 results for X-ray polarimetry that outperform current classical and deep learning 12 methods. 13

1. INTRODUCTION 14

The majority of state-of-the-art NN performances are single (high-dimensional) input, multiple-15 output tasks, for instance classifying images (Krizhevsky et al., 2012) , scene understanding (Red-16 mon et al., 2015) and voice recognition (Graves et al., 2006) . These tasks typically involve one input 17 vector or image and a single output vector of predictions.

18

In parametric density estimation, there is a known probability density that the data (or latent features 19 of the data) are expected to follow. The goal is to find representative distribution parameters for a 20 given dataset. In simple cases where the likelihood is calculable, maximum likelihood estimation 21 can be used effectively. In cases where latent features of the data follow a known distribution (e.g.,

22

heights of people in a dataset of photographs), NNs can potentially be used to directly estimate the 23 distribution parameters. For clarity, we define this direct/end-to-end approach as parametric feature 24 density estimation (PFDE). Such an approach requires employing entire datasets (with potentially 25 thousands to millions of high-dimensional examples) as inputs in order to output a vector of den-26 sity parameters. Furthermore, to be useful these NNs would need to generalize to arbitrarily sized 27 dataset-inputs.

28

One example of NNs making sense of large dataset-inputs is found in natural language processing.

29

Here large text corpora, converted to word vectors (Pennington et al., 2014; Devlin et al., 2019) , 30 can be input and summarized by single output vectors using recurrent neural networks (RNNs), for 31 instance in sentiment analysis (Can et al., 2018) . However, these problems and RNNs themselves 32 contain inductive bias -there is inherent structure in text. Not all information need be given at once 33 and a concept of memory or attention is sufficient (Vaswani et al., 2017) . The same can be said 34 about time domain problems, such as audio processing or voice recognition. Memory is inherently 35 imperfect -for PFDE, one ideally wants to know all elements of the ensemble at once to make 36 the best prediction: sequential inductive bias is undesirable. Ultimately, memory and architectural 37 constraints make training NNs for direct PFDE computationally intractable.

38

On the other hand, density estimation on data directly (not on its latent features), is computationally 39 tractable. Density estimation lets us find a complete statistical model of the data generating process. (Bishop, 1994) . In PFDE, 44 however, we have a known probability density over some features of the whole dataset. The features 45 may be more difficult to predict accurately in some datapoints than others.



Applying deep learning to density estimation has advanced the field significantly (Papamakarios, 41 2019). Most of the work so far focuses on density estimation where the density is unknown a priori.42This can be achieved with non-parametric methods such as neural density estimation (Papamakarios 43 1 Under review as a conference paper at ICLR 2021 et al., 2018), or with parametric methods such as mixture density networks

annex

Typical parametric density estimation does not make use of data uncertainties where some elements 47 in the dataset may be more noisy than others. Not including uncertainty information can lead to 48 biased or even degraded parameter estimates. The simplest example of parametric density estimation 49 using uncertainties is a weighted mean. This is the result of a maximum likelihood estimate for a 50 multi-dimensional Gaussian. For density estimation on predicted data features, PFDE, we would 51 like a way to quantify the predictive uncertainty. A general solution is offered by deep ensembles 52 (Lakshminarayanan et al., 2017) . While these are not strictly equivalent to a Bayesian approach, 53 although they can be made such using appropriate regularization (Pearce et al., 2018) approximately follows a Gaussian distribution. We develop a method that can estimate the density 82 parameters (mean and variance) and generalize to any dataset of photographs.

83

In general, the function f mapping the high dimensional data points to the desired density parame-84 ters is unknown, since the high dimensional data is abstracted from its features. Learning f directly 85 is typically infeasible because an entire ensemble of inputs {x n } N n=1 must be processed simultane-86 ously to estimate density parameters, and this approach would have to generalize to arbitrary N and 87 density parameter values. We discuss some special cases where this is possible in §1. However, the 88 function g mapping data features y n to the density parameters is known.

89

We cast this as a supervised learning problem where we have a dataset D consisting of N data points 90 D = {x n , y n } Ntrain n=1 with labels y ∈ R K where x ∈ R D . We want to estimate the density parameters 91 ψ 1 , ψ 2 , ...ψ k for an unseen test set g({y n } Ntest n=1 ) for arbitrary N test .

92

The basic recipe that comes to mind is training a single NN to predict output labels {y n } N n=1 then 93 evaluate g directly. This ignores the high variance in single NN predictions (dependent on train-

