UNCERTAINTY ESTIMATION VIA DISCRETE LATENT REPRESENTATION

Abstract

Many important problems in the real world don't have unique solutions. It is thus important for machine learning models to be capable of proposing different plausible solutions with meaningful probability measures. In this work we propose a novel deep learning based framework, named modal uncertainty estimation (MUE), to learn the one-to-many mappings between the inputs and outputs, together with faithful uncertainty estimation. Motivated by the multi-modal posterior collapse problem in current conditional generative models, MUE uses a set of discrete latent variables, each representing a latent mode hypothesis that explains one type of input-output relationship, to generate the one-to-many mappings. Benefit from the discrete nature of the latent representations, MUE can estimate any input the conditional probability distribution of the outputs effectively. Moreover, MUE is efficient during training since the discrete latent space and its uncertainty estimation are jointly learned. We also develop the theoretical background of MUE and extensively validate it on both synthetic and realistic tasks. MUE demonstrates (1) significantly more accurate uncertainty estimation than the current state-of-the-art, and (2) its informativeness for practical use.

1. INTRODUCTION

Making predictions in the real world has to face with various uncertainties. One of the arguably most common uncertainties is due to partial or corrupted observations, as such it is often insufficient for making a unique and deterministic prediction. For example, when inspecting where a single CT scan of a patient contains lesion, without more information it is possible for radiologists to reach different conclusions, as a result of the different hypotheses they have about the image. In such an ambiguous scenario, the question is thus, given the observable, which one(s) out of the many possibilities would be more reasonable than others? Mathematically, this is a one-to-many mapping problem and can be formulated as follows. Suppose the observed information is x ∈ X in the input space, we are asked to estimate the conditional distribution p(y|x) for y ∈ Y in the prediction space, based on the training sample pairs (x, y). There are immediate challenges that prevent p(y|x) being estimated directly in practical situations. First of all, both X and Y, e.g.as spaces of images, can be embedded in very high dimensional spaces with very complex structures. Secondly, only the unorganized pairs (x, y), not the one-tomany mappings x → {y i } i , are explicitly available. Fortunately, recent advances in conditional generative models based on Variational Auto-Encoder (VAE) framework from Kingma & Welling (2014) shed light on how to tackle our problem. By modelling through latent variables c = c(x), one aims to explain the underlying mechanism of how y is assigned to x. And hopefully, variation of c will result in variation in the output ŷ(x, c), which will approximate the true one-to-many mappings distributionally. 2018). This issue is particularly easy to understand in our setting, where we assume there are multiple y's for a given x. 2015), as long as both prior and variational posterior are parameterized by Gaussians. Now suppose for a particular x, there there are two modes y 1 , y 2 for the corresponding predictions. Since the minimization is performed on the entire training set, p(c|x) is forced to approximate a posterior mixture p(c|x, y (•) ) of two Gaussians from mode y 1 and y 2 . In the situation when the minimization is successful, meaning the KL divergence is small, the mixture of the variational posteriors must be close to a Gaussian, i.e.posterior collapsed as in Fig. 1 (b), and hence the multi-modal information is lost. Putting it in contrapositive, if multi-modal information is to be conveyed by the variational posterior, then the minimization will not be successful, meaning higher KL divergence. This may partly explain why it can be a delicate matter to train a conditional VAE. The situation is schematically illustrated in Figure 1 in one dimension. Note that the case in Figure 1 (a) is usually more preferable, however the density values of the prior used during testing cannot reflect the uncertainty level of the outputs. We quantitative demonstrate this in Section 4 and Fig. However, besides the need of extensive parameter tuning for these approaches, they are not tailored for the multi-modal posterior collapse problem we described above, thus do not solve the inaccurate uncertainty estimation problem. Mixture or compositions of Gaussian priors have also been proposed in Nalisnick et al. ( 2016); Tomczak & Welling ( 2018), but the number of Gaussians in the mixture is usually fixed apriori. Hence making it a conditional generative model further complicates the matter, since the number in the mixture should depend on the input. We therefore adopt another direction, which is to use a latent distribution parameterization other than Gaussians, and one that can naturally exhibit multiple modes. The simplest choice would be to constrain the latent space to be a finite set, as proposed in van den Oord et al. (2017) , so that we can learn the conditional distribution as a categorical distribution. We argue that the approach of discrete latent space may be beneficial particularly in our setting. First, different from unconditional or weak conditional generative modelling tasks where diversity is the main consideration, making accurate predictions based on partial information often leads to a significantly restricted output space. Second, there is no longer noise injection during training, so that the decoder can utilize the information from the latent variable more effectively. This makes it less prone to ignore the latent variable completely, in contrast to many conditional generation methods using noise inputs. Third, the density value learned on the latent space is more interpretable, since the learned prior can approximate the variational posterior better. In our case, the latent variables can now represent latent mode hypotheses for making the corresponding most likely predictions. We call our approach modal uncertainty estimation (MUE). The main contributions of this work are: (1) We solve the MUE problem by using c-VAE and justify the use of a discrete latent space from the perspective of multi-modal posterior collapse problem. (2) Our uncertainty estimation improves significantly over the existing state-of-art. (3)



Many current conditional generative models, including cVAE in Sohn et al. (2015), BiCycleGAN in Zhu et al. (2017b), Probabilistic U-Net in Kohl et al. (2018), etc., are developed upon the VAE framework, with Gaussian distribution with diagonal covariance as the de facto parametrization of the latent variables. However, in the following we will show that such a parametrization put a dilemma between model training and actual inference, as a form of what is known as the posterior collapse problem in the VAE literature Alemi et al. (2018); Razavi et al. (

c

Figure 1: Comparison between Gaussian latent representations and discrete latent representations in a multi-modal situation. Gaussian latents are structurally limited in such a setting. (a) The ideal situation when there is no posterior collapse as multiple modes appear, but the prior distribution is a poor approximation of the posterior. (b) Posterior collapse happens, and no multi-modal information is conveyed from the learned prior. (c) Discrete latent representation can ameliorate the posterior collapse problem while the prior can approximate the posterior more accurately when both are restricted to be discrete. Let us recall that one key ingredient of the VAE framework is to minimize the KL-divergence between the latent prior distribution p(c|x) and the latent variational approximation p φ (c|x, y) of the posterior. Here φ denotes the model parameters of the "recognition model" in VAE. It does not matter if the prior is fixed p(c|x) = p(c) Kingma & Welling (2014) or learned p(c|x) = p θ (c|x) Sohn et al. (2015), as long as both prior and variational posterior are parameterized by Gaussians. Now suppose for a particular x, there there are two modes y 1 , y 2 for the corresponding predictions. Since the minimization is performed on the entire training set, p(c|x) is forced to approximate a posterior mixture p(c|x, y (•) ) of two Gaussians from mode y 1 and y 2 . In the situation when the minimization is successful, meaning the KL divergence is small, the mixture of the variational posteriors must be close to a Gaussian, i.e.posterior collapsed as in Fig.1(b), and hence the multi-modal information is lost. Putting it in contrapositive, if multi-modal information is to be conveyed by the variational posterior, then the minimization will not be successful, meaning higher KL divergence. This may partly explain why it can be a delicate matter to train a conditional VAE. The situation is schematically illustrated in Figure1in one dimension. Note that the case in Figure1(a) is usually more preferable, however the density values of the prior used during testing cannot reflect the uncertainty level of the outputs. We quantitative demonstrate this in Section 4 and Fig.2.

2. One direction to solve the above problem is to modify the strength of KL-divergence or the variational lower bound, while keeping the Gaussian parametrization, and has been explored in the literature extensively, as in Higgins et al. (2017); Alemi et al. (2018); Rezende & Viola (2018).

