REGRESSION PRIOR NETWORKS

Abstract

Prior Networks are a class of models which yield interpretable measures of uncertainty and have been shown to outperform state-of-the-art ensemble approaches on a range of tasks. They can also be used to distill an ensemble of models via Ensemble Distribution Distillation (EnD 2 ), such that its accuracy, calibration, and uncertainty estimates are retained within a single model. However, Prior Networks have so far been developed only for classification tasks. This work extends Prior Networks and EnD 2 to regression tasks by considering the Normal-Wishart distribution. The properties of Regression Prior Networks are demonstrated on synthetic data, selected UCI datasets, and two monocular depth estimation tasks. They yield performance competitive with ensemble approaches.

1. INTRODUCTION

Neural Networks have become the standard approach to addressing a wide range of machine learning tasks (Girshick, 2015; Simonyan & Zisserman, 2015; Villegas et al., 2017; Mikolov et al., 2013b; a; 2010; Hinton et al., 2012; Hannun et al., 2014; Caruana et al., 2015; Alipanahi et al., 2015) . However, in order to improve the safety of AI systems (Amodei et al., 2016) and avoid costly mistakes in high-risk applications, such as self-driving cars, it is desirable for models to yield estimates of uncertainty in their predictions. Ensemble methods are known to yield both improved predictive performance and robust uncertainty estimates (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Maddox et al., 2019) . Importantly, ensemble approaches allow interpretable measures of uncertainty to be derived via a mathematically consistent probabilistic framework. Specifically, the overall total uncertainty can be decomposed into data uncertainty, or uncertainty due to inherent noise in the data, and knowledge uncertainty, which is due to the model having limited uncertainty of the test data (Malinin, 2019) . Uncertainty estimates derived from ensembles have been applied to the detection of misclassifications, out-of-domain inputs and adversarial attack detection (Carlini & Wagner, 2017; Smith & Gal, 2018) , and active learning (Kirsch et al., 2019) . Unfortunately, ensemble methods may be computationally expensive to train and are always expensive during inference. A class of models called Prior Networks (Malinin & Gales, 2018; 2019; Malinin, 2019; Sensoy et al., 2018) was proposed as an approach to modelling uncertainty in classification tasks by emulating an ensemble using a single model. Prior Networks parameterize a higher order conditional distribution over output distributions, such as the Dirichlet distribution. This enables Prior Networks to efficiently yield the same interpretable measures of total, data and knowledge uncertainty as an ensemble. Unlike ensembles, the behaviour of Prior Networks' higher-order distribution is specified via a loss function, such as reverse KL-divergence (Malinin & Gales, 2019) , and training data. However, such Prior Networks yield predictive performance consistent with that of a single model trained via Maximum Likelihood, which is typically worse than that of an ensemble. This can be overcome via Ensemble Distribution Distillation (EnD 2 ) (Malinin et al., 2020) , which is an approach that allows distilling an ensemble into Prior Network such that measures of ensemble diversity are preserved. This enables to retain both the predictive performance and uncertainty estimates of an ensemble at low computational and memory cost. Finally, it is important to point out that a related class of evidential methods has concurrently appeared (Sensoy et al., 2018; Amini et al., 2020) . Structurally they yield models similar to Prior Networks, but are trained in a different fashion. While Prior Networks have many attractive properties, they have only been applied to classification tasks. In this work we develop Prior Networks for regression tasks by considering the Normal-Wishart distribution -a higher-order distribution over the parameters of multivariate normal distributions. Specifically, we extend theoretical work from (Malinin, 2019) , where such models are considered, but never evaluated. We derive all measures of uncertainty, the reverse KL-divergence training objective, and the Ensemble Distribution Distillation objective in closed form. Regression Prior Networks are then evaluated on synthetic data, selected UCI datasets and the NYUv2 and KITTI monocular depth estimation tasks, where they are shown to yield comparable or better performance to state-of-the-art single-model and ensemble approaches. Crucially, they enable, via EnD 2 , to retain the predictive performance and uncertainty estimates of an ensemble within a single model.

2. REGRESSION PRIOR NETWORKS

In this section we develop Prior Network models for regression tasks. While typical regression models yield point-estimate predictions, we consider probabilistic regression models which parameterizes a distribution p(y|x, θ) over the target y ∈ R K . Typically, this is a normal distribution: p(y|x, θ) = N (y|µ, Λ), {µ, Λ} = f (x; θ) (1) where µ is the mean, and Λ the precision matrix, a positive-definite symmetric matrix. While normal distributions are usually defined in terms of the covariance matrix Σ = Λ -1 , parameterization using the precision is more numerically stable during optimization (Bishop, 2006; Goodfellow et al., 2016) . While a range of distributions over continuous random variables can be considered, we will consider the normal as it makes the least assumptions about the nature of y and is mathematically simple. As in the case for classification, we can consider an ensemble of networks which parameterize multivariate normal distributions (MVN) {p(y|x, θ (m) )} M m=1 . This ensemble can be interpreted as a set of draws from a higher-order implicit distribution over normal distributions. A Prior Network for regression would, therefore, emulate this ensemble by explicitly parameterizing a higher-order distribution over the parameters µ and Λ of a normal distribution. One sensible choice is the formidable Normal-Wishart distribution (Murphy, 2012; Bishop, 2006) , which is a conjugate prior to the MVN. This parallels how the Dirichlet distribution, the conjugate prior to the categorical, was used in classification Prior Networks. The Normal-Wishart distribution is defined as follows: N W(µ, Λ|m, L, κ, ν) = N (µ|m, κΛ)W(Λ|L, ν) where m and L are the prior mean and inverse of the positive-definite prior scatter matrix, while κ and ν are the strengths of belief in each prior, respectively. The parameters κ and ν are conceptually similar to precision of the Dirichlet distribution α 0 . The Normal-Wishart is a compound distribution which decomposes into a product of a conditional normal distribution over the mean and a Wishart distribution over the precision. Thus, a Regression Prior Network (RPN) parameterizes the Normal-Wishart distribution over the mean and precision of normal output distributions as follows: p(µ, Λ|x, θ) = N W(µ, Λ|m, L, κ, ν), {m, L, κ, ν} = Ω = f (x; θ) where Ω = {m, L, κ, ν} is the set of parameters of the Normal-Wishart predicted by neural network. The posterior predictive of this model is the multivariate Student's T distribution (Murphy, 2012) , which is the heavy-tailed generalization of the multivariate normal distribution: p(y|x, θ) = E p(µ,Λ|x,θ) [p(y|µ, Λ)] = T (y|m, κ + 1 κ(ν -K + 1) L -1 , ν -K + 1) In the limit, as ν → ∞, the T distribution converges to a normal distribution. The predictive posterior of the Prior Network given in equation 4 only has a defined mean and variance when ν > K + 1. Figure 1 depicts the desired behaviour of an ensemble of normal distributions sampled from a Normal-Wishart distribution. Specifically, the ensemble should be consistent for in-domain inputs in regions of low/high data uncertainty, as in figures 1a-b, and highly diverse both in the location of the mean and in the structure of the covariance for out-of-distribution inputs, as in figure 1c . Samples of continuous output distributions from a regression Prior Network should yield the same behaviour.

Measures of Uncertainty

Given an RPN which displays these behaviours, we can compute closedform expression for all uncertainty measures previously discussed for ensembles and Dirichlet Prior Networks (Malinin, 2019) . We can obtain measures of knowledge, total and data uncertainty by considering the mutual information between y and the parameters of the output distribution {µ, Λ}: I[y, {µ, Λ}] Knowledge Uncertainty = H E p(µ,Λ|x,θ) [p(y|µ, Λ)] Total Uncertainty -E p(µ,Λ|x,θ) H[p(y|µ, Λ)] Expected Data Uncertainty (5) This expression consists of the difference between the differential entropy of the posterior predictive and the expected differential entropy of draws from the Normal-Wishart prior. We can also consider the expected pairwise KL-divergence (EPKL) between draws from the Normal-Wishart prior: K[y, {µ, Λ}] = -E p(y|x,θ) [E p(µ,Λ|x,θ) [ln p(y|µ, Λ)]] -E p(µ,Λ|x,θ) H[p(y|µ, Λ)] This is an upper bound on mutual information (Malinin, 2019) . Notably, estimates of data uncertainty are unchanged. One practical use of EPKL is comparison with ensembles, as it is not possible to obtain a tractable expression for the mutual information of a regression ensemble (Malinin, 2019) . Alternatively, we can consider measures of uncertainty derived via the law of total variance: V p(µ,Λ|x,θ) [µ] Knowledge Uncertainty = V p(y|x,θ) [y] Total Uncertainty -E p(µ,Λ|x,θ) [Λ -1 ] Expected Data Uncertainty (7) This yields a similar decomposition to mutual information, but only first and second moments are considered. We provide closed-form expressions of ( 5)-( 7) in appendix A. Note, however, that these variance-based measures are not scale-invariant, and are therefore sensitive to the scale of predictions which the model makes. This is relevant to applications such as depth-estimation, where we are likely to encounter images with a wide range of possible depths, and therefore varying scale of predictions. We omit the closed-form expressions for all terms here and instead provide them in appendix A. RKL training objective Having discussed how to construct Prior Networks for regression, we now discuss how they can be trained. Prior Networks are trained using a multi-task loss, where an in-domain loss L in and an out-of-distribution (OOD) loss L out are jointly minimized: L(θ, D tr , D out ) = E ptr(y,x) L in (y, x, θ) + γ • E pout(x) L out (x, θ) The OOD loss is necessary to teach the model the limit of its knowledge (Malinin, 2019) and define a decision boundary between the in-domain and out-of-domain regions. In some applications the choice of OOD data can be particularly difficult. Normal distributions sampled from a Prior Network should be consistent and reflect the correct level of data uncertainty in-domain, and diverse in both mean and precision out-of-domain. Achieving the former is challenging, as the training data only consists of samples of inputs x and targets y, there is no access to the underlying distribution, and associated data uncertainty, represented by the precision Λ. Effectively, we are attempting to train a Normal-Wishart distribution from targets sampled from normal distribution that are sampled from the Normal-Wishart, rather than on the normal distribution samples themselves, which is challenging. However, it was shown that for Dirichlet Prior Networks minimizing the reverse KL-divergence between the model and an appropriate target Dirichlet induces in expectation the correct estimate of data uncertainty (Malinin & Gales, 2019) . As the Normal-Wishart, like the Dirichlet, is a conjugate prior and exponential family member, the precision can be induced in expectation by considering the reverse KL-divergence between the model p(µ, Λ|x, θ) and a target Normal-Wishart p(µ, Λ| Ω(i) ) corresponding to each x (i) . The appropriate Normal-Wishart is specified via Bayes's rule: p(µ, Λ| Ω(i) ) ∝ p(y (i) |µ, Λ) β p(µ, Λ|Ω 0 ) (9) where p(y (i) |µ, Λ) is a normal distribution and Ω 0 = {m 0 , L 0 , κ 0 , ν 0 } are the parameters of the prior p(µ, Λ|Ω 0 ) defined as follows: m 0 = 1 N N i=1 y (i) , L -1 0 = ν 0 N N i=1 (y (i) -m 0 )(y (i) -m 0 ) T , κ 0 = , ν 0 = K + 1 + In other words, we consider a semi-informative prior which corresponds to the mean and scatter matrix of marginal distribution p(y), and we see each sample of the training data β times. The hyper-parameter β allows us to weigh the effect of the prior and the data. is a small value, like 10 -2 , so that κ 0 and ν 0 yield a maximally un-informative, but proper predictive posterior. The reason to use a semi-informative prior is that in regression tasks, unlike classification tasks, uninformative priors are improper and lead to infinite differential entropy. Furthermore, we do know something about the data purely based on the marginal distribution, and it is sensible to use that as the prior. The reverse KL-divergence loss can then be expressed as: L(y, x, θ; β, Ω 0 ) = KL[p(µ, Λ|x, θ) p(µ, Λ| Ω(i) )] = β • E p(µ,Λ|x,θ) -ln p(y|µ, Λ) + KL[p(µ, Λ|x, θ) p(µ, Λ|Ω 0 )] + Z ( ) where Z is a normalization constant independent of parameters θ. For in-domain data, β can be set to a large value, and for out-of-domain training data β = 0, so that the model regresses to the prior. In-domain, the prior will add a degree of smoothing, which may prevent over-fitting and improve performance on small datasets. A large value of β means the model will yield a larger κ, ν for in-domain data. This will result in the first term of (11) being very close to the expected negative log-likelihood of the predictive posterior p(y|x, θ), thereby avoiding degradation of predictive performance. The derivation and closed-form expression for this loss is provided in appendix A. Ensemble Distribution Distillation An exciting task which Prior Networks can solve is Ensemble Distribution Distillation (EnD 2 ) (Malinin et al., 2020) , where the distribution of an ensemble's predictions is distilled into a single model. EnD 2 enables retaining an ensemble's improved predictive performance and uncertainty estimates within a single model at low cost. In contrast, standard Ensemble Distillation (EnD) minimizes the KL-divergence between a model and the ensemble: L EnD (φ, D trn ) = 1 N M N i=1 M m=1 KL[p(y|x (i) , θ (m) )||p(y|x (i) , φ)] This loses information about ensemble diversity. A complication that occurs in probabilistic regression models is that the student model will try to fit a single normal distribution on top of a mixture distribution, spreading itself across each model. This may result in both poor predictive performance and poor estimates of uncertainty. An approach to both overcome this and retain information about diversity was considered in (Tran et al., 2020; Wu et al., 2020) , where the student model parameterizes a mixture distribution, where each component of the mixture models a particular model in the ensemble. We refer to this as Mixture-Density Ensemble Distillation (MD-EnD): p(y|x, φ) = 1 M M m=1 N (y|µ (m) , Λ (m) ), {µ (m) , Λ (m) } M m=1 = f (x; φ) L MD-EnD (φ, D trn ) = 1 N M N i=1 M m=1 KL[p(y|x (i) , θ (m) )||N (y|µ (m) , Λ (m) ))] This clearly overcomes the issue of distributional mismatch, resolving any issues with poor predictive performance. It also allows the model to, theoretically, retain information about ensemble diversity. However, it may be challenging to fully replicate the behaviour of each ensemble member in detail by having only multiple output heads -thus, splitting the model at an earlier point in the network may be necessary, which increases computational cost. EnD 2 avoids the problem by directly modelling the bulk behaviour of the ensemble, rather than the behavior of each individual model. EnD 2 can be implemented for regression via RPNs as follows. Consider an ensemble {p(y|x, θ (m) )} M m=1 , where each model yields the mean and precision of a normal distribution. We can define an empirical distribution over the mean and precision as follows: p(µ, Λ, x) = {µ (mi) , Λ (mi) } M m=1 , x (i) N i=1 = D trn EnD 2 can then be accomplished by minimizing the negative log-likelihood of the ensemble's mean and precision under the Normal-Wishart prior: L EnD 2 (φ, D trn ) = E p(µ,Λ,x) -ln p(µ, Λ|x; φ) = E p(x) KL[p(µ, Λ|x)||p(µ, Λ|x; φ)] + Z (15) This is equivalent to minimizing the KL-divergence between the model and the empirical distribution of the ensemble. Note that here, unlike in the previous section, the parameters of a normal distribution are available for every input x, making forward KL-divergence the appropriate loss function. However, while this is a theoretically sound approach, the optimization might be numerically challenging. Similarly to (Malinin et al., 2020) we propose a temperature-annealing trick to make the optimization process easier. First, the ensemble is reduced to it's mean: µ (mi) T = 2 T + 1 µ (mi) + T -1 T + 1 μ(i) , μ(i) = 1 M M m=1 µ (mi) Λ -1(mi) T = 2 T + 1 Λ -1(mi) + T -1 T + 1 Λ-1(i) , Λ-1(i) = 1 M M m=1 Λ-1(mi) We use inverses of the precision matrix Λ because we are interpolating the covariance matrices Σ. Secondly, the predicted κ = T κ and ν = T ν are multiplied by T in order to make the Normal-Wishart sharp around the mean. The loss is divided by T to avoid scaling the gradients by T, yielding: p T (µ, Λ|x, φ) = N W(µ, Λ|m, L, T κ, T ν), {m, L, κ, ν} = f (x; φ) L EnD 2 (φ, D trn ; T ) = 1 T E pT (µ,Λ,x) -ln p T (µ, Λ|x; T, φ) This splits learning into two phases. First, when the temperature is high, the model learns to match the ensemble's mean (first moment). Second, as the temperature is annealed down to 1, the model will gradually focus on learning higher moments of the ensemble's distribution. This trick may be necessary, as the ensemble may have a highly non-Normal-Wishart distribution, which may be challenging to learn. Note that for EnD 2 it may be better to parameterize the Normal-inverse-Wishart distribution over the mean and covariance due to numerical stability concerns. However, for consistency, we describe EnD 2 in terms of the Normal-Wishart. Finally, we emphasize that EnD 2 does not require OOD training data, unlike the RKL objective above. This eliminates the non-trivial challenge of finding appropriate OOD data. Related Approaches It is necessary to mention evidential approaches, specifically Deep Evidential Regression (Amini et al., 2020) . Structurally, it considers models which are similar to RPNs, and is thus compared to in section 5. However, they do not use OOD training data nor attempt to emulate and generalize an ensemble's behaviour to enforce high-uncertainty behaviour in OOD regions. Rather, they are trained by maximizing the likelihood of the T distribution (equation ( 4)) together with an evidence regularizer which forces the model to yield high ν, κ on regions with low absolute error, and low evidence in regions of high L1 error. However, this seems susceptible to pathologies, such as making the model overconfident if it over-fits the training data and achieves zero MAE, or introducing an inverse correlation between ν, κ and the scale of predictions in datasets where the scale of predictions range widely. Thus, while they yield encouraging results, their principle of action remains unclear. Another related recent research direction is in developing computationally efficient ensembles (Wen et al., 2020) . Here, rather than emulating an ensemble using a single model, the goal is to generate high diversity in predictions while re-using large portions of a neural network model, thereby making for compact ensembles. While such approaches are more computationally efficient than using multiple independent models and use only a little more memory than a single model on disk, they still require using M times as much GPU memory at run time, which can be a significant limitation in resource constrained applications, like mobile devices or autonomous vehicles. The results presented in Figure 2 show several trends. Firstly, the total uncertainty of all models is high in the region of high heteroscedastic noise as well as out-of-domain. Secondly, total uncertainty decomposes into data uncertainty and knowledge uncertainty. The former is high in the region of high heteroscedastic noise and has undefined behavior out-of-domain, while the latter is low in-domain and large out-of-distribution. Third, EnD 2 successfully replicates the ensemble's estimates of uncertainty, though they are consistently larger, especially estimates of data uncertainty out-of-domain. This is a consequence of the ensemble being non-Normal-Wishart distributed when it is diverse, leading the EnD 2 Prior Network to over-estimate support. Thus, these results validate the principle claims that Regression Prior Networks can emulate an ensemble's behavior via multi-task training using the RKL objective or via EnD 2 and that they yield interpretable measures of uncertainty.

4. EXPERIMENTS ON UCI DATA

In this section, we evaluate Normal-Wishart Prior Networks trained via reverse-KL divergence (11) (NWPN) and Ensemble Distribution Distillation (EnD 2 ) relative to a Deep-Ensemble (ENSM) baseline on selected UCI datasets. Other ensemble-methods are not considered, as Deep Ensembles have been shown to consistently outperform them using fewer ensemble members (Ashukha et al., 2020; Ovadia et al., 2019; Fort et al., 2019) . We follow the experimental setup of (Lakshminarayanan et al., 2017) with several changes, detailed in appendix C. Out-of-distribution training data for NWPN is generated using a factor analysis model. This model is a linear generative model that learns to approximate in-domain data with x ∼ N (µ, W W T + Ψ), where [W, µ, Ψ] are model parameters. The out-of-domain training examples are then sampled from N (µ, 3W W T + 3Ψ), such that they are further from the in-domain region. Table 1 shows a comparison of all models in terms of NLL and RMSE. Single is a Gaussian model, ENSM is an ensemble of Gaussian models. Unsurprisingly, ensembles yield the best RMSE, though both NWPN and EnD 2 generally give comparable NLL scores. Furthermore, EnD 2 comes close to or matches the performance of the ensemble and outperforms NWPN. In Table 2 we compare uncertainty measures derived from all models on the tasks of error detection and OOD detection. To evaluate error detection a Prediction Rejection Ratio (PRR) is used. It shows what part of the best possible error-detection performance our algorithm covers and is defined in appendix C. For the evaluation of OOD-detection performance, we took parts of other UCI datasets as The results show that all models achieve comparable error-detection using measures of total uncertainty. In terms of OOD detection, EnD 2 generally reproduces the ensemble's behavior, while NWPN usually performs worse. However, on the MSD dataset, NWPN yields the best performance. This may be due to the nature of the OOD training data -it may simply be better suited to MSD OOD detection. Furthermore, the UCI datasets are generally small and have low input dimensionality -MSD, the largest, has 95 features. Therefore, it is difficult to assess the superiority of any particular model on these simple datasets -all we can say that they generally perform comparably. In the next section, we validate Regression Prior Networks on a more complex, larger-scale task. 

5. MONOCULAR DEPTH ESTIMATION EXPERIMENTS

Having established that the proposed methods work on par with or better than ensemble methods on the UCI datasets, we now examine them on the large-scale NYU Depth v2 (Nathan Silberman & Fergus, 2012) and KITTI (Menze & Geiger, 2015) depth-estimation tasks. In this section, the base model is DenseDepth (DD), which defines a U-Net like architecture on top of DenseNet-169 features (Alhashim & Wonka, 2018) . The original approach trains it on inverted targets using a combination of L1, SSIM, and Image-Gradient losses. We replace this with NLL training using a Gaussian model, which yields mean and precision for each pixel (Single). We also use original targets from the dataset. The rest of the data pre-processing, augmentation, optimization, and evaluation protocol is kept unchanged. On the challenging KITTI benchmark (Geiger et al., 2013) all models are evaluated on the split proposed by (Eigen et al., 2014) . We consider the following baselines: a single Gaussian model (Single), Deep-Ensemble of 5 Gaussian models (ENSM), ensemble distillation (EnD) (Hinton et al., 2015) , mixture density ensemble distillation (MD-EnD) Tran et al. (2020) ; Wu et al. (2020) and Deep Evidential Regression (Amini et al., 2020) . We distribution-distill the ensemble into a Regression Prior Network (EnD 2 ) with a per-pixel Normal-Wishart distribution. We also examine training an RPN with the RKL loss function (NWPN). On the NYU dataset, we take KITTI as OOD training data, and on KITTI, we use NYU as OOD training data. We retrain all models 4 times with different random seeds and report mean. Distribution-distillation is done with temperature annealing (T = 1.0 vs. T = 10.0). For T = 10.0 we train with initial temperature for 20% of epochs, linearly decay it to 1.0 during 60% of epochs, fine-tune with T = 1.0 for remaining epochs. We found that this greatly stabilized training. We report standard predictive performance metrics for depth estimation (Eigen et al., 2014) , a detailed description can be found in appendix D. Results in table 3 show that all probabilistic models either outperform or work on par with the original DenseDepth on both datasets. EnD 2 and MD-EnD achieve performance closest to the ensemble, with End 2 doing marginally better than MD-EnD on Kitti. At the same time both DER and NWPN achieve performance comparable to a single model, though NWPN does marginally worse on NYU. In terms of test-set negative log-likelihood, a metric of calibration, DER, NWPN and EnD 2 outperform all other approaches, including the ensemble. With regards to EnD 2 this may be due to the ensemble being poorly modeled by a Normal-Wishart distribution, leading the Prior Network to overestimate the distribution's support, and therefore yield less overconfident prediction. NWPN and DER may simply be well-regularized and smoothed. Finally, both EnD and MD-EnD are significantly worse in terms of calibration (NLL) on both datasets. EnD also achieves both poor predictive and calibration performance, which is likely an artifact of trying to model a mixture of Gaussians with a single Gaussian. In table 4 we assess all models on the task of out-of-domain input detection. Two OOD test-datasets are considered: LSUN-church (LSN-C) and LSUN-bed (LSN-B) (Yu et al., 2015) , which consist of images of churches and bedrooms. The latter is most similar to NYU Depth-V2 and more challenging to detect. OOD images are center-cropped and re-scaled to the in-domain data. With regards to KITTI, which consists of outdoor images of roads and displays a large range of depth in each image, the OOD data is far closer to the camera than the in-domain data as a result of a crop-and-scale prepossessing. Examples of this a provided in appendix D.4. Results show several trends. EnD 2 consistently outperforms the original ensemble using measures of knowledge uncertainty (I, K and V[µ]). However, when considering measures of total uncertainty (H[E], V[y]), the ensemble tends to yield superior performance on NYU. This is likely due to the over-estimation of the support of the ensemble's distribution by EnD 2 . In contrast, EnD and MD-EnD perform worse than a single model, likely due to either failing to match the ensemble due to distributional mismatch (EnD), or having limited capacity to model the individual behaviour of each ensemble member (MD-EnD). In contrast, EnD 2 models the 'bulk' behaviour of the ensemble, and does not suffer from either issue, which highlights its importance. At the same time, while NWPN yields the best OOD-detection performance on KITTI, it fails to be robust on NYU. It is necessary to point out that we found training RPNs with RKL challenging in this setting, as it is non-trivial to define what OOD is, especially for depth estimation. In this paper, we consider a different dataset to be OOD. An ablation study with varying OOD weight (appendix D) shows a trade-off between predictive quality and OOD detection quality. Appropriate training yields the best OOD detection performance (KITTI), and if the balance is incorrect, the performance is poor (NYU). We speculate that this is because depth estimation is a task sensitive to local features, while discriminating between datasets requires global features, so the two tasks interfere. This highlights the value of EnD 2 , which does not require OOD training data or additional hyperparameters, and yields good predictive and OOD detection performance. Interestingly, variance-based measures outperform information-theoretic measures of total uncertainty, and sometimes knowledge uncertainty on NYU, but fail on KITTI. This is a result of their sensitivity to scale. The models predict very low depth values for OOD data, which means that they have lower entropy and variance. In contrast, MI and EPKL, which are scale-invariant, are not affected. The issue of scale sensitivity is also the reason why Deep Evidential Regression (DER) completely fails on KITTI. As discussed in section 2, the evidence regularizer forces the model to yield low uncertainty measures in regions of low error. Therefore, when it observes data too close to the camera, DER tends to detect it as in-domain. Additional OOD detection results presented in appendix D.4. Last, figure 3 shows the error and estimates of total and knowledge uncertainty of the ensemble and an EnD 2 model for the same input image. Both the ensemble and our model effectively decompose uncertainty. Total uncertainty is correlated with an error and is large at object boundaries and distant points while knowledge uncertainty concentrates on the interior of unusual objects. EnD 2 yields both error and uncertainty measures, which are very similar to that of the original ensemble. This demonstrates that EnD 2 can emulate not only the predictive performance of the ensemble but also the behavior of the ensemble's measures of uncertainty. Further comparisons are provided in appendix D.

6. CONCLUSION

This work proposed Regression Prior Networks, yielding a set of general, efficient, and interpretable uncertainty estimation approaches for regression. A Regression Prior Network (RPN) predicts the parameters of a Normal-Wishart distribution, enabling it to efficiently represent ensembles of regression models, allowing interpretable measures of uncertainty to be obtained at a low computational cost. In this work closed-form measures of total, data and knowledge uncertainty are obtained for Normal-Wishart RPNs. Two RPN training approaches are proposed. First, the reverse-KL divergence between the model and a target Normal-Wishart distribution is described, allowing the behaviour of an RPN to be explicitly controlled but requiring an OOD training dataset. Second, Ensemble Distribution Distillation (EnD 2 ) is used, where an ensemble of regression models is distilled into an RPN such that it retains the improved predictive performance and uncertainty estimates of the original ensemble. This approach is particularly useful when it is challenging to define an appropriate out-of-domain training dataset, such as in depth-estimation. The properties of RPNs were evaluated on selected UCI datasets and two large-scale monocular depth-estimation tasks. Here Ensemble Distribution Distilled RPNs, which do not need OOD training data, were shown to outperform other single-model and distillation approaches in terms of predictive performance, and all models in overall OOD-detection quality. This demonstrates its value as a computationally cheap, general-purpose uncertainty estimation approach for regressions tasks.

A DERIVATIONS FOR NORMAL-WISHART PRIOR NETWORKS

The current appendix provides mathematical details of the Normal-Wishart distribution and derivations of the reverse-KL divergence loss, ensemble distribution distillation and all uncertainty measures.

A.1 NORMAL-WISHART DISTRIBUTION

The Normal-Wishart distribution is a conjugate prior over the mean µ and precision Λ of a normal distribution, defined as follows p(µ, Λ|Ω) = N W(µ, Λ|m, L, κ, ν) = N (µ|m, κΛ)W(Λ|L, ν); where Ω = {m, L, κ, ν} are the parameters predicted by neural network, N is the density of the Normal and W is the density of the Wishart distribution. Here, m and L are the prior mean and inverse of the positive-definite prior scatter matrix, while κ and ν are the strengths of belief in each prior, respectively. The parameters κ and ν are conceptually similar to precision of the Dirichlet distribution α 0 . The Normal-Wishart is a compound distribution which decomposes into a product of a conditional normal distribution over the mean and an Wishart distribution over the precision: N (µ|m, κΛ) = κ D 2 |Λ| 1 2 2π D 2 exp - κ 2 (µ -m) T Λ(µ -m) W(Λ|L, ν) = |Λ| ν-K-1 2 exp(-1 2 Tr(ΛL -1 )) 2 νK 2 Γ K ( ν 2 )|L| ν 2 ; Λ, L 0, ν > K -1. ( ) where Γ K (•) is the multivariate gamma function and K is the dimensionality of y. From (Murphy, 2012) the posterior predictive of this model is the multivariate T-distribution: p(y|x, θ) = E p(µ,Λ|x,θ) p(y|µ, Λ) = T (y|m, κ + 1 κ(ν -K + 1) L -1 , ν -K + 1). ( ) The T distribution is heavy-tailed generalization of the multivariate normal distribution defined as: T (y|µ, Σ, ν) = Γ( ν+K 2 ) Γ( ν 2 )ν K 2 π K 2 |Σ| 1 2 1 + 1 ν (y -µ) T Σ -1 (y -µ) - (ν+K) 2 , ν ≥ 0; ( ) where ν is the number of degrees of freedom. However, the mean is only defined when ν > 1 and the variance is defined only when ν > 2.

A.2 REVERSE KL-DIVERGENCE TRAINING OBJECTIVE

Now let us consider in greater detail the reverse KL-divergence training objective (11): L(y, x, θ; β, Ω 0 ) = β • E p(µ,Λ|x,θ) -ln p(y|µ, Λ) + KL[p(µ, Λ|x, θ) p(µ, Λ|Ω 0 )] + Z (22) where Ω 0 = [m 0 , L 0 , κ 0 , ν 0 ] are prior parameters that we set manually as discussed in section 2. It is necessary to show why the reverse KL-divergence objective will yield the correct level of data uncertainty. Lets consider taking the expectation of the first term in (11) with respect to the true distribution of targets p tr (y|x). Trivially, we can show that by exchanging the order of expectation, that we are optimizing the expected cross-entropy between samples from the Normal-Wishart and the true distribution: E ptr(y|x) E p(µ,Λ|x,θ) -ln p(y|µ, Λ) = E p(µ,Λ|x,θ) E ptr(y|x) -ln p(y|µ, Λ) This will yield an upper bound on the cross entropy between the predictive posterior and the true distribution. However, if we were to consider the forward KL-divergence between Normal-Wishart distributions, we would not obtain such an expression and not correctly estimate data uncertainty. Interestingly, the reverse KL-divergence training objective has the same form as an ELBO -the predictive term and a reverse KL-divergence to the prior. Having established an important property of the RKL objective, we not derive it's closed form expression. Note, that in these derivations, we make extended use of properties for taking expectations of traces and log-determinants matrices with respect to the Wishart distribution detailed in (Gupta & Srivastava, 2010) .For the first term in 22, we use the following property of the multivariate normal: x ∼ N (µ, Σ) ⇒ E[x T Ax] = Tr(AΣ) + m T Am; which allows us to get: E p(µ,Λ|x;θ) [-ln p(y|µ, Λ)] = = 1 2 E p(µ,Λ|x;θ) [(y -µ) T Λ(y -µ) + K ln(2π) -ln |Λ|] = 1 2 E N (µ|m,κΛ)W(Λ|L,ν) [(y -µ) T Λ(y -µ) + K ln(2π) -ln |Λ|] ∝ 1 2 E p(Λ|L;ν) [(y -m) T Λ(y -m) + Kκ -1 -ln |Λ|] = ν 2 (y -m) T L(y -m) + K 2κ - 1 2 ln |L| - 1 2 ψ K ( ν 2 ) + K 2 ln π. The second term in ( 22) may expressed as follows via the chain-rule of relative entropy (Cover & Thomas, 2006) : KL[p(µ, Λ|Ω) p(µ, Λ|Ω 0 )] = = E p(Λ|Ω) KL[p(µ|Λ, Ω) p(µ|Λ, Ω 0 )] + KL[p(Λ|Ω) p(Λ|Ω 0 )]; The first term in ( 26) can be computed as: E p(Λ|Ω) KL[p(µ|Λ, Ω) p(µ|Λ, Ω 0 )] = = E W(Λ|L,ν) KL[N (y|m, κΛ) N (y|m 0 , κ 0 Λ)] = κ 0 2 (m -m 0 ) T νL(m -m 0 ) + K 2 ( κ 0 κ -ln κ 0 κ -1); while the second term is: KL[p(Λ|Ω) p(Λ|Ω 0 )] = KL[W(Λ|L, ν) W(Λ|L 0 , ν 0 )] = = ν 2 tr(L -1 0 L) -K - ν 0 2 ln |L -1 0 L| + ln Γ K ( ν0 2 ) Γ K ( ν 2 ) + ν -ν 0 2 ψ K ν 2 . A.3 UNCERTAINTY MEASURES Given a Normal-Wishart Prior Network which displays the desired set of behaviours detailed in 2, it is possible to compute closed-form expression for all measures of uncertainty previously discussed for Dirichlet Prior Networks (Malinin, 2019) . The current section details the derivations of uncertainty measures introduced in section 2 for the Normal-Wishart distribution. We make extensive use of (Gupta & Srivastava, 2010) for taking expectation of log-determinants and traces of matrices.

A.3.1 DIFFERENTIAL ENTROPY OF PREDICTIVE POSTERIOR

As discussed in section 2, the predictive posterior of a Prior Network which parameterizes a Normal-Wishart distribution is a multivariate T distribution: E p(µ,Λ|x,θ) [p(y|µ, Λ)] = T (y|m, κ + 1 κ(ν -K + 1) L -1 , ν -K + 1). The differential entropy of the predictive posterior will be a measure of total uncertainty. The differential entropy of a standard multivariate student's T distribution with an identity scatter matrix Σ = I is given by: H[T (x|µ, I, ν)] = -ln Γ( ν+K 2 ) Γ( ν 2 )(νπ) K 2 + ( ν + K 2 ) • ψ( ν + K 2 ) -ψ( ν 2 ) ; which is a result obtained from (ARELLANO-VALLE et al., 2013) . Using the property of differential entropy (Cover & Thomas, 2006) , that if x ∼ p(x) and y = µ + Ax, then: H[p(y)] = H[p(x)] + ln |A|. We can show that the differential entropy of a standard general multivariate student's T distribution is given by: H[T (x|µ, Σ, ν)] = 1 2 ln |Σ| -ln Γ( ν+K 2 ) Γ( ν 2 )(νπ) K 2 + ( ν + K 2 ) • ψ( ν + K 2 ) -ψ( ν 2 ) . Using this expression, we can show that the differential entropy of the predictive posterior of a Normal-Wishart Prior Network is given by: H E p(µ,Λ|x,θ) [p(y|µ, Λ)] = H T (y|m, κ + 1 κ(ν -K + 1) L -1 , ν -K + 1) = ν + 1 2 ψ ν + 1 2 -ψ ν -K + 1 2 -ln Γ( ν+1 2 ) Γ( ν-K+1 2 ) (ν -K + 1)π K 2 - 1 2 ln |L| + K 2 ln κ + 1 κ(ν -K + 1) .

A.3.2 MUTUAL INFORMATION

The mutual information between the target y and the parameters of the output distribution {µ, Λ} is measures of knowledge uncertainty, and it the difference between the (differential) entropy of the predictive posterior and the expected differential entropy of each normal distribution sampled from the Normal Wishart: I[y, {µ, Λ}] Knowledge Uncertainty = H E p(µ,Λ|x,θ) [p(y|µ, Λ)] Total Uncertainty -E p(µ,Λ|x,θ) H[p(y|µ, Λ)] Expected Data Uncertainty The first term, the differential entropy of the predictive posterior was derived above in (33). We we derive the expected differential entropy as follows: E N W(µ,Λ|Ω H[N (y|µ, Λ)] = 1 2 E N W(µ,Λ|Ω) K ln(2πe) -ln |Λ| = = 1 2 K ln(πe) -ln |L| -ψ K ( ν 2 )) . Thus, the final expression for mutual information is: I[y, {µ, Λ}] = ν + 1 2 ψ ν + 1 2 -ψ ν -K + 1 2 -ln Γ( ν+1 2 ) Γ( ν-K+1 2 ) (ν -K + 1)π K 2 + K 2 ln κ + 1 κ(ν -K + 1) - 1 2 K ln(πe) -ψ K ( ν )) . (36) Note that this expression is no longer a function of L, which was important in representation data uncertainty.

A.3.3 EXPECTED PAIRWISE KL-DIVERGENCE

An alternative measures of knowledge uncertainty which can be considered is the expected pairwise kl-divergence (EPKL), which upper bounds mutual information (Malinin, 2019) . In this section we derive it's closed form expression for the Normal-Wishart distribution. K[y, {µ, Λ}] = E p(µ0,Λ0) E p(µ1,Λ1) KL[N (y|µ 1 , Λ 1 ) N (y|µ 0 , Λ 0 )] = 1 2 E p(µ0,Λ0) E p(µ1,Λ1) (µ 1 -µ 0 ) T Λ 0 (µ 1 -µ 0 ) + ln |Λ 1 | |Λ 0 | + Tr(Λ 0 Λ -1 1 ) -K . Here p(µ 0 , Λ 0 ) = p(µ 1 , Λ 1 ) = p(µ, Λ|x; θ) . In (37) the first term is: E p(µ0,Λ0) E p(µ1,Λ1) (µ 1 -µ 0 ) T Λ 0 (µ 1 -µ 0 ) = = E p(µ0,Λ0) E p(Λ1) (m -µ 0 ) T Λ 0 (m -µ 0 ) + Tr(Λ 0 1 κ Λ -1 1 ) = E p(µ0,Λ0) (m -µ 0 ) T Λ 0 (m -µ 0 ) + 1 κ(ν -K -1) Tr(Λ 0 L -1 ) = K κ + νK κ(ν -K -1) ; The second term in (37) is zero, and the third term is: E p(µ0,Λ0) E p(µ1,Λ1) Tr(Λ 0 Λ -1 1 ) = E p(µ0,Λ0) Tr(Λ 0 1 ν -K -1 L -1 ) = νK (ν -K -1) ; which in sum gives us: K[y, {µ, Λ}] = 1 2 νK(κ -1 + 1) (ν -K -1) - K 2 + K 2κ . ( ) Note that this is also not a function of L, just like mutual information. Rather, it is only a function of the pseudo-counts κ and ν.

A.3.4 LAW OF TOTAL VARIATION

Finally, in order to be able to compare with ensembles, we can also derive variance-based measures of total, data and knowledge uncertainty via the Law of total variance, as follows: V p(µ,Λ|x,θ) [µ] Knowledge Uncertainty = V p(y|x,θ) [y] Total Uncertainty -E p(µ,Λ|x,θ) [Λ -1 ] Expected Data Uncertainty This has a similar decomposition as mutual information. In this section we derive its closed form expression. We can compute the expected variance by using the probabilistic change of variables: E p(µ,Λ|x,θ) [Λ -1 ] = E N W(µ,Λ) [Λ -1 ] = E W(Λ) [Λ -1 ] = E W -1 (Λ -1 ) [Λ -1 ] = 1 ν -K -1 L -1 ; (42) and the variance of expected as: V p(µ,Λ|x,θ) [µ] = E N W(µ,Λ) (µ -m)(µ -m) T = 1 κ E W(Λ|L,ν) Λ -1 = 1 κ(ν -K -1) L -1 . Thus, the total variance is then expressed as: V p(y|x,θ) [y] = 1 + κ κ(ν -K -1) L -1 . ( ) Note that this yields a measure which only considers first and second moments. In addition, in order to obtain a scalar estimate of uncertainty, it is necessary to consider the log-determinant of each measure.

B EXPERIMENT ON SYNTHETIC DATA

The training data consists of 2048 inputs x uniformly sampled from [-10, 10] with targets y ∼ N (sin x + x 10 ). We use a relu network with 2 hidden layers containing 30 units each to predict the parameters of either Gaussian or Normal-Wishart distribution on this data. In all cases, we use Adam optimizer with learning rate 10 -2 and weight decay 10 -4 for 800 epochs with batch size 128. Gaussian models in an ensemble are trained via negative log-likelihood starting from different random initialization. To train a Regression Prior Network with reverse-KL divergence 512 points were uniformly sampled from [-25, -20] ∪ [20, 25] as training-ood data. We use objective (8) with coefficient γ = 0.5. The prior belief is κ 0 = 10 -2 and in-domain β is 10 2 . For EnD 2 training we set T = 1 and add gaussian noise to inputs with standard deviation 3.

C UCI EXPERIMENTS

The current appendix provides additional details of experiments on the UCI regression datasets. Note that we leave out the Yacht Hydrodynamics datasets, as it is the smallest with the fewest features. The remaining datasets are described in the table below: The current section provide a full set of predictive performance, error detection and OOD detection results on UCI datasets in tables 6-8, respectively. Results in table 6 show that all models achieve comparable performance and that EnD 2 tends to come close to the ensemble. Table 7 shows the error detection performance of all models in terms of prediction-rejection ratio (PRR). The results clearly show that measures of total uncertainty are useful in detecting errors, though it is more challenging on some datasets. At the same time, measures of knowledge uncertainty do significantly worse. Finally, table 8 shows the OOD detection performance in terms of % AUC-ROC. Here measures of knowledge uncertainty do far better. Notably, on the larger datasets, EnD 2 comes closer to the performance of the ensemble. In this section we give a description for metrics in Table 3 , which are commonly used in prior work on monocular depth estimation (Eigen et al., 2014; Fu et al., 2018; Alhashim & Wonka, 2018) . Let y be target depth map for a particular image, while ŷ is a predicted depth map, and let ŷi be a prediction for i-th pixel, while I is the set of all pixels. Then, the metrics for an individual image are defined as follows: Thresholds (δ 1 , δ 2 , δ 3 ): % of ŷi s. (45) Metrics for the whole dataset are obtained by averaging individual metrics. They allow to assess different properties of the model: δ k demonstrate confidence intervals, rel shows the ratio between prediction error and target, and log 10 measures the error in log-space which is less sensitive to outliers.

D.2 CALIBRATION

KDE estimates for histograms of NLLs are provided in the figure below. We can see that both EnD 2 , DER, and NWPN models yield far more consistent NLL than their Gaussian counterparts, while the latter has a long tail of outliers. This is probably connected with the fact that the Student T distribution has heavier tails than Gaussian, which allocates greater probability mass farther from the mean, allowing the model to be less confident about its predictions, especially outliers. We also see that DER likelihoods are shifted slightly to the right, which highlights the bias introduced by its evidence regularizer.

D.3 ADDITIONAL DEPTH ESTIMATION EXPERIMENTS

In RPN training with RKL objective, we observed the optimization trajectory to be very unstable and initialization-sensitive. To combat this, we linearly increase γ from 0 to some predefined value during 



Figure 1: Desired behaviors of an ensemble of regression models. The bottom row displays the desired Normal-Wishart Distribution and the top row depicts Normal Distributions samples from it.

Figure 2: Comparison of different models on synthetic data y ∼ N (sin x + x 10 , 1 |x|+1 + 0.01). Gray area indicates training data region.

Figure 3: Comparison of uncertainty measures between ensembles and EnD 2 . For two input images, we demonstrate the difference between prediction and ground truth (Error), measures of Total Variance (Total) and EPKL (Knowledge) obtained from ensembles and our model. The left and right images corresponds to models trained on the KITTI an NYU datasets, respectively.

Figure 4: Prediction-rejection curves.

Figure 7: Examples of test inputs for KITTI model. Images are in order: KITTI, LSUN-bed, LSUNchurch, NYU. OOD images are center-cropped and re-scaled to the in-domain data, preserving the aspect ratio.

Figure 8: Uncurated comparison of ENSM vs EnD 2 behaviour on Nyuv2 dataset (best viewed in color).

Figure 9: Uncurated comparison of ENSM vs EnD 2 models trained on Nyuv2 behaviour on Kitti and LSUN-bed datasets (best viewed in color).

Figure 10: Uncurated comparison of ENSM vs EnD 2 behaviour on KITTI dataset (best viewed in color).

Figure 11: Uncurated comparison of ENSM vs EnD 2 models trained on KITTI behaviour on Nyu and LSUN-bed datasets (best viewed in color).

RMSE and NLL of models on UCI datasets. Datasets listed in order of increasing size. Results on remaining UCI datasets are available in appendix C We made sure that the OOD-data comes from different domains and feature distributions are different. Columns of each OOD dataset are normalized using statistics derived from the indomain training dataset. Details are available in appendix C.

PRR and OOD detection scores

Predictive Performance comparison

OOD detection % AUC-ROC (↑) comparison

Description of UCI datasets.Lakshminarayanan et al., 2017) in all experiments except MSD we use 1 layer relu neural network with 50 hidden units, for MSD we use 100 hidden units. We optimize weights with Adam for 100 epochs with batch size 32. All hyper-parameters, including learning rate, weight decay, RKL prior belief in train data κ 0 , RKL OOD coefficient γ, EnD 2 initial temperature T and noise level ε are set based on search, where we use an equal computational budget in all models to ensure a fair comparison. Additionally, we use 10 folds cross-validation and report standard deviation based on it.To estimate the quality of out-of-domain detection, we additionally create evaluation out-of-domain data from external UCI datasets: "Relative location of CT slices on axial axis Data Set" for MSD and "Condition Based Maintenance of Naval Propulsion Plants Data Set" for other datasets. We drop all constant columns in them and leave first K columns and first N rows, where K is a number of features and N is a number of test examples in a corresponding dataset. For each comparison, the out-of-domain datasets are normalized by the per-column mean and variance obtained on in-domain training data, in order to make out-of-domain detection task more difficult. Uncertainty curve corresponds to the order of decreasing uncertainty of some model. Oracle curve corresponds to the best possible order. Prediction Rejection Ration is a ratio of the area between Uncertainty and Random curves AR uncertainty (orange in Figure4) and the area between Oracle and Random curves AR oracle (blue in Figure4):

Prediction performance metrics of models on six UCI datasets.

PRR scores on all six UCI datasets.

OOD Detection (ROC-AUC) of models on UCI datasets.

annex

the first five epochs, which allows our models to concentrate initially on predictive performance, and then gradually capture the properties of "in-domain" samples.Additionally, we performed an ablation study across different coefficients γ, with the results provided in tables 9 and 10. On NYU dataset, we see that models with lower γ improve performance at the cost of decreased quality of OOD detection. This may be an indication that the task of accurate prediction may not align well with the model's ability to detect unfamiliar samples. Based on this, we decided to use the coefficient of γ = 0.05 as achieving the best trade-off, and then fine-tuned the respective model for 10 additional epochs until convergence. 

D.4 ADDITIONAL OOD DETECTION EXPERIMENTS

We also provide additional OOD detection results, where models trained on NYU detect KITTI OOD data, and vice versa. Note, we do not evaluate NWPN in this scenario, as there two datasets represent the training data. The results generally follow the same trends as those outline in the main paper. Additionally, in the following two figures we provide examples of the in-domain and OOD data as it appear when it is pre-processed. From the figures we can clearly see that relative to NYU Depth, all images are either at comparable depth or a little farther. However, relative to KITTI, all OOD images are much closer to the camera, both naturally, and exacerbated via the crop and scale operation. 

