SHOULD ENSEMBLE MEMBERS BE CALIBRATED?

Abstract

Underlying the use of statistical approaches for a wide range of applications is the assumption that the probabilities obtained from a statistical model are representative of the "true" probability that event, or outcome, will occur. Unfortunately, for modern deep neural networks this is not the case, they are often observed to be poorly calibrated. Additionally, these deep learning approaches make use of large numbers of model parameters, motivating the use of Bayesian, or ensemble approximation, approaches to handle issues with parameter estimation. This paper explores the application of calibration schemes to deep ensembles from both a theoretical perspective and empirically on a standard image classification task, CIFAR-100. The underlying theoretical requirements for calibration, and associated calibration criteria, are first described. It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members. On CIFAR-100 the impact of calibration for ensemble prediction, and associated calibration is evaluated. Additionally the situation where multiple different topologies are combined together is discussed.

1. INTRODUCTION

Deep learning approaches achieve state-of-the-art performance in a wide range of applications, including image classification. However, these networks tend to be overconfident in their predictions, they often exhibit poor calibration. A system is well calibrated, if when the system makes a prediction with probability of 0.6 then 60% of the time that prediction is correct. Calibration is very important in deploying system, especially in risk-sensitive tasks, such as medicine (Jiang et al., 2012 ), auto-driving (Bojarski et al., 2016 ), and economics (Gneiting et al., 2007) . It was shown by Niculescu-Mizil & Caruana (2005) that shallow neural networks are well calibrated. However, Guo et al. (2017) found that more complex neural network model with deep structures do not exhibit the same behaviour. This work motivated recent research into calibration for general deep learning systems. Previous research has mainly examined calibration based on samples from the true data distribution {x (i) , y & Elkan, 2002; Vaicenavicius et al., 2019) . This analysis relies on the limiting behaviour as N → +∞ to define a well calibrated system (i) } N i=1 ∼ p(x, ω), y (i) ∈ {ω 1 , ..., ω K } (Zadrozny P(y = ŷ|P(ŷ|x; θ) = p) = p ⇐⇒ lim N →+∞ i∈S p j δ(y (i) , ŷ(i) ) |S p j | = p (1) where S p j = {i|P(ŷ (i) = j|x (i) ; θ) = p, i = 1, ..., N } and ŷ(i) the model prediction for x (i) . δ(s, t) = 1 if s = t, otherwise 0. However, Eq. (1) doesn't explicitly reflect the relation between P(y = ŷ|P(ŷ|x; θ) = p) and the underlying data distribution p(x, y). In this work we examine this explicit relationship and use it to define a range of calibration evaluation criteria, including the standard sample-based criteria. One issue with deep-learning approaches is the large number of model parameters associated with the networks. Deep ensembles (Lakshminarayanan et al., 2017) is a simple, effective, approach for handling this problem. It has been found to improve performance, as well as allowing measures of uncertainty. In recent literature there has been "contradictory" empirical observations about the relationship between the calibration of the members of the ensemble and the calibration of the final ensemble prediction (Rahaman & Thiery, 2020; Wen et al., 2020) . In this paper, we examine the underlying theory and empirical results relating to calibration with ensemble methods. We found, both theoretically and empirically, that ensembling multiple calibrated models decreases the confidence of final prediction, resulting in an ill-calibrated ensemble prediction. To address this, strategies to calibrate the final ensemble prediction, rather than individual members, are required. Additionally we empiricaly examine the situation where the ensemble is comprised of models with different topologies, and resulting complexity/performance, requiring non-uniform ensemble averaging. In this study, we focus on post-hoc calibration of ensemble, based on temperature annealing. Guo et al. ( 2017) conducted a thorough comparison of various existing post-hoc calibration methods and found that temperature scaling was a simple, fast, and often highly effective approach to calibration. However, standard temperature scaling acts globally for all regions of the input samples, i.e. all logits are scaled towards one single direction, either increasing or decreasing the distribution entropy. To address this constraint, that may hurt some legitimately confident predictions, we investigate the effect of region-specific temperatures. Empirical results demonstrate the effectiveness of this approach, with minimal increase in the number of calibration parameters.

2. RELATED WORK

Calibration is inherently related to uncertainty modeling. Two of the most important scopes of calibration are calibration evaluation and calibration system construction. One method to assessing calibration is the reliability diagram (Vaicenavicius et al., 2019; Bröcker, 2012) . Though informative, It is still desirable to have an overall metric. Widmann et al. ( 2019) investigate different distances in the probability simplex for estimating calibration error. Nixon et al. ( 2019) point out the problem of fixed spaced binning scheme, bins with few predictions may have low-bias but high-variance measurement. Calibration error measure adaptive to dense populated regions have also been proposed (Nixon et al., 2019) . Vaicenavicius et al. ( 2019) treated the calibration evaluation as hypotheses tests. All these approaches examine calibration criteria from a sample-based perspective, rather than as a function of the underlying data distribution which is used in the thoretical analysis in this work. There are two main approaches to calibrating systems. The first is to recalibrate the uncalibrated systems with post-hoc calibration mapping, e.g. Platt scaling (Platt et al., 1999) , isotonic regression (Zadrozny & Elkan, 2002) , Dirichlet calibration (Kull et al., 2017; 2019) . The second is to directly build calibrated systems, via: (i) improving model structures, e.g. deep convolutional Gaussian processes (Tran et al., 2019) ; (ii) data augmentation, e.g. adversarial samples (Hendrycks & Dietterich, 2019; Stutz et al., 2020) or Mixup (Zhang et al., 2018) ; (iii) minimize calibration error during training (Kumar et al., 2018) . Calibration based on histogram binning (Zadrozny & Elkan, 2001) , Bayesian binning (Naeini et al., 2015) and scaling binning (Kumar et al., 2019) are related to our proposed dynamic temperature scaling, in the sense that the samples are divided into regions and separate calibration mapping are applied. However, our method can preserve the property that all predictions belonging to one sample sum to 1. The region-based classifier by Kuleshov & Liang (2015) is also related to our approach. Ensemble diversity has been proposed for improved calibration (Raftery et al., 2005; Stickland & Murray, 2020) . In Zhong & Kwok (2013), ensembles of SVM, logistic regressor, boosted decision trees are investigated, where the combination weights of calibrated probabilities is based on AUC of ROC. However, AUC is not comparable between different models as discussed in Ashukha et al. (2020) . In this work we investigate the combination of different deep neural network structures. The weights assigned to the probabilities is optimised using a likelihood-based metric.

3. CALIBRATION FRAMEWORK

Let X ⊆ R d be the d-dimensional input space and Y = {ω 1 , ..., ω K } be the discrete output space consisting of K classes. The true underlying joint distribution for the data is p(x, ω) = P(ω|x)p(x), x ∈ X , ω ∈ Y. Given some training data D ∼ p(x, ω), a model θ is trained to predict the distribution P(ω|x; θ) given observation features. For a calibrated system the average predicted posterior probability should equate to the average posterior of the underlying distribution for a specific probability region. Two extreme cases will always yield perfect calibration. First when the predictions that are the same, and equal to the class prior for all inputs, P(ω j |x; θ) = P(ω j ). Sec-

