SHOULD ENSEMBLE MEMBERS BE CALIBRATED?

Abstract

Underlying the use of statistical approaches for a wide range of applications is the assumption that the probabilities obtained from a statistical model are representative of the "true" probability that event, or outcome, will occur. Unfortunately, for modern deep neural networks this is not the case, they are often observed to be poorly calibrated. Additionally, these deep learning approaches make use of large numbers of model parameters, motivating the use of Bayesian, or ensemble approximation, approaches to handle issues with parameter estimation. This paper explores the application of calibration schemes to deep ensembles from both a theoretical perspective and empirically on a standard image classification task, CIFAR-100. The underlying theoretical requirements for calibration, and associated calibration criteria, are first described. It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members. On CIFAR-100 the impact of calibration for ensemble prediction, and associated calibration is evaluated. Additionally the situation where multiple different topologies are combined together is discussed.

1. INTRODUCTION

Deep learning approaches achieve state-of-the-art performance in a wide range of applications, including image classification. However, these networks tend to be overconfident in their predictions, they often exhibit poor calibration. A system is well calibrated, if when the system makes a prediction with probability of 0.6 then 60% of the time that prediction is correct. Calibration is very important in deploying system, especially in risk-sensitive tasks, such as medicine (Jiang et al., 2012) , auto-driving (Bojarski et al., 2016) , and economics (Gneiting et al., 2007) . It was shown by Niculescu-Mizil & Caruana (2005) that shallow neural networks are well calibrated. However, Guo et al. (2017) found that more complex neural network model with deep structures do not exhibit the same behaviour. This work motivated recent research into calibration for general deep learning systems. Previous research has mainly examined calibration based on samples from the true data distribution {x (i) , y (i) } N i=1 ∼ p(x, ω), y (i) ∈ {ω 1 , ..., ω K } (Zadrozny & Elkan, 2002; Vaicenavicius et al., 2019) . This analysis relies on the limiting behaviour as N → +∞ to define a well calibrated system P(y = ŷ|P(ŷ|x; θ) = p) = p ⇐⇒ lim N →+∞ i∈S p j δ(y (i) , ŷ(i) ) |S p j | = p where S p j = {i|P(ŷ (i) = j|x (i) ; θ) = p, i = 1, ..., N } and ŷ(i) the model prediction for x (i) . δ(s, t) = 1 if s = t, otherwise 0. However, Eq. (1) doesn't explicitly reflect the relation between P(y = ŷ|P(ŷ|x; θ) = p) and the underlying data distribution p(x, y). In this work we examine this explicit relationship and use it to define a range of calibration evaluation criteria, including the standard sample-based criteria. One issue with deep-learning approaches is the large number of model parameters associated with the networks. Deep ensembles (Lakshminarayanan et al., 2017) is a simple, effective, approach for handling this problem. It has been found to improve performance, as well as allowing measures of uncertainty. In recent literature there has been "contradictory" empirical observations about the relationship between the calibration of the members of the ensemble and the calibration of the final ensemble prediction (Rahaman & Thiery, 2020; Wen et al., 2020) . In this paper, we examine the underlying theory and empirical results relating to calibration with ensemble methods. We found, both theoretically and empirically, that ensembling multiple calibrated models decreases the confidence of final prediction, resulting in an ill-calibrated ensemble prediction. To address this, strategies to calibrate the final ensemble prediction, rather than individual members, are required. Additionally we empiricaly examine the situation where the ensemble is comprised of models with different topologies, and resulting complexity/performance, requiring non-uniform ensemble averaging. In this study, we focus on post-hoc calibration of ensemble, based on temperature annealing. Guo et al. (2017) conducted a thorough comparison of various existing post-hoc calibration methods and found that temperature scaling was a simple, fast, and often highly effective approach to calibration. However, standard temperature scaling acts globally for all regions of the input samples, i.e. all logits are scaled towards one single direction, either increasing or decreasing the distribution entropy. To address this constraint, that may hurt some legitimately confident predictions, we investigate the effect of region-specific temperatures. Empirical results demonstrate the effectiveness of this approach, with minimal increase in the number of calibration parameters.

2. RELATED WORK

Calibration is inherently related to uncertainty modeling. Two of the most important scopes of calibration are calibration evaluation and calibration system construction. One method to assessing calibration is the reliability diagram (Vaicenavicius et al., 2019; Bröcker, 2012) . Though informative, It is still desirable to have an overall metric. Widmann et al. (2019) investigate different distances in the probability simplex for estimating calibration error. Nixon et al. (2019) point out the problem of fixed spaced binning scheme, bins with few predictions may have low-bias but high-variance measurement. Calibration error measure adaptive to dense populated regions have also been proposed (Nixon et al., 2019) . Vaicenavicius et al. (2019) treated the calibration evaluation as hypotheses tests. All these approaches examine calibration criteria from a sample-based perspective, rather than as a function of the underlying data distribution which is used in the thoretical analysis in this work. There are two main approaches to calibrating systems. The first is to recalibrate the uncalibrated systems with post-hoc calibration mapping, e.g. Platt scaling (Platt et al., 1999) , isotonic regression (Zadrozny & Elkan, 2002) , Dirichlet calibration (Kull et al., 2017; 2019) . The second is to directly build calibrated systems, via: (i) improving model structures, e.g. deep convolutional Gaussian processes (Tran et al., 2019) ; (ii) data augmentation, e.g. adversarial samples (Hendrycks & Dietterich, 2019; Stutz et al., 2020) or Mixup (Zhang et al., 2018) ; (iii) minimize calibration error during training (Kumar et al., 2018) . Calibration based on histogram binning (Zadrozny & Elkan, 2001) , Bayesian binning (Naeini et al., 2015) and scaling binning (Kumar et al., 2019) are related to our proposed dynamic temperature scaling, in the sense that the samples are divided into regions and separate calibration mapping are applied. However, our method can preserve the property that all predictions belonging to one sample sum to 1. The region-based classifier by Kuleshov & Liang (2015) is also related to our approach. Ensemble diversity has been proposed for improved calibration (Raftery et al., 2005; Stickland & Murray, 2020) . In Zhong & Kwok (2013) , ensembles of SVM, logistic regressor, boosted decision trees are investigated, where the combination weights of calibrated probabilities is based on AUC of ROC. However, AUC is not comparable between different models as discussed in Ashukha et al. (2020) . In this work we investigate the combination of different deep neural network structures. The weights assigned to the probabilities is optimised using a likelihood-based metric.

3. CALIBRATION FRAMEWORK

Let X ⊆ R d be the d-dimensional input space and Y = {ω 1 , ..., ω K } be the discrete output space consisting of K classes. The true underlying joint distribution for the data is p(x, ω) = P(ω|x)p(x), x ∈ X , ω ∈ Y. Given some training data D ∼ p(x, ω), a model θ is trained to predict the distribution P(ω|x; θ) given observation features. For a calibrated system the average predicted posterior probability should equate to the average posterior of the underlying distribution for a specific probability region. Two extreme cases will always yield perfect calibration. First when the predictions that are the same, and equal to the class prior for all inputs, P(ω j |x; θ) = P(ω j ). Sec-ond the minimum Bayes' risk classifier is obtained, P(ω j |x; θ) = p(x,ωj ) K k=1 p(x,ω k ) . Note that perfect calibration doesn't imply high accuracy, as shown by the system predicting the prior distribution.

3.1. DISTRIBUTION CALIBRATION

A system is calibrated if the predictive probability values can accurately indicate the portion of correct predictions. Perfect calibration for a system that yields P(ω|x; θ) when the training and test data are obtained form the joint distribution p(x, ω) can be defined as: x∈R p j (θ, ) P(ω j |x; θ)p(x)dx = x∈R p j (θ, ) P(ω j |x)p(x)dx ∀p, ω j , → 0 (2) R p j (θ, ) = x |P(ω j |x; θ) -p| ≤ , x ∈ X R p j (θ, ) denotes the region of input space where the system predictive probability for class ω j is sufficiently close, within error of , to the probability p. A perfectly calibrated system will satisfy this expression for all regions, the expected predictive probability (left side of Eq. ( 2)) is identical to the expected correctness, i.e., expected true probability (right side of Eq. ( 2)). R p j (θ, ) defines the region in which calibration is defined. For top-label calibration, only the most probable class is considered and the region defined in Eq. ( 3) is modified to reflect this: Rp j (θ, ) = R p j (θ, ) ∩ x ω j = arg max ω P(ω|x; θ), x ∈ X Eq. ( 4) is a strict subset of Eq. (3). As the two calibration regions are different between calibration and top-label calibration, perfect calibration doesn't imply top-label calibration, and vise versa. A simple illustrative example of this property is given in A.3. Binary classification, K = 2, is an exception to this general rule, as the regions for top-label calibration are equivalent to those for perfect calibration, i.e. Rp j (θ, ) = R p j (θ, ). Hence, perfect calibration is equivalent to top-label calibration for binary classification (Nguyen & O'Connor, 2015) . Eq. (2) defines the requirements for a perfectly calibrated system. It is useful to define metrics that allow how close a system is to perfect calibration to be assessed. Let the region calibration error be: C p j (θ, ) = x∈R p j (θ, ) (P(ω j |x; θ) -P(ω j |x))p(x)dx This then allows two forms of expected calibration losses to be defined ACE(θ) = 1 K 1 0 K j=1 C p j (θ, ) dp; ACCE(θ) = 1 K K j=1 1 0 C p j (θ, ) dp All Calibration Error (ACE) only considers the expected calibration error for a particular probability, irrespective of the class associated with the datafoot_0 (Hendrycks et al., 2019) . Hence, All Class Calibration Error (ACCE) that requires that all classes minimises the calibration error for all probabilities is advocated by Kull et al. (2019) ; Kumar et al. (2019) . Nixon et al. (2019) propose the Thresholded Adaptive Calibration Error (TACE) to consider only the prediction larger than a threshold, and it can be described as a special case of ACCE by replacing the integral range. Naeini et al. (2015) also propose to only consider the region with maximum error. Though measures such as ACE and ACCE require consistency of the expected posteriors with the true distribution, for tasks with multiple classes, particularly large numbers of classes, the same weight is given to the ability of the model to assign low probabilities to highly unlikely classes, and high probabilities to the "correct" class. For systems with large numbers of classes this can yield artificially low scores. To address this problem it is more common to replace the regions in Eq. ( 5) with the top-label regions in Eq. ( 4), to give a top-label calibration error Cp j (θ, ). This then yields the expected top-label equivalents of ACCE and ACE, Expected Class Calibration Error (ECCE) and Expected Calibration Error (ECE). Here for example ECE by Guo et al. (2017) is expressed as ECE(θ) = 1 0 K j=1 x∈ Rp j (θ, ) (P(ω j |x; θ) -P(ω j |x))p(x)dx dp (7) = 1 0 O(θ, p)|Conf(θ, p) -Acc(θ, p)|dp where O(θ, p) = K j=1 x∈ Rp j (θ, ) p(x)dx is the fraction observations that are assigned to that particular probability and Conf(θ, p) and Acc(θ, p) are the ideal distribution accuracy and confidences from the model for that probability. For more details see the appendix.

3.2. SAMPLE-BASED CALIBRATION

Usually only samples from the true joint distribution are available. Any particular training set is drawn from the distribution to yield D = {x (i) , y (i) } N i=1 , {x (i) , y (i) } ∼ p(x, ω), y (i) ∈ {ω 1 , ..., ω K }. The region defined in Eq. ( 3) is now changed to be indices of the samples: S p j (θ, ) = i |P(ω j |x (i) ; θ) -p| ≤ , x (i) ∈ D , The sample-based version of "perfect" calibration in Eq. ( 2) can then be expressed as: 1 |S p j (θ, )| i∈S p j (θ, ) P(ω j |x (i) ; θ) = 1 |S p j (θ, )| i∈S p j (θ, ) δ(y (i) , ω j ), ∀p, ω j , → 0 as N → ∞. When considering finite data, in this case N samples, it is important to set appropriately. Setting different yields different regions and leads to different calibration results (Kumar et al., 2019) . Thus it is important to specify when defining calibration for a system. Similarly, the distribution form of top-label calibration can be written in terms of samples as Eq. ( 4), with different regions considered: Sp j (θ, ) = S p j (θ, ) ∩ i ω j = arg max ω P(ω|x (i) ; θ), x (i) ∈ D The sample-based calibration losses in region S p j (θ, ) can be defined based on Eq. ( 10). For example ACE in Eq. ( 6) can be expressed in its sample-based form (Hendrycks et al., 2019 ) ACE(θ, ) = 1 N K p∈P( ) K j=1 i∈S p j (θ, ) P(ω j |x (i) ; θ) -δ(y (i) , ω j ) where P( ) = {p|p = min{1, (2z -1) }, z ∈ Z + }, and Z + is the set of positive integers. The measure of ECE relating to Eq. ( 7), which only considers the top regions in Eq. ( 11) can be defined as Guo et al. ( 2017) ECE(θ, ) = 1 N p∈P( ) K j=1 i∈ Sp j (θ, ) P(ω j |x (i) ; θ) -δ(y (i) , ω j ) (13) = p∈P( ) K j=1 | Sp j (θ, )| N Conf(θ, p) -Acc(θ, p) It should be noted that for a finite number of samples, the regions S p j (θ, ) and Sp j (θ, ) derived from the samples can be different from the theoretical regions, leading to difference between theoretical calibration error measures and the values estimated from the finite samples. This is also referred to as "estimator randomness" by Vaicenavicius et al. (2019) . An example is given in A.3 to illustrate this mismatch. The simplest region specification for calibration is to set = 1. In this case, |S p j (θ, 1)| = N , and the "minimum" perfect calibration requirement for a system with parameters θ becomes 1 N N i=1 P(ω j |x (i) ; θ) = 1 N N i=1 δ(y (i) , ω j ), ∀ω j This is also referred to as global calibration in this paper. Similarly, global top-label calibration can be defined as 1 N N i=1 P(ŷ (i) |x (i) ; θ) = 1 N N i=1 δ(y (i) , ŷ(i) ), ŷ(i) = arg max ω P(ω|x (i) ; θ) (16)

4. ENSEMBLE CALIBRATION

An interesting question when using ensembles is whether calibrating the ensemble members is sufficient to ensure calibrated predictions. Initially the ensemble model will be viewed as an approximation to Bayesian parameter estimation. Given training data D , the prediction of class ω j is: P(ω j |x * , D) = E θ∼p(θ|D) [P(ω j |x * ; θ)] = P(ω j |x * ; θ)p(θ|D)dx ≈ P (ω j |x * ; Θ) = 1 M M m=1 P(ω j |x * ; θ (m) ); θ (m) ∼ p(θ|D) where Eq. ( 17) is an ensemble, Monte-Carlo, approximation to the full Bayesian integration, with θ (m) the m-th ensemble member parameters in the ensemble Θ. The predictions of ensemble and members are ŷ * m = arg max ω {P(ω|x * ; θ (m) )}, ŷ * E = arg max ω 1 M M m=1 P(ω|x * ; θ (m) ) .

4.1. THEORETICAL ANALYSIS

For ensemble methods it is only important that the final ensemble prediction, ŷE , is well calibrated, rather than the individual ensemble members. It is useful to examine the relationship between this ensemble prediction and the predictions from the individual models when the ensemble members are calibrated. Consider a particular top-label calibration region for the ensemble prediction, Rp (Θ, ), related to Eq. ( 4), the following expression is true x∈ Rp (Θ, ) 1 M M m=1 P(ŷ E |x; θ (m) )p(x)dx ≤ x∈ Rp (Θ, ) 1 M M m=1 P(ŷ m |x; θ (m) )p(x)dx (18) where the ensemble region is defined as Rp (Θ, ) = x |P(ŷ E |x; Θ) -p| ≤ , x ∈ X . For all regions Rp (Θ, ) the ensemble is no more confident than the average confidence of individual member predictions. This puts bounds on the ensemble prediction performance if the resulting ensemble prediction is top-label calibrated, and all ensemble members yield the same region Rp (Θ, ). Here x∈ Rp (Θ, ) P(ŷ E |x; Θ)p(x)dx = x∈ Rp (Θ, ) P(ŷ E |x)p(x)dx From Eq. ( 18) the left hand-side of this expression, the ensemble prediction confidence, cannot be greater that than the average ensemble member confidence. If the regions associated with the ensemble prediction and members are the same, then for top-label calibrated members this average confidence is the same as the average ensemble member accuracy. Furthermore, if the ensemble prediction is top-label calibrated, then this average ensemble member accuracy bounds the ensemble prediction accuracy. Under these conditions ensembling the members yields no performance gains. The above bound holds with the assumption that the members are calibrated on the same regions. Proposition 3 in Appendix describes one trivial case when all members are calibrated on the same regions. Another case is the calibration on global regions. As shown in Proposition 1, at the global level, ensemble accuracy is still bounded. Proposition 1. If all members and the corresponding ensemble are globally top-label calibrated, the ensemble performance is no better than the average performance of the members: 1 N N i=1 δ(y (i) , ŷ(i) E ) ≤ 1 M M m=1 1 N N i=1 δ(y (i) , ŷ(i) m ) Proof. If all members and the ensemble are globally top-label calibrated, 1 N N i=1 P(ŷ (i) m |x (i) ; θ (m) ) = 1 N N i=1 δ(y (i) , ŷ(i) m ), m = 1, ..., M 1 N N i=1 1 M M m=1 P(ŷ (i) E |x (i) ; θ (m) ) = 1 N N i=1 δ(y (i) , ŷ(i) E ) By definition, P(ŷ (i) E |x (i) ; θ (m) ) ≤ P(ŷ (i) m |x (i) ; θ (m) ) Hence, 1 N N i=1 δ(y (i) , ŷ(i) E ) ≤ 1 M M m=1 1 N N i=1 δ(y (i) , ŷ(i) m ) However, this is not true for all-label calibration. In both cases, all-label calibrated members always yield all-label calibrated ensemble, no matter whether the ensemble accuracy exceeds the mean accuracy of members or not (Example 2 in Appendix gives illustration on a synthetic dataset). Proposition 2. If all members are global all-label calibrated, then the overall ensemble is global all-label calibrated. Proof. If all members are global all-label calibrated, then 1 N N i=1 P(ω j |x (i) ; θ (m) ) = 1 N N i=1 δ(y (i) , ω j ), ∀ω j , m = 1, ..., M Hence, 1 N N i=1 P(ω j |x (i) ; Θ) = 1 M M m=1 1 N N i=1 δ(y (i) , ω j ) = 1 N N i=1 δ(y (i) , ω j ) In general the regions are not the same, the ensemble accuracy is not bounded in the above way. However, note that global level calibration is the minimum requirement of calibration. The above discussion based on regions still sheds light on the question of should the members be calibrated or not, though the final theoretical answer is still absent. It should be also noted that, global all-label calibration does not imply global top-label calibration, because the regions considered are different (as illustrated by Example 1 in Appendix). For the discussion so far, the ensemble members are combined with uniform weights, motivated from a Bayesian approximation perspective. When, for example, multiple different topologies are used as members of the ensemble, a non-uniform averaging of the members of the ensemble, reflecting the model complexities and performance may be useful. Propositions 1 and 2 will still apply.

4.2. TEMPERATURE ANNEALING FOR ENSEMBLE CALIBRATION

Calibrating ensembles can be performing using a function f ∈ F with some parameters, t, F : [0, 1] → [0, 1] for scaling probabilities. There are two modes for calibrating an ensemble: Pre-combination Mode. the function is applied to the probabilities predicted by members, prior to combining the members to obtain ensemble prediction using a set of calibration parameters T . P pre (ŷ E |x; Θ, T ) = 1 M M m=1 f P(ŷ E |x; θ (m) ), t (m) Post-combination Mode. the function is applied to the ensemble predicted probability after combining members' predictions. P post (ŷ E |x; Θ, t) = f 1 M M m=1 P(ŷ E |x; θ (m) ) , t There are many functions for transforming predicted probability in the calibration literature, e.g. histogram binning, Platt scaling and temperature annealing. However, histogram binning shouldn't be adopted in the pre-combination mode as scaling function f for calibrating multi-class ensemble, as the transformed values may not yield a valid PMF. As shown in Guo et al. (2017) , temperature scaling is a simple, effective, option for the mapping function F, which scales the logit values associated with the posterior by a temperature t, f (z; t) = exp{z/t}/ j exp{z j /t}. Here a single temperature is used for scaling logits for all samples. This leads to the problem that the entropy of the predictions for all regions are either increased or decreased. From Eq. ( 2) the temperature can be made region specific. f dyn (z; t) = exp{z/t r } j exp{z j /t r } , if max i exp{z i } j exp{z j } ∈ R r To determine the optimal set of temperatures, the samples in the validation set are divided into R regions based on the ensemble predictions (e.g. R 1 = [0, 0.3), R 2 = [0.3, 0.6), and R 3 = [0.6, 1]). Each region has an individual temperature for scaling {R r , t r } R r=1 .

4.3. EMPIRICAL RESULTS

Experiments were conducted on CIFAR-100 (and CIFAR-10 in the A.4). The data partition was 45,000/5,000/10,000 images for train/validation/test. We train LeNet (LEN) (LeCun et al., 1998) , DenseNet 100 and 121 (DSN100, DSN121), (Huang et al., 2017) and Wide ResNet 28 (RSN) (Zagoruyko & Komodakis, 2016) following the original training recipes in each paper (more details in A.4). The results presented are slightly lower than that in the original papers, as 5,000 images were held-out to enable calibration parameter optimisation. Figure 1 examines the empirical performance of ensemble calibration on CIFAR-100 test set using the three trained networks. The top row shows that, with appropriate temperature scaling, the members are calibrated on different regions (because otherwise the accuracy values should be the same). The middle row shows the ECE of ensemble members and ensemble prediction at different temperatures. The optimal calibration temperature for the ensemble prediction are consistently smaller than those associated with the ensemble members. This indicates that the ensemble predictions are less confident than those of the members, as stated in Eq. ( 23). The bottom row of figures show the reliability curves when the ensemble members are calibrated with optimal temperature values, and the resulting combination. It is clear that calibrating the ensemble members, using temperature, does not yield a calibrated ensemble prediction. Furthermore for all models the ensemble prediction is less confident than it should be, the line is above the diagonal. As discussed in Proposition 1, this is necessary, or the ensemble prediction is no better, which is clearly not the case for the performance plots in the top row. This ensemble performance is relatively robust to poorly calibrated ensemble members, with consistent performance over a wide range of temperatures. Table 1 shows the calibration performance using three temperature scaling methods, pre-, postand dynamic post-combination. The temperatures are optimized to minimize ECE (Liang et al., (Widmann et al., 2019) . All three methods effectively improve the ensemble prediction calibration, with the dynamic approach yielding the best performance. We further investigate the impact of region numbers on the dynamic approach, as shown in Figure 3 . It can be found that increasing the region number tends to improve the calibration performance, while requiring more parameters. Finally, for the topology ensemble, weights were optimised using either maximum likelihood (Max LL) or area under curve (AUC) Zhong & Kwok (2013) (results in A.4). In Figure 2 , the ensemble of calibrated structures is shown to be uncalibrated, with reliability curves typically slightly above the diagonal line. When the ensemble prediction is calibrated it can be seen that the calibration for the ensemble prediction is lower than the individual calibration errors in Table 1 ("post" lines). Table 2 : Topology ensembles for CIFAR-100, optimal weights using ML estimation. Calibrations of each topology and ensemble using post-combination mode ("post" in Table 1 ). 

5. CONCLUSIONS

State-of-the-art deep learning models often exhibit poor calibration performance. In this paper two aspects of calibration for these models are investigated: the theoretical definition of calibration and associated attributes for both general and top-label calibration; and the application of calibration to ensemble methods that are often used in deep-learning approaches for improved performance and uncertainty estimation. It is shown that calibrating members of the ensemble is not sufficient to ensure that the ensemble prediction is itself calibrated. The resulting ensemble predictions will be under-confident, requiring calibration functions to be optimised for the ensemble prediction, rather than ensemble members. These theoretical results are backed-up by empirical analysis on CIFAR-100 deep-learning models, with ensemble performance being robust to poorly calibrated ensemble members but requiring calibration even with well calibrated members.

A APPENDIX

A.1 THEORETICAL PROOF Proposition 3. If all members are calibrated and the regions are the same, i.e., for different members θ (m) and θ (m ) R p j (θ (m) , ) = R p j (θ (m ) , ) ∀p, ω j , → 0 then the ensemble is also calibrated on the same regions x∈R p j (Θ, ) P(ω j |x; Θ)p(x)dx = x∈R p j (Θ, ) P(ω j |x)p(x)dx, ∀p, ω j , → 0 Proof. If R p j (θ (m) , ) = R p j (θ (m ) , ) ∀p, ω j , → The ensemble is also calibrated and the regions are the same: Proof. Assume globally top-label calibrated members imply globally top-label calibrated ensemble, that is, given R p j (Θ, ) = x 1 M M m=1 P(ω j |x; θ (m) ) -p ≤ = R p j (θ (m) ) ∀p, ω j , → 0 (30) x∈R p j (Θ, ) 1 M M m=1 P(ω j |x; θ (m) )p(x)dx = 1 M M m=1 x∈R p j (θ (m) , ) P(ω j |x; θ (m) )p(x)dx = 1 M M m=1 x∈R p j (θ (m) , ) P(ω j |x)p(x)dx 1 N N i=1 P(ŷ (i) m |x (i) ; θ (m) ) = 1 N N i=1 δ(y (i) , ŷ(i) m ), m = 1, ..., M the following is true 1 N N i=1 P(ŷ (i) E |x (i) ; Θ) = 1 N N i=1 δ(y (i) , ŷ(i) E ) (32) If ∃n, m, τ > 0, such that ŷ(n) E = ŷ(n) m , then it is possible to write P(ŷ (n) E |x (n) ; Θ) =   1 M m = m P(ŷ (n) E |x (n) ; θ (m) )   + 1 M P(ŷ (n) E |x (n) ; θ ( m) ) For top-label calibration there are no constraints on the second term in Eq. ( 33) as it is not the toplabel for model θ ( m) . Thus there are a set of models that satisfy the top-label calibration constraints for member m that only need to satisfy the following constraints 0 ≤ P(ŷ (n) E |x (n) ; θ ( m) ) < P(ŷ (n) m |x (n) ; θ ( m) ) ≤ 1 (34) and the standard sum-to-one constraint over all classes. Consider replacing member m of the ensemble with a member having parameters θ( m) , to yield Θ, that satisfies max ω P(ω|x (n) ; θ( m) ) = max ω P(ω|x (n) ; θ ( m) ) = ŷ(n) m (35) P(ŷ (n) m |x (n) ; θ( m) ) = P(ŷ (n) m |x (n) ; θ ( m) ) P(ŷ (n) E |x (n) ; θ( m) ) = P(ŷ (n) E |x (n) ; θ ( m) ) + τ where τ > 0, and the standard sum-to-one constraint is satisfied, and all other predictions are unaltered. This results in the following constraints max ω P(ω|x (n) ; Θ) = max ω P(ω|x (n) ; Θ) = ŷ(n) E P(ŷ (n) E |x (n) ; Θ) < P(ŷ (n) E |x (n) ; Θ) The accuracy of the two ensembles Θ and Θ are the same from Eq. ( 38), but the probabilities associated with those predictions cannot be the same from Eq. ( 39), so both ensemble predictions cannot be calibrated, as assuming that the ensemble prediction for Θ is calibrated 1 N N i=1 P(ŷ (i) E |x (i) ; Θ) > 1 N N i=1 P(ŷ (i) E |x (i) ; Θ) = 1 N N i=1 δ(y (i) , ŷ(i) E ) Hence there are multiple values of P(ŷ (n) E |x (n) ; Θ) for which all the models satisfy the topcalibration constraints, but these cannot all be consistent with Eq. ( 40). For the situation where there is no sample or model where ŷ(n) E = ŷ(n) m then the predictions for all models for all samples are the same as the ensemble prediction, so by definition there can be no performance gain.

A.2 GLOBAL GENERAL CALIBRATION AND TOP-LABEL CALIBRATION

To demonstrate the differences between global top-label calibration and global calibration, a set of ensemble member predictions were generated using Algorithm 1, this ensures that the predictions are perfectly calibrated. Since the member predictions are perfectly calibrated, the ensemble members will be globally calibrated. Figure 4 (a) shows the performance in terms of ACE of the ensemble prediction as the value of increases, note when = 1 this is a global calibration version of ACE. It can be seen that as increases ACE decreases, and for the global case reduces to zero for the ensemble predictions as the theory states. In terms of top-label calibration, as the ensemble members are perfectly calibrated, they will again be global top-label calibrated. This is illustrated in Figure 4 (b) where ECE is zero for all ensemble members. For top-label calibration the value of ECE does not decrease to zero as the → 1, again as the theory states. This is because the underlying probability regions associated with each of the members of the ensemble are different. Hence, even for perfectly calibrated ensemble members, the ensemble prediction is not global top-label calibrated.

A.3 TOY DATASETS

Example 1. In this example, we show the difference between all-label calibration and top-label calibration which consider the different regions in Eq. (3) and Eq. (4). Assuming p(x) ∝ 1, the whole input space X is consisted of three regions R 1 , R 2 and R 3 , and The corresponding system prediction P and the true distribution P is: x∈R1 p(x)dx = x∈R2 p(x)dx = x∈R3 p(x)dx. P = P(ω j |x ∈ R r ; θ) = ω 1 ω 2 ω 3 ω 4 0.5 0.4 0.05 0.05 R 1 0.3 0.4 0.2 0.1 R 2 0.3 0.3 0.35 0.05 R 3 (42) P = ω 1 ω 2 ω 3 ω 4 0.5 0.4 -τ 0.05 0.05 + τ R 1 0.3 -τ 0.4 + τ 0.2 0.1 R 2 0.3 + τ 0.3 0.35 0.05 -τ R 3 , τ > 0 It can be verified that  Example 2. This example shows that combination of calibrated members yields uncalibrated ensemble. Algorithm 1 generates the true data distribution p by sampling from a Dirichlet distribution with equal concentration parameters of 1. To generate the member predictions, the N samples are randomly assigned to N/b bins with size of b. In each bin, the member predictions p are equally the average of true data distribution of the associated samples. This ensures that for each region, Eq. ( 2) holds for each member. However, regions for different member are different due to the random assignment. Therefore, the corresponding ensemble is not automatically calibrated. Figure 4 shows that the ensemble is uncalibrated with ACE of 0.0697. Example 3. In this example, we show that for a finite number of samples, the regions S p j (θ, ) and Sp j (θ, ) derived from the samples is different from the theoretical regions, leading to difference between theoretical calibration error measures and the values estimated from the finite samples. Algorithm 2 generates data with difference between finite sample-based calibration error and theoretical error. The theoretical ACE of the predicted probabilities in Algorithm 2 is 9 32 . However, the finite sample-based ACE is 0. The true data distribution is P(ω j |x) = 1 4 , p(x) ∝ 1. The samples in D are assigned to bins of size 2. The type of bin that a sample is assigned to determines the predicted probability of sample. Considering each class ω j , there are three types of bins: B p=1 : both samples belong to class ω j . B p=0.5 : only one sample is class ω j . 



In this section the references given refer to the sample-based equivalent versions of the distributional calibration expressions in this paper using the same concepts, rather than identical expressions.



Figure 1: Top-label calibration error and accuracy of members (mem) and the whole ensemble (ens) on CIFAR-100 (test set) using LeNet, DenseNet and ResNet. "pre" denotes the calibration where shared temperature is applied to members before combination. The reliability curves shows the calibrated members and calibrated ensembles with optimal temperature values.

Figure 2: Reliability curves of weighted combination of 4 calibrated structures, LEN, DSN100, DSN121 and RSN on CIFAR-100. The weightes are estimated by Max LL. Each structure is an ensemble of 10 models.

Figure 3: Impact of different region numbers on dynamic temperature annealing in calibration of ensembles of DSN100, DSN121 and RSN.

j |x)p(x)dx Proposition 4. When class number K > 2, if all members are globally top-label calibrated, then the ensemble is not necessarily global top-label calibrated.

Figure 4: ACE (calibration) and ECE (top-label calibration) of a perfectly calibrated set of ensemble members, as the value of varies.

j |x)p(x)dx, ∀ω j , p, → 0 (44) However, when p = 0.4, j = 2 x∈ Rp j (θ, ) P(ω j |x, θ)p(x)dx = x∈ Rp j (θ, ) P(ω j |x, θ)p(x)dx, → 0

Temperature calibration techniques on CIFAR-100, calibration parameters optimized to minimize ECE on validation set. In the "pre" mode, each member is scaled with one separate temperature. "dyn." denotes dynamic temperature scaling in post-combination mode using 6 regionbased temperatures. Ranges indicate ±2σ.

Individual model performance on CIFAR-10. Full data training (100%Train) and training with 5,000 images held out (90%Train) are presented. * denotes results from the original papers.

Calibration using temperature annealing on CIFAR-10. The temperatures are optimized to minimize ECE on validation set. In the 'pre' mode, each member is scaled with one separate temperature. 'dyn.' denotes dynamic temperature scaling in post-combination mode using 6 regionbased temperatures. The three structures investigated are LeNet 5, DenseNet 121, DenseNet 100 and Wide ResNet 28. Ranges indicate ±2σ.

Topology ensembles for CIFAR-100, optimal weights based on AUC. Calibrations of each topology and ensemble using post-combination mode ("post" in Table1).

annex

Algorithm 1: Algorithm for generating calibrated members that yield uncalibrated ensemble Result: {p (i) } N i=1 ,{ p(i) , ŷ(i) } N i=1 b = 2; //bin size, number of samples in one bin; K = 4; //number of classes; M = 10; //number of members; N = 1000000; //number of samples; p (i) ∼ Dirichlet(α = 1), i = 1, ..., N ; //true data distribution, sampled from Dirichlet distribution with equal concentration parameters of 1; I = [1, 2, ..., N ]; // index vector; for m in [1, ..., M ] do Ĩ ← shuffle(I); for j in [1, ..., N/b ] dothentherefore,A.4 ADDITIONAL EXPERIMENTAL RESULTSIn this section, we show some comparison experiments to the empirical results in Section 4.3. We conducted experiments on CIFAR-100 and CIFAR-10 dataset. We have 10 runs of our experiment to obtain the deviations.In Section 4.3, we presented ensemble calibration on CIFAR-100. The counterpart on CIFAR-10 is given in Table 5 . Other than separately specified, all sample based evaluation criteria in this paper use 15 bins following previous literature (Guo et al., 2017) . The temperatures in pre-, post-and dynamic post-combination modes are optimized on the validation set by minimizing ECE (Liang et = pBj ; end 2020), using SGD with learning rate of 0.1 for 400 iterations. It can be observed that combination of DenseNet and ResNet improves the calibration performance, while combination of LeNet doesn't help. This is because LeNet is not as over-confident as DenseNet and ResNet (as shown in Figure 1 ). Hence the simple ensemble combination doesn't help, but aggravates the calibration. The three temperature-based calibration methods effectively improve the system calibration performance on CIFAR-10 as well.Table 6 gives the ensemble combination based on AUC weights (Zhong & Kwok, 2013) . The AUC weights are much even than the Max LL weights in Table 2 . The structures combined are first calibrated, nevertheless, applying post-combination calibration to the ensemble obtains gains. We evaluated post-combination for topology combination in Table 7 . The post-and dynamic postcombination methods are applied to calibrate the topologies and the topology ensemble. The dynamic temperature method shows clear advantage in obtaining calibrated ensemble of mutiple topologies. 

