SHOULD ENSEMBLE MEMBERS BE CALIBRATED?

Abstract

Underlying the use of statistical approaches for a wide range of applications is the assumption that the probabilities obtained from a statistical model are representative of the "true" probability that event, or outcome, will occur. Unfortunately, for modern deep neural networks this is not the case, they are often observed to be poorly calibrated. Additionally, these deep learning approaches make use of large numbers of model parameters, motivating the use of Bayesian, or ensemble approximation, approaches to handle issues with parameter estimation. This paper explores the application of calibration schemes to deep ensembles from both a theoretical perspective and empirically on a standard image classification task, CIFAR-100. The underlying theoretical requirements for calibration, and associated calibration criteria, are first described. It is shown that well calibrated ensemble members will not necessarily yield a well calibrated ensemble prediction, and if the ensemble prediction is well calibrated its performance cannot exceed that of the average performance of the calibrated ensemble members. On CIFAR-100 the impact of calibration for ensemble prediction, and associated calibration is evaluated. Additionally the situation where multiple different topologies are combined together is discussed.

1. INTRODUCTION

Deep learning approaches achieve state-of-the-art performance in a wide range of applications, including image classification. However, these networks tend to be overconfident in their predictions, they often exhibit poor calibration. A system is well calibrated, if when the system makes a prediction with probability of 0.6 then 60% of the time that prediction is correct. Calibration is very important in deploying system, especially in risk-sensitive tasks, such as medicine (Jiang et al., 2012 ), auto-driving (Bojarski et al., 2016 ), and economics (Gneiting et al., 2007) . It was shown by Niculescu-Mizil & Caruana (2005) that shallow neural networks are well calibrated. However, Guo et al. (2017) found that more complex neural network model with deep structures do not exhibit the same behaviour. This work motivated recent research into calibration for general deep learning systems. Previous research has mainly examined calibration based on samples from the true data distribution {x (i) , y , 2002; Vaicenavicius et al., 2019) . This analysis relies on the limiting behaviour as N → +∞ to define a well calibrated system (i) } N i=1 ∼ p(x, ω), y (i) ∈ {ω 1 , ..., ω K } (Zadrozny & Elkan P(y = ŷ|P(ŷ|x; θ) = p) = p ⇐⇒ lim N →+∞ i∈S p j δ(y (i) , ŷ(i) ) |S p j | = p (1) where S p j = {i|P(ŷ (i) = j|x (i) ; θ) = p, i = 1, ..., N } and ŷ(i) the model prediction for x (i) . δ(s, t) = 1 if s = t, otherwise 0. However, Eq. (1) doesn't explicitly reflect the relation between P(y = ŷ|P(ŷ|x; θ) = p) and the underlying data distribution p(x, y). In this work we examine this explicit relationship and use it to define a range of calibration evaluation criteria, including the standard sample-based criteria. One issue with deep-learning approaches is the large number of model parameters associated with the networks. Deep ensembles (Lakshminarayanan et al., 2017) is a simple, effective, approach for handling this problem. It has been found to improve performance, as well as allowing measures of uncertainty. In recent literature there has been "contradictory" empirical observations about the relationship between the calibration of the members of the ensemble and the calibration of the final

