BEYOND CALIBRATION: ESTIMATING THE GROUPING LOSS OF MODERN NEURAL NETWORKS

Abstract

The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.

1. INTRODUCTION

Validating the compliance of a model to a predefined set of specifications is important to control operational risks related to performance but also trustworthiness, fairness or robustness to varying operating conditions. It often requires that probability estimates capture the actual uncertainty of the prediction, i.e. are close to the true posterior probabilities. Indeed, many situations call for probability estimates rather than just a discriminant classifier. Probability estimates are needed when the decision is left to a human decision maker, when the model needs to avoid making decisions if they are too uncertain, when the context of model deployment is unknown at training time, etc. To evaluate probabilistic predictions, statistics and decision theory have put forward proper scoring rules (Dawid, 1986; Gneiting et al., 2007) , such as the Brier or the log-loss. Strictly proper scoring rules are minimized when a model produces the true posterior probabilities, which make them a valuable tool for comparing models and selecting those with the best estimated probabilities (e.g. Dawid & Musio, 2014) . What they do not provide though is a means of validating whether the best estimated probabilities are good enough to be put into production, or whether further effort is needed to improve the model. Indeed, proper scores compound the irreducible loss -due to the inherent randomness of a problem, i.e. the aleatoric uncertainty-and the epistemic loss -which measures how far a model is from the best possible one. For example, a classifier with a Brier score of 0.15 could have optimal estimated probabilities (irreducible loss close to 0.15) or poor ones (irreducible loss close to 0). Calibration errors are another tool to evaluate probabilistic predictions, and measuring them is an active research topic in the machine learning community. (Kumar et al., 2019; Minderer et al., 2021; Roelofs et al., 2022) . The calibration error is in fact a component of proper scoring rules (Bröcker, 2009; Kull & Flach, 2015) : it measures whether among all samples to which a calibrated classifier gave the same confidence score, on average, a fraction equal to the confidence score is positive. Importantly, the calibration error can be evaluated efficiently as it does not require access to the ground truth probabilities, but solely to their calibrated version. Calibration is however an incomplete characterization of predictive uncertainty. It measures an aggregated error that is blind to potential individual errors compensating each other. For example, among a group of individuals to which a calibrated cancer-risk classifier assigns a probability of 0.6, a fraction of 60% actually has cancer. But a subgroup of them could be composed of 100% cancer patients while another would only contain 20% of cancer patients. In general, estimating the true posterior probabilities or obtaining individual guarantees is impossible (Vovk et al., 2005; Barber, 2020) . Recent works have thus attempted to refine guarantees on uncertainty estimates at an intermediary subgroup level. In particular, Hébert-Johnson et al. (2018) has introduced the notion of multicalibration, generalizing the notion of calibration within groups studied in fairness (Kleinberg et al., 2016) to every efficiently-identifiable subgroup. Barber et al. (2019) ; Barber (2020) defines subgroups-based coverage guarantees which lie in between the coarse marginal coverage and the impossible conditional coverage guarantees. In a similar vein, we study the remaining term measuring the discrepancy between the calibrated probabilities and the unknown true posterior probabilities (Kull & Flach, 2015) , i.e. the grouping loss, for which no estimation procedure exists to date. In particular: • We provide a new decomposition of the grouping loss into explained and residual components, together with a debiased estimator of the explained component as a lower bound (Section 4). • We demonstrate on simulations that the proposed estimator can provide tight lower-bounds on the grouping loss (Section 5.1). • We evidence for the first time the presence of grouping loss on pre-trained vision and language architectures, notably in distribution shifts settings (Section 5.2).

2. CALIBRATION IS NOT ENOUGH

Calibration can be understood with a broad conceptual meaning of alignment of measures and statistical estimates (Osborne, 1991) . However, in the context of decision theory or classifiers, the following definitions are used (Foster & Vohra, 1998; Gneiting et al., 2007; Kull & Flach, 2015) : Confusion about calibration A common confusion is to mistake confidence scores of a calibrated classifier with true posterior probabilities and think that a calibrated classifier outputs true posterior probabilities, which is false. We identified three main sources of confusion in the literature -see Appendix A for specific quotes. First, the vocabulary used sometimes leaves room for ambiguity, e.g., posterior probabilities may refer to confidence scores or to the true posterior probabilities without further specifications. Second, plain-English definitions of calibration are sometimes incorrect, defining calibrated scores as the true posterior probabilities. Lastly, even when everything is correctly defined, it is sometimes implicitly supposed that true posterior probabilities are close to the calibrated scores. While it may be true in some cases, equating the two induces misconceptions. True Calibration with good accuracy does not imply good individual confidences It is tempting to think that a calibrated classifier with optimal accuracy should provide confidence scores 

S(X) Q(X)

Figure 1 : A calibrated binary classifier with optimal accuracy and confidence scores S(X) everywhere different from the true posterior probabilities Q(X). close to the true posterior probabilities. However, caution is necessary: Figure 1 shows a simple counterexample. The classifier presented gives an optimal accuracy as its confidence scores are always on the same side of the decision threshold as the true posterior probabilities. It is moreover calibrated, as for a given score s (either 0.2 or 0.7 here), the expectation of Q over the region where the confidence score is s is actually equal to s. Yet, the confidence scores are not equal to Q as Q displays variance over regions of constant scores. This variance can be made as large as desired as long as both Q and S stay on the same side of the decision threshold to preserve accuracy. The flaws of a perfectly calibrated classifier that always predicts the same score are typically due to variations of the true posterior probabilities over constant confidence scores. As we formalize below, such variations are captured by the grouping loss (Kull & Flach, 2015) . Appendix B provides a more realistic variant of this example based on the output of a neural network.



posterior probabilities: Q := P (Y = 1|X), Confidence scores: S := f (X) score output by a classifier, Calibrated scores: C := P (Y = 1|S) = E[Q | S], average true posterior probabilities for a score S.

