BEYOND CALIBRATION: ESTIMATING THE GROUPING LOSS OF MODERN NEURAL NETWORKS

Abstract

The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.

1. INTRODUCTION

Validating the compliance of a model to a predefined set of specifications is important to control operational risks related to performance but also trustworthiness, fairness or robustness to varying operating conditions. It often requires that probability estimates capture the actual uncertainty of the prediction, i.e. are close to the true posterior probabilities. Indeed, many situations call for probability estimates rather than just a discriminant classifier. Probability estimates are needed when the decision is left to a human decision maker, when the model needs to avoid making decisions if they are too uncertain, when the context of model deployment is unknown at training time, etc. To evaluate probabilistic predictions, statistics and decision theory have put forward proper scoring rules (Dawid, 1986; Gneiting et al., 2007) , such as the Brier or the log-loss. Strictly proper scoring rules are minimized when a model produces the true posterior probabilities, which make them a valuable tool for comparing models and selecting those with the best estimated probabilities (e.g. Dawid & Musio, 2014) . What they do not provide though is a means of validating whether the best estimated probabilities are good enough to be put into production, or whether further effort is needed to improve the model. Indeed, proper scores compound the irreducible loss -due to the inherent randomness of a problem, i.e. the aleatoric uncertainty-and the epistemic loss -which measures how far a model is from the best possible one. For example, a classifier with a Brier score of 0.15 could have optimal estimated probabilities (irreducible loss close to 0.15) or poor ones (irreducible loss close to 0). Calibration errors are another tool to evaluate probabilistic predictions, and measuring them is an active research topic in the machine learning community. (Kumar et al., 2019; Minderer et al., 2021; Roelofs et al., 2022) . The calibration error is in fact a component of proper scoring rules (Bröcker, 2009; Kull & Flach, 2015) : it measures whether among all samples to which a calibrated classifier gave the same confidence score, on average, a fraction equal to the confidence score is positive. Importantly, the calibration error can be evaluated efficiently as it does not require access to the ground truth probabilities, but solely to their calibrated version. Calibration is however an incomplete characterization of predictive uncertainty. It measures an aggregated error that is blind to potential individual errors compensating each other. For example, among a group of individuals to which a calibrated cancer-risk classifier assigns a probability of 0.6, a fraction of 60% actually has cancer. But a subgroup of them could be composed of 100% cancer patients while another would only contain 20% of cancer patients. In general, estimating the true posterior probabilities or obtaining individual guarantees is impossible (Vovk et al., 2005; Barber, 2020) . Recent works have thus attempted to refine guarantees on uncertainty estimates at an intermediary subgroup level. In particular, Hébert-Johnson et al. (2018) has introduced the notion of multicalibration, generalizing the notion of calibration within groups studied in fairness (Kleinberg et al., 2016) to every efficiently-identifiable subgroup. Barber et al. (2019) ; Barber (2020) defines subgroups-based coverage guarantees which lie in between the coarse marginal coverage and the impossible conditional coverage guarantees. In a similar vein, we study the remaining term measuring the discrepancy between the calibrated probabilities and the unknown true posterior probabilities (Kull & Flach, 2015) , i.e. the grouping loss, for which no estimation procedure exists to date. In particular: • We provide a new decomposition of the grouping loss into explained and residual components, together with a debiased estimator of the explained component as a lower bound (Section 4). • We demonstrate on simulations that the proposed estimator can provide tight lower-bounds on the grouping loss (Section 5.1). • We evidence for the first time the presence of grouping loss on pre-trained vision and language architectures, notably in distribution shifts settings (Section 5.2).

2. CALIBRATION IS NOT ENOUGH

Calibration can be understood with a broad conceptual meaning of alignment of measures and statistical estimates (Osborne, 1991) . However, in the context of decision theory or classifiers, the following definitions are used (Foster & Vohra, 1998; Gneiting et al., 2007; Kull & Flach, 2015) : Confusion about calibration A common confusion is to mistake confidence scores of a calibrated classifier with true posterior probabilities and think that a calibrated classifier outputs true posterior probabilities, which is false. We identified three main sources of confusion in the literature -see Appendix A for specific quotes. First, the vocabulary used sometimes leaves room for ambiguity, e.g., posterior probabilities may refer to confidence scores or to the true posterior probabilities without further specifications. Second, plain-English definitions of calibration are sometimes incorrect, defining calibrated scores as the true posterior probabilities. Lastly, even when everything is correctly defined, it is sometimes implicitly supposed that true posterior probabilities are close to the calibrated scores. While it may be true in some cases, equating the two induces misconceptions. True Calibration with good accuracy does not imply good individual confidences It is tempting to think that a calibrated classifier with optimal accuracy should provide confidence scores 

S(X) Q(X)

Figure 1 : A calibrated binary classifier with optimal accuracy and confidence scores S(X) everywhere different from the true posterior probabilities Q(X). close to the true posterior probabilities. However, caution is necessary: Figure 1 shows a simple counterexample. The classifier presented gives an optimal accuracy as its confidence scores are always on the same side of the decision threshold as the true posterior probabilities. It is moreover calibrated, as for a given score s (either 0.2 or 0.7 here), the expectation of Q over the region where the confidence score is s is actually equal to s. Yet, the confidence scores are not equal to Q as Q displays variance over regions of constant scores. This variance can be made as large as desired as long as both Q and S stay on the same side of the decision threshold to preserve accuracy. The flaws of a perfectly calibrated classifier that always predicts the same score are typically due to variations of the true posterior probabilities over constant confidence scores. As we formalize below, such variations are captured by the grouping loss (Kull & Flach, 2015) . Appendix B provides a more realistic variant of this example based on the output of a neural network.

3. THEORETICAL BACKGROUND

Notations Let (X, Y ) ∈ X × Y be jointly distributed random variables describing the features and labels of a K-class classification task. Let e k be the one-hot vector of size K with its k th entry equal to one. The label space Y = {e 1 , . . . , e K } is the set of all one-hot vectors of size K. We assume that labels are drawn according to the true posterior distribution Q = (Q 1 , . . . , Q K ) ∈ ∆ K where Q k := P (Y = e k |X) and ∆ K is the probability simplex ∆ K = (p 1 , . . . , p K ) ∈ [0, 1] K : k p k = 1 . We consider a probabilistic classifier f giving scores S = f (X) with S = (S 1 , . . . , S K ) ∈ ∆ K . Note that S and Q are random vectors since they depend on X. This section introduces the formal definition of the grouping loss, which uses the concepts of calibrated scores as well as scoring rules.

3.1. CALIBRATION IN A MULTI-CLASS SETTING

In multi-class settings various definitions of calibration give different trade offs between control stringency and practical utility (Vaicenavicius et al., 2019; Kull et al., 2019) . The strongest definition controls the proportion of positives for groups of samples with the same vector of scores S. Definition 3.1. A probabilistic classifier giving scores s = (s 1 , . . . , s k ) is jointly calibrated if among all instances getting score s, the class probabilities are actually equal to s: Calibration P (Y = e k |S = s) = s k for k = 1, . . . , K. The score S being a vector of size K the number of classes, estimating the probability of Y conditioned on S is a difficult task that requires many samples  P (Y = e k |S k = s k ) = s k for k = 1, . . . , K. As classwise calibration can still be challenging to estimate when the number of samples per class is too small, an even weaker definition is used in the machine learning community (Guo et al., 2017) . Definition 3.3. A probabilistic classifier giving scores s = (s 1 , . . . , s k ) is top-label-calibrated if among all instances for which the confidence score of the predicted class is s, the probability that the predicted class is the correct one is s: Top-label calibration P Y = e argmax(s) | max(S) = s = s. Top-label calibration simplifies the problem by reducing it to a binary problem. However, it has an important limitation (Vaicenavicius et al., 2019) : as it only accounts for the confidence of the predicted class, it does not tell whether smaller probabilities are also calibrated.

3.2. PROPER SCORING RULES AND THEIR DECOMPOSITION

Scoring rules Scoring rules measure how well an estimated probability vector S explains the observed labels Y . The two most widely used scoring rules are the log-loss and Brier score: Log-loss : ϕ LL (S, Y ) := - K k=1 Y k log S k Brier score : ϕ BS (S, Y ) := K k=1 (S k -Y k ) 2 (4) Scoring rules are defined per sample, and the score over a dataset is obtained by averaging over samples. More generally, the expected score for rule ϕ of the estimated probability vector S with regards to the class label Y drawn according to Q is given by s ϕ (S, Q) := E Y ∼Q [ϕ(S, Y )]. Proper scoring rules decompositions have been introduced in terms of their divergences rather than their scores. The divergence between probability vectors S and Q is then defined as d ϕ (S, Q) := s ϕ (S, Q) -s ϕ (Q, Q). The divergences for the Brier score and the log-loss read: Log-loss : d LL (S, Q) := K k=1 Q k log Q k S k Brier score : d BS (S, Q) := K k=1 (S k -Q k ) 2 (5) Minimizing the Brier score in expectation thus amounts to minimizing the mean squared error between S and the unknown Q. A scoring rule is said strictly proper if its divergence is non-negative and d ϕ (S, Q) = 0 implies S = Q. Both the log-loss and Brier score are strictly proper. Scoring rules decomposition Let C be the calibrated scores in the sense of Definition 3.1, the strongest one i.e., C k = P (Y = e k |S = s) for k = 1, . . . , K. The divergence of strictly proper scoring rules can be decomposed as (Kull & Flach, 2015) : E [d ϕ (S, Y )] = E [d ϕ (S, C)] Calibration: CL + E [d ϕ (C, Q)] Grouping: GL + E [d ϕ (Q, Y )] Irreducible: IL (6) where the expectation is taken over Y ∼ Q and X. CL is the calibration loss. IL is the irreducible loss which stems from the fact that one point may not have a deterministic label, making perfect predictions impossible. GL is the grouping loss. Intuitively, while the calibration loss captures the deviation of the expected score in a bin vs the expected posterior probabilities, the grouping loss captures variations of the true posterior probabilities around their expectation. Together calibration and grouping form the epistemic loss, capturing intrinsic the randomness of the best possible predictor. The scoring rule decomposition (6) holds for top-label calibration (Definition 3.3) as it can be reduced to a binary problem. In the case of classwise calibration, the extension is not straightforward in the general case but we prove in Proposition C.3 that it holds for the Brier score and the log-loss.

4. CHARACTERIZATION OF THE GROUPING LOSS

In this section, we focus for simplicity on all settings where the calibrated scores can be expressed as C k = E [Y k |S], which includes binary classification as well as the multi-class setting with joint or top-label calibration. For classwise calibration, Appendix C.9 shows that all the results presented in this section also hold for the Brier score and log-loss.

4.1. REWRITING THE GROUPING LOSS AS A FORM OF VARIANCE

To shed light on the grouping loss, we rewrite it using f -variances: Definition 4.1 (f -variance). Let U, V : Ω → R d be two random variables defined on the same probability space, and function f : R d → R. Assuming the required expectations exist, the f -variance of U given V is: V f [U | V ] := E[f (U ) | V ] -f (E[U | V ]). The f -variance corresponds to the Jensen gap. It is positive by Jensen's inequality when f is convex. Beyond positivity, it can be seen as an extension of the variance as using the square function for f recovers the traditional notion of variance. Lemma 4.1 (The grouping loss as an h-variance). Let h be the negative entropy of the scoring rule ϕ, i.e. h : p → -s ϕ (p, p). The grouping loss GL of the classifier S with calibrated scores C = E[Q | S] and scoring rule ϕ writes: E[d ϕ (C, Q)] GL(S) = E[V h [Q | S]] The proof is given in Appendix C.1. In other words, the grouping loss associated to a scoring rule ϕ is an h-variance of the true posterior probability Q around the average scores C on groups of same level confidence S (Equation 7). In particular for the Brier score, the h-variance is a classical variance. It measures discrepancy between Q and C with a squared norm: V h [Q | S] = E ∥Q -C∥ 2 S . For the log-loss, it is a Kullback-Leibler divergence: V h [Q | S] = E[D KL (Q ∥ C) | S] . These expressions highlight two challenges in estimating the grouping loss. First, it relies on the true posterior probabilities Q, which we do not have access to. Second, it involves a conditioning on the confidence scores S, which are difficult to estimate for continuous scores.

4.2. GROUPING LOSS DECOMPOSITION AND LOWER-BOUND

As an h-variance of Q given S, evaluating the grouping loss requires access to Q(X) for any point X. Unfortunately Q(X) is difficult to estimate, except in special settings -e.g. multiple labels per sample as in Mimori et al. (2021) . In fact, the scores S of a classifier are generally one's best estimate of Q, and the whole point of the grouping loss is to quantify how far this best estimate is from the unknown oracle Q. We show that it is nevertheless possible to estimate a lower bound on the grouping loss. On the level set where a classifier score is S, it is indeed possible to estimate the average of Q  V h [Q | S, R 1 ] and V h [Q | S, R 2 ] remain uncaptured. E[Q|S] = 0.7 E[Q|S, R 1 ] = 0.6 E[Q|S, R 2 ] = 0.8 R 1 R 2

X

Level set S = 0.7 0.5 0.9 Q on regions of the feature space. Since by definition Q is non-constant on the level set of a classifier with non-zero grouping loss, it allows to capture part of the grouping loss (Figure 2 ). Intra-region variance remains uncaptured but can be reduced by choosing smarter and more numerous regions in the partition of the feature space. Theorem 4.1 formalizes this intuition: Theorem 4.1 (Grouping loss decomposition). Let R : X → N be a partition of the feature space. It holds that: GL(S) = E[V h [E[Q | S, R] | S]] GLexplained(S) + E[V h [Q | S, R]] GLresidual(S) Moreover if the scoring rule is proper, then: GL(S) ≥ GL explained (S) ≥ 0. ( ) Appendix C.2 gives the proof by showing that the law of total variance is also valid for the h-variance, which allows to decompose the grouping loss into explained and residual terms. GL explained quantifies the h-variance captured through the partition R, i.e. coarse-grained h-variance reflecting betweenregion variations of Q, while GL residual captures the remaining intra-region h-variance. Due to the positivity of GL residual , GL explained is a lower-bound of the grouping loss that ranges between 0 and GL depending on how much h-variance the partition captures. Importantly, while V h [Q | S, R] cannot be estimated because the oracle Q is unknown, it is possible to estimate E[Q | S, R] and thus GL explained .

4.3. CONTROLLING THE GROUPING LOSS INDUCED BY BINNING CLASSIFIER SCORES S

The grouping loss as well as GL explained involve a conditioning on the confidence scores S, which cannot be estimated by mere counting when the scores are continuous. To overcome this difficulty, standard practice in calibration approximates the conditional expectation using a binning strategy: the classifier scores are binned into a finite number of values (Definition 4.2). Definition 4.2 (Binned classifier). Let S : X → ∆ K be a classifier. Let B := {B j } 1≤j≤J be a partition of ∆ K . The binned version of S outputs the average of S on each bin: S B : X → S x → E[S|S ∈ B j ] where B j is the bin S(x) falls into. The binned calibrated scores are defined by: C B := P (Y = 1|S B ) = E[Q | S B ] = E[C | S B ] . This is the approach taken by the popular Expected Calibration Error (ECE) (Naeini et al., 2015) . However, the loss estimated for a binned classifier deviates from that of the original one. In particular, binning biases the calibration loss downwards (Kumar et al., 2019) . Here we show that on the contrary it creates an upwards bias for the grouping loss. Binning a classifier S into S B boils down to merging the level sets S into a finite number of larger level sets of confidence score S B . For example in Figure 3 , all level sets with S ∈ [0.5, 1] are merged into one level set of confidence S B = 0.75, which artificially inflates the variance of Q in each bin. This intuition is formalized in Proposition 4.1 Proposition 4.1 (Binning-induced grouping loss). The grouping loss of the binned classifier GL(S B ) deviates from that of the original classifier GL(S) by an induced grouping loss GL induced (S, S B ): E[V h [Q | S B ]] GL(S B ) = E[V h [Q | S]] GL(S) + E[V h [C | S B ]] GLinduced(S,S B ) (11) Moreover, if the scoring rule is proper: GL induced (S, S B ) ≥ 0. Appendix C.3 gives the proof. Proposition 4.1 shows that the difference between the grouping loss of the binned and original classifier is given by the h-variance of the original calibrated scores in a bin. This result provides an expression for GL induced which can then be estimated as shown in Section 4.4. Remark 1. Interestingly, the binning-induced grouping and calibration losses partly compensate each other (Corollary C.1 in Appendix C.8). Applying the decomposition of Theorem 4.1 to the binned classifier S B and accounting for binning using Proposition 4.1, we obtain a new decomposition of the grouping loss: Proposition 4.2 (Explained grouping loss accounting for binning). GL(S) = GL explained (S B ) -GL induced (S, S B ) + GL residual (S B ) If the scoring rule is proper, then: GL(S) ≥ GL explained (S B ) -GL induced (S, S B ) GLLB(S,S B ) . ( ) The proof is given in Appendix C.4. Importantly, contrary to the grouping loss, both terms in the lower-bound (Equation 13) can be estimated. In the remainder of this paper, we will be interested in the estimation and optimization of the lower bound GL LB (S, S B ).

4.4. GROUPING LOSS ESTIMATION

We now derive a grouping-loss estimation procedure by focusing on each of its components in turn: GL explained (S B ) and GL induced (S, S B ). A debiased estimator for the explained grouping loss GL explained (S B ) The most natural estimator for the explained grouping loss is a plugin estimator, replacing ) be the number of samples belonging to level set R (s) (resp. region R (s) j ). We define the empirical average of Y over these regions as: E[Q | S, μ(s,k) j := 1 n (s,k) j i:X (i) ∈R (s) j Y (i) k and ĉ(s,k) = 1 n (s,k) i:X (i) ∈R (s) Y (i) k The debiased estimator of GL explained is: GL explained (S B ) = K k=1 s∈S n (s,k) n GL (s,k) explained (S B ) with: GL (s,k) explained (S B ) = J j=1 n (s,k) j n (s,k) μ(s,k) j -ĉ(s,k) 2 plugin estimator GLplugin - J j=1 n (s,k) j n (s,k) μ(s,k) j (1 - μ(s,k) j ) n (s,k) j -1 - ĉ(s,k) (1 -ĉ(s,k) ) n (s,k) -1 bias estimation GLbias Appendix C.5 gives the proof, with a debiasing logic similar to Bröcker (2012) . The leftmost term corresponds to the plugin estimate: the estimator of the explained grouping loss (Theorem 4.1) with sample estimators for the quantities of interest. The two rightmost terms represent the finite-sample variance in estimating expectations over regions. They correct the upwards bias of the plugin estimate. Estimation of the grouping loss induced by binning classifier scores. GL induced (S, S B ) involves the h-variance of the calibrated scores C inside each bin, thus its estimation requires C. A solution is to estimate a continous calibration curve Ĉ, which amounts to a one-dimensional problem for which various methods are available. In our experiments, we use a kernel-based method (e.g. LOWESS). It is then easy to compute the h-variance of Ĉ inside each bin by evaluating Ĉ for all available samples. The resulting expression of the estimator GL induced is given in Appendix C.7. A partition to minimize GL residual In order to achieve the best possible lower-bound, we choose partitions in Theorem 4.1 to minimize GL residual . We use a decision tree with a loss corresponding to the scoring rule -squared loss for Brier score-on the labels Y to define regions that minimize the loss on a given level set of S. As this approach relies on Y , a train-test split is used to control for overfitting: a partitioning of the feature space is defined using the leaves of the tree fitted on one part, then the empirical means used in GL explained are estimated on the other part given this partitioning. In the experiments of Section 5.2, we work in the output space of the penultimate layer of the networks.

5.1. SIMULATIONS: FINER PARTITIONS GIVE A TIGHT GROUPING-LOSS LOWER BOUND

Here we investigate the behavior of our estimation procedure with respect to the number of bins and number of regions on simulated data with known grouping loss. The importance of both corrections -the binning-induced grouping loss (Proposition C.1) and the debiasing (Proposition 4.3) -is also evaluated. For this, data Y ∈ {0, 1} is drawn according to a known true posterior probability Q and we consider a calibrated logistic regression classifier for the scores S (details in Appendix B.2 and Figure 11 ). The estimation procedures are then applied according to two different scenarios. First we vary the number of samples per region (e.g. region ratio) while the number of bins is fixed (Figure 4a .). Then we vary the number of bins while the region ratio is fixed (Figure 4b .). For a fine-enough partition (a large number of regions, and hence a small region ratio), GL LB provides a tight lower bound to the true grouping loss GL. If the average number of samples per region becomes too small, some regions have less than two samples, which breaks the estimate (grayed out area in Figure 4a .). Conversely, the naive plugin estimate GL plugin substantially overestimates the true grouping loss as it does not include the corrections GL induced and GL bias . Figure 4b . shows that to control the GL induced due to binning, a reasonably large number of bins is needed, e.g. 15 as typical to compute ECE. Given these bins, we suggest to use a tree to divide them in as many regions as possible while controlling the probability of regions ending up with less than two samples, typically targeting a region ratio of a dozen, to obtain the best possible lower bound GL LB .

5.2. MODERN NEURAL NETWORKS DISPLAY GROUPING LOSS

The grouping diagram: visualization of the grouping loss In a binary setting, calibration curves display the calibrated scores C versus the confidence scores S of the positive class. To visualize the heterogeneity among region scores in a level set, we add to this representation the estimated region scores μj , i.e. the fraction of positives in each region obtained from the partitioning of level sets (Figure 5 ). The further apart the region scores are, the greater the grouping loss. Vision We evaluate 15 vision models (listed on Figure 7 ) from PyTorch pre-trained on ImageNet-1K (Deng et al., 2009) . Here we report evaluation on ImageNet-R (Hendrycks et al., 2021) , an ImageNet variant with 15 different renditions: paintings, toys, tattoos, origami... The dataset contains 30 000 images and 200 ImageNet classes. Appendix D reports evaluation on the validation set of ImageNet-1K and ImageNet-C (Hendrycks & Dietterich, 2019) , an ImageNet variant with corrupted versions of ImageNet images. As often with many classes, the small number of samples per class (50) does not allow to study the classwise calibration and grouping loss. Hence, following common practice, we consider top-label versions (Definition 3.3). Appendix D gives experimental details. We find substantial grouping loss inside level sets for most networks on ImageNet-R (ConvNeXt Tiny and ViT L-16 in Figure 6 , others in Appendix D.1), even after post-hoc recalibration (Figure 6 right). For instance, while ConvNext + Isotonic is calibrated (third graph), it is strongly over-confident in one part of the feature space and under-confident in the other, creating a high grouping loss. Figure 7 shows that the grouping loss varies across architectures, even with comparable accuracy. For example, ViT has a slightly better accuracy than ConvNeXt, but a lower estimated grouping loss. Post-hoc recalibration does not affect the grouping loss (Figure 6 ) and with isotonic recalibration (c. and d.) . In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy. Figure 7 : Evaluating vision models: a debiased estimate of the grouping loss lower bound GL LB (Equation 13) and an estimate of the calibration loss CL, both accounting for binning, evaluated on ImageNet-R and sorted by model accuracy. Partitions R are obtained from a decision tree partitioning constrained to create at most # samples in bin /30 regions in each bin. Isotonic regression is used for post-hoc recalibration of the models (right). ) and with isotonic recalibration (c. and d.) . The test set is either restricted to the 5 topics on which the network was trained (in-distribution) or to 5 unseen topics (out-of-distribution). In each level set clusters are built with a balanced decision stump and a 50-50 train-test split strategy. NLP We evaluate the grouping loss on BART Large (Lewis et al., 2019) from HuggingFace pre-trained on the Multi-Genre Natural Language Inference dataset (Williams et al., 2018) . We consider zero-shot topic classification on the Yahoo Answers Topics dataset, composed of questions and topic labels. There are 60 000 test samples and 10 topics. The model is fine-tuned on 5 out of the 10 topics of the training set, totaling 700 000 samples. Given a question title and a hypothesis (e.g. "This text is about Science & Mathematics"), the model outputs its confidence in the hypothesis to be true. The classification being zero-shot, the hypothesis can be about an unseen topic. We evaluate the model separately on the 5 unseen topics and the 5 seen topics of the test set. Both results in a binary classification task on whether the hypothesis is correct or not. Appendix E gives experimental details. The partitioning reveals grouping loss in the out-of-distribution setting both before and after recalibration (Figure 8b. and d.) . However, we found no evidence of grouping loss in the in-distribution setting. As in vision, this suggests that out-of-distribution settings lead to stronger grouping loss.

6. DISCUSSION AND CONCLUSION

A working estimator of grouping loss While calibrated scores can be estimated by solving a onedimensional problem, the grouping loss is much harder to estimate: it measures the discrepancy to the true posterior probabilities, which are unknown. We show that combining debiased partition-based estimators with an optimized partition captures the grouping loss well. This procedure allows us to characterize the grouping loss of popular neural networks for the first time. We find that in vision and NLP, models can be calibrated -if needed via post-hoc recalibration-but significant heterogeneity of errors remains, e.g. ConvNeXt has larger grouping loss than calibration loss. Several avenues could be explored to better capture the grouping loss. Complex level sets may not be approximated well with the partitioning defined by a tree, leaving a large residual in th. 4.1. In this case, the estimated grouping loss may only be a rather loose lower bound. Such a lower bound is nevertheless useful to reject models with high grouping loss. In addition, we apply the tree on the penultimate layer of neural networks, where class boundaries are simplified. Finally, complementing the proposed lower bound with an upper bound would also allow to identify models without grouping loss. We need to talk about grouping loss Model should be evaluated not only on aggregate measures, but also on their individual predictions, using grouping loss. The presence of grouping loss means that the model is systematically under-confident for certain groups of individuals and over-confident for others, questioning the use of such models for individual decision making. The presence of grouping loss also means that downstream tasks relying on confidence scores can be hindered, such as causal inference with propensity scores or simulation-based inference. Finally, this heterogeneity raises fairness concerns. In fact, the grouping loss and our lower bound are fundamentally related to fairness -see sufficiency and group calibration (Barocas et al., 2017, chap 3) , and multicalibration (Kleinberg et al., 2016) . We hope that our measure of grouping loss will spur new research in this area. 

A EXAMPLES OF CONFUSING STATEMENTS ON CALIBRATION

Here we detail specific examples of confusing statements on calibration in the literature. We choose most of these examples in well-cited and well regarded works. • Kuhn & Johnson (2013) : "We desire that the estimated class probabilities are reflective of the true underlying probability of the sample. That is, the predicted class probability (or probability-like value) needs to be well-calibrated. To be well-calibrated, the probabilities must effectively reflect the true likelihood of the event of interest." The authors write that it is desirable to have confidence scores S reflective of the true posterior probabilities Q, which is indeed desirable as discussed in Section 1. However, they write this is obtained through calibration. Although post-hoc recalibration makes the confidence scores closer to Q in some sense, there is an implicit shortcut. As pointed out in Section 2 and Appendix B, calibration, even with optimal accuracy, does not guarantee confidence scores S to be close to the true posterior probabilities Q. • (Gupta et al., 2020) : "A classifier is said to be calibrated if the probability values it associates with the class labels match the true probabilities of correct class assignments." The authors write that calibration is matching the confidence scores S of a classifier to the true posterior probabilities Q. In fact, calibration is matching the confidence scores S to the calibrated scores C, which can be far from the true posterior probabilities Q as pointed out in Section 2 and Appendix B. • Garcin & Stéphan (2021) : "Ideally, we would like machine learning models to output accurate probabilities in the sense that they reflect the real unobserved probabilities. This is exactly the purpose of calibration techniques, which aim to map the predicted probabilities to the true ones in order to reduce the probability distribution error of the model." The authors write that calibration is outputting confidence scores S that are true posterior probabilities Q. As in the previous citations, calibration is outputting calibrated scores C, which can be far from Q (Section 2 and Appendix B). • Flach (2016): "A probabilistic classifier is well calibrated if, among the instances receiving a predicted probability vector p, the class distribution is approximately distributed as p. Hence, the classifier approximates, in some sense, the class posterior." "The main point is that knowing the true class posterior allows the classifier to make optimal decisions. It therefore makes sense for a classifier to (approximately) learn the true class posterior." Here, calibration is rightly defined as outputting confidence scores S that are equal to the calibrated scores C. However, by writing that confidence scores S of a calibrated classifier approximate the true class posterior Q, the author makes an implicit assumption that the calibrated scores C are close to the true posterior probabilities Q, which is not guaranteed in theory as pointed out in Section 2 and Appendix B.

B EXAMPLES OF ACCURATE AND CALIBRATED CLASSIFIERS WITH HIGH GROUPING LOSS

Here we build simple binary classification examples of calibrated classifiers with optimal accuracy having their confidence scores far from the true posterior probabilities. In Appendix B.1 we build examples with an arbitrary link between true posterior probabilities Q and confidence scores S (up to a limit to keep the classifier's accuracy optimal). In Appendix B.2 we build a more realistic example based on the output of a neural network.

B.1 ARBITRARY LINK BETWEEN TRUE POSTERIOR PROBABILITIES Q AND CONFIDENCE SCORES S

To show that calibration, even combined with optimal accuracy, does not impose strong constraints on how close the true posterior probabilities Q should be from the classifiers' confidence scores S, we build examples in which Q and S have an arbitrary link. For simplicity we consider binary examples with a one-dimensional feature space X . These can be extended to multiple dimensions by projecting onto a vector ω (via x → ω T x). h : s → -s 2 + 2s h : s → min(2s, 1) 2 2 0 1 2 1 X S(X) Q(X) X 0 1 0 1 Confidence score S True probability Q Negative Positive 2 2 0 1 2 1 X S(X) Q(X) X 0 1 0 1 Confidence score S True probability Q Negative Positive Figure 9 : Calibrated but not accurate. Example of calibrated classifiers S constructed from links h following the procedure described in Appendix B.1. The accuracy of these two classifiers is not optimal as Q and S are not on the same side of the decision threshold ( 12 ) wherever Q ̸ = 1 2 . Refer to Figure 10 for an example with optimal accuracy. Calibration curves (in black on 2 nd and 4 th plot) are obtained from 1 million samples. The idea is to build a classifier that outputs confidence scores having at most two antecedents each. One antecedent should have its true posterior probability Q at an arbitrary distance +∆ from the associated confidence score S, while the other has a distance -∆. Scores with only one antecedent should have Q = S. This combined with an equal density weight of X onto the two antecedents guarantees calibration: E[Q | S] = S. To maintain the classifier's accuracy optimal, the offset ∆ is constrained to keep Q and S on the same side of the decision threshold. To achieve this, we cut the one-dimensional feature space X into three parts: R ⋆ + , R ⋆ -and {0}. As a classifier, we take an even function S(X) with S -1 ({0}) reduced to a singleton so that each confidence score has either two antecedents (one in R ⋆ + and one in R ⋆ -) or one antecedent in {0}. To assign an equal weight to each antecedent, we choose a symmetric distribution for X , e.g. a standard normal distribution centered on 0. We build the true posterior probabilities Q from deviations h : [0, 1] → [0, 1] and g : [0, 1] → [0, 1] of the confidence scores S in R ⋆ + and R ⋆ -: Q : x → 1 x>0 h(S(x)) + 1 x<0 g(S(x)) + 1 x=0 S(0) For S to be calibrated, deviations must average to identity, i.e. ∀s ∈ S(R), 1 2 (h(s) + g(s)) = s. A proof of this statement is given below: Proof. E[Q(X) | S(X)] = E[1 X>0 | S(X)] h(S(X)) + E[1 X<0 | S(X)] g(S(X)) + E[1 X=0 | S(X)] S(0) = P (X > 0|S(X)) h(S(X))+P (X < 0|S(X)) g(S(X))+P (X = 0|S(X)) S(0) = 1 /21 S(X)̸ =S(0) (h(S(X)) + g(S(X))) + 1 S(X)=S(0) S(0) since P (X > 0|S(X)) = P (X < 0|S(X)) = 1 2 1 S(X)̸ =S(0) . Hence, S(X) calibrated ⇔ E[Q(X) | S(X)] = S(X) ⇔ 1 2 (h(S(X)) + g(S(X))) = S(X). From here, we choose h : [0, 1] → [0, 1] and define g : s → 2s-h(s). Note that to keep g(s) ∈ [0, 1], h is constrained by: ∀s ∈ S(R), 2s -1 ≤ h(s) ≤ 2s. At this point of the procedure, classifiers S may not have an optimal accuracy. Figure 9 shows two examples of links h, one of which saturates the constraint h(s) <= min(2s, 1). To make the classifiers accurate, the deviations h(s)s should be small enough to keep S and Q on the same side of the decision threshold. This adds two constraints on h: ∀s ∈ S(R)∩[0, 1 /2[, h(s) < 1 2 and ∀s ∈ S(R) ∩ [ 1 /2, 1], h(s) ≥ 1 2 (with the convention that a score of exactly 1 2 predicts the positive class). Figure 10 (left) shows a classifier built following the above procedure. Figure 10 (right) shows that we can release the constraint 1 /2(h(s) + g(s)) = s if we tweak the distribution of X to adapt the weights between the two antecedents accordingly (and take e.g. g(s) = 1 h(s)<s ). h : s → max(min(2s, 1 2 )), 2s -1) h : s → 1 2 1 0<s<1 + 1 s=1 2 2 0 1 2 1 X S(X) Q(X) X 0 1 0 1 Confidence score S True probability Q Negative Positive 2 2 0 1 2 1 X S(X) Q(X) X 0 1 0 1 Confidence score S True probability Q Negative Positive Figure 10 : Calibrated and optimal accuracy. Example of calibrated classifiers S constructed from links h following the procedure described in Appendix B.1. The accuracy of these two classifiers is optimal as Q and S are on the same side of the decision threshold ( 12 ) wherever Q ̸ = 1 2 . However, confidence scores S are almost everywhere different from the true posterior probabilities Q. Calibration curves (in black on 2 nd and 4 th plot) are obtained from 1 million samples.

B.2 REALISTIC EXAMPLE BASED ON NEURAL NETWORK'S OUTPUT

The examples of Appendix B.1, while proving our point, are quite unusual in practice especially in the choice of classifier S. In this section we build a more realistic example based on the output of a neural network. We focus on a binary classification setting with a feature space X being at least two-dimensional. The classifier is taken as a sigmoid of ω T X (akin to the last layer of a neural network predicting the confidence score of the positive class). Based on this choice of model, we build a class of calibrated and accurate classifiers with confidence scores S far from the true posterior probabilities Q. The idea is to create heterogeneity in the blind spot of calibration, i.e. orthogonally to ω. The perturbations creating heterogeneity must balance each other out to keep the classifier calibrated. To achieve this, we define: • d ≥ 2 the dimension of the feature space X . • ω ∈ R d , the last layer's weights. • φ : R → [0, 1] the link function mapping ω T x to confidence scores, e.g. a sigmoid. • S : x ∈ R d → φ(ω T x) ∈ [0, 1] the classifier's confidence scores of the positive class. • ω ⊥ ∈ R d such that ω T ω ⊥ = 0, the direction in which heterogeneity will be introduced. • ψ : R → [-1, 1] an odd perturbation introducing balanced heterogeneity along ω ⊥ . • ∆ max : x → min(1-S(x), S(x)) modulating the range of the perturbation to keep Q ∈ [0, 1]. • Q : x ∈ R d → S(x) + ψ(ω T ⊥ x)∆ max (x) ∈ [0, 1] the constructed true posterior probabilities. • X ∼ N (0, Σ) the data distribution, with Σ ∈ R d×d having ω and ω ⊥ among its eigenvectors. With the above construction, the classifier S is calibrated. Indeed, E[Q(X) | S(X)] = S(X) + E ψ(ω T ⊥ X)∆ max (X) S(X) (15) = S(X) + E ψ(ω T ⊥ X) S(X) ∆ max (X) since ∆ max (X) is a function of S(X). We have E ψ(ω T ⊥ X) S(X) = 0 by construction: ψ is odd and the distribution of X has a symmetric weight along ω ⊥ since Σ is aligned on ω and ω ⊥ . Hence E[Q(X) | S(X)] = S(X). Figure 11 shows two examples generated with this procedure. However, it is not yet accurate. As in Appendix B.1, the perturbation should be constrained to keep Q and S on the same side of the decision threshold to keep the accuracy optimal. This is simply achieved by defining Example of a calibrated classifier S constructed following the procedure described in Appendix B.2. Its accuracy is not optimal as Q and S are not on the same side of the decision threshold ( 12 ) wherever Q ̸ = 1 2 . Refer to Figure 12 for an example with optimal accuracy. Calibration curves (in black on last column) are obtained from 1 million samples.  ∆ max : x → min(1-S(x), S(x), | 1 2 -S(x)|). a. Classifier S(X) = (1 + exp(-ω T X)) -1 2 2 2 2 X 1 X 2 S(X) = ( T X) S = 1 2 0 1 2 1 S(X) b. Perturbation ψ(z) = 2(1 + exp(-z)) -1 -1 2 2 2 2 X 1 X 2 (X) = ( T X) max (X) T X = 0 1 2 0 1 2 (X) 2 2 X 1 Q(X) = S(X) + (X) S(X) = 1 2 0 1 2 1 Q(X) 0 1 0 1 Confidence score S True probability Q Negative Positive c. Perturbation ψ(z) = 1 z>0 -1 z<0 2 2 2 2 X 1 X 2 (X) = ( T X) max (X) T X = 0 1 2 0 1 2 (X) 2 2 X 1 Q(X) = S(X) + (X) S(X) = 1 2 0 1 2 1 Q(X) 0 1 0 1 Confidence score S True probability Q Negative Positive a. Classifier S(X) = (1 + exp(-ω T X)) -1 2 2 2 2 X 1 X 2 S(X) = ( T X) S = 1 2 0 1 2 1 S(X) b. Perturbation ψ(z) = 2(1 + exp(-z)) -1 -1 2 2 2 2 X 1 X 2 (X) = ( T X) max (X) T X = 0 1 4 0 1 4 (X) 2 2 X 1 Q(X) = S(X) + (X) S(X) = 1 2 0 1 2 1 Q(X) 0 1 0 1 Confidence score S True probability Q Negative Positive c. Perturbation ψ(z) = 1 z>0 -1 z<0 2 2 2 2 X 1 X 2 (X) = ( T X) max (X) T X = 0 1 4 0 1 4 (X) 2 2 X 1 Q(X) = S(X) + (X) S(X) = 1 2 0 1 2 1 Q(X) 0 1 0 1 Confidence score S True probability Q Negative Positive E[d ϕ (C, Q)] GL(S) = E[V h [Q | S]] Proof of Lemma 4.1. Let ϕ be a scoring rule, h : p → -s ϕ (p, p) and C = E[Q | S]. E[d ϕ (C, Q)] = E[s ϕ (C, Q) -s ϕ (Q, Q)] Definition of divergence (17) = E[s ϕ (C, Q) + h(Q)] Definition of h (18) = E K k=1 ϕ(C, e k )Q k + h(Q) Definition of expected score (19) = E E K k=1 ϕ(C, e k )Q k + h(Q) S Law of total expectation (20) = E K k=1 E[ϕ(C, e k )Q k | S] + E[h(Q) | S] Linearity of expectation ( 21) = E K k=1 ϕ(C, e k )E[Q k | S] + E[h(Q) | S] ϕ(C, e k ) is a function of S (22) = E K k=1 ϕ(C, e k )C k + E[h(Q) | S] C k = E[Q k | S] (23) = E[-h(C) + E[h(Q) | S]] Definition of h (24) = E[E[h(Q) | S] -h(E[Q | S])] C = E[Q | S] (25) = E[V h [Q | S]] Definition of V h [Q | S] C.2 THEOREM 4.1: GROUPING LOSS DECOMPOSITION Lemma C.1 (Law of total h-variance). Let X, Y, Z : Ω → R d be random variables defined on the same probability space and a function f : R d → R. The law of total variance holds for the f -variance: V f [Y | Z] = E[V f [Y | X, Z] | Z] + V f [E[Y | X, Z] | Z] Proof. E[f (Y )] = E[E[f (Y ) | X]] Law of total expectation = E[V f [Y | X]] + E[f (E[Y | X])] Definition of V f [Y | X] E[f (Y )] -f (E[Y ]) = E[V f [Y | X]] + E[f (E[Y | X])] -f (E[E[Y | X]]) Law of total expectation = E[V f [Y | X]] + V f [E[Y | X]] Definition of V f [E[Y | X]] The same proof holds when the expectations and h-variances are conditioned on Z. Theorem 4.1 (Grouping loss decomposition). Let R : X → N be a partition of the feature space. It holds that: GL(S) = E[V h [E[Q | S, R] | S]] GLexplained(S) + E[V h [Q | S, R]] GLresidual(S) Moreover if the scoring rule is proper, then: GL(S) ≥ GL explained (S) ≥ 0. Proof of Theorem 4.1. Applying Lemma C.1 with (R, Q, S) as (X, Y, Z) gives the decomposition. Proper scoring rules have a convex negative entropy h (see Gneiting & Raftery, 2007, th. 1) . Note that depending on the convention (maximization or minimization of scoring rules), one may find in the litterature that the entropy is either convex or concave. In the convention taken by this article (minimization of scoring rules), the entropy is concave and the negative entropy is convex. Using Jensen's inequality, we thus have  V h [Q | S, E[V h [Q | S B ]] GL(S B ) = E[V h [Q | S]] GL(S) + E[V h [C | S B ]] GLinduced(S,S B ) Moreover, if the scoring rule is proper: GL induced (S, S B ) ≥ 0. Proof of Proposition 4.1. ) be the number of samples belonging to level set R (s) (resp. region R (s) j ). We define the empirical average of Y over these regions as: V h [Q | S B ] = E[V h [Q | S, S B ] | S B ] + V h [E[Q | S, S B ] | S B ] Law of total h-variance (Lemma C.1) = E[V h [Q | S] | S B ] + V h [E[Q | S] | S B ] S B is a function of S = E[V h [Q | S] | S B ] + V h [C | S B ] C = E[Q | S] E[V h [Q | S B ]] = E[V h [Q | S]] + E[V h [C | S B ]] Law of μ(s,k) j := 1 n (s,k) j i:X (i) ∈R (s) j Y (i) k and ĉ(s,k) = 1 n (s,k) i:X (i) ∈R (s) Y (i) k The debiased estimator of GL explained is: GL explained (S B ) = K k=1 s∈S n (s,k) n GL (s,k) explained (S B ) with: k) and k ∈ {1, K}, and define p(s,k) ,k) . We now compute the bias of the plugin estimator for GL (s,k) explained . To ease calculations, we start by rewriting the plugin estimate: GL (s,k) explained (S B ) = J j=1 n (s,k) j n (s,k) μ(s,k) j -ĉ(s,k) 2 plugin estimator GLplugin - J j=1 n (s,k) j n (s,k) μ(s,k) j (1 - μ(s,k) j ) n (s,k) j -1 - ĉ(s,k) (1 -ĉ(s,k) ) n (s,k) -1 bias estimation GLbias Proof. Let s ∈ S ( j := n (s,k) j n (s GL (s,k) plugin = J j=1 p(s,k) j μ(s,k) j -ĉ(s,k) 2 (28) = J j=1 p(s,k) j μ(s,k) j 2 -2ĉ (s,k)   J j=1 p(s,k) j μ(s,k) j   + ĉ(s,k) 2 (29) = J j=1 p(s,k) j μ(s,k) j 2 -ĉ(s,k) 2 From now on, we omit the exponent (s, k) to lighten notations. We now take the expectation of both terms in the lower-bound. E ĉ2 = E [ĉ] 2 + Var(ĉ) (31) = c 2 + c(1 -c) n ( ) where we made use of Lemma C.2 for equation 32. Similarly, E μ2 j pj = E [μ j ] 2 + Var(μ j ) (33) = µ 2 j + µ j (1 -µ j ) n j When n j = 0 (or equivalently pj = 0), which happens with probability ν j = (1p j ) n , μj as well as the right term in equation 34 are undefined. The problem disappears when multiplying by pj , and agreeing that μj = 0 whenever n j = 0. E   J j=1 pj μ2 j   = J j=1 E E pj μ2 j 1 pj ≥0 pj (35) = J j=1 E pj 1 pj ≥0 µ 2 j + µ j (1 -µ j ) n j (36) = J j=1 p j µ 2 j + E 1 pj ≥0 µ j (1 -µ j ) n (37) = J j=1 p j µ 2 j + (1 -ν j ) µ j (1 -µ j ) n Putting together equations 32 and 38, we get: E GL plugin = J j=1 p j µ 2 j -c 2 + J j=1 (1 -ν j ) µ j (1 -µ j ) n - c(1 -c) n (39) = J j=1 p j (µ j -c) 2 GLexplained + J j=1 (1 -ν j ) µ j (1 -µ j ) n - c(1 -c) n GLbias In practice ν j , which gives the probability that no sample falls in component j, is very close to 0 unless p j and n are very small. Hence, we will approximate ν j ≈ 0. More importantly, the expression of the bias given in 40 depends on oracle quantities µ j and c, which are unavailable. Therefore, we resort to debiasing the plugin estimate of the lower-bound using sample estimates of the bias, which gives: GL (s,k) explained = J j=1 n (s,k) j n (s,k) μ(s,k) j -ĉ(s,k) 2 plugin estimator GLplugin - J j=1 n (s,k) j n (s,k) μ(s,k) j (1 - μ(s,k) j ) n (s,k) j -1 + ĉ(s,k) (1 -ĉ(s,k) ) n (s,k) -1 (41) where we used a Bessel correction for the estimation of population variances. Finally, a debiased estimator of GL explained is obtained by summing over the debiased estimators for all k ∈ {1, K} and all s ∈ S k . Lemma C.2. Define μ(s,k) j and ĉ(s,k) as in Proposition 4.3. Then: E μ(s,k) j = µ (s,k) j and V ar μ(s,k) j = µ (s,k) j 1 -µ (s,k) j n (s,k) j . ( ) Similarly, E ĉ(s,k) = c (s,k) and V ar ĉ(s,k) = c (s,k) 1 -c (s,k) n (s,k) . ( ) The labels Y (i) k are by definition drawn from a Bernoulli distribution with probability P (Y (i) k |X (i) ) = Q (i) k , i.e., for each sample i, the probability of the Bernoulli changes. This lemma shows that despite these varying Bernoulli probabilities, the empirical average of labels Y k over a given subspace has the same expectation and variance as a binomial variable that would be drawn with a probability equal to the expectation of Q k over this subspace. Proof of Lemma C.2. Below we write the proof for the case of μ(s,k) j (equation 42) as the one for ĉ(s,k) (equation 43) follows exactly the same lines. Let I (s) j = i : X (i) ∈ R (s) j , be the subset of samples such that X (i) belongs to bin R (s) j . E μ(s,k) j = 1 n (s,k) j i∈I (s) j E Y (i) k | S k = s, R(X (i) ) = j (44) = 1 n (s,k) j i∈I (s) j E E Y (i) k | X (i) | S k = s, R(X (i) ) = j (45) = 1 n (s,k) j i∈I (s) j E Q (i) k | S k = s, R(X (i) ) = j (46) = 1 n (s,k) j i∈I (s) j µ (s,k) j (47) = µ (s,k) j ( ) where we used the law of total expectation in eq 45, the definition of Q k in eq 46, and the definition of µ (s,k) j in eq 47.

Var

μ(s,k) j = E (μ (s,k) j -µ (s,k) j ) 2 S k = s, R(X (i) ) = j (49) = E μ(s,k) j 2 S k = s, R(X (i) ) = j -µ (s,k) j 2 (50) = 1 n (s,k) j 2 E    i∈I (s) j Y (i) l∈I (s) j Y (l) k S k = s, R(X (i) ) = j    -µ (s,k) j 2 (51) = 1 n (s,k) j 2 E      i∈I (s) j Y (i) k + i̸ =l i,l∈I (s) j Y (i) k Y (l) k S k = s, R(X (i) ) = j      -µ (s,k) j 2 (52) = 1 n (s,k) j 2      i∈I (s) j µ (s,k) j + i̸ =l i,l∈I (s) j µ (s,k) j 2      -µ (s,k) j 2 (53) = 1 n (s,k) j 2 n (s,k) j µ (s,k) j + n (s,k) j (n (s,k) j -1) µ (s,k) j 2 -µ (s,k) j 2 (54) = µ (s,k) j (1 -µ (s,k) j ) n (s,k) j ( ) where we used the fact that Y  explained (S B ) = J j=1 1 -ν (s,k) j µ (s,k) j (1 -µ (s,k) j ) n (s,k) - c (s,k) (1 -c (s,k) ) n (s,k) (56) By convexity of the function x → (x -E [x]) 2 , we have:   J j=1 n (s,k) j n (s,k) μ(s,k) j -E   J j=1 n (s,k) j n (s,k) μ(s,k) j     2 ≤ J j=1 n (s,k) j n (s,k) μ(s,k) j -E μ(s,k) j 2 (57) Using the fact that ĉ(s,k) = J j=1 n (s,k) j n (s,k) μ(s,k) j , and taking the expectation of both sides, we get: Var(ĉ (s,k) ) ≤ J j=1 n (s,k) j n (s,k) Var(μ (s,k) j ) Finally, using Lemma C.2, we get: c (s,k) 1 -c (s,k) n (s,k) ≤ J j=1 µ (s,k) j 1 -µ (s,k) j n (s,k) (59) Hence, we have: bias L (s,k) GL = J j=1 µ (s,k) j (1 -µ (s,k) j ) n (s,k) - c (s,k) (1 -c (s,k) ) n (s,k) ≥0 - J j=1 ν (s,k) j µ (s,k) j (1 -µ (s,k) j ) n (s,k) (60) Because of the term involving ν (s,k) j , this inequality does not prove that the bias is always positive. However in practice ν k) , which represents the probability that no point belongs to region j, is very close to 0 unless p (s,k) j is very small or the total number of points n (s,k) is small. Hence, equality 60 shows that the bias can only be 'slightly' negative. In the simulations below, the upwards bias of the plugin estimate appears clearly. (s,k) j = 1 -p (s,k) j n (s,

C.7 ESTIMATOR FOR THE INDUCED GROUPING LOSS

Proposition C.1 (Estimator for the induced grouping loss). Let Ĉ be an estimator of C. An estimator of C B is ĈB (s) = 1 n (s) i:S B (X (i) )=s Ĉ(S(X (i) )) with n (s) the number of sample in the level set s. An estimator of the grouping loss induced by the binning of S into S B is: GL induced (S, S B ) = s∈S n (s) n   1 n (s) i:S B (X (i) )=s e( Ĉ(S(X (i) )) -e( ĈB (s))  

C.8 ANALYSIS OF BINNING-INDUCED ERRORS FOR THE BRIER SCORE

It is well known that binning can induce error in estimating calibration loss, leading to underestimating it (Bröcker, 2012; Kumar et al., 2019; Roelofs et al., 2022) . Proposition 4.1 shows that it also leads to errors on the grouping loss, overestimating it. Here we characterize the errors on the calibration and grouping loss for the Brier score and show that they partly compensate each other and the error on the sum of both can be bounded.  E ∥S B -C B ∥ 2 CL(S B ) = E ∥S -C∥ 2 CL(S) -E V h S -C S B CL induced (S, S B ) The calibration loss induced by the binning, CL induced (S, S B ), is always negative. CL(S B ) is thus biased downward, which is already known from Kumar et al. (2019) ; Roelofs et al. (2022) . Conversely, the grouping loss induced by the binning, GL induced (S, S B ), is always positive. GL(S B ) is thus biased upward. The mere effect of binning artificially creates grouping loss and artificially reduces calibration error. For calibrated continuous classifiers, CL induced = 0 and induced grouping loss is small: with N equal-width bins, GL induced ≤ 1 4N 2 . If in addition the scores are uniform on the bins: GL induced = 1 12N 2 (Lemma C.3). Both induced calibration and grouping losses can be large since V[C | S B ] can be large. High GL induced expresses strong miscalibrations within the bin. However interestingly, both induced losses compensate. In a binary setting, the sum of induced calibration and grouping losses is contained as showed by Theorem C.1, and can be bounded by estimable quantities (Corollary C.1). While measuring CL(S B ) and GL(S B ) separately can lead to high binning-induced bias, measuring CL(S B ) + GL(S B ) through CL(S B ) + GL explained (S B ) enables reducing binning-induced errors and minorizing MSE(S, Q) (Corollary C.2). Theorem C.1 (Bounds on induced calibration and grouping losses). In a binary setting, the calibration and grouping losses induced by the binning of classifier S into S B sums to: CL induced + GL induced = E[2Cov[S, C | S B ] -V[S | S B ]] which is bounded by: -E V[S | S B ] 2 V[C | S B ] + V[S | S B ] ≤ CL induced + GL induced ≤ E V[S | S B ] 2 V[C | S B ] -V[S | S B ] Suppose that [0, 1] is divided in N equal-width bins. Then: -1 N E C B (1 -C B ) -1 4N 2 ≤ CL induced + GL induced ≤ 1 N E C B (1 -C B ) Corollary C.1. -E V[S | S B ] 2 C B (1 -C B ) + V[S | S B ] ≤ CL induced + GL induced ≤ E V[S | S B ] 2 C B (1 -C B ) -V[S | S B ] With N equal-width bins: -1 N E C B (1 -C B ) -1 4N 2 ≤ CL induced + GL induced ≤ 1 N E C B (1 -C B ) Corollary C.2. The mean square error (MSE) between continuous S and Q is lower bounded by: MSE(S, Q) = CL + GL ≥ ℓ 2 -ECE B + L GL B -E V[S | S B ] 2 V[C | S B ] -V[S | S B ] ≥ ℓ 2 -ECE B + L GL B -E V[S | S B ] 2 C B (1 -C B ) -V[S | S B ] With N equal bins: ≥ ℓ 2 -ECE B + L GL B -1 N E C B (1 -C B ) where ℓ 2 -ECE B is the ℓ 2 Expected Calibration Error of the binned classifier S B and L GL B is the grouping loss lower bound of S B .

PROOFS

Proof of Proposition C.2. Let h the negative entropy of the Brier scoring rule. ∥S B -C B ∥ 2 = ∥E[S | S B ] -E[C | S B ] ∥ 2 S B = E[S | S B ] , C B = E[C | S B ] = ∥E[S -C | S B ] ∥ 2 Linearity of expectation = E ∥S -C∥ 2 S B -V h [S -C | S B ] Definition of V h [S -C | S B ] E (S B -C B ) 2 = E ∥S -C∥ 2 -E[V h [S -C | S B ]]

Law of total expectation

Brier Score Given any two probability vectors P and Q, the divergence associated to the Brier score reads: d(P, Q) = K k=1 (P k -Q k ) 2 (65) For all k ∈ {1, . . . , K}, let d k : P k , Q k → (P k -Q k ) 2 . d k (S k , Y k ) = (S k -Y k ) 2 (66) = (S k -C k + C k -Q k + Q k -Y k ) 2 (67) = (S k -C k ) 2 + (C k -Q k ) 2 + (Q k -Y k ) 2 + 2(S k -C k )(C k -Q k ) + 2(S k -C k )(Q k -Y k ) + 2(C k -Q k )(Q k -Y k ) (68) Taking the expectation on both sides conditional on X: E [d k (S k , Y k ) | X] = (S k -C k ) 2 + (C k -Q k ) 2 + E (Q k -Y k ) 2 X +2(S k -C k )(C k -Q k ) (69) since S k and Q k are function of X, C k is a function of S k and thus of X, and E [Y k | X] = Q k . Then taking the expectation conditional on S k : E [d k (S k , Y k ) | S k ] = (S k -C k ) 2 + E (C k -Q k ) 2 S k + E (Q k -Y k ) 2 S k ( ) where we use the fact that C k is a function of S k , that E [Q k | S k ] = C k , and the property according to which for two random variables U and V and a function h, E [E [V | U ] | h(U )] = E [V | h(U )]. Finally, taking the expectation over S k we get: E [d k (S k , Y k )] = E (S k -C k ) 2 + E (C k -Q k ) 2 + E (Q k -Y k ) 2 (71) The desired decomposition is then obtained by summing over the K classes on both sides. log-loss Given any two probability vectors P and Q, the divergence associated to the log loss reads: d(P, Q) = K k=1 Q k log Q k P k (72) For all k ∈ {1, . . . , K}, let d k : P k , Q k → Q k log Q k P k . d k (S k , Y k ) = Y k log Y k S k (73) = Y k log Y k Q k + Y k log Q k C k + Y k log C k S k (74) E [d k (S k , Y k ) | X] = E Y k log Y k Q k X + Q k log Q k C k + Q k log C k S k (75) E [d k (S k , Y k ) | S k ] = E Y k log Y k Q k S k + E Q k log Q k C k S k + C k log C k S k (76) = E [d k (Q k , Y k ) | S k ] + E [d k (C k , Q k ) | S k ] + d k (S k , C k ) ( ) where we have used the same properties as those described for the proof of the Brier score classwise decomposition above. The desired decomposition is then obtained by taking the expectation over S k and summing over the K classes. The proper scoring rule decomposition holds for top-label calibration. Unlike classwise calibration, top-label calibration does not define a vector C ∈ R K of calibrated probabilities. Instead, it defines a notion of calibration for a simpler binary problem in which labels indicate whether the classifier predicts the correct class for a given X. More precisely, the labels for this binary Moreover, if h k is convex, then: GL(S) ≥ GL explained (S) ≥ 0 (90) GL induced (S, S B ) ≥ 0 (91) GL(S) ≥ GL explained (S B ) -GL induced (S, S B ) GLLB(S,S B ) (92) Proof of Theorem C.2. Applying the law of total variance (Lemma C.1) on each of the V h k [Q k | S k ] with R k as conditioning variable proves Equation 87. Similarly, applying the law of total variance on each of the V h k [Q k | S B k ] with S k as conditioning variable proves Equation 88. The proof for Equation 89 is the same as Proposition 4.2. Using Jensen's inequality, if h k is convex, then V h k ≥ 0, which proves Equation 90, 91 and 92. For the log-loss scoring rule, we have ϕ LL (p, e k ) =log(p k ) and h k (p) = log(p k )p k wich is convex. Thus, Theorem C.2 holds for the log-loss. Unfortunately the Brier score does not satisfy the assumptions of Lemma C.4 since ϕ BS (p, e k ) is not a function of p k . But a forumlation similar to Equation 78 holds for the Brier score: E d ϕ BS (C, Q) = E s ϕ BS (C, Q) -s ϕ BS (Q, Q) Definition of d ϕ BS (93) = E[(C -Q) • (C -Q)] Definition of s ϕ BS (94) = E[(C • C -2C • Q + Q • Q)] (95) = E[(Q • Q -C • C)] E[C • Q] = C • C (96) = K k=1 E (Q 2 k -C 2 k ) Linearity of expectation (97) = K k=1 E (E Q 2 k S k -C 2 k ) Law of total expectation (98) = K k=1 E[V[Q k | S k ]] Definition of the variance (99) with: (100) E[C • Q] = K k=1 E[C k Q k ] (101) = K k=1 E[E[C k Q k | S k ]] Law of total expectation (102) = K k=1 E[C k E[Q k | S k ]] C k is a function of S k (103) = K k=1 E C 2 k Definition of C k (104) = E[C • C] Since V = V f with f : x → x 2 Equation 78 is satisfied for the Brier score. Since f is convex, Theorem C.2 holds for the Brier score. To conclude, Theorem C.2 holds for the Brier score and the log-loss in a classwise setting. It is likely that some other proper scoring rules satisfy Equation 78 and Theorem C.2.

C.10 IMPACT OF RECALIBRATION ON THE GROUPING LOSS

Lemma C.5. Let ĉ be a recalibration mapping and S ′ = ĉ(S) the classifier recalibrated with that mapping. The grouping loss of the recalibrated classifier GL(S ′ ) deviates from that of the original classifier GL(S) as follows: GL(S ′ ) = GL(S) + E[V h [C | S ′ ]] If the mapping is perfect (i.e. S ′ = C) or invertible, then GL(S ′ ) = GL(S). Proof of Lemma C.5.

GL(S

′ ) = E[V h [Q | S ′ ]] Definition of GL = E[V h [Q | S ′ , S]] + E[V h [E[Q | S ′ , S] | S ′ ]] Law of total h-variance on S (Lemma C.1) = E[V h [Q | S]] + E[V h [E[Q | S] | S ′ ]] S ′ is a function of S = GL(S) + E[V h [C | S ′ ]] Definition of GL(S) and C If S ′ = C, then V h [C | S ′ ] = 0, hence GL(S ′ ) = GL(S). If the mapping ĉ is invertible, then knowing S ′ is knowing S. Hence V h [C | S ′ ] = V h [C | S] = 0 since C a function of S. Hence GL(S ′ ) = GL(S).

D IMAGENET

ImageNet-1K (ILSVRC2012) (Deng et al., 2009 ) is a classification dataset for computer vision with 1 000 classes. Networks studied in this article are pre-trained on the training set of ImageNet-1K, comprising 1.2 million samples. Models' architectures and weights are available on PyTorch v0.12 (Paszke et al., 2019) . We evaluated the networks on ImageNet variants' ImageNet-R and ImageNet-C (Appendix D.1, D.2) as well as the validation set of ImageNet-1K (Appendix D.3). We work in the highlevel feature space of the networks, i.e. the output space of the penultimate layer (embedding space). For each of ImageNet-R, ImageNet-C and the validation set of ImageNet, we plot the grouping diagrams of each network with and without post-hoc recalibration, obtained with a balanced decision stump. For each network, if several versions are available, we study both the smallest and the best performing one on the validation set of ImageNet-1K (usually the largest version). For ImageNet-R we also provide the grouping diagrams obtained with a 2-cluster k-means. Each experiment is detailed in Appendix D.1, D.2 and D.3. Detailed experimental method First, we forward each sample of the evaluation dataset (ImageNet-R, ImageNet-C or the validation set of ImageNet-1K) through the studied network. We build confidence scores by applying a softmax to the output logits. We extract a representation of the input images in the high-level feature space of the network (i.e. the input space of the last linear layer). Since there is not enough samples per class (50), we restrict our study to the top-label problem (Definition 3.3). For each sample, the class with the highest confidence is predicted. The label is 1 if the network predicted a correct class (0 otherwise) and the associated confidence score is the one of the predicted class. We divide the samples of the evaluation set in half making sure that the confidence score distribution is the same in both resulting subsets. On one set, we train the isotonic regression for calibration and calibrate the confidence scores of both sets. If no post-hoc recalibration is used, we skip this step. Then, we create groups of same-level confidences by binning the confidence scores with 15 equal-width bins in [0, 1]. We partition each of the 15 level sets independently. For each of them, we create the partition by training the partitioning method on the training samples of the isotonic regression. We then evaluate region scores on the remaining samples to avoid overfitting. For the grouping diagrams, we mainly use a balanced decision stump with 2 clusters (e.g. using scikit-learn's DecisionTreeRegressor with min_samples_leaf taken as half the samples in the bin), resulting in one split along one of the axis of the high-level feature space. For comparison, we also used k-means with 2 clusters. Constraining the partitioning methods to 2 regions is a choice to provide visually informative grouping diagrams rather than to maximize the lower bound GL explained . When optimizing the lower bound, (Figure 7 and Figure 14 ), we increase the number of allowed regions in the partition by setting a region ratio: the number of training samples in the bin over the number of allowed regions in the bin. Fixing a region ratio prevents from having regions with too few samples. In our experiments, we fix the region ratio to 30.

D.1 IMAGENET-R

ImageNet-R (Hendrycks et al., 2021) is a variant of ImageNet containing renditions of the ImageNet classes. Example of renditions are: paintings, toys, tattoos and origami. There are 15 rendition types in total listed in Figure 13 . The dataset contains 30 000 images and is limited to 200 of the 1 000 ImageNet classes. Figure 14a . compares estimated grouping loss lower bound and calibration errors of all networks (small and best versions) on ImageNet-R. Overall, we observe a strong grouping loss We also investigate whether there is heterogeneity among renditions. In Figure 13 we observe that some renditions are better predicted than average (e.g. deviant art, sketch or art) while some others are predicted worse than average (e.g. embroidery, cartoon, tattoo). These considerations would be useful in a fairness setting. Also, Figure 13 highlights that if we could build regions out of renditions (i.e. renditions are well separated in the feature space), this would result in a high grouping loss lower bound.

D.2 IMAGENET-C

ImageNet-C is a variant of ImageNet containing corrupted versions of ImageNet images. Examples of corruptions are: blur, noise, saturate, contrast, brightness and compression. There are 19 corruption types in total. Each corruption has a severity ranging from 1 to 5. The dataset contains the 50 000 images of the validation set of ImageNet, each of them being applied 19 corruptions with 5 severity levels each. We built a merged version of ImageNet-C by randomly sampling one corruption for each image. We also study one corruption only (snow). For both the merged version and the snow version, we study the maximum severity of the corruption (5). Figure 14b . and c. compare estimated grouping loss lower bound and calibration errors of all networks (small and best versions) on ImageNet-C merged and snow. Overall, we observe similar effect than on ImageNet-R. However, when all samples have the same corruption (snow), we exhibit more grouping loss among the networks than when the 19 corruptions are randomly applied on the dataset (merged) (Figure 14c. ). An intuition is that heterogeneity created by one corruption is canceled out by another one having heterogeneity in the opposite direction, leading to region scores closer to the average. Grouping diagrams of all networks are available at: 

D.3 IMAGENET-1K VALIDATION SET

The validation set of ImageNet-1K comprises 50 000 samples for 1 000 classes. Figure 14d . compares estimated grouping loss lower bound and calibration errors of all networks (small and best versions) on the validation set of ImageNet-1K. Conversely to ImageNet-R and ImageNet-C, we cannot exhibit substantial grouping loss on any of the networks. The grouping diagrams (Figure 24 ) show however that ConvNeXt Tiny displays more heterogeneity than the other networks on this dataset. Grouping diagrams are available in: • Section D. 

E NLP

We use BART Large (Lewis et al., 2019) pre-trained on the Multi-Genre Natural Language Inference dataset (Williams et al., 2018) and fine-tuned on the Yahoo Answers Topics dataset for zero-shot topic classification. The fine-tuned model is available on HuggingFace at https://huggingface. co/joeddav/bart-large-mnli-yahoo-answers. Yahoo Answers Topics is composed of question titles and bodies and topic labels. There are 1 400 000 training samples, 60 000 test samples and 10 topics. The dataset is available at https://huggingface.co/datasets/ yahoo_answers_topics. The model is fine-tuned on 5 out of the 10 topics of the training set, totalizing 700 000 samples. Given a question title and a hypothesis (e.g. "This text is about Science & Mathematics"), the model outputs its confidence in the hypothesis to be true for the given question. The classification being zero-shot, the hypothesis can be about an unseen topic. We evaluate the model separately on the 5 unseen topics and the 5 seen topics of the test set (i.e. seen topics but unseen samples). This results in a binary classification task in which each sample is composed of a question title and a hypothesis and each label is 1 or 0 whether the hypothesis is correct or not. As for the clustering and calibration procedure, we used a balanced decision stump in the same way as described in Section D: "Detailed experimental method". We work in the high-level feature space of the network, i.e. the output space of the penultimate layer (embedding space). Figure 27 : Vision: Fraction of correct predictions versus confidence score of predicted class (max k S k ) on ImageNet-1K (validation set) for best versions of pre-trained networks, with isotonic recalibration. In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy.



posterior probabilities: Q := P (Y = 1|X), Confidence scores: S := f (X) score output by a classifier, Calibrated scores: C := P (Y = 1|S) = E[Q | S], average true posterior probabilities for a score S.

Figure 2: Intuition. In the feature space X , the level set of confidence S = 0.7 displays E[Q | S] = 0.7, which we expect from a calibrated classifier. However, a partition of the level set into 2 regions R 1 and R 2 reveals that E[Q | S, R 1 ] = 0.6 while E[Q | S, R 2 ] = 0.8, suggesting a high grouping loss. Intra-region variances V h [Q | S, R 1 ] and V h [Q | S, R 2 ] remain uncaptured.

Figure 3: Binning inflates the grouping loss.

Figure 4: Simulation: estimating the grouping loss lower bound GL LB on a simulated problem (Appendix B.2, Figure 11). Right has a fixed ratio # samples # regions = 30 per bin. Bins are equal-width. Averaged curves are plotted with a ±1 standard deviation envelop.

Figure 5: Grouping diagram. Calibration curve of a binned binary classifier augmented with the estimated region scores μj for a partitioning of each level set into 2 regions. Region sample sizes are plotted as a gradient color. The classifier is binned into 10 equal-width bins whose sample sizes n (s) are given as an histogram. A Clopper-Pearson 95% confidence interval is plotted on the region scores. Regions for which the calibrated score ĉ(s) lie within this interval are grayed out.

Figure7shows that the grouping loss varies across architectures, even with comparable accuracy. For example, ViT has a slightly better accuracy than ConvNeXt, but a lower estimated grouping loss. Post-hoc recalibration does not affect the grouping loss (Figure6right and Figure7right), leading to the same conclusions (see Appendix C.10 for the analytical impact of recalibration on the grouping loss). We observe the same effects on ImageNet-C (Appendix D.2), but little or none on ImageNet-1K (Appendix D.3). This suggests that stronger grouping loss arises in out-of-distribution settings. Visual inspection of the images suggests that the partitions capture heterogeneity coming from how realistic an image is, or the different rendition types (Appendix D.1).

Figure 8: NLP: Fraction of positives versus confidence score of the positive class of fine-tuned BART for zero-shot classification on the test set of Yahoo Answers Topics without post-hoc recalibration (a. and b.) and with isotonic recalibration(c. and d.). The test set is either restricted to the 5 topics on which the network was trained (in-distribution) or to 5 unseen topics (out-of-distribution). In each level set clusters are built with a balanced decision stump and a 50-50 train-test split strategy.

Figure11: Calibrated but not accurate. Example of a calibrated classifier S constructed following the procedure described in Appendix B.2. Its accuracy is not optimal as Q and S are not on the same side of the decision threshold (12 ) wherever Q ̸ = 1 2 . Refer to Figure12for an example with optimal accuracy. Calibration curves (in black on last column) are obtained from 1 million samples.

Figure12: Calibrated and optimal accuracy. Example of a calibrated classifier S constructed following the procedure described in Appendix B.2. Its accuracy is optimal as Q and S are on the same side of the decision threshold wherever Q ̸ = 1 2 . However, confidence scores S are almost everywhere different from the true posterior probabilities Q. Calibration curves (in black on last column) is obtained from 1 million samples.

total expectation GL(S B ) = GL(S) + GL induced (S, S B ) Lemma 4.1 and definition of GL induced Remark: this proposition does not require S B to be the average scores on the bins E[S | S ∈ B j ]. C.4 PROPOSITION 4.2: EXPLAINED GROUPING LOSS ACCOUNTING FOR BINNING Proposition 4.2 (Explained grouping loss accounting for binning). GL(S) = GL explained (S B ) -GL induced (S, S B ) + GL residual (S B ) (12) If the scoring rule is proper, then: GL(S) ≥ GL explained (S B ) -GL induced (S, S B ) = GL(S B ) -GL induced (S, S B ) Propostion 4.1 = GL explained (S B ) + GL residual (S B ) -GL induced (S, S B ) Theorem 4.1 on GL(S B ) For proper scoring rules, Theorem 4.1 gives GL residual (S B ) ≥ 0 which completes the proof. C.5 PROPOSITION 4.3: DEBIASED ESTIMATOR FOR THE BRIER SCORE Proposition 4.3 (Debiased estimator for the Brier score). For all class k ∈ {1, . . . , K} and bin s ∈ S, let n (s,k) (resp. n (s,k) j

are independent when i ̸ = l in eq 52. C.6 THE PLUGIN ESTIMATOR FOR THE GROUPING LOSS LOWER BOUND IS BIASED UPWARDS. Analytical evaluation of the sign of the bias Let k ∈ {1, . . . , K} and s ∈ S. The bias of the plugin estimate GL (s,k) explained (S B ) is given by (40): bias GL (s,k)

Proposition C.2 gives the deviation term induced by the binning for the calibration loss with the Brier scoring rule. Proposition C.2 (Calibration loss decomposition). Let h be the negative entropy of the Brier scoring rule and C = E[Q | S]. The binned calibration loss CL(S B ) deviates from the calibration loss CL(S) by a negative induced calibration loss CL induced (S, S B ):

Figure13: Comparison of renditions on ImageNet-R. Differences between the calibrated scores of samples of one rendition Crendition and the calibrated scores of all samples Call , weighted by the number of samples of this rendition in the level set and summed over the 15 bins on confidence scores. Interpretation: should renditions define regions of the feature space, they would exhibit a high grouping loss lower bound.

Section D.2.1: ImageNet-C -No post-hoc recalibration, small versions. • Section D.2.2: ImageNet-C -No post-hoc recalibration, best versions. • Section D.2.3: ImageNet-C -Isotonic recalibration, small versions. • Section D.2.4: ImageNet-C -Isotonic recalibration, best versions.

3.1: ImageNet-1K -No post-hoc recalibration, small versions. • Section D.3.2: ImageNet-1K -No post-hoc recalibration, best versions. • Section D.3.3: ImageNet-1K -Isotonic recalibration, small versions. • Section D.3.4: ImageNet-1K -Isotonic recalibration, best versions.

R] by the empirical means of Y over each region. It is nonetheless generally biased. We show below that in the case of the Brier scoring rule, a direct empirical estimation of GL explained on the partition is biased upwards (cf

Table 1 gives the raw values.

REPRODUCIBILITY STATEMENTAll datasets are publicly available (ImageNet-R, ImageNet-C, ImageNet-1K, Yahoo Answers Topics) and all models involved are pre-trained and publicly available on PyTorch and HuggingFace. Simulated examples are described in Appendix B. Detailed experimental methods are given in Appendix D and E. Proofs of all theoretical results are listed in Appendix C. The source code for the implementation of the algorithm, experiments, simulations and figures is available on GitHub: https://github.com/aperezlebel/beyond_calibration.

R] ≥ 0. Hence both GL explained and GL residual are positive, which gives GL ≥ GL explained . The grouping loss of the binned classifier GL(S B ) deviates from that of the original classifier GL(S) by an induced grouping loss GL induced (S, S B ):

Raw values of the estimators in the vision (Figure7) and NLP experiments of Section 5.2, before ( CL and GL LB ) and after ( CL Comparing vision models: a debiased estimate of the grouping loss lower bound GL LB Vision: Fraction of correct predictions versus confidence score of predicted class (max k S k ) on ImageNet-C for small versions of pre-trained networks, without post-hoc recalibration. In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy.D.2.3 IMAGENET-C -ISOTONIC RECALIBRATION, SMALL VERSIONS Vision:Fraction of correct predictions versus confidence score of predicted class (max k S k ) on ImageNet-C for small versions of pre-trained networks, with isotonic recalibration. In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy. D.3.1 IMAGENET-1K -NO POST-HOC RECALIBRATION, SMALL VERSIONS Vision: Fraction of correct predictions versus confidence score of predicted class (max k S k ) on ImageNet-1K (validation set) for small versions of pre-trained networks, without post-hoc recalibration. In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy. Vision: Fraction of correct predictions versus confidence score of predicted class (max k S k ) on ImageNet-1K (validation set) for best versions of pre-trained networks, without post-hoc recalibration. In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy. D.3.3 IMAGENET-1K -ISOTONIC RECALIBRATION, SMALL VERSIONS Vision: Fraction of correct predictions versus confidence score of predicted class (max k S k ) on ImageNet-1K (validation set) for small versions of pre-trained networks, with isotonic recalibration. In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy.

ACKNOWLEDGMENTS

We acknowledge support in part by the French Agence Nationale de la Recherche under Grant ANR-20-CHIA-0026 (LearnI).

Table of Contents

Lemma C.3. In a binary setting, suppose that [0, 1] is divided in N equal-width bins. Then:If in addition, scores S are uniform:Proof of Theorem C.1. In a binary setting for the Brier scoring rule, we have V h = V. Hence:With N equal-width bins:Positivity of the varianceC.9 EXTENSION TO CLASSWISE CALIBRATION C.9.1 PROPER SCORING RULES DECOMPOSITIONWe show below that the proper scoring rules decomposition of Kull & Flach (2015) holds for classwise-calibration (Definition 3.2) for the Brier score and the log-loss. Proposition C.3 (Brier and log-loss classwise decomposition). For the Brier score as well as the log-loss, the decomposition into calibration, grouping, and irreducible losses (Equation 6) holds when replacing the calibrated scores by the classwise-calibrated scores (Definition 3.2).Proof of Proposition C.3. For all k ∈ {1, . . . , K}, letproblem are given by Y ′ := 1 Y =e arg max(S) . Since S is a function of X, the random variable Y ′ is a function of Y and X. Define now the scores associated to this binary problem as S ′ := max(S) ∈ R.Reformulated in terms of these notations, top-label calibration states that S ′ is well calibrated if for all s, P (Y ′ = 1|S ′ = s) = s. Thus, as for a classical binary problem, we can definegives the probability that the classifier predicts the correct class for a given score S ′ (resp. a given input X). As the quantities S ′ , C ′ , Q ′ and Y ′ define a classical binary problem, the decomposition (6) into calibration, grouping, and irreducible loss holds for this problem.Compared to the classwise definition of calibration and grouping, here the calibration loss measures whether on average over all points scored S across all classes, the proportion of correctly predicted points in actually S. In this setting, the grouping loss also measures to what extent there exist over-confident scores for certain classes that compensate under-confident scores for other classes. and scoring rule ϕ writes: Theorem C.2 (Results in classwise setting). Suppose Equation 78 is satisfied for the scoring rule ϕ. For all k ∈ {1, . . . , K}, let R k : X → N be a partition of the feature space. It holds that: In each bin on confidence scores, the level set is partitioned into 2 regions with a decision stump constrained to one balanced split, with a 50-50 train-test split strategy.

