A CALL TO REFLECT ON EVALUATION PRACTICES FOR FAILURE DETECTION IN IMAGE CLASSIFICATION

Abstract

Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-world application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t. all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring. Code and trained models are at https://github.com/IML-DKFZ/fd-shifts.

1. INTRODUCTION

"Neural network-based classifiers may silently fail when the test data distribution differs from the training data. For critical tasks such as medical diagnosis or autonomous driving, it is thus essential to detect incorrect predictions based on an indication of whether the classifier is likely to fail". Such or similar mission statements prelude numerous publications in the fields of misclassification detection (MisD) (Corbière et al., 2019; Hendrycks and Gimpel, 2017; Malinin and Gales, 2018) , Out-of-Distribution detection (OoD-D) (Fort et al., 2021; Winkens et al., 2020; Lee et al., 2018; Hendrycks and Gimpel, 2017; DeVries and Taylor, 2018; Liang et al., 2018) , selective classification (SC) (Liu et al., 2019; Geifman and El-Yaniv, 2019; 2017) , and predictive uncertainty quantification (PUQ) (Ovadia et al., 2019; Kendall and Gal, 2017) , hinting at the fact that all these approaches aim towards the same eventual goal: Enabling safe deployment of classification systems by means of failure detection, i.e. the detection or filtering of erroneous predictions based on ranking of associated confidence scores. In this context, any function whose continuous output aims to separate a classifier's failures from correct predictions can be interpreted as a confidence scoring function (CSF) and represents a valid approach to the stated goal. This holistic perspective on failure detection reveals extensive shortcomings in current evaluation protocols, which constitute major bottlenecks in progress toward the goal of making classifiers suitable for application in real-world scenarios. Our work is an appeal to corresponding communities to reflect on current practices and provides a technical deduction of a unified evaluation protocol, a list of empirical insights based on a large-scale study, as well as hands-on recommendations for researchers to catalyze progress in the field. Figure 1 : Holistic perspective on failure detection. Detecting failures should be seen in the context of the overarching goal of preventing silent failures of a classifier, which includes two tasks: preventing failures in the first place as measured by the "robustness" of a classifier (Task 1), and detecting the non-prevented failures by means of CSFs (Task 2, focus of this work). For failure prevention across distribution shifts, a consistent task formulation exists (featuring accuracy as the primary evaluation metric) and various benchmarks have been released covering a large variety of realistic shifts (e.g. image corruption shifts, sub-class shifts, or domain shifts). In contrast, progress in the subsequent task of detecting the non-prevented failures by means of CSFs is currently obstructed by three pitfalls: 1) A diverse and inconsistent set of evaluation protocols for CSFs exists (MisD, SC, PUQ, OoD-D) impeding comprehensive competition. 2) Only a fraction of the spectrum of realistic distribution shifts and thus potential failure sources is covered diminishing the practical relevance of evaluation. 3) The task formulation in OoD-D fundamentally deviates from the stated purpose of detecting classification failures. Overall, the holistic perspective on failure detection reveals an obvious need for a unified and comprehensive evaluation protocol, in analogy to current robustness benchmarks, to make classifiers fit for safety-critical applications. Abbreviations: CSF: Confidence Scoring Function, OoD-D: Out-of-Distribution Detection, MisD: Misclassification Detection, PUQ: Predictive Uncertainty Quantification, SC: Selective Classification.

2. PITFALLS OF CURRENT EVALUATION PRACTICES

Figure 1 gives an overview of the current state of failure detection research and its relationship to the preceding failure prevention task, which is measured by classifier robustness. This perspective reveals three main pitfalls, from which we derive three requirements R1-R3 for a comprehensive and realistic evaluation in failure detection: Pitfall 1: Heterogeneous and inconsistent task definitions. To achieve a meaningful evaluation, all relevant solutions toward the stated goal must be part of the competition. In research on failure detection, currently, four separate fields exist each evaluating proposed methods with their individual metrics and baselines. Incomplete competition is first and foremost an issue of historically evolved delimitations between research fields, which go so far that employed metrics are by design restricted to certain methods. MisD: Evaluation in MisD (see Section B.2.1 for a formal task definition) exclusively measures discrimination of a classifier's success versus failure cases by means of ranking metrics such as AUROC ffoot_0 (Hendrycks and Gimpel, 2017; Jiang et al., 2018; Corbière et al., 2019; Bernhardt et al., 2022) . This protocol excludes a substantial part of relevant CSFs from comparison, because any CSF that affects the underlying classifier (e.g. by introducing dropout or alternative loss functions) alters the set of classifier failures, i.e. ground truth labels, and thus creates their individual test set (for a visualization of this pitfall see Figure 4 ). As an example, a CSF that negatively affects the accuracy of a classifier might add easy-to-detect failures to its test set and benefit in the form of high AUROC f scores. As depicted in Figure 1 , we argue that the task of detecting failures is no self-purpose, but preventing and detecting failures are two sides of the same coin when striving to avoid silent classification failures. Thus, CSFs should be evaluated as part of a symbiotic system with the associated classifier. While additionally reporting classifier accuracy associated with each CSF renders these effects transparent, it requires nontrivial weighting of the two metrics when aiming to rank CSFs based on a single score. PUQ: Research in PUQ often remains vague about the concrete application of extracted uncertainties stating the purpose of "meaningful confidence values" (Ovadia et al., 2019; Lakshminarayanan et al., 2017) (see Appendix B.2.3 for a formal task definition), which conflates the related but independent use-cases of failure detection and confidence calibration. This (arguably vague) goal is reflected in the evaluation, where typically strictly proper scoring rules (Gneiting and Raftery, 2007) such as negative log-likelihood assess a combination of ranking and calibration of scores. However, for failure detection use cases, arguably an explicit assessment of failure detection performance is desired (see Appendix C for a discussion on how calibration relates to failure detection). Furthermore, these metrics are specifically tailored towards probabilistic predictive outputs such as softmax classifiers and exclude all other CSFs from comparison. → Requirement 1 (R1): Comprehensive evaluation requires a single standardized score that applies to arbitrary CSFs while taking into account their effects on the classifier. Pitfall 2: Ignoring the major part of relevant failure sources. As stated in the introductory quote, research on failure detection typically expects classification failures to occur when inputs upon application differ from the training data distribution. As shown in Figure 1 , we distinguish "covariate shifts" (label-preserving shifts) versus "new-class shifts" (label-altering shifts). For a detailed formulation of different failure sources, see Appendix A. The fact that in the related task of preventing failures a myriad of nuanced covariate shifts have been released on various data sets and domains (Koh et al., 2021; Santurkar et al., 2021; Hendrycks and Dietterich, 2019; Liang and Zou, 2022; Wiles et al., 2022) to catalyze real-world progress of classifier robustness begs the question: If simulating realistic classification failures is such a delicate and extensive effort, why are there no analogous benchmarking efforts in the research on detecting failures? In contrast, CSFs are currently almost exclusively evaluated on i.i.d. test sets (MisD, PUQ, SC). Exceptions (see hatched areas in Figure 1 ) are a PUQ study that features corruption shifts (Ovadia et al., 2019) , or SC evaluated on sub-class shift (comparing a fixed CSF under varying classifiers) (Tran et al., 2022) and applied to question answering under domain shift (Kamath et al., 2020) . Further, research in OoD-D (see Section B.2.2 for a formal task definition) exclusively evaluates methods under one limited fraction of failure sources: new-class shifts (see images 7 and 8 in Figure 2 (right panel)). A recent trend in this area is to focus on "near OoD" scenarios, i.e. shifts affecting semantic image features but leaving the context unchanged (Winkens et al., 2020; Fort et al., 2021; Ren et al., 2021) . While the notion that nuanced shifts might bear more practical relevance compared to vast context switches seems reasonable, the term "near" is misleading, as it ignores the whole spectrum of even "nearer" and thus potentially more relevant covariate shifts, which OoD-D methods are not tested against. We argue that for most applications, it is not realistic to exclusively assume classification failures from label-altering shifts and no failures caused by label-preserving shifts. → Requirement 2 (R2): Analogously to robustness benchmarks, progress in failure detection requires to evaluate on a nuanced and diverse set of failure sources. Pitfall 3: Discrepancy between the stated purpose and evaluation. The described limitations of OoD-D evaluation are only symptoms of a deeper rooted problem: Methods are not tested to predict failures of a classifier, but instead to predict an external, i.e. classifier-agnostic "outlier" label. In some cases, this formulation reflects the inherent nature of the given problem, such as in anomaly detection, where no underlying task is defined and the data sets are potentially unlabeled (Ruff et al., 2021) . However, the majority of work on OoD-detection comes with a defined classification task, including training labels and states detecting failures of the classifier as their main purpose (Fort et al., 2021; Winkens et al., 2020; Lee et al., 2018; Hendrycks and Gimpel, 2017; DeVries and Taylor, 2018; Liang et al., 2018 ). Yet, this line of work falls short of justifying why associated methods are subsequently not shown to detect the said failures but are instead tested on the surrogate task of detecting distribution shifts in the data. Figure 2 shows that the outlier label constitutes a poor tool to define which cases we wish to filter because the question "what is an outlier?" is highly subjective for covariate shifts (see purple question marks). The ambiguity of the label extends to the concept of "inliers" (what extent of data variation is still considered i.i.d.?), which the protocol rewards to retain irrespective of whether they cause the classifier to fail (see purple lightning). → Requirement 3 (R3): If there is a defined classifier whose incorrect predictions are to be detected, its respective failure information should be used to assess CSFs w.r.t the stated purpose instead of a surrogate task such as distribution shift detection. Figure 2 : Left: The discrepancy between the commonly stated purpose and evaluation in OoD-Detection. The stated eventual purpose of detecting incorrect predictions of a classifier is represented by the binary "failure label" and its associated event space (top). However, in practice this goal is merely approximated by instead evaluating the detection of distribution shift, i.e. separating cases according to a binary "outlier label" irrespective of the classifier's correctness (bottom). Right: Exemplary failure detection study under different types of failure sources. A hypothetical classifier trained to distinguish "ape" from "bear" is evaluated on 8 images under a whole range of relevant distribution shifts: For instance, images 5 and 6 depict apes, but these were not among the breeds in the training data and thus constitute sub-class shifts. Images 7 and 8 depict entirely unseen categories, but while "meerkat" stays within the task context ("semantic", "near OoD"), "house number" represents a vast context switch ("non-semantic", "far OoD"). See Appendix A for a detailed formulation of shifts.

3. UNIFIED TASK FORMULATION

Parsing the quoted purpose statement at the start of Section 1 results in the following task formulation: Given a data set {(x i , y cl,i )} N i=1 of size N with (x i , y cl,i ) independent samples from X × Y and y cl the class label, and given a pair of functions (m, g), where g : X → R is a CSF and m(•, w) : X → Y is the classifier including model parameters w, the classification output after failure detection is defined as: (m, g)(x) := P m (y cl |x, w), if g(x) ≥ τ filter, otherwise. Filtering ("detection") is triggered when g(x) falls below a threshold τ . In order to perform meaningful failure detection, a CSF g(x) is required to output high confidence scores for correct predictions and low confidence scores for incorrect predictions based on the binary failure label y f (x, w, y) = I(y cl ̸ = ŷm (x, w)), where ŷm = argmax c∈Y P m (y cl = c|x, w) and I is the identity function (1 for true events and 0 for false events). Despite accurately formalizing the stated purpose of numerous methods from MisD, OoD-D, SC, and PUQ, and allowing for evaluation of arbitrary CSFs g(x), this generic task formulation is currently only stated in SC research (see Appendix B for a detailed technical description of all protocols considered in this work). To deduct an appropriate evaluation metric for the formulated task, we start with the ranking requirement on g(x), which is assessed e.g. by AUROC f in MisD leading to Pitfalls described in Section 2. Following R1 and modifying AUROC f to take classifier performance into account lets us naturally converge (see Appendix B.2.5 for the technical process) to a metric that has previously been proposed in SC as a byproduct, but is not widely employed for evaluation (Geifman et al., 2019) : the Area under the-Risk-Coverage-Curve (AURC, see Equation 31). We propose to use AURC as the primary metric for all methods with the stated purpose of failure detection, as it fulfills all three requirements R1-R3 in a single score. The inherently joint assessment of classifier accuracy and CSF ranking performance comes with a meaningful weighting between the two aspects, eliminating the need for manual (and potentially arbitrary) score aggregation. AURC measures the risk or error rate (1 -Accuracy) on the non-filtered cases averaged over all filtering thresholds (score range: [0,1], lower is better) and can be interpreted as directly assessing the rate of silent failures occurring in a classifier. While this metric enables a general evaluation of CSFs, depending on the use case, a more specific assessment of certain coverage regions (i.e. the ratio of remaining cases after filtering) or even single risk-coverage working points might be appropriate. In Appendix F we provide an open source implementation of AURC fixing several shortcomings of previous versions.

3.1. REQUIRED MODIFICATIONS FOR CURRENT PROTOCOLS

The general modifications necessary to shift from current protocols to a comprehensive and realistic evaluation of failure detection, i.e. to fulfill requirements R1-R3, are straightforward for considered fields (SC, MisD, PUQ, OoD-D): Researchers may simply consider reporting performance in terms of AURC and benchmark proposed methods against relevant baselines from all previously separated fields as well as on a realistic variety of failure sources (i.e. distribution shifts). An additional aspect needs to be considered for SC, where the task is to solve failure prevention and failure detection at the same time (see Task 1 and Task 2 in Figure 1 ), i.e. the goal is to minimize absolute AURC scores. This setting includes studies that compare different classifiers while fixing the CSF (Tran et al., 2022) . On the contrary, the evaluation of failure detection implies a focus on the performance of CSFs (Task 2 in Figure 1 ) while monitoring the classifier performance as a requirement (R1) to ensure a fair comparison of arbitrary CSFs. This shift of focus is reflected in the fact that the classifier architecture, as well as training procedure, are to be fixed (with some exceptions as described in Appendix E.4) across all compared CSFs. This way, external variations in classifier configuration are removed as a nuisance factor from CSF evaluation and the direct effect of the CSF on the classifier training is isolated to enable a relative comparison of AURC scores. For evaluation of new-class shifts (as currently performed in OoD-D), a further modification is required: The current OoD-D protocol rewards CSFs for not detecting inlier misclassifications (see Figure 2 ). On the other hand, penalizing CSFs for not detecting these cases (as handled by AURC) would dilute the desired evaluation focus on new-class shifts. Thus, we propose to remove inlier misclassifications from evaluation when reporting a CSF's performance under new-class shift. Figure 5 visualizes the proposed modification. Notably, the proposed protocol does still consider the CSF's effect on classifier performance (i.e. does not break with R1), since higher classifier accuracy will remain to cause higher AURC scores (see Equations 29-31).

3.2. OWN CONTRIBUTIONS IN THE PRESENCE OF SELECTIVE CLASSIFICATION

Given the fact that the task definition in Equation 1 as well as AURC, the metric advocated for in this paper, have been formulated before in SC literature (see Appendix B.2.4 for technical details on current evaluation in SC)), it is important to highlight that the relevance of our work is not limited to advancing research in SC, but, next to the shift of focus on CSFs described in Section 3.1, we articulate a call to other communities (MisD, OoD-D, PUQ) to reflect on current practices. In other words, the relevance of our work derives from providing evidence for the necessity of the SC protocol in previously separated research fields and from extending their scope of evaluation (including the current scope of SC) w.r.t. compared methods and considered failure sources.

4. EMPIRICAL STUDY

To demonstrate the relevance of a holistic perspective on failure detection, we performed a large-scale empirical study, which we refer to as FD-shiftsfoot_1 . For the first time, state-of-the-art CSFs from MisD, OoD-D, PUQ, and SC are benchmarked against each other. And for the first time, analogously to recent robustness studies, CSFs are evaluated on a nuanced variety of distribution shifts to cover the entire spectrum of failure sources.

4.1. UTILIZED DATA SETS

Appendix E features details of all used data sets, and Appendix A describes the considered distribution shifts. FD-Shift benchmarks CSFs on CAMELYON-17-Wilds (Koh et al., 2021) , iWildCam-2020-Wilds (Koh et al., 2021) , and BREEDS-ENTITY-13 (Santurkar et al., 2021) , which have originally been proposed to evaluate classifier robustness (Task 1 in Figure 1 ) under sub-class shift in various domains. Further sub-class shifts are considered in the form of super-classes of CIFAR-100 (Krizhevsky, 2009) , where one random class per super-class is held out during training. For studying corruption shifts, we report results on the 15 corruption types and 5 intensity levels proposed by Hendrycks and Dietterich (2019) (Vaze et al., 2022) for semantic new-class shifts who argue that the softmax operation cancels out feature magnitudes relevant for OoD-D (we also add MLS scores averaged over MCD samples to the benchmark: MCD-MLS). Finally, we include the recently reported state-of-the-art approach: Mahalanobis Distance (MAHA) measured on representations of a Vision Transformer (ViT) that has been pretrained on ImageNet (Fort et al., 2021) . Classifiers: Because this change in the classifier biases the comparison of CSFs, we additionally report the results for selected CSFs when trained in conjunction with a ViT classifier. For implementation and training details, see Appendix E. Since drawing conclusions from re-implemented baselines has to be taken with care, we report reproducibility results for all baselines including justifications for all hyperparameter deviations from the original configurations in Appendix J.

4.3. RESULTS

The broad scope of this work reflects in the type of empirical observations we make: We view the holistic task protocol as an enabler for future research, thus we showcase the variety of research questions and topics that are now unlocked rather than providing an in-depth analysis on a single observation. Appendix G.1 features a discussion on how this study empirically confirms R1-R3 stated in Section 2. shifts). Even on the test data they have been proposed on, all three struggle to outperform simple baselines. These findings indicate a pressing need to evaluate newly proposed CSFs for failure detection in a wide variety of data sets and distribution shifts in order to draw general methodological conclusions. Prevalent OoD-D methods are only relevant in a narrow range of distribution shifts. The proposed evaluation protocol allows, for the first time, to study the relevance of the predominant OoD-D methods in a realistic range of distribution shifts. While for non-semantic new class shifts ("far OoD"), prevalent methods from OoD-D (MLS, MCD-MLS, MAHA) show the best performance across both classifiers, their superiority vanishes already on semantic new class shifts (only ViT-based MAHA on SVHN shows best performance). On the broad range of more nuanced (and arguable more realistic) covariate shifts, however, OoD-D methods are widely outperformed by softmax baselines. This finding points out an interesting future research direction of developing CSFs that are able to detect failures across the entire range of distribution shifts. AURC is able to resolve previous obscurities between classifier robustness and CSF performance. The results of ConfidNet provide a vivid example of the relevance of assessing classifier performance and confidence ranking in a single score when evaluating CSFs. The original publication reports superior results on CIFAR-10 and CIFAR-100 compared to the MCD-MSR baseline as measured by the MisD metric AUROC f . These results are confirmed in Table 9 , but we observe a beneficial effect of MCD on classifier training that leads to improved accuracy (see Table 8 ). This poses the question: Which of the two methods (ConfidNet or MCD-MSR) will eventually lead to fewer silent failures of the classifier? One directly aids the classifier to produce fewer failures and the other seems better at detecting the existing ones (at least on its test set with potentially more easily preventable failures)? AURC naturally answers this question by expressing the two effects in a single score that directly relates to the overarching goal of preventing silent failures. This reveals that the MCD-MSR baseline is in fact superior to ConfidNet on the i.i.d. test sets of both CIFAR-10 and CIFAR-100. ViT outperforms the CNN classifier on most data sets. Figure 8 shows a comparative analysis between ViT and CNN classifier across several metrics. As for AURC, ViT outperforms CNN on all datasets except iWildCam, indicating that the domain gap of ImageNet-pretrained representations might be too large for this task. This is an interesting observation, given that CAMELYON featuring images from the biomedical domain could intuitively represent a larger domain gap. Further looking at Accuracy and AUROC f performance, we see that performance gains clearly stem from improved classifier accuracyfoot_4 , but the CSF ranking performance is on par for ViT and CNN (although the failure detection task might be harder for ViT given fewer detectable failures compared to CNN). Different types of uncertainty are empirically not distinguishable. Considering the associations made in the literature between uncertainty measures and specific types of uncertainty (see Appendix I), we are interested in the extent to which such relations can be confirmed by empirical evidence from our experiments. As an example, we would expect mutual information (MCD-MI) to perform well on new class shifts where model uncertainty should be high and expected entropy (MCD-EE) to perform well on i.i.d. cases where inherent uncertainty in the data (seen during training) is considered the prevalent type of uncertainty. Although, as expected, MCD-EE performs generally better than MCD-MI on i.i.d. test sets, the reverse behavior can not be observed for distribution shifts. Therefore, no clear distinction can be made between aleatoric and epistemic uncertainty based on the expected benefits of the associated uncertainty measures. Furthermore, no general advantages of entropy-based uncertainty measures over the simple MCD-MSR baseline are observed. CSFs beyond Maximum Softmax Response yield well-calibrated scores. We advocate for a clear purpose statement in research related to confidence scoring, which for most scenarios implies a separation of the tasks of confidence calibration and confidence ranking (see Section 2). Nevertheless, to demonstrate the relevance of our holistic perspective, we extend FD-Shifts to assess the calibration error, a measure previously exclusively applied to softmax outputs, of all considered CSFs. Platt scaling is used to calibrate CSFs with a natural output range beyond [0, 1] (Platt, 1999) . Calibration errors of CSFs are reported in Table 10 , indicating that currently neglected CSFs beyond MSR provide competitive calibration (e.g. MCD-PE on CNN or MAHA on ViT) and thus constitute appropriate confidence scores to be interpreted directly by the user. This observation points out a potential research direction, where, analogously to the quest for CSFs that outperform softmax baselines in confidence ranking, it might be possible to identify CSFs that yield better calibration compared to softmax outputs across a wide range of distribution shifts. The Maximum Softmax Response baseline is disadvantaged by numerical errors in the standard setting. Running inference of our empirical study yields terabytes of output data. When tempted to save disk space by storing logits as 16-bit precision floats instead of 32-bit precision, we found the confidence ranking performance of MSR baselines to drop substantially (reduced AURC and AUROC f scores). This effect is caused by a numerical error, where high logit scores are rounded to 1 during the softmax operation, thereby losing the ranking information between rounded scores. Surprisingly, when returning to 32-bit precision, we found the rate at which rounding errors occur to still be substantial, especially on the ViT classifier (which has higher accuracy and confidence scores compared to the CNN). Table 2 shows error rates as well as affected metrics for different floating point precisions. Crucially, confidence ranking on ViT classifiers is still affected by rounding errors even in the default 32-bit precision setting (effects for the CNN are marginal as seen in AURC scores), see for instance AUROC f drops of 9% on CIFAR-10 and 5.47% on BREEDS (i.e. ImageNet data). This finding has far-reaching implications affecting any ViT-based MSR baseline used for confidence ranking tasks (including current OoD-D literature). We recommend either casting logits to 64-bit precision (as performed for our study) or performing a temperature scaling prior to the softmax operation in order to minimize the rounding errors. Further results. Despite its relevance for application, the often final step of failure detection, i.e. the definition of a decision threshold on the confidence score, is often neglected in research. In Appendix D we present an approach that does not require the calibration of scores and analyze its reliability under distribution shift. In addition, Appendix G features Accuracy and AUROC f results for all experiments. For a qualitative study of failure cases, see Appendix H.

5. CONCLUSION & TAKE-AWAYS

This work does not propose a novel method, metric, or data set. Instead, following the calls for a more rigorous understanding of existing methods (Lipton and Steinhardt, 2018; Sculley et al., 2018) and evaluation pitfalls (Reinke et al., 2021) , its relevance comes from providing compelling theoretical and empirical evidence that a review of current evaluation practices is necessary for all research aiming to detect classification failures. Our results demonstrate vividly that the need for reflection in this field outweighs the need for novelty: None of the prevalent methods proposed in the literature is able to outperform a softmax baseline across a range of realistic failure sources. Therefore, our take-home messages are: 1. Research on confidence scoring (including MisD, OoD-D, PUQ, SC) should come with a clearly defined use case and employ a meaningful evaluation protocol that directly reflects this purpose. 2. If the stated purpose is to detect failures of a classifier, evaluation needs to take into account potential effects on the classifier performance. We recommend AURC as the primary metric, as it combines the two aspects in a single score. 3. Analogously to failure prevention ("robustness"), evaluation of failure detection should include a realistic and nuanced set of distribution shifts covering potential failure sources. 4. Comprehensive evaluation in failure detection requires comparing all relevant solutions towards the same goal including methods from previously separated fields. 5. The inconsistency of our results across data sets indicates the need to evaluate failure detection on a variety of diverse data sets. 6. Logits should be cast to 64-bit precision or temperature-scaled prior to the softmax operation for any ranking-related tasks to avoid subpar softmax baselines. 7. Calibration of confidence scoring functions beyond softmax outputs should be considered as an independent task. 8. Our open-source framework features implementations of baselines, metrics, and data sets that allow researchers to perform meaningful benchmarking of confidence scoring functions. 

B TASK FORMULATIONS ADDRESSING FAILURE DETECTION

This section provides a technical exposition of current evaluation protocols for all task formulations stating to address the goal of preventing failures of a related classifier by means of CSFs. Figure 3 gives an overview of relevant metrics in this context.

B.1 CLASSIFICATION

To set the context we describe the standard evaluation protocol of a classification task, which is denoted in the notation with y cl for the class label, distinguishing it from the failure detection label y f . Given a data set {(x i , y cl,i )} N i=1 of size N with (x i , y cl,i ) independent samples from X × Y, and including the discrete class label y cl ∈ Y = {0, 1, 2, ..., n classes -1}. On this data set a classifier m(•, w) : X → Y maps from the input images x to the predicted labels ŷ with model parameters w. ŷm,i is the classification decision obtained as the maximum class probability from the classification model's probability output vector P m (y cl |x, w) ∈ [0, 1] n classes : ŷm = arg max c∈Y P m (y cl = c|x, w), with c ∈ {0, ...n classes -1} for the respective class. According to this notation, arg max c∈Y P m (y = c|x i , w). The performance of such a classifier is in multi-class setups often evaluated via accuracy, Accuracy = 1 N N i I(y cl,i = ŷm,i ), where I(Event) is defined as indicator function: I(Event) := 1, if the event Event occurs 0, otherwise. ( ) Another way is to evaluate classes separately by computing the binary label y c ∈ [0, 1] as Following this notation y c,i = I(y cl,i = c) is the binary class label for sample i. Subsequently, the confusion matrix can be computed counting the four possible evaluation outcomes per case and class c: y c = I(y cl = c). TP cl (θ, c) = N i y c,i • I(P m (y cl = c|x i , w) ≥ θ) (7) FP cl (θ, c) = N i (1 -y c,i ) • I(P m (y cl = c|x i , w) ≥ θ) (8) TN cl (θ, c) = N i (1 -y c,i ) • I(P m (y cl = c|x i , w) < θ) (9) FN cl (θ, c) = N i y c,i • I(P m (y cl = c|x i , w) < θ). ( ) These cardinalities are defined depending on the cut-off θ on the provided predicted class probabilities (PCP). Subsequently, various counting metrics can be applied, for instance Sensitivity (also called recall or true positive rate), False Positive Rate (FPR, or 1 -Specificity), or Precision per class c: Sensitivity cl (θ, c) = TP cl (θ, c) TP cl (θ, c) + FN cl (θ, c) (11) FPR cl (θ, c) = FP cl (θ, c) TN cl (θ, c) + FP cl (θ, c) (12) Precision cl (θ, c) = TP cl (θ, c) TP cl (θ, c) + FP cl (θ, c) . ( ) Next to evaluating these metrics on a certain cut-off θ on PCPs, often multi-threshold metrics are employed, which scan over cut-offs from all PCP values present in the data set to obtain ROC-curves (Sensitivity plotted over FPR) or Precision-Recall-Curves (PRC, Precision plotted over Sensitivity). Model performance in the form of a single score is extracted via computing the respective area under the curve, i.e. the AUROC for ROC-curves. Here the multi-threshold list {θ t } T t=0 of length T are the cut-off values obtained as the unique values of the descending ranking of all CSF values. The class-wise AUROC values can then be computed as follows: AUROC cl (c) = T t=1 1 2 (FPR cl (θ t , c) -FPR cl (θ (t-1) , c)) • (Sensitivity cl (θ t , c) + Sensitivity cl (θ (t-1) , c)). The AUPRC for PRC-curves is defined similarly, and commonly approximated by a average precision (AP) score due to interpolation issues: AP cl (c) = T t=1 (Sensitivity cl (θ t , c) -Sensitivity cl (θ (t-1) , c)) • Precision cl (θ t , c). ( ) Both areas under the curves can be interpreted as ranking metrics, i.e. they require cases of the class c to be separated from cases of other classes based on a ranking of PCP values.

B.2 FAILURE DETECTION

Concrete application of CSFs for the task of detecting failures in order to prevent incorrect predictions of a classifier are formulated in Equation 1 in Section 3. Based on the failure label in Equation 2, which is 1 for classification failure and 0 for success, a confusion matrix can be determined for a cutoff value τ as follows: TP f (τ ) = N i (1 -y f,i ) • I(g(x i ) ≥ τ ) (16) FP f (τ ) = N i y f,i • I(g(x i ) ≥ τ ) (17) TN f (τ ) = N i y f,i • I(g(x i ) < τ ) (18) FN f (τ ) = N i (1 -y f,i ) • I(g(x i ) < τ ). ( ) Analogous to the evaluation of the classification performance, one can compute different counting metrics for the confidence ranking task: Sensitivity f (τ ) = TP f (τ ) TP f (τ ) + FN f (τ ) (20) FPR f (τ ) = FP f (τ ) TN f (τ ) + FP f (τ ) (21) Precision f (τ ) = TP f (τ ) TP f (τ ) + FP f (τ ) .

B.2.1 MISCLASSIFICATION DETECTION

This evaluation protocol for CSFs directly sticks to the binary classification task between correct and incorrect predictions defined above, and uses ranking metrics analogous to the area under the curves defined for classification evaluation based on a multi-threshold list {τ t } T t=0 of length T , which are the cut-off values obtained as the unique values of ascending ranking of confidence scores. Most commonly used are the failure detection AUROC: AUROC f = T t=1 (FPR f (τ t ) -FPR f (τ (t-1) )) • (Sensitivity f (τ t ) + Sensitivity f (τ (t-1) ))/2 = T t=1 N i y f,i • (I(g(x i ) ≥ τ t ) -I(g(x i ) ≥ τ (t-1) )) N i y f,i • N i (1 -y f,i ) • (I(g(x i ) ≥ τ t ) + I(g(x i ) ≥ τ (t-1) ) 2 • N i (1 -y f,i ) , and the failure detection AUPRC or AP-score: AP f = T t=1 (Sensitivity f (τ t ) -Sensitivity f (τ t-1) )) • Precision f (τ t ) = T t=1 N i (1 -y f,i ) • (I(g(x i ) ≥ τ t ) -I(g(x i ) ≥ τ (t-1) )) N i (1 -y f,i ) • N i (1 -y f,i ) • I(g(x i ) ≥ τ t ) N i I(g(x i ) ≥ τ t ) . The data set in these tasks is often biased towards many correct predictions and few incorrect predictions since one does not usually think about preventing failures if the classification performance is very poor in the first place. Consequently, this biases the failure detection AP-score resulting in higher scores. The reverse form defining errors as positive samples, which is not biased in the aforementioned way, is also evaluated: AP f,err = T t=1 N i y f,i • (I(g(x i ) < τ t ) -I(g(x i ) < τ (t-1) )) N i y f,i • N i y f,i • I(g(x i ) < τ t ) N i I(g(x i ) < τ t ) . As described in Section 2, this protocol comes with the pitfall of not considering the classifier performance, thus potential effects of CSFs on the classifier performance caused by joined training go unnoticed (for a visualization of this pitfall see Figure 4 ). This effect can be nailed down to the (1 -y f,i ) factor of Equation 24: The precision score (second factor of the product) only gets a weight for correct predictions y f = 0, so by perfect separation, it is possible to traverse all correct predictions (all "weights") while the precision score is one, and any subsequent incorrect predictions are not considered by the metric. This behaviour prevents meaningful comparison of arbitrary CSFs in practice. For instance, consider a comparison between two very standard CSFS, the confidence scores derived from the maximum softmax response (MSR) of a classifier against scores derived from the softmax mean over Monte Carlo Dropout (MCD) sampling of the same classifier. This comparison can already lead to considerable evaluation bias because the classification decision based on MCD means approach most likely produces a different set of failure cases. What's more, associated metrics are notoriously sensitive to such label biases: While it is well-known, that high scores in failure detection AUROC f can be achieved via a bias towards true negatives (such as background objects in object detection), in failure detection AP f,err with failure as the positive label is used (Hendrycks and Gimpel, 2017; Corbière et al., 2019) , which can be equally hacked by biases towards easy-to-detect true positives. As a consequence, this metric typically gives the highest scores at the beginning of model training, when large amounts of failures are produced by the classifier. Crucially, the results from this evaluation ("failure labels") will be used as reference labels for the CSF evaluation (see "task 2: Failure detection" in Figure 1 ). For simplicity, we assume that all three CSFs output identical confidence scores. However, ranking metrics such as AUROC f (as applied in MisD) yield vastly different results across CSFs because the evaluation is based different sets of reference labels. In this case CSF1 shows the best failure detection performance, but is also the one causing additional failures in the first place. This shows that isolated evaluation of failure detection (as done in MisD) is flawed, because CSF effects on the classifier training are not accounted for. To serve the overarching purpose of preventing silent failures of a classifier ("task1 + task2" in Figure 1 ), one would need to select a CSF based on both, rankings of Accuracy and AUROC f . But what is a natural way to weigh the two rankings? The metric proposed in Section 3 directly reflects the purpose of preventing silent failures and can be interpreted as providing a natural weighting between Accuracy and AUROC f rankings.

B.2.2 OUT-OF-DISTRIBUTION DETECTION

In out of distribution detection, the label y out signals whether a sample is an outlier y out = 1 or an inlier y out = 0. Based on this label the AUROC value is primarily used as a performance metric: AUROC out = T t (FPR out (τ t ) -FPR out (τ (t-1) )) • (Sensitivity out (τ t ) + Sensitivity out (τ (t-1) ))/2 = T t N i (1 -y out,i ) • (I(g(x i ) ≥ τ t ) -I(g(x i ) ≥ τ (t-1) )) N i (1 -y out,i ) • N i (y out,i ) • (I(g(x i ) ≥ τ t ) + I(g(x i ) ≥ τ (t-1) ) 2 • N i y out,i . This formulation and the thresholds τ t are equivalent to the AUROC used for MisD in Equation 23except for the different label y out that is employed (for implications and pitfalls of using this label see Figure 2 ). Due to the vast overlap of task formulations FD-Shifts enables a seamless integration of OoD-detection methods for holistic comparison (see Section 3).

B.2.3 PREDICTIVE UNCERTAINTY ESTIMATION

Common evaluation in PUQ (e.g. Ovadia et al. (2019) ; Lakshminarayanan et al. (2017) employs proper scoring rules like Negative-Log-Likelihood NLL = - 1 N N i log(P m (y cl = y cl,i |x i , w)), and Brier Score BrierScore = 1 N N i Y c (P m (y cl = c|x i , w) -y c,i ) 2 . ( ) Both directly assess the classifier output and are not applicable to other CSFs. Each metric requires both ranking and calibration of confidence scores which is interpreted as assessing the "general meaningfulness" of scores (see also Figure 6 ). However, in case there is a clear use case defined requiring either ranking or calibration of confidence scores, evaluation with proper scoring rules might dilute this focus and hinder the practical relevance of results.

B.2.4 SELECTIVE CLASSIFICATION

The idea of equipping a classifier with the option to abstain from certain decisions based on a selection function has been around for a long time (Chow, 1957; El-Yaniv and Wiener, 2010) . SC directly evaluates the process of detecting failures of a classifier via CSFs as described by Equation 1. SC essentially tries to optimize the trade-off between achieving low risk and high coverage. Thereby, risk is defined as the error rate of cases remaining after selection: Risk(τ ) = 1 -Precision f (τ ) = FP f (τ ) TP f (τ ) + FP f (τ ) = N i y f,i • I(g(x i ) ≥ τ ) N i I(g(x i ) ≥ τ ) And coverage as the ratio of cases remaining after selection: Coverage(τ ) = TP f (τ ) + FP f (τ ) TP f (τ ) + FP f (τ ) + FN f (τ ) + TN f (τ ) = N i I(g(x i ) ≥ τ ) N ( ) Typical evaluation is performed for single cut-offs of τ reporting and comparing resulting risk and coverage scores. One can also calculate the Area under the Risk-Coverage-Curve (AURC): AURC = T t=1 (Coverage(τ t ) -Coverage(τ (t-1) )) • (Risk(τ t ) + Risk(τ (t-1) ))/2 = T t=1 N i (I(g(x i ) ≥ τ t ) -I(g(x i ) ≥ τ (t-1) )) N • N i y f,i • I(g(x i ) ≥ τ t ) 2 • N i I(g(x i ) ≥ τ t ) + N i y f,i • I(g(x i ) ≥ τ (t-1) ) 2 • N i I(g(x i ) ≥ τ (t-1) ) (31) which uses the thresholds τ t equivalently to the failure detection AUROC for MisD in Equation 23as the unique values of the ascending ranking of confidence scores. However, AURC is currently not used as the primary metric or part of an established evaluation protocol in SC. Instead, e-AURC has been proposed (Geifman et al., 2019) : e-AURC = AURC+(1-Risk(τ 0 ))•log(1-Risk(τ 0 )) = AURC-Accuray•log(Accuracy), ( ) where it is used that Risk(τ 0 ) = (1 -Prevalence f ) is equal to the Negative Rate. Further, Prevalence f is equal to the Accuracy of the base classifier m. Therefore e-AURC effectively subtracts the classifier performance aspect from AURC for exclusive focus on evaluating the ranking power of CSFs. This process essentially collapses AURC back to a pure ranking metric such as employed in MisD. However, as laid out in Section 2, evaluating CSFs while being agnostic to the classifier comes with significant pitfalls preventing comprehensive method assessment.

B.2.5 UNIFICATION OF TASK FORMULATION -TECHNICAL DETAILS

In this work, we propose to establish AURC as the primary metric for failure detection, as it fulfills requirements R1-R3 defined in Section 2. Comparing Equation 31 to Equation 24, a deviation is observed in technical details such as the conservative interpolation of the AP-score versus trapezoidal interpolation in AURC, or the fact that AURC is defined with "reverse precision" in the form of Risk requiring minimization of the score. However, the one crucial conceptual difference is the fact that "weight" is put on the risk score (second factor of the product) in Equation 31 for all steps of coverage (first factor). This is in contrast to the (1 -y f,i ) factor in Equation 24, which ignores steps of incorrect predictions in sensitivity (first factor in the equation), and essentially implies that incorrect predictions of the classifier are penalized irrespective of how perfect the CSF separates these. We argue that this aspect of evaluation is an essential part of evaluating CSFs in the context of failure detection, because to truly evaluate the property of a CSF to "prevent silent failures of a classifier" we don't only need to check whether CSFs are able to filter existing failures. We also need to check whether they might have caused new failures (or prevented further failures) by altering the training compared to a neutral classifier trained without the CSF, thereby creating their own test set. In Appendix F we provide a revised implementation of AURC fixing various shortcomings of prior open-source implementations. Figure 5 visualizes how OoD-D protocols need to be adapted in order to follow the unified task formulation Figure 5 : Modification of the current OoD-Detection protocol for seamless integration into unified evaluation. Intuitively, data samples labelled as "outlier" can by definition be set to "failure"(or "filter"), because the trained classifier has no possibility to correctly predict unseen classes. However, not all "inlier" are correct classifier decisions. The current OoD-D protocol rewards CSFs for not detecting inlier misclassifications (see Figure 2 ). On the other hand, penalizing CSFs for not detecting these cases would dilute the desired evaluation focus on new-class shifts. Thus, we propose to remove inlier misclassifications from evaluation when reporting a CSF's performance under new-class shift. While taking those cases out of the confidence ranking assessment, we argue that at the same time it is required to fulfil R1, i.e. assess the effect of CSFs on classifier performance. Thus, we propose to report AURC even on new-class shifts instead of AUROC f , which follows R1 because more inlier failures taken out lead to a higher ratio of OoD cases (failure cases by definition) in the test set and thus a lower accuracy.

C RELATION BETWEEN CONFIDENCE RANKING AND CONFIDENCE CALIBRATION

Calibration in the context of CSFs describes the requirement of predicted class scores to match the empirical accuracy of associated cases (e.g. cases with a confidence score of 0.8 are correct 80% of the time), which is useful for use cases like interpretable decision making (e.g. human assessment or a need for interpretable cutoffs like "risk > 0.8"), or assessing the applicability of a trained classifier to entire data sets (Guo et al., 2017) . Figure 6 shows how confidence calibration and confidence ranking are two independent requirements, i.e. either of the tasks can be solved without solving the other. A CSF can in theory have zero calibration error (perfect calibration) but provide no ranking of single cases and vice versa, raising the questions: Which requirement do we actually want a CSF to satisfy given a concrete use case? What can we do with calibrated scores in practice if there is no ranking? What are concrete use cases where both tasks are required i.e. a "general meaningfulness" is measured such as by proper scoring rules like negative log-likelihood or Brier Score? We argue that for all use cases where any sort of selection, cut-off, or separation between cases with lower or higher confidence is performed that does not rely on interpretation of raw confidence values, only confidence ranking and no calibration is required: Given a well-separating CSF (good scores according to ranking metrics), one can define a cut-off value on a validation set according to practical requirements such as "filter only 20% of cases" (coverage guarantee) or "filter such that a maximum of 5% errors remain" (risk guarantee) and select cases according to this cut-off in a subsequent application. Importantly, calibrated scores are not required at any stage in this process. Appendix D demonstrates a realistic example of how to do this under distribution shifts. Figure 6 : This figure is a rough sketch of the results of a hypothetical classifier to illustrate that the two requirements of calibration and ranking are entirely independent of each other. Note that the confidence scores drawn for the combination of "perfect ranking" and "perfect calibration" (bottom right panels) are not reachable for MSR, because MSR has a lower bound at 1/n classes .

D DEFINING A FINAL DECISION THRESHOLD ON CONFIDENCE SCORES

The idea exists that despite a well-ranked confidence scoring function one still requires calibration of scores in order to set a meaningful decision threshold at the end. In Figure 7 we explore selection under guaranteed risk (SGR) as an alternative decision making tool under distribution shift by exemplary studying ConfidNet under image corruptions on CIFAR-100. Again, comparing the practical applicability to the calibration approach: When a specific application and a potential practitioner observe 0.11 ECE on the i.i.d. set, what guarantees for reliable decision making can be derived from this score? The results of the SGR study show how given a desired risk r * the thresholds determined on the validation set provide reliable risk guarantees on the i.i.d. test set. Under image corruption shift, as expected, these guarantees start to break in the order of confidence (here in the statistical sense) parameters δ. While providing risk guarantees under unknown distribution shifts is certainly an open problem, SGR might help practitioners to develop a general feeling for risk excesses under potential shifts in their application domain and to understand how choices of δ might be able to counteract certain levels of expected shift.

E.2 CLASSIFIER ARCHITECTURES

For the CNN training, we used a small convolutional network on SVHN (following Corbière et al. ( 2019)), VGG-13 (Simonyan and Zisserman, 2015) on CIFAR-10/100 (following DeVries and Taylor ( 2018)), and a ResNet-50 (He et al., 2015) for the three robustness benchmarks. For ViT training we use the ViT-B/16 architecture and scale up all images to 384x384 pixels following Dosovitskiy et al. (2020) .

E.3 HYPERPARAMETERS

On SVHN and CIFAR-10/100 we sticked as close as possible to configurations of original publications (Corbière et al., 2019; DeVries and Taylor, 2018; Liu et al., 2019) while at the same time converging to identical classification model configurations. On robustness benchmarks, we stuck to the proposed configurations of respective baseline experiments (Santurkar et al., 2021; Koh et al., 2021) . Thus, data augmentation was only used on CIFAR-10/100 (slight rotation, horizontal flip, and cutout following DeVries and Taylor ( 2018)) and on BREEDS-Entity-13 (horizontal flip and color jitter following Santurkar et al. (2021) ). We used a cosine decay schedule without restarts following Loshchilov and Hutter (2017) for all methods and data sets, as we found this schedule to robustly generate high-quality results (see Appendix J for ablations). All classifiers are trained with SGD and momentum 0.9. ConfidNet is trained with Adam, learning rate 10 -4 and finetuned with learning rate 10 -6 following the original configuration. Table 4 shows training parameters that have been modified on each data set. Table 7 shows training parameters used for finetuning the ViT. 

E.4 MODEL SELECTION

Using the validation set we selected whether or not to use dropout during training for all deterministic confidence scoring functions as well as the best hyperparameter o per DeepGambler-based confidence scoring function. For the latter, we repeated all experiments using o = [2.2, 3, 6, 10]. As we found higher o to be beneficial for more complex data sets, we additionally ran o = [12, 15, 20] on CIFAR-100 and o = 15 on iWildCam and BREEDS-Entity-13 (additional runs on the last two data sets had to be reduced due to computing resource constraints). Table 5 shows selected o per method and data set, Table 6 shows whether or not dropout has been used for training per method and data set. Table 7 shows whether or not dropout has been used and what learning rate was selected when finetuning the ViT. To select the learning rate we do a single run of training for all learning rates in [3 * 10 -2 , 10 -2 , 3 * 10 -3 , 10 -3 , 3 * 10 -4 , 10 -4 ] and chose the lowest AURC on the validation set. If dropout was used to finetune the ViT the dropout rate was 0.1.  1 0 1 1 1 1 DG-MSR 0 0 0 0 0 1 DG-PE 0 0 0 0 0 1 DG-Res 0 1 0 0 0 1 Devries et al. 1 0 1 1 0 1 MSR 0 0 1 0 1 1 PE 0 0 1 1 1 1 Table 7 : Training parameters and dropout selection per data set and method for ViT. init-lr: Initial learning rate of the cosine scheduler as selected. do: "0" denotes no dropout was used, "1" denotes dropout was used. wd: L2 weight-decay. steps: Number of batches that was trained on. Dataset Method init-lr do wd batch size steps SVHN MAHA 3 * 10 -2 1 0 128 40000 MCD 10 -2 1 MSR 10 -2 0 PE 10 -2 0 CIFAR-10 MAHA 10 -2 1 0 128 40000 MCD 10 -2 1 MSR 3 * 10 -4 0 PE 3 * 10 -4 0 CIFAR-100 MAHA 10 -2 0 0 512 10000 MCD 10 -2 1 MSR 10 -2 1 PE 10 -2 1 CIFAR-100 (super-classes) MAHA 3 * 10 -3 0 0 512 10000 MCD 10 -3 1 MSR 10 -3 1 PE 10 -3 1 iWildCam MAHA 3 * 10 -3 0 0 512 40000 MCD 3 * 10 -3 1 MSR 10 -3 0 PE 10 -3 0 Wilds-Camelyon-17 MAHA 10 -3 0 0 128 40000 MCD 3 * 10 -3 1 MSR 10 -3 0 PE 10 -3 0 BREEDS-Entity-13 MAHA 3 * 10 -3 0 0 128 40000 MCD 10 -2 1 MSR 10 -3 0 PE 10 -3 0 F REVISED AURC IMPLEMENTATION Our implementation of AURC is based on two implementations we found by Geifman et al. 8 and Corbiere et al.foot_8 . The two existing implementations have several shortcomings, for instance, in Geifman et al. steps in the RC curve are not defined as new unique values in the sorted list of confidence scores, but instead each data sample, including the ones with equal confidence scores, is considered as leading to an individual classification decision with an associated risk-coverage pair. This effectively leads to a random interpolation of the RC-curve between unique confidence scores adding considerable noise to the result especially in dense singular points such as confidence scores of 0 or 1. A shortcoming in the implementation by Corbiere et al. is that there is no well-defined endpoint of the curve meaning the risk just drops to zero after the last RC-curve step (i.e. the coverage value corresponding to thresholding at the lowest confidence score), which effectively favours methods with higher "lowest coverages". This can make a substantial difference in practice because methods often assign equal confidence scores to more than one case especially at 0 and 1. Thus, analogously to scikit-learn's PRC-curve implementation, we add a final point at zero coverage and the risk remaining the risk of the last RC-curve step.

G ADDITIONAL RESULTS

This section contains additional results for Accuracy (Table 8 ), AUROC f (Table 9 , as well as Expected Calibration Error (Table 10 ). Table 11 shows rankings of confidence scoring functions based on AURC scores (i.e. based on Table 1 ).

G.1 EMPIRICAL CONFIRMATION OF THE IMPORTANCE OF REQUIREMENTS R1-R3

R1: Comprehensive evaluation requires a single standardized score that applies to arbitrary CSFs while accounting for their effects on the classifier. The arguments that lead to R1 are stated in Section 2 and visualized in Figure 4 . In Table 12 we provide evidence for the importance of this argument: Not following R1 and instead evaluating CSFs with pure rankings metrics like AUROC f (as common in MisD) leads to substantially differing rankings of CSFs. This means that the effect of CSFs on the classifier is a crucial factor to consider in practice when ranking CSFs. The finding "AURC is able to resolve previous obscurities between classifier robustness and CSF performance" in Section 4.3 describes a concrete example of how neglecting this factor has led to misleading results in the literature. Equivalently, we argue that conflating evaluation of failure detection with calibration, as long as the purpose for this dual-purpose assessment is not clearly stated, is a shortcoming of current practices. Empirically showcasing the effects of this conflation on the ranking of CSFs analogously to Table 12 is not possible, because common metrics in PUQ (proper scoring rules) exclusively operate on the predicted class scores and are not compatible with arbitrary CSFs. Instead, this restriction in itself acts as an exclusion criterion for these metrics for the comprehensive evaluation of CSFs. The experiment shown in Table 12 underlines that fulfilling R1 is essential for meaningful comparison of arbitrary CSFs. Thus, the general findings of our study described in Section 4.3 are all made possible by the proposed protocol following R1 and can be seen as further empirical confirmation of the importance of this requirement. R2: Analogously to robustness benchmarks, progress in failure detection requires to evaluate on a nuanced and diverse set of failure sources. As described in Section 4.3, our study reveals that "none of the evaluated methods from literature beats the simple Maximum Softmax Response baseline across a realistic range of failure sources" and "prevalent OoD-D methods are only relevant in a narrow range of distribution shifts". These findings are a direct result of fulfilling R2 in our evaluation protocol and demonstrate how current protocols (without R2) lead to the proposition of 1 def AURC(residuals, confidence): for i in range(0, len(idx_sorted) -1):  2 coverages = [] 3 risks = [] 14 15 cov = cov-1 16 error_sum = error_sum -residuals[idx_sorted[i]] 17 selective_risk = error_sum /(n -1 -i) 18 tmp_weight += 1 19 20 if i == 0 or \ 21 confidence[idx_sorted[i]] != confidence[idx_sorted[i -1]]:

MCD-MSR CNN

Table 13 : Comparing Rankings of AUROC f based on the current OoD-protocol ("original" → O) versus the proposed modification of dismissing inlier misclassifications ("proposed" → P). The fact that these two protocols lead to considerably different rankings of CSFs in many scenarios demonstrates the importance of the effect of inlier misclassifications on OoD-D evaluation and thus the importance of the proposed modification (visualized in Figure 5 ). The color heatmap is normalized per column and classifier (separately for CNN and ViT), while whiter colors depict better scores. Scores are averaged over 5 runs per experiment on all data sets. Abbreviations: ncs: new-class shift (s for semantic, ns for non-semantic), c10/100: CIFAR-10/100, ti: TinyImagenet. 1 scaled to [0, 1] for interpretability. Note, how an "external" confidence score like entropy can be zero even on the predicted class (such decoupling would be inherently impossible with MSR). Looking at overconfident cases again reveals label ambiguities on the i.i.d. test set rather than actual failures of confidence scoring. Examples are a bird in front of a monitor ("monitor" is a training sub-class to "equipment"), a woman wearing a scarf on top of a poncho ("poncho" is a training sub-class to "garment"), or a vehicle behind a fence ("fence" is a sub-class to man-made structure). Further, this study reveals interesting shortcut learning such as focusing on knitting patterns ("garment" vs. "accessory"), or a dog being predicted as unlikely to appear in front of the water. A really hard categorization task is posed by the image showing hot air balloons, as they are confused with volleyballs (training sub-class to "equipment").

I TECHNICAL FORMULATION OF CLASSIFIER-OUTPUT BASED CONFIDENCE SCORING FUNCTIONS

The majority of confidence scoring functions explored in this study is based on one of the well-established methods for quantifying the uncertainty of a classifier's prediction. While strictly proper scoring rules such as NLL or Brier Score only assess performance related to failure detection for one of those measures, MSR, the protocol proposed in this study enables comparison across arbitrary measures. While PE is often associated with capturing a prediction's total uncertainty ( (Smith and Gal, 2018; Depeweg et al., 2018; Malinin and Gales, 2018) ), MCD-EE is said to capture the uncertainty inherent in the data (or aleatoric uncertainty). Consequentially, subtracting the latter from the former, i.e. the mutual information (MCD-MI) between the parameters θ and the categorical label y, is often associated with only capturing the model uncertainty (a form of epistemic uncertainty). The Maximum Logit Score (MLS), proposed by Vaze et al. (2022) , uses the magnitude of the logits which they argue carries information relevant for OoD-D, where greater logits should correspont to more certain predictions. Maximum Softmax Response (MSR): P = max c P (y = c|x * ; D), where c represents the different classes. Predictive Entropy (PE): H[P(y|x * ; D)] = - C c=1 P (y = c|x * ; D) • ln(P (y = c|x * ; D) Maximum Logit Score (MLS): max c f (x * , D), ( ) where f is the function producing the logits that are used to create the probability vector P (y|x * , D) = softmax(f (x * , D)). Maximum Softmax Response obtained from Monte Carlo Dropout (MCD-MSR): P = max c E p(θ|D) [P(ω c |x * ; θ)], where θ are the model parameters. Predictive Entropy obtained from Monte Carlo Dropout (MCD-PE): H[E p(θ|D) [P(y|x * ; θ)]], Expected Entropy obtained from Monte Carlo Dropout (MCD-EE):  E p(θ|D) [H[P(y|x * ; θ)]] where f is the function producing the logits that are used to create the probability vector P (y|x * , θ) = softmax(f (x * , θ)).

J REPRODUCIBILITY AND DEVIATIONS FROM ORIGINAL BASELINE CONFIGURATIONS

As described in Section E.3, a meaningful evaluation of confidence scoring methods for failure detection requires deviating from original configurations of baselines in order to assimilate the underlying classifier across compared methods. Since re-implementing and re-configuring baseline methods is prone to inducing biases in comparative studies and respective conclusions (Lipton and Steinhardt, 2018) , in this section, we want to explain all deviations in detail and provide evidence that baselines perform on a par with or superior to the original versions under our configuration.

J.1 DEEPGAMBLERS

In Table 14 , we compare our configuration against the original configuration using the results reported on CIFAR-10. We deviate from the original configuration in using additional data augmentation in the form of a cutout, decaying the learning rate via cosine scheduler, training a VGG-13 model instead of VGG-16, and not using dropout for training (as selected by our model selection). The hyperparameter o has been selected as 2.2 on this data set, which is identical to the original configuration. The comparison shows, that our results are substantially better than the original configuration, mainly due to a better classifier accuracy. The only substantial deviation from the original protocol is the replacement of the proposed MultiStep learning rate scheduler with our cosine scheduler (the latter is more aggressive with overall higher learning rates). As Figure 10 shows, this deviation has a negative effect on the OoD-detection performance only when training on CIFAR-10 and testing on TinyImagenet. There, our configuration only achieves 0.965 AUROC instead of 97.0 AUROC from the original paper. In all other settings, the cosine scheduler results in superior performance. (Vaze et al., 2022) 96.0 MLS (improved baseline) (Vaze et al., 2022) 97.1 J.5 MAHALANOBIS SCORE Fort et al. (2021) compare different vision transformer architectures, due to computational constraints we only consider the ViT-B/16 architectue. Since the original publication does not report how results have been obtained, i.e. whether multiple runs have been performed and how final results were selected from these runs, we report both our best run as well as an average over five runs, selected by best accuracy. Table 16 compares our implementation to theirs as measured by AUROC f trained on CIFAR-10/CIFAR-100 using CIFAR-100/CIFAR-10 as the out-of-distribution dataset respectively. We see that our average is slightly below the original results, but results are very volatile with high standard deviations over runs. Our best run results are en par with the original results. Equivalently to original results, MAHA score are consistently higher compared to MSR. 



where the "f" denotes "failure" (see Appendix B for details), not to be confused with the classification AUROC. Code is at: https://github.com/IML-DKFZ/fd-shifts We aimed to further include OpenHybrid(Zhang et al., 2020), but were not able to reproduce their results despite running the original code. Notably, the loss attenuation of DG seems to have positive effects on the softmax for i.i.d. settings leading to DG-MCD-MSR being the top-performing i.i.d. method with the CNN classifier on 3 out of 6 data sets. Our CNN results are not representative for state-of-the-art CNN performance, since we employ small models such as VGG-13 or ResNet-50, see Appendix E.2 The term subpopulation shift has a different meaning inKoh et al. (2021), where it describes variations of category frequencies as opposed to unseen variations We used the 32x32 resized data set from https://github.com/ShiyuLiang/odin-pytorch https://github.com/geifmany/uncertainty_ICLR/blob/master/utils/ uncertainty_tools.py https://github.com/valeoai/ConfidNet/blob/master/confidnet/utils/ metrics.py



Figure 3: Overview of Evaluation Metrics in the Context of Failure Detection. The three performance aspects related to confidence scoring are classification performance, confidence ranking, and confidence calibration. As argued in this work, the former two are required for evaluating CSFs in the context of failure detection, while confidence calibration usually constitutes a separate task and use case (see Appendix C). Appendix B provides a definition of all metrics as well as their relations and showcases how we naturally converged to considering AURC as the primary metric for evaluation in failure detection.

Figure 4: Visualizing the pitfall of the MisD protocol for failure detection: This figure demonstrates the need behind Requirement 1 ("Comprehensive evaluation requires a single standardized score that applies to arbitrary CSFs while accounting for their effects on the classifier"). Many CSFs affect the classifier training (e.g. CSFs based on dropout or additional loss functions resulting in distinctive classifiers per CSF. Evaluating the hypothetical task of binary image classification between apes and bears (see "task 1: Failure prevention" in Figure 1), joint training with CSF1 decreases the accuracy of the classifier (compared to training without CSF setting), CSF2 leads to altered predictions with equivalent accuracy, and CSF3 leaves the predictions of the classifier unchanged.Crucially, the results from this evaluation ("failure labels") will be used as reference labels for the CSF evaluation (see "task 2: Failure detection" in Figure1). For simplicity, we assume that all three CSFs output identical confidence scores. However, ranking metrics such as AUROC f (as applied in MisD) yield vastly different results across CSFs because the evaluation is based different sets of reference labels. In this case CSF1 shows the best failure detection performance, but is also the one causing additional failures in the first place. This shows that isolated evaluation of failure detection (as done in MisD) is flawed, because CSF effects on the classifier training are not accounted for. To serve the overarching purpose of preventing silent failures of a classifier ("task1 + task2" in Figure1), one would need to select a CSF based on both, rankings of Accuracy and AUROC f . But what is a natural way to weigh the two rankings? The metric proposed in Section 3 directly reflects the purpose of preventing silent failures and can be interpreted as providing a natural weighting between Accuracy and AUROC f rankings.

Tomani et al. (2021) studied calibration under a variety of distribution shifts, to the best of our knowledge recent work in the context of neural networks only considers a single CSF: the maximum softmax response (MSR) of the classifier. Applying our holistic perspective, the question arises: If other CSFs beat MSR in confidence ranking tasks, is it not likely that they could also provide better-calibrated scores? To this end, we extend our empirical study and analyze the calibration performance of all studied CSFs under distribution shifts (see Appendix G.

Figure 9: Qualitative study of failure cases in failure detection. a) The three most overconfident and underconfident test cases of MSR for different distribution shifts on CIFAR-100. b) The three most overconfident and underconfident test cases of Predictive Entropy obtained by Monte Carlo Dropout (MCD-PE) on BREEDS-Entitiy-13.

Mutual Information obtained from Monte Carlo Dropout (MCD-MI):I(y; θ|) = H[E p(θ|D) [P(y|x * ; θ)]] -E p(θ|D) [H[P(y|x * ; θ)]](40)Maximum Logit Score obtained from Monte Carlo Dropout (MCD-MLS):max c E p(θ|D) [f (x * , θ)],

Figure 10: Ablation study comparing the originally proposed Multistep scheduler against the Cosine scheduler for the confidence scores from Devries et al. trained on CIFAR-10. One dot denotes one run, the thin line denotes the mean across runs and the thick line denotes the median across runs.

Figure 11: Ablation study for the three-staged model selection proposed in the original ConfidNet configuration. One dot denotes one run and evaluation on the CIFAR-10 i.i.d. test set, the thin line denotes the mean across runs and the thick line denotes the median across runs. MS: Model selection, latest: select the latest epoch at each training stage.



Rounding errors occurring during the softmax operation negatively affect the ranking performance of MSR. The table shows, for different floating point precisions, error rates at which rounding errors occur, affected ranking metrics AURC and AUROC f , as well as Accuracy for i.i.d. test set performance of MSR. The default setting of 32-bit precision leads to substantial ranking performance drops on the ViT classifier. Note that not all rounding errors necessarily decrease ranking performance, especially in high-accuracy settings where failure cases are rare.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, page 9, 2011. A FORMULATION OF FAILURE SOURCES In general, we distinguish three sources of error that can cause image classification systems to output false predictions (see Figure 2 for exemplary visualizations). Inlier Misclassifications: This type of failure source is defined by occurring on cases that are sampled i.i.d. with respect to the training distribution and is commonly addressed by work in MisD and SC. Possible reasons for occurrence are missing evidence in the image related to the ground truth category, poor model fitting, or data variations that are not considered as distribution shifts yet in a specific use case. Covariate Shift: Images subject to a covariate shift (Quionero-Candela et al., 2009) can still be assigned to one of the training categories. Various examples are investigated by recent robustness benchmarks: Image corruptions shift (Hendrycks and Dietterich, 2019), domain shift such as medical images from different scanners and clinical sites or satellite images from different seasons(Koh et al., 2021), or subpopulation shift 6 , where unseen semantic variations of the training categories occur such as unseen breeds of an animal category(Santurkar et al., 2021). We summarize domain shift and subpopulation shift under the term sub-class shift. To the best of our knowledge, these relevant sub-class shifts have not been studied in the context of confidence scoring before. New-class Shift: Images subject to new a class shift can not be assigned to any of the training categories. This type of failure is commonly addressed by work in OoD detection. We followAhmed and Courville (2020) in further distinguishing semantic (only foreground object is subject to semantic variations, e.



Selected hyperparameters o based on the validation set for all confidence scoring functions trained with the DeepGamblers objective.

Whether or not dropout-training has been selected based on the validation set. This selection is only done for deterministic confidence scoring methods (no MCD). "1" denotes dropout training and "0" denotes training without dropout.

Classification performance measured as Accuracy [%] (higher is better). The color heatmap is normalized per column and classifier (separately for CNN and ViT), while whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. Accuracy scores are averaged over 5 runs per experiment on all data sets except 10 runs on CAMELYON-17-Wilds (as recommended by the authors due to high volatility in results) and 2 runs on BREEDS.

Confidence Ranking results measured as AUROC f [%] (higher is better). The color heatmap is normalized per column and classifier (separately for CNN and ViT), while whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. AUROC f scores are averaged over 5 runs per experiment on all data sets except 10 runs on CAMELYON-17-Wilds (as recommended by the authors due to high volatility in results) and 2 runs on BREEDS.

Confidence Calibration Results measured as Expected Calibration Error (lower is better). To our knowledge, this table shows the most comprehensive study of confidence calibration to date featuring CSFs beyond MSR and including a broad range of distribution shifts. Importantly, only the CSFs with a natural output range beyond [0,1] have undergone Platt-scaling, so the data does not provide a fair comparison of calibration errors between CSFs. Instead, the purpose is to demonstrate that CSFs beyond MSR can be calibrated and could thus be considered an appropriate competition to MSR in future research. The color heatmap is normalized per column and classifier (separately for CNN and ViT), while whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. ECE scores are averaged over 5 runs per experiment on all data sets except 10 runs on CAMELYON-17-Wilds (as recommended by the authors due to high volatility in results) and 2 runs on BREEDS. Abbreviations: ncs: new-class shift (s for semantic, ns for non-semantic), iid: independent and identically distributed, sub: sub-class shift, cor: image corruptions, c10/100: CIFAR-10/100, ti: TinyImagenet

Failure Detection Results shown as Rankings according to AURC. While providing results in the form of rankings might facilitate parsing the information quickly, it should be said that rankings hide important information in method assessment. Specifically, the notion to distinguish between "similar performance" and "substantially different" performance is lost. The color heatmap is normalized per column and classifier (separately for CNN and ViT), while whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. Scores are averaged over 5 runs per experiment on all data sets except 10 runs on CAMELYON-17-Wilds (as recommended by the authors due to high volatility in results) and 2 runs on BREEDS. Abbreviations: ncs: new-class shift (s for semantic, ns for non-semantic), iid: independent and identically distributed,

Comparing Rankings of AURC → α versus AUROC f → β. The fact that these two metrics result in substantially different rankings of CSFs demonstrates the important effect of the CSFs on classifier performance and thus the relevance of the pitfall visualized in Figure4. In Section 2 we argue that the effects of CSFs on the classifier need to be taken into account for fair comparison (R1), i.e. the ability to prevent silent failures should be assessed by AURC instead of AUROC f . The color heatmap is normalized per column and classifier (separately for CNN and ViT), while whiter colors depict better scores. "cor" is the average over 5 intensity levels of image corruption shifts. Scores are averaged over 5 runs per experiment on all data sets except 10 runs on CAMELYON-17-Wilds (as recommended by the authors due to high volatility in results) and 2 runs on BREEDS. Abbreviations: ncs: new-class shift (s for semantic, ns for non-semantic), iid: independent and identically distributed, sub: sub-class shift, cor: image corruptions, c10/100: CIFAR-10/100, ti: TinyImagenet.

Comparing our DeepGamblers configuration to the original configuration (as proposed in Liu et al. (2019)) on CIFAR-10. The table shows risk scores at predefined coverages. Lower is better.

As part of the assimilation of the training configuration across compared methods, we increased the number of training epochs for Devries et al. from 200 to 250. Since performance did not benefit from this increase in training time, we believe there is no fairness issue with DeepGamblers and ConfidNet requiring additional training stages leading to overall more training exposure (see Table4). We did not include additional confidence measures such as MSR, PE, or any MCD variations based on the Devries et al. learning objective, as we found no interesting synergies to occur (this is in contrast to DeepGamblers and ConfidNet).

Comparing our classifier configuration to the configuration proposed by Vaze et al. (2022) as measured by AUROC f on the SVHN dataset in the open-set setting. Baseline MLS and Baseline MSR+ use the aforementioned improvements to the training strategy.

Comparing our ViT-B/16 implementation to the one byFort et al. (2021) as measured by AUROC f when fine-tuned on CIFAR-10/CIFAR-100 using CIFAR-100/CIFAR-10 as the out-ofdistribution dataset respectively.ID to ODMAHA AUROC f * 100 ↑ MSR AUROC f * 100 ↑

ACKNOWLEDGEMENTS

This work was funded by Helmholtz Imaging (HI), a platform of the Helmholtz Incubator on Information and Data Science. We thank David Zimmerer and Fabian Isensee for insightful discussions and feedback.

annex

For the experiments on SGR, we use the algorithm proposed by Geifman and El-Yaniv (2017) , which computes the threshold τ that maximizes coverage, while satisfying the following condition:where δ is a defined confidence parameter, r * is the desired risk, R the risk and S i are i. Table 3 shows the number of images per utilized data set and associated splits. We used official splits for all data sets, except that we followed Liu et al. (2019) in splitting a small amount (in our case 1000 images) of the i.i. CSFs with a lack of generalization ability that often directly contradicts their purpose statement: to enable the safe application of classifiers by detecting silent failures.R3: If there is a defined classifier whose incorrect predictions are to be detected, its respective failure information should be used to assess CSFs w.r.t the stated purpose instead of a surrogate task such as distribution shift detection. The fact that evaluation w.r.t failure labels, as opposed to OoD-labels, enables testing methods on a broad and realistic range of distribution shifts is the most important argument for R3. This importance is empirically confirmed in the finding "prevalent OoD-D methods are only relevant in a narrow range of distribution shifts" (see Section 4.3), which reveals a clear contradiction of many of these methods' purpose statements and their actual functionality.Besides the general importance of following R3, we would like to empirically highlight the importance of dismissing inlier misclassifications from CSF evaluation on new-class shifts (the proposed protocol for new-class shifts is visualized in Figure 5 ). We argue that when assessing confidence ranking w.r.t failure detection on new class shifts, penalizing the detection of failures from the i.i.d. set is an undesired behavior of the current OoD-D protocol and subsequently propose to dismiss these cases from ranking (they still affect the classifier performance considered by AURC thus fulfilling R1). We categorize two types of errors for confidence scores: Overconfidence, i.e. assigning high confidence to wrong predictions, and underconfidence, i.e. assigning low confidence to correct predictions. Figure 9a ) shows the three most overconfident and underconfident test cases of the MSR confidence score on CIFAR-100 per distribution shift (since test cases subject to new-class shift are incorrect by default, there are no underconfident predictions). Looking at overconfident predictions, on the i.i.d. test set we see some expected confusions between semantically close categories. Images under corruption shift are subjectively hard to classify even for the human eye. The sub-class shift reveals an ambiguity in the labels rather than a failure of the confidence score containing objects of two valid categories, fish and people, indicating that superclasses in CIFAR-100 have not been specifically designed for sub-class shift by e.g. excluding such cases (Figure 9b looks at failure cases of BREEDS-Entitiy-13 which has been curated for meaningful sub-class shift).On the new-class shifts, we observe the expected behaviour of predicting the training categories most semantically similar to the unknown categories. Looking at underconfident cases reveals some interesting shortcut learning (Geirhos et al., 2020) : The fact that the two racoon images received low confidence despite showing their faces indicates that the model relies on spurious features perhaps related to the background. Similarly, the fox image on the corruption tests may have received low confidence due to the artificial frame. Further, the colored pixel noise in the hamster image as a source of confusion hints at the mode relying on local texture patterns rather than global semantic features.While the previous qualitative study only showcases confidence based on MSR, the proposed unified protocol facilitates to study arbitrary confidence scoring functions in detail and across different distribution shifts. Thus, in Figure 1b ) we study failure cases of the Predictive Entropy obtained from Monte Carlo Dropout (MCD-PE) on the BREEDS-Entity-13 data set. Entropy scores are reversed and

