A CALL TO REFLECT ON EVALUATION PRACTICES FOR FAILURE DETECTION IN IMAGE CLASSIFICATION

Abstract

Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-world application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t. all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring. Code and trained models are at https://github.com/IML-DKFZ/fd-shifts.

1. INTRODUCTION

"Neural network-based classifiers may silently fail when the test data distribution differs from the training data. For critical tasks such as medical diagnosis or autonomous driving, it is thus essential to detect incorrect predictions based on an indication of whether the classifier is likely to fail". Such or similar mission statements prelude numerous publications in the fields of misclassification detection (MisD) (Corbière et al., 2019; Hendrycks and Gimpel, 2017; Malinin and Gales, 2018) , Out-of-Distribution detection (OoD-D) (Fort et al., 2021; Winkens et al., 2020; Lee et al., 2018; Hendrycks and Gimpel, 2017; DeVries and Taylor, 2018; Liang et al., 2018) , selective classification (SC) (Liu et al., 2019; Geifman and El-Yaniv, 2019; 2017) , and predictive uncertainty quantification (PUQ) (Ovadia et al., 2019; Kendall and Gal, 2017) , hinting at the fact that all these approaches aim towards the same eventual goal: Enabling safe deployment of classification systems by means of failure detection, i.e. the detection or filtering of erroneous predictions based on ranking of associated confidence scores. In this context, any function whose continuous output aims to separate a classifier's failures from correct predictions can be interpreted as a confidence scoring function (CSF) and represents a valid approach to the stated goal. This holistic perspective on failure detection reveals extensive shortcomings in current evaluation protocols, which constitute major bottlenecks in progress toward the goal of making classifiers suitable for application in real-world scenarios. Our work is an appeal to corresponding communities to reflect on current practices and provides a technical deduction of a unified evaluation protocol, a list of empirical insights based on a large-scale study, as well as hands-on recommendations for researchers to catalyze progress in the field. 

2. PITFALLS OF CURRENT EVALUATION PRACTICES

Figure 1 gives an overview of the current state of failure detection research and its relationship to the preceding failure prevention task, which is measured by classifier robustness. This perspective reveals three main pitfalls, from which we derive three requirements R1-R3 for a comprehensive and realistic evaluation in failure detection: Pitfall 1: Heterogeneous and inconsistent task definitions. To achieve a meaningful evaluation, all relevant solutions toward the stated goal must be part of the competition. In research on failure detection, currently, four separate fields exist each evaluating proposed methods with their individual metrics and baselines. Incomplete competition is first and foremost an issue of historically evolved delimitations between research fields, which go so far that employed metrics are by design restricted to certain methods. MisD: Evaluation in MisD (see Section B.2.1 for a formal task definition) exclusively measures discrimination of a classifier's success versus failure cases by means of ranking metrics such as AUROC ffoot_0 (Hendrycks and Gimpel, 2017; Jiang et al., 2018; Corbière et al., 2019; Bernhardt et al., 2022) . This protocol excludes a substantial part of relevant CSFs from comparison, because any CSF that affects the underlying classifier (e.g. by introducing dropout or alternative loss functions) alters the set of classifier failures, i.e. ground truth labels, and thus creates their individual test set (for a visualization of this pitfall see Figure 4 ). As an example, a CSF that negatively affects the accuracy of a classifier might add easy-to-detect failures to its test set and benefit in the form of high AUROC f scores. As depicted in Figure 1 , we argue that the task of detecting failures is no self-purpose, but preventing and detecting failures are two sides of the same coin when striving to avoid silent classification failures. Thus, CSFs should be evaluated as part of a symbiotic system with the associated classifier. While additionally reporting classifier accuracy associated with each CSF



where the "f" denotes "failure" (see Appendix B for details), not to be confused with the classification AUROC.



Figure 1: Holistic perspective on failure detection. Detecting failures should be seen in the context of the overarching goal of preventing silent failures of a classifier, which includes two tasks: preventing failures in the first place as measured by the "robustness" of a classifier (Task 1), and detecting the non-prevented failures by means of CSFs (Task 2, focus of this work). For failure prevention across distribution shifts, a consistent task formulation exists (featuring accuracy as the primary evaluation metric) and various benchmarks have been released covering a large variety of realistic shifts (e.g. image corruption shifts, sub-class shifts, or domain shifts). In contrast, progress in the subsequent task of detecting the non-prevented failures by means of CSFs is currently obstructed by three pitfalls: 1) A diverse and inconsistent set of evaluation protocols for CSFs exists (MisD, SC, PUQ, OoD-D) impeding comprehensive competition. 2) Only a fraction of the spectrum of realistic distribution shifts and thus potential failure sources is covered diminishing the practical relevance of evaluation. 3) The task formulation in OoD-D fundamentally deviates from the stated purpose of detecting classification failures. Overall, the holistic perspective on failure detection reveals an obvious need for a unified and comprehensive evaluation protocol, in analogy to current robustness benchmarks, to make classifiers fit for safety-critical applications. Abbreviations: CSF: Confidence Scoring Function, OoD-D: Out-of-Distribution Detection, MisD: Misclassification Detection, PUQ: Predictive Uncertainty Quantification, SC: Selective Classification.

