A CALL TO REFLECT ON EVALUATION PRACTICES FOR FAILURE DETECTION IN IMAGE CLASSIFICATION

Abstract

Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-world application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t. all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring. Code and trained models are at https://github.com/IML-DKFZ/fd-shifts.

1. INTRODUCTION

"Neural network-based classifiers may silently fail when the test data distribution differs from the training data. For critical tasks such as medical diagnosis or autonomous driving, it is thus essential to detect incorrect predictions based on an indication of whether the classifier is likely to fail". Such or similar mission statements prelude numerous publications in the fields of misclassification detection (MisD) (Corbière et al., 2019; Hendrycks and Gimpel, 2017; Malinin and Gales, 2018) , Out-of-Distribution detection (OoD-D) (Fort et al., 2021; Winkens et al., 2020; Lee et al., 2018; Hendrycks and Gimpel, 2017; DeVries and Taylor, 2018; Liang et al., 2018) , selective classification (SC) (Liu et al., 2019; Geifman and El-Yaniv, 2019; 2017) , and predictive uncertainty quantification (PUQ) (Ovadia et al., 2019; Kendall and Gal, 2017) , hinting at the fact that all these approaches aim towards the same eventual goal: Enabling safe deployment of classification systems by means of failure detection, i.e. the detection or filtering of erroneous predictions based on ranking of associated confidence scores. In this context, any function whose continuous output aims to separate a classifier's failures from correct predictions can be interpreted as a confidence scoring function (CSF) and represents a valid approach to the stated goal. This holistic perspective on failure detection reveals extensive shortcomings in current evaluation protocols, which constitute major bottlenecks in progress toward the goal of making classifiers suitable for application in real-world scenarios. Our work is an appeal to corresponding communities to reflect on current practices and provides a technical deduction of a unified evaluation protocol, a list of empirical insights based on a large-scale study, as well as hands-on recommendations for researchers to catalyze progress in the field.

