DETECTING MISCLASSIFICATION ERRORS IN NEURAL NETWORKS WITH A GAUSSIAN PROCESS MODEL

Abstract

As neural network classifiers are deployed in real-world applications, it is crucial that their predictions are not just accurate, but trustworthy as well. One practical solution is to assign confidence scores to each prediction, then filter out lowconfidence predictions. However, existing confidence metrics are not yet sufficiently reliable for this role. This paper presents a new framework that produces more reliable confidence scores for detecting misclassification errors. This framework, RED, calibrates the classifier's inherent confidence indicators and estimates uncertainty of the calibrated confidence scores using Gaussian Processes. Empirical comparisons with other confidence estimation methods on 125 UCI datasets demonstrate that this approach is effective. An experiment on a vision task with a large deep learning architecture further confirms that the method can scale up, and a case study involving out-of-distribution and adversarial samples shows potential of the proposed method to improve robustness of neural network classifiers more broadly in the future.

1. INTRODUCTION

Classifiers based on Neural Networks (NNs) are widely deployed in many real-world applications (LeCun et al., 2015; Anjos et al., 2015; Alghoul et al., 2018; Shahid et al., 2019) . Although good prediction accuracies are achieved, lack of safety guarantees becomes a severe issue when NNs are applied to safety-critical domains, e.g., healthcare (Selis ¸teanu et al., 2018; Gupta et al., 2007; Shahid et al., 2019) , finance (Dixon et al., 2017) , self-driving (Janai et al., 2017; Hecker et al., 2018) , etc. One way to estimate trustworthiness of a classifier prediction is to use its inherent confidence-related score, e.g., the maximum class probability (Hendrycks & Gimpel, 2017) , entropy of the softmax outputs (Williams & Renals, 1997) , or difference between the highest and second highest activation outputs (Monteith & Martinez, 2010) . However, these scores are unreliable and may even be misleading: high-confidence but erroneous predictions are frequently observed (Provost et al., 1998; Guo et al., 2017; Nguyen et al., 2015; Goodfellow et al., 2014; Amodei et al., 2016) . In a practical setting, it is beneficial to have a detector that can raise a red flag whenever the predictions are suspicious. A human observer can then evaluate such predictions, making the classification system safer. In order to construct such a detector, quantitative metrics for measuring predictive reliability under different circumstances are first developed, and a warning threshold then be set based on users' preferred precision-recall tradeoff. Existing such methods can be categorized into three types based on their focus: error detection, which aims to detect the natural misclassifications made by the classifier (Hendrycks & Gimpel, 2017; Jiang et al., 2018; Corbière et al., 2019) ; out-of-distribution (OOD) detection, which reports samples that are from different distributions compared to training data (Liang et al., 2018; Lee et al., 2018a; Devries & Taylor, 2018) ; and adversarial sample detection, which filters out samples from adversarial attacks (Lee et al., 2018b; Wang et al., 2019; Aigrain & Detyniecki, 2019) . Among these categories, error detection, also called misclassification detection (Jiang et al., 2018) or failure prediction (Corbière et al., 2019) , is the most challenging (Aigrain & Detyniecki, 2019) and underexplored. For instance, Hendrycks & Gimpel (2017) defined a baseline based on maximum class probability after softmax layer. Although the baseline performs reasonably well in most testing cases, reduced efficacy in some scenaria indicates room for improvement (Hendrycks & Gimpel, 2017) . Jiang et al. (2018) proposed Trust Score, which measures the similarity between the original classifier and a modified nearest-neighbor classifier. The main limitation of this method is scalability: the Trust Score may provide no or negative improvement over the baseline for high-dimensional data. ConfidNet (Corbière et al., 2019) builds a separate NN model to learn the true class probablity, i.e. softmax probability for the ground-truth class. However, ConfidNet itself is a standard NN, so its confidence scores may be unreliable or misleading: A random input may generate a random confidence score, and ConfidNet do not provide any information regarding uncertainty of these confidence scores. Moreover, none of these methods can differentiate natural classifier errors from risks caused by OOD or adversarial samples; if a detector could do that, it would be easier for practitioners to fix the problem, e.g., by retraining the original classifier or applying better preprocessing techniques to filter out OOD or adversarial data. To meet these challenges, a new framework is developed in this paper for error detection in NN classifiers. The main idea is to utilize RIO (Residual, Input, Output; Qiu et al., 2020) , a regression model based on Gaussian Processes, on top of the original NN classifier. Whereas the original RIO is limited to regression problems, the proposed approach extends it to misclassification detection by modifying its components. The new framework not only produces a calibrated confidence score based on the original maximum class probability, but also provides a quantitative uncertainty estimation of that score. Errors can therefore be detected more reliably. Note that the proposed method does not change the prediction accuracy of the original classification model. Instead, it provides a quantitative metric that makes it possible to detect misclassification errors. This framework, referred to as RED (RIO for Error Detection), is compared empirically to existing approaches on 125 UCI datasets and on a large-scale deep learning architecture. The results demonstrate that the approach is effective and robust. A further case study with OOD and adversarial samples shows the potential of using RED to diagnose the sources of mistakes as well, thereby leading to a possible comprehensive approach for improving trustworthiness of neural network classifiers in the future.

2. RELATED WORK

In the past two decades, a large volume of work was devoted to calibrating the confidence scores returned by classifiers. Early works include Platt Scaling (Platt, 1999; Niculescu-Mizil & Caruana, 2005) (Xing et al., 2020) . These methods focus on reducing the difference between reported class probability and true accuracy, and generally the rankings of samples are preserved after calibration. As a result, the separability between correct and incorrect predictions is not improved. In contrast, RED aims at deriving a score that can differentiate incorrect predictions from correct ones better. A related direction of work is the development of classifiers with rejection/abstention option. These approaches either introduce new training pipelines/loss functions (Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; Cortes et al., 2016) , or define mechanisms for learning rejection thresholds under certain risk levels (Dubuisson & Masson, 1993; Santos-Pereira & Pires, 2005; Chow, 2006; Geifman & El-Yaniv, 2017) . In contrast to these methods, RED assumes an existing pretrained NN classifier, and provides an additional metric for detecting potential errors made by this classifier, without specifying a rejection threshold. Designing metrics for detecting potential risks in NN classifiers has also become popular recently. While most approaches focus on detecting OOD (Liang et al., 2018; Lee et al., 2018a; Devries & Taylor, 2018) or adversarial examples (Lee et al., 2018b; Wang et al., 2019; Aigrain & Detyniecki, 2019) , work on detecting natural errors, i.e., regular misclassifications not caused by external sources, is more limited. Ortega (1995) and Koppel & Engelson (1996) conducted early work in predicting whether a classifier is going to make mistakes, and Seewald & Fürnkranz (2001) built a meta-grading classifier based on similar ideas. However, these early works did not consider NN classifiers. More recently, Hendrycks & Gimpel (2017) and Haldimann et al. (2019) demonstrated that raw maximum class probability is an effective baseline in error detection, although its performance was reduced in some scenaria. More elaborate techniques for error detection have also been developed recently. Mandelbaum & Weinshall (2017) proposed a confidence score based on the data embedding derived from the penul-



, histogram binning (Zadrozny & Elkan, 2001), isotonic regression (Zadrozny & Elkan, 2002), with recent extensions like Temperature Scaling (Guo et al., 2017), Dirichlet calibration (Kull et al., 2019), and distance-based learning from errors

