DETECTING MISCLASSIFICATION ERRORS IN NEURAL NETWORKS WITH A GAUSSIAN PROCESS MODEL

Abstract

As neural network classifiers are deployed in real-world applications, it is crucial that their predictions are not just accurate, but trustworthy as well. One practical solution is to assign confidence scores to each prediction, then filter out lowconfidence predictions. However, existing confidence metrics are not yet sufficiently reliable for this role. This paper presents a new framework that produces more reliable confidence scores for detecting misclassification errors. This framework, RED, calibrates the classifier's inherent confidence indicators and estimates uncertainty of the calibrated confidence scores using Gaussian Processes. Empirical comparisons with other confidence estimation methods on 125 UCI datasets demonstrate that this approach is effective. An experiment on a vision task with a large deep learning architecture further confirms that the method can scale up, and a case study involving out-of-distribution and adversarial samples shows potential of the proposed method to improve robustness of neural network classifiers more broadly in the future.

1. INTRODUCTION

Classifiers based on Neural Networks (NNs) are widely deployed in many real-world applications (LeCun et al., 2015; Anjos et al., 2015; Alghoul et al., 2018; Shahid et al., 2019) . Although good prediction accuracies are achieved, lack of safety guarantees becomes a severe issue when NNs are applied to safety-critical domains, e.g., healthcare (Selis ¸teanu et al., 2018; Gupta et al., 2007; Shahid et al., 2019) , finance (Dixon et al., 2017) , self-driving (Janai et al., 2017; Hecker et al., 2018) , etc. One way to estimate trustworthiness of a classifier prediction is to use its inherent confidence-related score, e.g., the maximum class probability (Hendrycks & Gimpel, 2017) , entropy of the softmax outputs (Williams & Renals, 1997) , or difference between the highest and second highest activation outputs (Monteith & Martinez, 2010) . However, these scores are unreliable and may even be misleading: high-confidence but erroneous predictions are frequently observed (Provost et al., 1998; Guo et al., 2017; Nguyen et al., 2015; Goodfellow et al., 2014; Amodei et al., 2016) . In a practical setting, it is beneficial to have a detector that can raise a red flag whenever the predictions are suspicious. A human observer can then evaluate such predictions, making the classification system safer. In order to construct such a detector, quantitative metrics for measuring predictive reliability under different circumstances are first developed, and a warning threshold then be set based on users' preferred precision-recall tradeoff. Existing such methods can be categorized into three types based on their focus: error detection, which aims to detect the natural misclassifications made by the classifier (Hendrycks & Gimpel, 2017; Jiang et al., 2018; Corbière et al., 2019) ; out-of-distribution (OOD) detection, which reports samples that are from different distributions compared to training data (Liang et al., 2018; Lee et al., 2018a; Devries & Taylor, 2018) ; and adversarial sample detection, which filters out samples from adversarial attacks (Lee et al., 2018b; Wang et al., 2019; Aigrain & Detyniecki, 2019) . Among these categories, error detection, also called misclassification detection (Jiang et al., 2018) or failure prediction (Corbière et al., 2019) , is the most challenging (Aigrain & Detyniecki, 2019) and underexplored. For instance, Hendrycks & Gimpel (2017) defined a baseline based on maximum class probability after softmax layer. Although the baseline performs reasonably well in most testing cases, reduced efficacy in some scenaria indicates room for improvement (Hendrycks & Gimpel, 

