TAKING A STEP BACK WITH KCAL: MULTI-CLASS KERNEL-BASED CALIBRATION FOR DEEP NEURAL NETWORKS

Abstract

Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. In high-risk applications like healthcare, practitioners require fully calibrated probability predictions for decision-making. That is, conditioned on the prediction vector, every class' probability should be close to the predicted value. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs, reduce classification accuracy in the process, or only calibrate the predicted class. This paper proposes a new Kernel-based calibration method called KCal. Unlike existing calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, KCal learns a metric space on the penultimate-layer latent embedding and generates predictions using kernel density estimates on a calibration set. We first analyze KCal theoretically, showing that it enjoys a provable full calibration guarantee. Then, through extensive experiments across a variety of datasets, we show that KCal consistently outperforms baselines as measured by the calibration error and by proper scoring rules like the Brier Score. Our code is available at https://github.com/zlin7/KCal.

1. INTRODUCTION

The notable successes of Deep Neural Networks (DNNs) in complex classification tasks, such as object detection (Ouyang & Wang, 2013) , speech recognition (Deng et al., 2013) , and medical diagnosis (Qiao et al., 2020; Biswal et al., 2017) , have made them essential ingredients within various critical decision-making pipelines. In addition to the classification accuracy, a classifier should ideally also generate reliable uncertainty estimates represented in the predicted probability vector. An influential study (Guo et al., 2017) reported that modern DNNs are often overconfident or miscalibrated, which could lead to severe consequences in high-stakes applications such as healthcare (Jiang et al., 2012) . Calibration is the process of closing the gap between the prediction and the ground truth distribution given this prediction. For a K-classification problem, with covariates X ∈ X and the label Y ∈ Y = [K], denote our classifier X → ∆ K-1 as p = [p 1 , . . . , pK ], with ∆ K-1 being (K-1)-simplex. Then, Definition 1. (Full Calibration (Vaicenavicius et al., 2019)  ) p is fully-calibrated if ∀k ∈ [K]: ∀q = [q1, . . . , qK ] ∈ ∆ K-1 , P{Y = k|p(X) = q} = q k . (1) It is worth noting that Def. (1) implies nothing about accuracy. In fact, ignoring X and simply predicting π, the class frequency vector, results in a fully calibrated but inaccurate classifier. As a result, our goal is always to improve calibration while maintaining accuracy. Another important requirement is that p ∈ ∆ K-1 . Many binary calibration methods such as Zadrozny & Elkan (2001; 2002) result in vectors that are not interpretable as probabilities, and have to be normalized. Many existing works only consider confidence calibration (Guo et al., 2017; Zhang et al., 2020; Wenger et al., 2020; Ma & Blaschko, 2021) , a much weaker notion than that encapsulated by Def. ( 1) and only calibrates the predicted class (Kull et al., 2019; Vaicenavicius et al., 2019) . Definition 2. (Confidence Calibration) p is confidence-calibrated if: ∀q ∈ [0, 1], P{Y = arg max k pk (X)| max k pk (X) = q} = q. (2) However, confidence calibration is far from sufficient. Doctors need to perform differential diagnoses on a patient, where multiple possible diseases should be considered with proper probabilities for all of them, not only the most likely diagnosis. Figure 1 shows an example where the confidence is calibrated, but prediction for important classes like Seizure is poorly calibrated. A classifier can be confidence-calibrated but not useful for such tasks if the probabilities assigned to most diseases are inaccurate. Recent Caruana, 2005) . Unlike existing methods, we take one step back and train a new low-dimensional metric space on the penultimatelayer embeddings of DNNs. Then, we use a kernel density estimation-based classifier to predict the class probabilities directly. We refer to our Kernel-based Calibration method as KCal. Unlike most calibration methods, KCal provides high probability error bounds for full calibration under standard assumptions. Empirically, we show that with little overhead, KCal outperforms all existing calibration methods in terms of calibration quality, across multiple tasks and DNN architectures, while maintaining and sometimes improving the classification accuracy.

Summary of Contributions:

• We propose KCal, a principled method that calibrates DNNs using kernel density estimation on the latent embeddings. • We present an efficient pipeline to train KCal, including a dimension-reducing projection and a stratified sampling method to facilitate efficient training. • We provide finite sample bounds for the calibration error of KCal-calibrated output under standard assumptions. To the best of our knowledge, this is the first method with a full calibration guarantee, especially for neural networks. • In extensive experiments on multiple datasets and state-of-the-art models, we found that KCal outperforms existing calibration methods in commonly used evaluation metrics. We also show that KCal provides more reliable predictions for important classes in the healthcare datasets. The code to replicate all our experimental results is submitted along with supplementary materials.

2. RELATED WORK

Research on calibration originated in the context of meteorology and weather forecasting (see Murphy & Winkler (1984) for an overview) and has a long history, much older than the field of machine learning (Brier, 1950; Murphy & Winkler, 1977; Degroot & Fienberg, 1983) . We refer to Filho et al. (2021) for a holistic overview and focus below on methods proposed in the context of modern neural networks. Based on underlying methodological similarities, we cluster them into distinct categories. Scaling: A popular family of calibration methods is based on scaling, in which a mapping is learned from the predicted logits to probability vectors. Confidence calibration scaling methods include temperature scaling (TS) (Guo et al., 2017) and its antecedent Platt scaling (Platt, 1999) , an ensemble of TS (Zhang et al., 2020), Gaussian-Process scaling (Wenger et al., 2020) , combining a base calibrator (TS) with a rejection option (Ma & Blaschko, 2021) . Matrix scaling with regularization



Figure 1: Reliability diagrams for confidence calibration (top) and Seizure (bottom). The popular temperature scaling (right) only calibrates the confidence, leaving Seizure poorly calibrated. See Figure 2 and the Appendix for complete reliability diagrams.

