TAKING A STEP BACK WITH KCAL: MULTI-CLASS KERNEL-BASED CALIBRATION FOR DEEP NEURAL NETWORKS

Abstract

Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. In high-risk applications like healthcare, practitioners require fully calibrated probability predictions for decision-making. That is, conditioned on the prediction vector, every class' probability should be close to the predicted value. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs, reduce classification accuracy in the process, or only calibrate the predicted class. This paper proposes a new Kernel-based calibration method called KCal. Unlike existing calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, KCal learns a metric space on the penultimate-layer latent embedding and generates predictions using kernel density estimates on a calibration set. We first analyze KCal theoretically, showing that it enjoys a provable full calibration guarantee. Then, through extensive experiments across a variety of datasets, we show that KCal consistently outperforms baselines as measured by the calibration error and by proper scoring rules like the Brier Score. Our code is available at https://github.com/zlin7/KCal.

1. INTRODUCTION

The notable successes of Deep Neural Networks (DNNs) in complex classification tasks, such as object detection (Ouyang & Wang, 2013) , speech recognition (Deng et al., 2013) , and medical diagnosis (Qiao et al., 2020; Biswal et al., 2017) , have made them essential ingredients within various critical decision-making pipelines. In addition to the classification accuracy, a classifier should ideally also generate reliable uncertainty estimates represented in the predicted probability vector. An influential study (Guo et al., 2017) reported that modern DNNs are often overconfident or miscalibrated, which could lead to severe consequences in high-stakes applications such as healthcare (Jiang et al., 2012) . Calibration is the process of closing the gap between the prediction and the ground truth distribution given this prediction. For a K-classification problem, with covariates X ∈ X and the label Y ∈ Y = [K], denote our classifier X → ∆ K-1 as p = [p 1 , . . . , pK ], with ∆ K-1 being (K-1)-simplex. Then, Definition 1. (Full Calibration (Vaicenavicius et al., 2019) ) p is fully-calibrated if ∀k ∈ [K]: ∀q = [q1, . . . , qK ] ∈ ∆ K-1 , P{Y = k|p(X) = q} = q k . (1) It is worth noting that Def. (1) implies nothing about accuracy. In fact, ignoring X and simply predicting π, the class frequency vector, results in a fully calibrated but inaccurate classifier. As a result, our goal is always to improve calibration while maintaining accuracy. Another important requirement is that p ∈ ∆ K-1 . Many binary calibration methods such as Zadrozny & Elkan (2001; 2002) result in vectors that are not interpretable as probabilities, and have to be normalized. Many existing works only consider confidence calibration (Guo et al., 2017; Zhang et al., 2020; Wenger et al., 2020; Ma & Blaschko, 2021) , a much weaker notion than that encapsulated by Def. (1) and only calibrates the predicted class (Kull et al., 2019; Vaicenavicius et al., 2019) . See Figure 2 and the Appendix for complete reliability diagrams. However, confidence calibration is far from sufficient. Doctors need to perform differential diagnoses on a patient, where multiple possible diseases should be considered with proper probabilities for all of them, not only the most likely diagnosis. Figure 1 shows an example where the confidence is calibrated, but prediction for important classes like Seizure is poorly calibrated. A classifier can be confidence-calibrated but not useful for such tasks if the probabilities assigned to most diseases are inaccurate. Recent research effort has started to focus on full calibration, for example, in Vaicenavicius et al. (2019) ; Kull et al. (2019) ; Widmann et al. (2019) . We approach this problem by leveraging the latent neural network embedding in a nonparametric manner. Nonparametric methods such as histogram binning (HB) (Zadrozny & Elkan, 2001) and isotonic regression (IR) (Zadrozny & Elkan, 2002) , are natural for calibration and have become popular. Gupta & Ramdas (2021) recently showed a calibration guarantee for HB. However, HB usually leads to noticeable drops in accuracy (Patel et al., 2021) , and IR is prone to overfitting (Niculescu-Mizil & Caruana, 2005) . Unlike existing methods, we take one step back and train a new low-dimensional metric space on the penultimatelayer embeddings of DNNs. Then, we use a kernel density estimation-based classifier to predict the class probabilities directly. We refer to our Kernel-based Calibration method as KCal. Unlike most calibration methods, KCal provides high probability error bounds for full calibration under standard assumptions. Empirically, we show that with little overhead, KCal outperforms all existing calibration methods in terms of calibration quality, across multiple tasks and DNN architectures, while maintaining and sometimes improving the classification accuracy.

Summary of Contributions:

• We propose KCal, a principled method that calibrates DNNs using kernel density estimation on the latent embeddings. • We present an efficient pipeline to train KCal, including a dimension-reducing projection and a stratified sampling method to facilitate efficient training. • We provide finite sample bounds for the calibration error of KCal-calibrated output under standard assumptions. To the best of our knowledge, this is the first method with a full calibration guarantee, especially for neural networks. • In extensive experiments on multiple datasets and state-of-the-art models, we found that KCal outperforms existing calibration methods in commonly used evaluation metrics. We also show that KCal provides more reliable predictions for important classes in the healthcare datasets. The code to replicate all our experimental results is submitted along with supplementary materials.

2. RELATED WORK

Research on calibration originated in the context of meteorology and weather forecasting (see Murphy & Winkler (1984) for an overview) and has a long history, much older than the field of machine learning (Brier, 1950; Murphy & Winkler, 1977; Degroot & Fienberg, 1983) . We refer to Filho et al. (2021) for a holistic overview and focus below on methods proposed in the context of modern neural networks. Based on underlying methodological similarities, we cluster them into distinct categories. Scaling: A popular family of calibration methods is based on scaling, in which a mapping is learned from the predicted logits to probability vectors. Confidence calibration scaling methods include temperature scaling (TS) (Guo et al., 2017) and its antecedent Platt scaling (Platt, 1999) , an ensemble of TS (Zhang et al., 2020) , Gaussian-Process scaling (Wenger et al., 2020) , combining a base calibrator (TS) with a rejection option (Ma & Blaschko, 2021) . Matrix scaling with regularization was also used to perform full calibration (Kull et al., 2019) . While some scaling-based methods can be data-efficient, there are no known theoretical guarantees for them to the best of our knowledge. Binning: Another cluster of solutions relies on binning and its variants, and includes uniformmass binning (Zadrozny & Elkan, 2001) , scaling before binning (Kumar et al., 2019) , and mutualinformation-maximization-based binning (Patel et al., 2021) . Isotonic regression (Zadrozny & Elkan, 2002) is also often interpreted as binning. Uniform-mass binning (Zadrozny & Elkan, 2001 ) has a distribution-free finite sample calibration guarantee (Gupta & Ramdas, 2021) and asymptotic convergent ECE estimation (Vaicenavicius et al., 2019) . However, in practice, binning tends to decrease accuracy (Patel et al., 2021; Guo et al., 2017) . Binning can also be considered a member of the broader nonparametric calibration family of methods. Such methods also include Gaussian Process Calibration (Wenger et al., 2020) , which however also only considers confidence calibration. Loss regularization: There are also attempts to train a calibrated DNN to begin with. Such methods typically add a suitable regularizer to the loss function (Karandikar et al., 2021; Mukhoti et al., 2020; Kumar et al., 2018) , which can sometimes result in expensive optimization and reduction in accuracy. Use of Kernels: Although not directly used for calibration, kernels have also been used for uncertainty quantification for deep learning classification. In classification with rejection, the k-nearest-neighbors algorithm (kNN), closely related to kernel-based methods, has been used to provide a "confidence measure" which is used to make a binary decision (i.e., whether to reject or to predict) (Papernot & McDaniel, 2018; Jiang et al., 2018) . Recently, continuous kernels have also been used to measure calibration quality or used as regularization during training (Widmann et al., 2019; Kumar et al., 2018) . Zhang et al. (2020) introduced a kernel density estimation (KDE) proxy estimator for estimating ECE. However, it uses a un-optimized kernel over ∆ K-1 , and shows the KDE-ECE estimator (but not the calibration map) is consistent. To the best of our knowledge, use of trained KDE to calibrate predictions hasn't been proposed before. Further, we also provide a bound on the calibration error.

3. KCAL: KERNEL-BASED CALIBRATION

In this section, we formally introduce KCal, study its calibration properties theoretically, and present crucial implementation details and comparisons with other methods. Specifically, in Section 3.1, we discuss how to construct (automatically) calibrated predictions for test data using a calibration set S cal . Doing so requires a well-trained kernel and metric space, and we describe a procedure to train such a kernel in Section 3.2. In Section 3.3, we show that an appropriate shrinkage rate of the bandwidth ensures that the KCal prediction is automatically calibrated. Sections 3.4 provides implementation details. Finally, in Section 3.5, we compare and contrast KCal with existing methods.

3.1. CLASSIFICATION WITH KERNEL DENSITY ESTIMATION

Following the calibration literature, we first require a holdout calibration set S cal = {X i , Y i } N i=1 . In KCal, we fix a kernel function φ which is learned (the learning procedure is described in Section 3.2). For a new datum X N +1 , the class probability pk (X N +1 ) takes the following form: pk (XN+1; φ, Scal) = (x,y)∈S k cal φ(x, XN+1) (x,y)∈S cal φ(x, XN+1) , where S k cal := {(x, y) ∈ S cal |y = k}. The notation pk (X N +1 ; φ, S cal ) emphasizes the dependence on φ and S cal . However, we will use pk (X N +1 ) when the dependence is clear from context. Remarks: What we have described is essentially the classical nonparametric procedure of applying kernel density estimation for classification. For a moment, suppose we know the true density function f k of P k (the distribution of all the data in class k), and the proportion of class k, denoted π k (such that k∈[K] π k = 1). Then, for any particular x 0 , using the Bayes rule we get:  P{Y = k|X = x0} = f k (x0)π k k ′ ∈[K] f k ′ (x0)π k ′ . (

3.2. TRAINING

For good performance under the kernel density framework, it is crucial to employ an appropriate kernel function φ, which in turn relies on the choice of the underlying metric. Therefore, we train a metric space on top of the penultimate layer embeddings of deep learning models. To begin, we assume a deep neural network is already trained on S train = {X train i , Y train i } M i=1 . We place no limitations on the form of loss function, optimizer, or the model architecture. However, we do require the neural net to compute an embedding before a final prediction layer, which is always the case in modern classification models. We denote the embedding function from X → R h as f . Given a base "mother kernel" function ϕ, such as the Radial Basis Function (RBF) kernel, we denote the kernel with bandwidth b as ϕ b := 1 b ϕ( • b ). We parameterize the learnable kernel as: φ(x, x ′ ) := φΠ,f,b (x, x ′ ) := ϕ b (Π(f (x)) -Π(f (x ′ ))). ( ) where Π : R h → R d is a dimension-reducing projection parameterized by a shallow MLP (Section 3.4). Since the inference time is linear in d, letting d < h also affords computational benefits. Given that the embedding function f (x) from the neural network is fixed, the only learnable entities are b and Π. In the training phase, we fix b = 1, and train Π using (stochastic) gradient descent and log-loss. The specific value of b does not matter since it can be folded into Π. Let us denote S k train = {(x, y) ∈ S train : y = k}. In each iteration, we randomly sample two batches of data from S train -the prediction data, denoted as S B train , to evaluate Π, and "background" data for each k, denoted as B k , from S k train \ S B train to construct the KDE classifier. Then, the prediction for any x j is given by pk (xj; φ, Strain \ S B train ) := (x,y)∈B k |S k train \ S B train | |B k | φ(x, xj) / k ′ ,(x,y)∈B k ′ |S k ′ train \ S B train | |B k ′ | φ(x, xj) where φ is shorthand for φΠ,f,b=1 defined in Eq. ( 5). Algorithm 1 Overview of KCal Input:  S train : {(X train i , Y train i )} M i=1 used to train the NN S cal : {(X i , Y i )} N i=1 calibration set f : Embedding function X → R h (trained

pk (X

N +1 ) ← (x,y)∈S k cal φb * (x,X N +1 ) (x,y)∈S cal φb * (x,X N +1 ) . The log-loss is given formally by L = - 1 B (x,y)∈S B train log py(x; φ, Strain \ S B train ). Finally, we pick a b = b * on the calibration set S cal using cross-validation. This is because b should be chosen contingent on the sample size (Section 3.3). Choosing b can be done efficiently (Section 3.4). Algorithm 1 summarizes the steps we explicated upon so far.

3.3. THEORETICAL ANALYSIS: CALIBRATION COMES FREE

In the previous section, we have only described a procedure to improve the prediction accuracy for p on S train . This section will show that calibration comes free with the p obtained using Algorithm 1. In particular, we show that as the sample-size for each class in S cal increases, p converges to the true frequency vector of Y given the prediction. For smoother presentation, we only state the relevant claims in what follows. Detailed proofs are presented in the Appendix. To begin, we make a few standard assumptions, such as in Chacón & Duong (2018), including: • (∀k) The density on the embedded space, Π(f (X |Y = k)), denoted as f Π•f ,k , is square integrable and twice differentiable, with all second order partials bounded, continuous, and square integrable. • ϕ is spherically symmetric, with a finite second moment. Lemma 3.1 and 3.2 focus on an arbitrary class k and ignore the subscript k to the density f for readability. We denote the size |S k cal | = m. Intuitively, due to the bias-variance trade-off, a suitable bandwidth b will depend on m: A small b reduces bias, but with the finite m, a smaller b also leads to increased variance. Thus, b should go to 0 "slowly", which is formally stated below: Lemma 3.1. For almost all x, if b d m → ∞ and b → 0 as m → ∞, then we have ∥ fΠ•f,k (x) -f Π•f ,k (x)∥2 P → 0 as m → ∞. Here fΠ•f,k is the estimated f Π•f ,k using S cal . Recall that d is the dimension of Π(f (X )). We will call such a bandwidth b admissible, and we sometimes write b(m) to emphasize the dependence on m. The following lemma gives the optimal admissible bandwidth: Lemma 3.2. The optimal bandwidth is b = Θ(m -1 d+4 ), which leads to the fastest decreasing MSE (i.e. E[∥ fΠ•f,k (x) -f Π•f ,k (x)∥ 2 ]) of O(m -4 d+4 ). Now we are in a position to present the main theoretical results. In the following, m denotes the rarest class's count (m := min k {|S k cal |}. Theorem 3.3 provides a bound between p and the true conditional probability vector on the embedded space p(Π(f (X))): Theorem 3.3. Fixing x such that the density of Π(f (x)) is positive, with b(m) = Θ(m -1 d+4 ), for any λ ∈ (0, 2): P{|p k (x) -p k (Π(f (x)))| > (3K + 1)Cm -λ d+4 } ≤ Ke -Bm 4-2λ d+4 (8) where p k (Π(f (x))) := P{Y = k|Π(f (X)) = Π(f (x))} (9) for some constant B and C. As a corollary, p( ), for some constants B and C we have: P{sup X,k |p k (X) -P{Y = k|p(X)}| > (3K + 1)C( log m m ) α d+2α } ≤ K(m -1 + m -B 2α d+2α m d d+2α ). ( ) We now proceed to present details pertaining to the efficient implementation of KCal.

3.4. IMPLEMENTATION TECHNIQUES

Efficient Training: As might be immediately apparent, utilizing algorithm 1 for prediction using full S train \ S B train can be an expensive exercise. In order to afford a training speedup, we consider a random subset from S train \ S B train using a modified stratified sampling. Specifically, we take m random samples from each S k train , denoted as S k,m train , and replace the right-hand side of Eq. 6 with: (x,y)∈S k,m train |S k train | m φ(x, x0) / k ′ ∈[K],(x,y)∈S k ′ ,m train |S k ′ train | m φ(x, x0) . The re-scaling term |S k train | m is crucial to get an unbiased estimate of fk πk . The stratification employed makes the training more stable, while also reducing the estimation variance for rarer classes (more details in Appendix B). The overall complexity is now O(KmdhB) per batch. In all experiments, we used m = 20 and B = 64. Form of Π: While there is considerable freedom in choosing a suitable form for Π, we parameterize Π with a two layer MLP with a skip connection. Consequently, Π can reduce to linear projection when sufficient, and be more expressive when necessary. We also experimented with using only a linear projection, the results for which are included in the appendix. We fix the output dimension to d = min{dim(f ), 32}, except for ImageNet (d = 128). Bandwidth Selection: Finally, to find the optimal bandwidth using S cal , we use Golden-Section search (Kiefer, 1953) to find the log-loss-minimizing b * . This takes O(log ub-lb tol ) steps where [lb, ub] is the search space, and tol is the tolerance. Essentially, we assume that the loss is a convex function with respect to b, permitting an efficient search (see Appendix H, which presents empirical evidence that the convexity assumption is valid across datasets).

3.5. COMPARISONS WITH EXISTING CALIBRATION METHODS

Most existing calibration methods discussed in Section 2 and KCal all utilize a holdout calibration set. However, unlike KCal, existing works usually fix the last neural network layer. KCal, on the other hand, "takes a step back", and replaces the last prediction layer with a kernel density estimation based classifier. Since the DNN f is fixed regardless of whether we use the original last layer or not, we are really comparing a KDE classifier (KCal) with linear models trained in various ways, after mapping all the data with f . Note that this characterization is true for most existing methods, with a few exceptions (e.g., those summarized under "loss regularization" in Section 2). Employing a KDE classifier affords some clear advantages such as a straightforward convergence guarantee and some interpretabilityfoot_0 . Furthermore, KCal can also be improved in an online fashion, a benefit especially desirable in certain high-stakes applications such as in healthcare. For example, a hospital can calibrate a trained model prior to deployment using its own patient data (which is usually not available to train the original model) as it becomes available. Another important advantage of KCal is concerning normalization. In fact, simultaneously calibrating all classes while satisfying the constraint that p ∈ ∆ K-1 is a distinguishing challenge for multi-class calibration. Many calibration methods perform one-vs-rest calibration for each class, and require a separate normalization step at test time (Zadrozny & Elkan, 2001; 2002; Patel et al., 2021; Gupta et al., 2021) . This creates a gap between training and testing and could lead to drastic drop in performance (Section 4). On the other hand, KCal automatically satisfies p ∈ ∆ K-1 , and the normalization is consistent during training and testing. A disadvantage of KCal is the need to remember the Π(f (S cal )) used to generate the KDE prediction. This is however mitigated to a large extent by the dimension reduction step, which already reduces the computational overhead significantlyfoot_1 . For example, in one of our experiments on CIFAR-100, there are 160K (5K images, d = 32) scalars to remember, which is only 0.2% of the parameters (85M+) of the original DNN (ViT-base-patch16). Moreover, KDE inference is trivial to parallelize on GPUs. There is also a rich, under-explored, literature to further speed up the inference. Examples include, KDE merging (Sodkomkham et al., 2016) , Dual-Tree (Gray & Moore, 2003) , and Kernel Herding (Chen et al., 2010) . These methods can easily be used in conjunction with KCal.

4.1. DATA AND NEURAL NETWORKS

We utilize two sets of data: computer vision benchmarks on which previous calibration methods were tested, and health monitoring datasets where full calibration is crucial for diagnostic applications. Table 1 summarizes the datasets and their splits. Benchmark data Following Kull et al. (2019) , we use multiple image benchmark datasets, including CIFAR-10, CIFAR-100, and SVHN (Krizhevsky, 2009; Netzer et al., 2011) . We reserve 10% of the training data as the calibration set. We fine-tune pretrained ViT (Dosovitskiy et al., 2021) and MLP-Mixer (Mixer) (Tolstikhin et al., 2021) from the timm library (Wightman, 2019) . We chose ViT and Mixer because they are the state-of-the-art neural architectures in computer vision, and accuracy should come before calibration quality. We also included the ImageNet dataset (Deng et al., 2009) and use the pretrained Inception ResNet V2 (Szegedy et al., 2017) following Patel et al. (2021) . Health monitoring data We also use three health monitoring datasets for diagnostic tasks: IIIC (Jing et al., 2021) , an ictal-interictal-injury-continuum (IIIC) patterns classification dataset; ISRUC (Khalighi et al., 2016) , a sleep staging (classification) dataset using polysomnographic (PSG) recordings; PN2017 (2017 PhysioNet Challenge) (Clifford et al., 2017; Goldberger et al., 2000) , a public electrocardiogram (ECG) dataset for rhythm (particularly Atrial Fibrillation) classification. For the training set, we follow Hong et al. (2019) ; Jing et al. (2021) for PN2017 and IIIC, and used 69 patients' data for ISRUC. For the remaining data, 5% is used as the calibration set and 95% for testing. We perform additional experiments after splitting into training/calibration/test sets by patients for IIIC and ISRUCfoot_2 , marked as the "pat" version in tables. The calibration/test split is 20/80 in "IIIC (pat)" and "ISRUC (pat)" because the number of patients is small. For IIIC and ISRUC, we follow the standard practice and train a CNN (ResNet) on the spectrogram (Biswal et al., 2017; Ruffini et al., 2019; Yuan et al., 2019; Yang et al., 2022) . For PN2017, we used a top-performing model from the 2017 PhysioNet Challenge, MINA (Hong et al., 2019) . 

4.2. BASELINES METHODS

We compare KCal with the multiple state-of-the-art calibration methods, including Temperature Scaling (TS) (Guo et al., 2017) , Dirichlet Calibration (DirCal) (Kull et al., 2019) , Mutualinformation-maximization-based Binning (I-Max) (Patel et al., 2021) , Gaussian Process Calibration (GP) (Wenger et al., 2020) , Intra Order-preserving Calibration (IOP) (Rahimi et al., 2020) , Splinesbased Calibration (Spline) (Gupta et al., 2021) , Focal-loss-based calibration (Focal) (Mukhoti et al., 2020) , MMCE-based calibration (MMCE) (Kumar et al., 2018) .

4.3. EVALUATION METRICS

We report standard evaluation metrics: Accuracy, class-wise expected calibration error (CECE) (Kull et al., 2019; Patel et al., 2021; Nixon et al., 2019) , expected calibration error (ECE) (Guo et al., 2017) , and Brier score (Brier, 1950) . CECE is typically used as a proxy to evaluate full calibration quality, because directly binning basing on the entire vector p requires exponentially (in K) many bins. Similar to Patel et al. (2021) ; Nixon et al. (2019) , we ignore all predictions with very small probabilities (less than max{0.01, 1 K }). ECE, on the other hand, only measures confidence calibration (Def 2). For both ECE and CECE, we use the "adaptive" version with equal number of samples in each bin (with 20 bins), because this is shown to measure the calibration quality better than the equal-width version (Nixon et al., 2019) . Brier score can be viewed as the sum of a "calibration" term, and a "refinement" term measuring how discriminative a model is (Kull & Flach, 2015) . Here we focus on the brier score of the top class. We refer to (Guo et al., 2017; Kull et al., 2019; Nixon et al., 2019) for further discussion on these metrics.

4.4. RESULTS

The results are presented in Tables 2, 3, 4 and 5. All experiments are repeated 10 times by reshuffling calibration and test sets, and the standard deviations are reported. For ImageNet, we skipped Focal and MMCE because the base NN is given and these methods require training from scratch. Due to space constraints, we include ablation studies in the Appendix. In general, KCal has the best CECE, accuracy and Brier score, and is highly competitive in terms of ECE as well. Note that KCal is also the only method with provable calibration guarantee. TS is effective in controlling overall ECE but shows little improvement on CECE over UnCal. DirCal often ranks high for the calibration quality but tends to decrease accuracy as K increases. DirCal's performance also has a higher cost: Every experiment requires training over hundreds of models with SGD and taking the best ensemble, accounting for most of the experiment computation cost in this paper. The amount of tuning suggested for good performance indicates sensitivity to the choice of hyper-parameters, which we have indeed observed to be the case. Spline, IOP and GP are similar to DirCal on vision datasets, but generally perform worse on the healthcare datasets. In Patel et al. (2021) , I-Max lowers ECE and CECE significantly. However, it has a critical issue -it does not produce a valid probability vectorfoot_3 . Once normalized, as reported in our experiments, the performance worsens. Since calibrating all the classes simultaneously is the distinguishing challenge in multiclass classification, we interpret the observation as: If this normalization constraint is removed, the "optimization problem" (to lower calibration error) is much simpler, but the results are invalid hence unusable probability vectors. Spline also requires a re-normalization step, but its performance stays consistent. Focal is worse than the UnCal in many experiments. While calibration performance may improve by combing Focal with other methods, the drop in accuracy is harder to overcomefoot_4 . We also observed that for healthcare datasets, being able to tune on a different set of patients boosts the performance significantly. This is reflected in the accuracy gain for DirCal and KCal, and suggests that the embeddings/logits are quite transferable, but the prediction criteria itself can vary from patient to patient. Finally, we summarize the rankings of all datasets in Table 6 . It is clear that KCal consistently improves calibration quality for all classes and maintains or improves accuracy. And if we look at only the confidence prediction (Brier or ECE), KCal is still highly competitive. 

4.5. CASE STUDY FOR SEIZURE PREDICTION

We show the reliability diagrams (Kull et al., 2019; Guo et al., 2017) on the IIIC dataset to illustrate the importance of full calibration in Figure 2 . We include both the the predicted class (confidence calibration) and Seizure. More reliability diagrams can be found in the Appendix, and the results are consistent for all classes. The un-calibrated predictions have large gaps for both confidence and Seizure. Most baselines provide calibrated confidence calibration, but fail to calibrated the output for the rare class Seizure. KCal, on the other hand, achieves the most consistent results. We note again that since all competing classes must be considered together for any clinical decision, full calibration is indispensable in medical applications.

5. CONCLUSION

This paper proposed KCal, a learned-kernel-based calibration method for deep learning models. KCal consists of a supervised dimensionality reduction step on the penultimate layer neural network embedding to improve efficiency. A KDE classifier using the calibration set is employed in this new metric space. As a natural consequence of the construction, KCal provides a calibrated probability vector prediction for all classes. Unlike most existing calibration methods, KCal is also provably asymptotically fully calibrated with finite sample error bounds. We also showed that empirically, it outperforms existing state-of-the-art calibration methods in terms of accuracy and calibration quality. Moreover, KCal is more robust to distributional shift, which is common in high-risk applications such as healthcare, where calibration is far more crucial. The major limitation of KCal is the need to store the entire calibration set, which is a small overhead with the dimension reduction step and potential improvements.



That is, one could understand how the prediction is made by examining similar samples. Experiments about the effect of d on performance and overhead are provided in the Appendix. PN2017 did not provide patient IDs, so we cannot split by patient. It generates a vector whose sum ranges from 0.4 to 2.0 in our experiments. The range is wider for a larger K. In PN2017, rare classes are oversampled during training(Hong et al., 2019). While this did not cause issues for other calibration methods, the distributional shift at test time seems catastrophic for Focal.



Figure 1: Reliability diagrams for confidence calibration (top) and Seizure (bottom). The popular temperature scaling (right) only calibrates the confidence, leaving Seizure poorly calibrated. See Figure 2 and the Appendix for complete reliability diagrams.

) Now, replacing f k with the kernel density estimate fk (x 0 ) := ( (x,y)∈S k cal φb (x, x 0 ))/|S k cal |, and the class proportion π k with πk := |S k cal |/|S cal | we get back Eq. (3).

NN) X N +1 : Unseen datum for prediction Training (of the projection Π): Denote S k train := {(x, y) ∈ S train |y = k}. Denote ϕ b as a base kernel function (e.g. RBF) with bandwidth b. repeat Sample S B train = {(x j , y j )} B j=1 from S train . Compute p(x j ) via Eq. (6). Loss l ← 1 B B j=1 LogLoss(p(x j ), y j ). Update Π with (stochastic) gradient descent. until the loss l does not improve. Set φb ← φΠ,f,b for inference. Inference: Denote S k cal := {(x, y) ∈ S cal |y = k}. Tune b * on S cal by minimizing log loss.

f (x))) as m → ∞. Next, we bound the full calibration error with additional standard assumptions. More specifically, we use and build upon the main uniform convergence result for classical KDE presented in Jiang (2017), to obtain Theorem 3.4: Theorem 3.4. Assume f Π•f ,k is α-Hölder continuous and bounded away from 0 for any k. For an admissible b(m) with shrinkage rate Θ(( log m m ) 1 d+2α

Figure 2: Reliability diagrams for the predicted class (top) and Seizure (bottom) in IIIC. All methods calibrate confidence well, but only KCal achieves reasonable calibration quality for Seizure.

Dataset summary: Splits and number of classes (K).

Accuracy in % (↑ means higher=better). Accuracy numbers lower than the predictions are in dark red and the best are in bold (both at p=0.01). KCal typically improves or maintains the accuracy.



ECE in 10 -2 (↓ means lower=better). The best accuracy-preserving method is in bold (p=0.01). The lowest but not accuracy-preserving number is underscored. KCal is usually on par or better than the best baseline.

Brier Score in 10 -2 (↓ means lower=better). The best accuracy-preserving methods are in bold (p=0.01). The lowest but not accuracy-preserving number is underscored.

Ranks for different evaluation metrics. The best rank is underscored. In general, KCal consistently outperforms baselines on Accuracy, CECE and Brier, and the difference between most methods on ECE is small.

ACKNOWLEDGMENTS

This work was supported by NSF award SCH-2205289, SCH-2014438, IIS-1838042, NIH award R01 1R01NS107291-01.

