ESD: EXPECTED SQUARED DIFFERENCE AS A TUNING-FREE TRAINABLE CALIBRATION MEASURE

Abstract

Studies have shown that modern neural networks tend to be poorly calibrated due to over-confident predictions. Traditionally, post-processing methods have been used to calibrate the model after training. In recent years, various trainable calibration measures have been proposed to incorporate them directly into the training process. However, these methods all incorporate internal hyperparameters, and the performance of these calibration objectives relies on tuning these hyperparameters, incurring more computational costs as the size of neural networks and datasets become larger. As such, we present Expected Squared Difference (ESD), a tuning-free (i.e., hyperparameter-free) trainable calibration objective loss, where we view the calibration error from the perspective of the squared difference between the two expectations. With extensive experiments on several architectures (CNNs, Transformers) and datasets, we demonstrate that (1) incorporating ESD into the training improves model calibration in various batch size settings without the need for internal hyperparameter tuning, (2) ESD yields the best-calibrated results compared with previous approaches, and (3) ESD drastically improves the computational costs required for calibration during training due to the absence of internal hyperparameter.

1. INTRODUCTION

The calibration of a neural network measures the extent to which its predictions align with the true probability distribution. Possessing this property becomes especially important in real-world applications, such as identification (Kim & Yoo, 2017; Yoon et al., 2022) , autonomous driving (Bojarski et al., 2016; Ko et al., 2017) , and medical diagnosis (Kocbek et al., 2020; Pham et al., 2022) , where uncertainty-based decisions of the neural network are crucial to guarantee the safety of the users. However, despite the success of modern neural networks in accurate classification, they are shown to be poorly calibrated due to the tendency of the network to make predictions with high confidence regardless of the input. (i.e., over-confident predictions) (Guo et al., 2017) . Traditionally, post-processing methods have been used, such as temperature scaling and vector scaling (Guo et al., 2017) , to calibrate the model using the validation set after the training by adjusting the logits before the final softmax layer. Various trainable calibration objectives have been proposed recently, such as MMCE (Kumar et al., 2018) and SB-ECE (Karandikar et al., 2021) , which are added to the loss function as a regularizer to jointly optimize accuracy and calibration during training. A key advantage of calibration during training is that it is possible to cascade post-processing calibration methods after training to achieve even better-calibrated models. Unfortunately, these existing approaches introduce additional hyperparameters in their proposed calibration objectives, and the performance of the calibration objectives is highly sensitive to these design choices. Therefore these hyperparameters need to be tuned carefully on a per model per dataset basis, which greatly reduces their viability for training on large models and datasets. To this end, we propose Expected Squared Difference (ESD), a trainable calibration objective loss that is hyperparameter-free. ESD is inspired by the KS-Error (Gupta et al., 2021) , and it views the calibration error from the perspective of the difference between the two expectations. In detail, our contributions can be summarized as follows: • We propose ESD as a trainable calibration objective loss that can be jointly optimized with the negative log-likelihood loss (NLL) during training. ESD is a binning-free calibration objective loss, and no additional hyperparameters are required. We also provide an unbiased and consistent estimator of the Expected Squared Difference, and show that it can be utilized in small batch train settings. • With extensive experiments, we demonstrate that across various architectures (CNNs & Transformers) and datasets (in vision & NLP domains), ESD provides the best calibration results when compared to previous approaches. The calibrations of these models are further improved by post-processing methods. • We show that due to the absence of an internal hyperparameter in ESD that needs to be tuned, it offers a drastic improvement compared to previous calibration objective losses with regard to the total computational cost for training. The discrepancy in computational cost between ESD and tuning-required calibration objective losses becomes larger as the model complexity and dataset size increases. 3 PROBLEM SETUP

3.1. CALIBRATION ERROR AND METRIC

Let us first consider an arbitrary neural network as f θ : D → [0, 1] C with network parameters θ, where D is the input domain and C is the number of classes in the multiclass classification task. Furthermore, we assume that the training data, (x i , y i ) n i=1 are sampled i.i.d. from the joint distribution P(X, Y ) (here we use one-hot vector for y). Here, Y = (Y 1 , ..., Y C ), and y = (y 1 , ..., y C ) are samples from this distribution. We can further define a multivariate random variable Z = f θ (X) as the distribution of the outputs of the neural network. Similarly, Z = (Z 1 , ..., Z C ), and z = (z 1 , ..., z C ) are samples from this distribution. We use (z K,i , y K,i ) to denote the output confidence and the one-hot vector element associated with the K-th class of the i-th training sample. Using this formulation, a neural network is said to be perfectly calibrated for class K, if and only if



Our work focuses on trainable calibration methods, which train neural networks using a hybrid objective, combining a primary training loss with an auxiliary calibration objective loss. In this regard, one popular objective is Maximum Mean Calibration Error (MMCE)(Kumar et al., 2018), which is a kernel embedding-based measure of calibration that is differentiable and, therefore, suitable as a calibration loss. Moreover,Karandikar et al. (2021)  proposes a trainable calibration objective loss, SB-ECE, and S-AvUC, which softens previously defined calibration measures.

