ESD: EXPECTED SQUARED DIFFERENCE AS A TUNING-FREE TRAINABLE CALIBRATION MEASURE

Abstract

Studies have shown that modern neural networks tend to be poorly calibrated due to over-confident predictions. Traditionally, post-processing methods have been used to calibrate the model after training. In recent years, various trainable calibration measures have been proposed to incorporate them directly into the training process. However, these methods all incorporate internal hyperparameters, and the performance of these calibration objectives relies on tuning these hyperparameters, incurring more computational costs as the size of neural networks and datasets become larger. As such, we present Expected Squared Difference (ESD), a tuning-free (i.e., hyperparameter-free) trainable calibration objective loss, where we view the calibration error from the perspective of the squared difference between the two expectations. With extensive experiments on several architectures (CNNs, Transformers) and datasets, we demonstrate that (1) incorporating ESD into the training improves model calibration in various batch size settings without the need for internal hyperparameter tuning, (2) ESD yields the best-calibrated results compared with previous approaches, and (3) ESD drastically improves the computational costs required for calibration during training due to the absence of internal hyperparameter.

1. INTRODUCTION

The calibration of a neural network measures the extent to which its predictions align with the true probability distribution. Possessing this property becomes especially important in real-world applications, such as identification (Kim & Yoo, 2017; Yoon et al., 2022) , autonomous driving (Bojarski et al., 2016; Ko et al., 2017) , and medical diagnosis (Kocbek et al., 2020; Pham et al., 2022) , where uncertainty-based decisions of the neural network are crucial to guarantee the safety of the users. However, despite the success of modern neural networks in accurate classification, they are shown to be poorly calibrated due to the tendency of the network to make predictions with high confidence regardless of the input. (i.e., over-confident predictions) (Guo et al., 2017) . Traditionally, post-processing methods have been used, such as temperature scaling and vector scaling (Guo et al., 2017) , to calibrate the model using the validation set after the training by adjusting the logits before the final softmax layer. Various trainable calibration objectives have been proposed recently, such as MMCE (Kumar et al., 2018) and SB-ECE (Karandikar et al., 2021) , which are added to the loss function as a regularizer to jointly optimize accuracy and calibration during training. A key advantage of calibration during training is that it is possible to cascade post-processing calibration methods after training to achieve even better-calibrated models. Unfortunately, these existing approaches introduce additional hyperparameters in their proposed calibration objectives, and the performance of the calibration objectives is highly sensitive to these design choices. Therefore these hyperparameters need to be tuned carefully on a per model per dataset basis, which greatly reduces their viability for training on large models and datasets. To this end, we propose Expected Squared Difference (ESD), a trainable calibration objective loss that is hyperparameter-free. ESD is inspired by the KS-Error (Gupta et al., 2021) , and it views the calibration error from the perspective of the difference between the two expectations. In detail, our contributions can be summarized as follows: • We propose ESD as a trainable calibration objective loss that can be jointly optimized with the negative log-likelihood loss (NLL) during training. ESD is a binning-free calibration objective loss, and no additional hyperparameters are required. We also provide an unbiased and consistent estimator of the Expected Squared Difference, and show that it can be utilized in small batch train settings. • With extensive experiments, we demonstrate that across various architectures (CNNs & Transformers) and datasets (in vision & NLP domains), ESD provides the best calibration results when compared to previous approaches. The calibrations of these models are further improved by post-processing methods. • We show that due to the absence of an internal hyperparameter in ESD that needs to be tuned, it offers a drastic improvement compared to previous calibration objective losses with regard to the total computational cost for training. The discrepancy in computational cost between ESD and tuning-required calibration objective losses becomes larger as the model complexity and dataset size increases.

2. RELATED WORK

Calibration of neural networks has gained much attention following the observation from Guo et al. (2017) that modern neural networks are poorly calibrated. One way to achieve better calibration is to design better neural network architectures tailored for uncertainty estimation, e.g., Bayesian Neural Networks (Blundell et al., 2015; Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) . Besides model design, post-processing calibration strategies have been widely used to calibrate a trained machine learning model using a hold-out validation dataset. Examples include temperature scaling (Guo et al., 2017) , which scales the logit output of a classifier with a temperature parameter; Platt scaling (Platt, 1999) , which fits a logistic regression model on top of the logits; and Conformal prediction (Vovk et al., 2005; Lei et al., 2018) which uses validation set to estimate the quantiles of a given scoring function. Other post-processing techniques include histogram binning (Zadrozny & Elkan, 2001) , isotonic regression (Zadrozny & Elkan, 2002) , and Bayesian binning into quantiles (Pakdaman Naeini et al., 2015) . Our work focuses on trainable calibration methods, which train neural networks using a hybrid objective, combining a primary training loss with an auxiliary calibration objective loss. In this regard, one popular objective is Maximum Mean Calibration Error (MMCE) (Kumar et al., 2018) , which is a kernel embedding-based measure of calibration that is differentiable and, therefore, suitable as a calibration loss. Moreover, Karandikar et al. (2021) proposes a trainable calibration objective loss, SB-ECE, and S-AvUC, which softens previously defined calibration measures.

3.1. CALIBRATION ERROR AND METRIC

Let us first consider an arbitrary neural network as f θ : D → [0, 1] C with network parameters θ, where D is the input domain and C is the number of classes in the multiclass classification task. Furthermore, we assume that the training data, (x i , y i ) n i=1 are sampled i.i.d. from the joint distribution P(X, Y ) (here we use one-hot vector for y). Here, Y = (Y 1 , ..., Y C ), and y = (y 1 , ..., y C ) are samples from this distribution. We can further define a multivariate random variable Z = f θ (X) as the distribution of the outputs of the neural network. Similarly, Z = (Z 1 , ..., Z C ), and z = (z 1 , ..., z C ) are samples from this distribution. We use (z K,i , y K,i ) to denote the output confidence and the one-hot vector element associated with the K-th class of the i-th training sample. Using this formulation, a neural network is said to be perfectly calibrated for class K, if and only if P(Y K = 1|Z K = z K ) = z K . (1) Intuitively, Eq. ( 1) requires the model accuracy for class K to be z K on average for the inputs where the neural network produces prediction of class K with confidence z K . In many cases, the research of calibration mainly focuses on the calibration of the max output class that the model predicts. Thus, calibration error is normally reported with respect to the predicted class (i.e., max output class) only. As such, k will denote the class with maximum output probability. Furthermore, we also write I(•) as the indicator function which returns one if the Boolean expression is true and zero otherwise. With these notations, one common measurement of calibration error is the difference between confidence and accuracy which is mathematically represented as (Guo et al., 2017 ) E Z k [|P(Y k = 1|Z k ) -Z k |] . To estimate this, the Expected Calibration Error (ECE) (Naeini et al., 2015) uses B number of bins with disjoint intervals B j = ( j B , j+1 B ], j = 0, 1, ..., B -1 to compute the calibration error as follows: ECE = 1 |D| B-1 j=0 N i=1 I j B < z k,i ≤ j + 1 B , y k,i = 1 -z k,i ) . (3)

3.2. CALIBRATION DURING TRAINING

Post-processing method and calibration during training are the two primary approaches for calibrating a neural network. Our focus in this paper is on the latter, where our goal is to train a calibrated yet accurate classifier directly. Note that calibration and predictive accuracy are independent properties of a classifier. In other words, being calibrated does not imply that the classifier has good accuracy and vice versa. Thus, training a classifier to have both high accuracy and good calibration requires jointly optimizing a calibration objective loss alongside the negative log-likelihood (NLL) with a scaling parameter λ for the secondary objective: min θ NLL(D, θ) + λ • CalibrationObjective(D, θ). 3.2.1 EXISTING TRAINABLE CALIBRATION OBJECTIVES NEED TUNING Kumar et al. (2018) and Karandikar et al. (2021) suggested that the disjoint bins in ECE can introduce discontinuities which are problematic when using it as a calibration loss in training. Therefore, additional parameters were introduced for the purpose of calibration during training. For example, Kumar et al. (2018) proposed Maximum Mean Calibration Error (MMCE), where it utilizes a Laplacian Kernel instead of the disjoint bins in ECE. Furthermore, Karandikar et al. (2021) proposed soft-binned ECE (SB-ECE) and soft AvUC, which are softened versions of the ECE metric and the AvUC loss (Krishnan & Tickoo, 2020) , respectively. All these approaches address the issue of discontinuity which makes the training objective differentiable. However they all introduce additional design choices -MMCE requires a careful selection of the kernel width ϕ, SB-ECE needs to choose the number of bins M and a softening parameter T , and S-AvUC requires a user-specified entropy threshold κ in addition to the softening parameter T . Searching for the optimal hyperparameters can be computationally expensive especially as the size of models and dataset become larger.

3.2.2. CALIBRATION DURING TRAINING SUFFERS FROM OVER-FITTING PROBLEM

NLL loss is known to implicitly train for calibration since it is a proper scoring rule, so models trained with NLL can overfit in terms of calibration error (Mukhoti et al., 2020; Karandikar et al., 2021) . Figure 1 provides an example of an accuracy curve and its corresponding ECE curve for a model trained with NLL loss. We see that the model is overfitting in terms of both accuracy and ECE, which causes the gap between train and test ECE to become larger during training. As such, adding a calibration objective to the total loss will not be able to improve model calibration as it mainly helps improve calibration of the model with respect to the training data only. Kumar et al. (2018) further and dedicate a small portion of it to optimize the calibration objective loss. We will follow this strategy and introduce the technical details in the later sections.

4. ESD: EXPECTED SQUARED DIFFERENCE

We propose Expected Squared Difference (ESD) as a tuning-free (i.e., hyperparameter-free) calibration objective. Our approach is inspired by the viewpoint of calibration error as a measure of distance between two distributions to obtain a binning-free calibration metric (Gupta et al., 2021) . By using this approach, there is no need to employ kernels or softening operations to handle the bins within calibration metrics, such as ECE, in order to make them suitable for training. In particular, we consider calibration error as the difference between the two expectations. We first start with the definition of perfect calibration in Eq. ( 1): P(Y k = 1|Z k = z k ) = z k ∀z k ∈ [0, 1] ⇔ P(Y k = 1, Z k = z k ) = z k P(Z k = z k ) ∀z k ∈ [0, 1] (by Bayes rule). (5) Now considering the accumulation of the terms on both sides for arbitrary confidence level α ∈ [0, 1], the perfect calibration for an arbitrary class k can now be written as: α 0 P(Y k = 1, Z k = z k )dz k = α 0 z k P(Z k = z k )dz k ⇔ E Z k ,Y k [I(Z k ≤ α, Y k = 1)] = E Z k ,Y k [Z k I(Z k ≤ α)], ∀α ∈ [0, 1]. This allows us to write the difference between the two expectations as d k (α) = α 0 P(Y k = 1, Z k = z k ) -z k P(Z k = z k )dz k = |E Z k ,Y k [I(Z k ≤ α)(I(Y k = 1) -Z k )]|, and d k (α) = 0, ∀α ∈ [0, 1] if and only if the model is perfectly calibrated for class k. Since this has to hold ∀α ∈ [0, 1], we propose ESD as the expected squared difference between the two expectations: E Z ′ k [d k (Z ′ k ) 2 ] = E Z ′ k [E 2 Z k ,Y k [I(Z k ≤ Z ′ k )(I(Y k = 1) -Z k )]]. Due to the close relationship between d k (α) and calibration of a neural network, the difference between an uncalibrated and a calibrated neural network can be clearly observed using E Z ′ k [d k (Z ′ k ) 2 ] as visually shown in Appendix A. Since ESD = 0 iff. the model is perfectly calibrated as shown in the following theorem, this metric is a good measure of calibration. Theorem 1. E Z ′ k [E 2 Z k ,Y k [I(Z k ≤ Z ′ k )(I(Y k = 1) -Z k )]] = 0 iff. the model is perfectly calibrated. Proof. Since d k (Z ′ k ) 2 = E 2 Z k ,Y k [I(Z k ≤ Z ′ k )(I(Y k = 1) -Z k )] is a non-negative random variable induced by Z ′ k , E Z ′ k [d k (Z ′ k ) 2 ] = 0 iff. P(d k (Z ′ k ) = 0) = 1. Furthermore, P(d k (Z ′ k ) = 0) = 1 ⇐⇒ d k (α) = 0 ∀α ∈ I where I is the support set of Z ′ k . Thus, E Z ′ k [d k (Z ′ k ) 2 ] = 0 iff. d k (α) = 0 ∀α ∈ I. From lemma 1.1 below, we have d k (α) = 0 ∀α ∈ [0, 1] iff. d k (α) = 0 ∀α s.t. P(Z ′ k = α) ̸ = 0. Consequently, E Z ′ k [E 2 Z k ,Y k [I(Z k ≤ Z ′ k )(I(Y k = 1) -Z k )]] = 0 iff. the model is perfectly calibrated. Lemma 1.1 Let I be the support set of random variable Z k , then d k (α) = 0 ∀α ∈ I iff. d k (α) = 0 ∀α ∈ [0, 1]. Proof. The backward direction result is straight-forward, so we only prove for the forward direction: If d k (α) = 0 ∀α ∈ I, then d k (α) = 0 ∀α ∈ [0, 1]. For arbitrary α ′ ∈ I c , let α = arg min a∈I |a -α ′ | and α < α ′ . We then have, d k (α ′ ) = | α ′ 0 P(Y k = 1, Z k = z k ) -z k P(Z k = z k )dz k | = | α ′ 0 P(Y k = 1|Z k = z k )P(Z k = z k ) -z k P(Z k = z k )dz k | = | α 0 P(Y k = 1|Z k = z k )P(Z k = z k ) -z k P(Z k = z k )dz k + α ′ α P(Y k = 1|Z k = z k )P(Z k = z k ) -z k P(Z k = z k )dz k | = | α 0 P(Y k = 1|Z k = z k )P(Z k = z k ) -z k P(Z k = z k )dz k | = 0.

4.1. AN ESTIMATOR FOR ESD

In this section, we use (z k,i , y k,i ) to denote the output confidence and the one-hot vector element associated with the k-th class of the i-th training sample respectively. As the expectations in the true Expected Squared Difference (Eq. ( 8)) are intractable, we propose a Monte Carlo estimator for it which is unbiased. A common approach is to use a naive Monte Carlo sampling with respect to both the inner and outer expectations to give the following: E Z ′ k [E 2 Z k ,Y k [I(Z k ≤ Z ′ k )(I(Y k = 1) -Z k )]] ≈ 1 N N i=1 ḡi 2 , ( ) where ḡi = 1 N -1 N j=1 j̸ =i g ij and g ij = I(z k,j ≤ z k,i )[I(y j = k) -z k,j ]. However, Eq. ( 9) results in a biased estimator that is an upper bound of the true Expected Squared Difference. To account for the bias, we propose the unbiased and consistent estimator (proof in Appendix B) of the true Expected Squared Differencefoot_0 , ESD = 1 N N i=1 ḡi 2 - S 2 gi N -1 where S 2 gi = 1 N -2 N j=1 j̸ =i (g ij -ḡi ) 2 . (10)

4.2. INTERLEAVED TRAINING

Negative log-likelihood (NLL) has be shown to greatly overfit to ECE of the data it is trained on. Thus, training for calibration using the same data for the NLL has limited effect on reducing the calibration error of the model. Karandikar et al. (2021) proposed interleaved training where they split the train set into two subsets -one is used to optimize the NLL and the other is used to optimize the calibration objective. Following this framework, let D train denote the entire train set. We separate the train set into two subsets -D ′ train and D ′ cal . The joint training of ESD with NLL becomes, min θ NLL(D ′ train , θ) + λ • ESD(D ′ cal , θ). With this training scheme, NLL is optimized on D ′ train and ESD is optimized on D ′ cal . This way, Karandikar et al. (2021) showed we can avoid minimizing ECE that is already overfit to the train set.

5.1. DATASETS AND MODELS

Image Classification For image classification tasks we use the following datasets: • MNIST (Deng, 2012) : 54,000/6,000/10,000 images for train, validation, and test split was used. We resized the images to (32x32) before inputting to the network. • CIFAR10 & CIFAR100 (Krizhevsky et al., a;b) : 45,000/5,000/10,000 images for train, validation, and test split was used. Used random cropping of 32 with padding of 4. Normalized each RGB channel with mean of 0.5 and standard deviation of 0.5. • ImageNet100 (Deng et al., 2009) bins. For the image classification tasks we use AdamW (Loshchilov & Hutter, 2019) optimizer with 10 -foot_3 learning rate and 10 -2 weight decay for 250 epochs, except for ImageNet100, in which case we used 10 -4 weight decay for 90 epochs. For the NLI tasks, we use AdamW optimizer with 10 -5 learning rate and 10 -2 weight decay for 15 epochs. For both tasks, we use a batch size of 512. The internal hyperparameters within MMCE (ϕ) and SB-ECE (M , T ) were sequentially optimized 3 following the search for optimal λ. Following the original experimental setting of SB-ECE by Karandikar et al. (2021) , we fix the hyperparameter M to 15. Similar to the model selection criterion utilized by Karandikar et al. (2021) , we look at the accuracy and the ECE of all possible hyperparameter configurations in the grid. We choose the lowest ECE while giving up less than 1.5% accuracy relative to baseline accuracy on the validation set. All experiments were done using NVIDIA Quadro RTX 8000 and NVIDIA RTX A6000.

6. EXPERIMENTAL RESULT

In Table 1 , we report the accuracy and ECE for the models before and after post-processing (i.e., temperature scaling and vector scaling) of various datasets and models. Compared to the baseline or other trainable calibration objective loss (MMCE and SB-ECE), jointly training with ESD as the secondary loss consistently results in a better-calibrated network for all the datasets and models with around 1% degradation of accuracy. Moreover, we observe that applying post-processing (i.e., temperature scaling (TS) and vector scaling (VS)) after training with a calibration objective loss generally results in better-calibrated models over baseline post-processing, in which case ESD still outperforms other methods. Comparing the calibration outcomes in temperature scaling with vector scaling for models trained with ESD, vector scaling worked comparable if not better as a post-processing method, except for ANLI, with minimal impact on the accuracy for all datasets. Additionally, Ovadia et al. (2019) demonstrated that postprocessing methods, such as temperature We show in Figure 2 that across different models and datasets the calibration performance of MMCE and SB-ECE is sensitive to the internal hyperparameter. This shows the importance of hyperparameter tuning in these methods. On the other hand, ESD does not only outperforms MMCE and SB-ECE in terms of ECE but does not have an internal hyperparameter to tune for.

7.1. COMPUTATIONAL COST OF HYPERPARAMETER SEACH

We investigate the computational cost required to train a model with ESD compared to MMCE and SB-ECE. As depicted in Figure 3 , the computational cost required to train a model for a single run remains nearly identical across different models and datasets. However, considering the need for tuning additional hyperparameters within MMCE and SB-ECE, the discrepancy in total computational cost between ESD and tuning-required calibration objective losses becomes more prominent as the model complexity and dataset size increases. 

7.3. ARE INDICATOR FUNCTIONS TRAINABLE?

Recent papers, Kumar et al. (2018) and Karandikar et al. (2021) , have suggested that ECE is not suitable for training as a result of its high discontinuity due to binning, which can be seen as a form of an indicator function. However, the results from our method suggest that our measure was still able to train well despite the existence of indicator functions. In addition, previous measures also contain indicator functions in the form of argmax function that introduces discontinuities but remains to be trainable. As such, this brings to rise the question of whether calibration measures with indicator functions can be used for training. To investigate this, we ran ECE as an auxiliary calibration loss on different batch sizes and observed its performance on CIFAR 100 (Table 3 ). We found that ECE, contrary to previous belief, is trainable under large batch sizes and not trainable under small batch sizes while ours maintains good performance regardless of batch size. The poor performance of ECE under small batch size setting could potentially be attributed to the high bias present in such cases. From this, it seems to suggest that indicator functions do not seem to inhibit the training for calibration. Furthermore, the calibration of these models is further improved after post-processing. In addition, we demonstrate that ESD can be utilized in small batch settings while maintaining performance. More importantly, in contrast to previously proposed trainable calibration objectives, ESD does not contain any internal hyperparameters, which significantly reduces the total computational cost for training. This reduction in cost is more prominent as the complexity of the model and dataset increases, making ESD a more viable calibration objective option.

A VISUAL INTUITION OF EXPECTED SQUARED DIFFERENCE (ESD)

From Eq. ( 7), d k (α) = α 0 P(Y k = 1, Z k = z k ) -z k P(Z k = z k )dz k = |E Z k ,Y k [I(Z k ≤ α)(I(Y k = 1) -Z k )]|. This can be viewed as the difference between two quantities: Cumulative Accuracy = α 0 P(Y k = 1, Z k = z k )dz k = E Z k ,Y k [I(Z k ≤ α, Y k = 1)] ≈ 1 N N i=1 I(z k,i ≤ α, Y k,i = 1). Cumulative Confidence = α 0 z k P(Z k = z )dz k = E Z k ,Y k [Z k I(Z k ≤ α)] ≈ 1 N N i=1 z k,i I(z k,i ≤ α). Due to the close relationship between d k (α) and the calibration of a neural network, the average squared difference (E Z ′ k [d k (Z ′ k ) 2 ] ) between the cumulative accuracy and confidence closely corresponds with the calibration of a network (Figure 4 ). That is, the average squared difference between the cumulative accuracy and confidence is larger for an uncalibrated network compared to a calibrated network. Jointly training with our proposed Expected Squared Difference (ESD) as an auxiliary loss tries to minimize this squared difference between the two curves during training on average, thus achieving a better calibrated model. 

B PROOF THAT ESD IS AN UNBIASED AND CONSISTENT ESTIMATOR

In Theorem 2, we prove that ESD is an unbiased estimator, that is taking the expectation of ESD is equal to the true Expected Squared Difference. Theorem 2 ESD is an unbiased estimator, i.e. E Z k ,Y k [ESD] = E Z ′ k [d 2 k (Z ′ k )]] where d k (Z ′ k ) = |E Z k ,Y k [I(Z k ≤ Z ′ k )(I(Y k = 1) -Z k )]|. Proof. Let Z k =(Z k,1 ,...,Z k,n ), Y k = (Y k,1 ,...,Y k,n ) and G i = ḡi 2 - S 2 g i N -1 . We have that E Z k ,Y k [ESD] = 1 N N i=1 E Z k ,Y k [G i ] (by linearity of expectation) = 1 N N i=1 E Z k,i ,Y k,i [µ 2 i ] (by lemma 2.1) = 1 N N i=1 E Z ′ k [d 2 k (Z ′ k )] (since E Z k,i ,Y k,i [µ 2 i ] = E Z ′ k [d 2 k (Z ′ k )] = E Z ′ k [d 2 k (Z ′ k )]. Lemma 2.1 E Z k,-i ,Y k,-i [G i ] = µ 2 i where µ i = E Z k ,Y k [I(Z k ≤ Z k,i )(I(Y k = 1) -z k )]. Proof. Since samples are i.i.d., for a fixed i, it holds that: E Z k,-i ,Y k,-i [ ḡi ] = µ i E Z k,-i ,Y k,-i [ S 2 gi N -1 ] = σ 2 i N -1 (where Var[g i ] = σ 2 i ) Var[ ḡi ] = E Z k,-i ,Y k,-i [ ḡi 2 ] -µ 2 i . Shifting variables around, we get µ 2 i = E Z k,-i ,Y k,-i [ ḡi 2 ] -Var[ ḡi ] = E Z k,-i ,Y k,-i ḡi 2 - S 2 gi N -1 (since Var[ ḡi ] = σ 2 i N -1 ) = E Z k,-i ,Y k,-i [G i ]. In Theorem 3, we prove that ESD is a consistent estimator, which means that ESD converges in probability to the true Expected Squared Difference. Theorem 3 ESD is a consistent estimator, i.e. ESD P -→ E Z ′ k [d 2 k (Z ′ k )]. Proof. Since ESD is an unbiased estimator, it is sufficient to prove lim n→∞ Var[ESD] = 0. For simplicity, we use where N i,j=1 = N i=1 N j=1 and E Z k ,Y k [G i ] = E Z ′ k [d 2 k (Z ′ k )] ∀i, Var[ESD] = 1 N 2 N i=1 Var[G i ] + 1 N 2 N i,j=1 i̸ =j Cov[G i , G j ]. Since E Z k ,Y k [G i ] = E Z ′ k [d 2 k (Z ′ k )] ∀i, = 1 N 2 N i=1 Var[G i ] + 1 N 2 N i,j=1 i̸ =j (E Z k ,Y k [G i G j ] -E 2 Z ′ k [d 2 k (Z ′ k )]) ≤ 1 N 2 N i=1 2 + 1 N 2 N i,j=1 i̸ =j (E Z k ,Y k [ ḡi 2 ḡj 2 ] -E 2 Z ′ k [d 2 k (Z ′ k )] ) (by lemma 3.1 and 3.2) + 1 N 2 N i,j=1 i̸ =j 4 (N -2) 2 = 2 N + 4N (N -1) N 2 (N -2) 2 + N (N -1) N 2 (E Z k ,Y k [ ḡ1 2 ḡ2 2 ] -E 2 Z ′ k [d 2 k (Z ′ k )] ). (by lemma 3.3) Thus, by using lemma 3.4 and the non-negativeness of variance, 0 ≤ lim n→∞ Var[ESD] ≤ lim n→∞ E Z k ,Y k [ ḡ1 2 ḡ2 2 ] -E 2 Z ′ k [d 2 k (Z ′ k )] = 0. Consequently, lim n→∞ Var[ESD] = 0. Lemma 3.1 Var[G i ] ≤ 2 < ∞. Proof. Var[G i ] = E Z k ,Y k [G 2 i ] -E 2 Z k ,Y k [G i ] ≤ E Z k ,Y k [G 2 i ] = E Z k ,Y k   ḡi 2 -2 S 2 gi N -1 + S 2 gi N -1 2   ≤ E Z k ,Y k   ḡi 2 + S 2 gi N -1 2   . By lemma 3.2, E Z k ,Y k [g 2 i ] ≤ 1 and E Z k ,Y k S 2 g i N -1 2 ≤ 2 N -2 2 ≤ 1. Thus, E Z k ,Y k ḡi 2 + S 2 g i N -1 2 ≤ 2. Lemma 3.2 | S 2 g i N -1 | ≤ 2 N -2 and | ḡi | ≤ 1. Proof. By the triangle inequality, | ḡi | = | 1 N -1 N m=1 m̸ =i g im | ≤ 1 N -1 N m=1 m̸ =i |g im | ≤ 1, since ∀m |g im | ≤ 1. Furthermore, we have S 2 gi N -1 = 1 (N -1) 2 N m=1 m̸ =i g 2 im + 1 (N -1) 2 (N -2) ( N m=1 m̸ =i g im ) 2 ≤ 1 (N -1) 2 N m=1 m̸ =i |g im | 2 + 1 (N -1) 2 (N -2) ( N m=1 m̸ =i |g im |) 2 ≤ 1 (N -1) 2 N m=1 m̸ =i 1 + 1 (N -1) 2 (N -2) ( N m=1 m̸ =i 1) 2 . The last inequality is by ∀m, |g im | ≤ 1, and the direct calculation implies that S 2 gi N -1 ≤ 1 N -1 + 1 N -2 ≤ 2 N -2 . Lemma 3.3 E Z k ,Y k [ ḡi 2 ḡj 2 ] = E Z k ,Y k [ ḡ1 2 ḡ2 2 ] ∀i, j where i ̸ = j. Proof. N m,n=1 {m,n}̸ ={i,j} g im g jn = N m,n=1 {m,n}̸ ={i,j}, m̸ =n g im g jn + N m,n=1 n̸ ={i,j} g ij g jn + N m,n=1 m̸ ={i,j} g im g ji + N m,n=1 m̸ ={i,j} g im g jm + g ij g ji (since these terms are disjoint terms) d = N m ′ ,n ′ =1 {m ′ ,n ′ }̸ ={1,2}, m ′ ̸ =n ′ g 1m ′ g 2n ′ + N m ′ ,n ′ =1 n ′ ̸ ={1,2} g 12 g 2n ′ + N m ′ ,n ′ =1 m ′ ̸ ={1,2} g 1m ′ g 21 + N m ′ ,n ′ =1 m ′ ̸ ={1,2} g 1m ′ g 2m ′ + g 12 g 21 = N m ′ ,n ′ =1 m ′ ̸ =1,n ′ ̸ =2 g 1m ′ g 2n ′ .

Proof of claim above:

As the summation can be divided into five distinct cases, Let i and j be arbitrary such that i ̸ = j, Case 1: {m, n} ̸ = {i, j} and m ̸ = n, g im g jn = I(Z k,m ≤ Z k,i )[I(Y k,m = 1) -Z k,m ]I(Z k,n ≤ Z k,j )[I(Y k,n = 1) -Z k,n ] = h 1 (Z k,i , Z k,m , Z k,j , Z k,n , Y k,m , Y k,n ) d = h 1 (Z k,1 , Z k,m ′ , Z k,2 , Z k,n ′ , Y k,m ′ , Y k,n ′ ). where {m ′ , n ′ } ̸ = {1, 2} and m ′ ̸ = n ′ Case 2: m = j and n ̸ = {i, j}, g ij g jn = I(Z k,j ≤ Z k,i )[I(Y k,j = 1) -Z k,j ]I(Z k,n ≤ Z k,j )[I(Y k,n = 1) -Z k,n ] = h 2 (Z k,i , Z k,j , Z k,n , Y k,j , Y k,n ) d = h 2 (Z k,1 , Z k,2 , Z k,n ′ , Y k,2 , Y k,n ′ ). where m ′ = 2 and n ′ ̸ = {1, 2} Case 3: n = i and m ̸ = {i, j}, g im g ji = I(z k,m ≤ Z k,i )[I(Y k,m = 1) -Z k,m ]I(Z k,i ≤ Z k,j )[I(Y k,i = 1) -Z k,i ] = h 3 (Z k,i , Z k,m , Z k,j , Y k,m , Y k,i ) d = h 3 (Z k,1 , Z k,m ′ , Z k,2 , Y k,m ′ , Y k,1 ). where n ′ = 1 and m ′ ̸ = {1, 2} Case 4: m = n ̸ = {i, j}, g im g jm = I(Z k,m ≤ Z k,i )[I(Y k,m = 1) -Z k,m ]I(Z k,m ≤ Z k,j )[I(Y k,m = 1) -Z k,m ] = h 4 (Z k,i , Z k,m , Z k,j , Y k,m ) d = h 4 (Z k,1 , Z k,m ′ , Z k,2 , Y k,m ′ ). where m ′ = n ′ ̸ = {1, 2} Case 5: m = j and n = i, g ij g ji = I(Z k,j ≤ Z k,i )[I(Y k,j = 1) -Z k,j ]I(Z k,i ≤ Z k,j )[I(Y k,i = 1) -Z k,i = h 5 (Z k,i , Z k,j , Y k,i , Y k,j ) d = h 5 (Z k,1 , Z k2 , Y k,1 , Y k,2 ). where m ′ = 2 and n ′ = 1 This follows from the fact that Z k,i 's and Y k,i 's are i.i.d. random variables. Therefore, ( N m,n=1 m̸ =i,n̸ =j g im g jn ) 2 d = ( N m ′ ,n ′ =1 m ′ ̸ =1,n ′ ̸ =2 g 1m ′ g 2n ′ ) 2 ḡi 2 ḡj 2 d = ḡ1 2 ḡ2 2 E Z k ,Y k [ ḡi 2 ḡj 2 ] = E Z k ,Y k [ ḡ1 2 ḡ2 2 ]. Lemma 3.4 lim N →∞ E Z k ,Y k [ ḡ1 2 ḡ2 2 ] = E 2 Z ′ k [d 2 k (Z ′ k )]. Proof. For arbitrary i, ḡi P -→ µ i by the strong law of large number. Thus, ḡ1 P -→ µ 1 and ḡ2 P -→ µ 2 . Therefore, since marginal convergence in probability implies joint convergence in probability (Vaart, 1998) , ( ḡ1 , ḡ2 ) P -→ (µ 1 , µ 2 ). By continuous mapping theorem, since f (x, y) = x 2 y 2 is a continuous function, ḡ1 2 ḡ2 2 P -→ µ 2 1 µ 2 2 . Additionally, since | ḡ1 2 ḡ2 2 | ≤ 1 ∀N , it is uniformly bounded and thus uniformly integrable. Combining this with the fact that it converges in probability, lim N →∞ E Z k ,Y k [ ḡ1 2 ḡ2 2 ] = E Z k ,Y k [µ 2 1 µ 2 2 ]. Thus, lim N →∞ E Z k ,Y k [ ḡ1 2 ḡ2 2 ] = E Z k ,Y k [µ 2 1 µ 2 2 ] = E Z k ,Y k [µ 2 1 ]E Z k ,Y k [µ 2 2 ] = E 2 Z ′ k [d 2 k (Z ′ k )].

C EXPECTED SQUARED DIFFERENCE (ESD) PSEUDOCODE

In this section, we provide a pseudocode for calculating the Expected Squared Difference (ESD) for a given batch output. 

D UNBIASED ESTIMATORS AND ITS GRADIENT

Since ESD is an unbiased estimator, Thus, the gradient of the unbiased estimator is on average the gradient of the desired metric. E Z k ,Y k [ESD] = E Z ′ k [d 2 k (Z ′ k )] ∇ θ E Z ′ k [d 2 k (Z ′ k )] = ∇ θ E Z k ,Y k [ESD] = ∇ θ E X,

E ADDITIONAL INFORMATION ON THE CHOICE OF LAMBDA RANGE

To validate the choice of λ grid range in our experimental settings, we plot the variations in accuracy with respect to increasing λ for all methods (Figure 5 ). We see that our choice of λ contains points within the acceptable range of accuracy (i.e., within 1.5% degradation in accuracy compared to baseline). However, after a particular λ value, which is different on a per method per dataset basis, the accuracy decreases consistently with increasing λ. As such, the grid chosen is suitable for the experiments conducted. F TRAINABLE CALIBRATION MEASURES UNDER DISTRIBUTION SHIFT In Figure 6 , we evaluate the performance of calibration methods under distribution shift benchmark datasets, CIFAR 10-C and CIFAR 100-C, introduced in Hendrycks & Dietterich (2019). For models jointly trained with an auxiliary calibration objective (NLL+MMCE, NLL+SB-ECE, NLL+ESD), we observe that they are more robust to distribution shifts when compared to those solely trained with NLL. In addition, prior work (Ovadia et al., 2019) , has shown that the calibration performance of a neural network after temperature scaling may be significantly reduced under distribution shifts. Our results on CIFAR 10-C and CIFAR 100-C imply a similar trend. Stacking calibration during training methods with temperature scaling (NLL+MMCE+T, NLL+SB-ECE+T, NLL+ESD+T), we observe that they perform comparable to temperature scaled models trained with NLL (NLL+T) with the exception of ESD, where it performs marginally better. As such, training with ESD could potentially improve a model's robustness after temperature scaling regarding distribution shifts and mitigate this issue.



To avoid confusion, ESD from this point onward will refer to the estimator instead of its expectation form. For λ, we search for [0.2, 0.4, 0.6, 0.8, 1.0, 2.0, 3.0, ... 10.0] (Appendix E). For ϕ, we search [0.2, 0.4, 0.6, 0.8]. For T , we search [0.0001, 0.001, 0.01, 0.1].



Figure 1: Accuracy (%) curve (left) and its corresponding ECE (%) curve (right) during training with negative log-likelihood (NLL) loss. It could be seen that since NLL implicitly trains for calibration error, the ECE of the train set approaches zero while the ECE of the test set increases during training.

Figure 2: ECE performance curve of MMCE (left) and SB-ECE (right) with respect to their varying internal hyperparameters on MNIST, CIFAR10, SNLI datasets.

Figure 3: Computational cost of single-run training (left) and total cost considering hyperparameter tuning (right). The x-axis in both cases are in the order of increasing model complexity.

Figure 4: Visual intuition plot showing the cumulative confidence and cumulative accuracy with varying quantile scores of prediction confidence (α) for an uncalibrated (left) and calibrated (right) network. The uncalibrated network was obtained by training Resnet34 on CIFAR100 with NLL, and the calibrated network was acquired by temperature scaling on the aforementioned trained network.

Pytorch-like Pseudocode: Expected Squared Difference (ESD) # n:number of samples in a mini-batch. # confidence:1D tensor of n elements containing max softmax outputs of each sample from a neural network. # correct:1D tensor of n elements containing 1/0 corresponding to the correctness of each sample. def ESD loss(n, confidence, correct): # compute the difference between confidence and correctness. diff = correct.float() -confidence # Prepare the split between inner and outer expectation estimation. split = torch.ones(n,n) -torch.eye(n) # compute the inner expectation estimation. confidence mat= confidence.expand(n,n) ineq = torch.le(confidence mat,confidence mat.T).float() diff mat = diff.view(1,n).expand(n,n) x mat = torch.mul(diff mat,ineq) * split mean row = torch.sum(x mat, dim = 1)/(n-1) x mat squared = torch.mul(x mat, x mat) var = 1/(n-2) * torch.sum(x mat squared,dim=1) -(n-1)/(n-2) * torch.mul(mean row,mean row) # compute the outer expectation estimation. d k sq vector = torch.mul(mean row, mean row) -var/(n-1) ESD = torch.sum(d k sq vector)/n return ESD

Y [ESD] (by law of unconscious statistician) = E X,Y [∇ θ ESD]. (since (X, Y ) are independent of θ)

Figure 5: Accuracy plot with respect to varying values of λ across different datasets and models trained with MMCE, SB-ECE, and ESD. The threshold accuracy represents the value 1.5% below the baseline accuracy, which was used as the model selection criterion as stated in section 5.2.

Figure 6: Accuracy and ECE plot for varying intensities of distribution shifts in (a) CIFAR 10-C and (b) CIFAR 100-C across models trained using NLL as well as those jointly trained with an auxiliary calibration objective (i.e., NLL+MMCE, NLL+SB-ECE, NLL+ESD) and their performance after post-processing with temperature scaling (i.e., NLL+T, NLL+MMCE+T, NLL+SB-ECE+T, NLL+ESD+T).

tackle this issue by introducing a weighted version of their calibration objective loss. They consider using larger weights to incorrect prediction samples after observing that the fraction of incorrect to the correct samples on the training data is smaller than that of the validation and test data. Instead of changing the calibration objective function itself,Karandikar et al. (2021) introduced a new training scheme called interleaved training, where they split the training data

The data consists of 550,152/10,000/10,000 sentence pairs for train/val/test set respectively. The max length of the input is set to 158.• ANLI(Nie et al., 2020): ANLI dataset is a large-scale NLI dataset, collected via an, adversarial human-and-model-in-the-loop procedure. The data consists of 162,865/3,200/3,200 sentence pairs for train/val/test set respectively. The max length of the input is set to 128.For the Image Classification datasets, we used Convolutional Neural Networks (CNNs). Specifically, we used LeNet5(Lecun et al., 1998), ResNet50, ResNet34, and ResNet18 (He et al., 2016), for MNIST, CIFAR10, CIFAR100, and ImageNet100, respectively. For the NLI datasets, we finetuned transformer based Pre-trained Language Models (PLMs). Specifically, we used Bert-base(Devlin et al., 2019) and Roberta-base(Liu et al.), for SNLI and ANLI, respectively.5.2 EXPERIMENTAL SETUPWe compare our Expected Squared Difference (ESD) to previously proposed trainable calibration objectives, MMCE and SB-ECE, on the datasets and models previously mentioned. For fair comparison, interleaved training has been used for all three calibration objectives. For MMCE, in whichKumar et al. (2018) proposed an unweighted and weighted version of the objective, we use the former to set the method to account for the overfitting problem mentioned in section 3.2.2 consistent to interleaved training. For the interleaved training settings, we held out 10% of the train set to the calibration set. The regularizer hyperparameter λ for weighting the calibration measure with respect to NLL is chosen via fixed grid search 2 . For measuring calibration error, we use ECE with 20 equally sized



Average accuracy and ECE (with std. across 5 trials) for ESD after training with batch sizes 64, 128, 256, and 512.

Average accuracy and ECE (with std. across 5 trials) for CIFAR100 after training with ECE on batch sizes 128, 256, and 512.

ACKNOWLEDGEMENT

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics), and Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No. 2021-0-01381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments).

