CALIBRATION TESTS BEYOND CLASSIFICATION

Abstract

Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither over-nor under-confident. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classification models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the first framework that unifies calibration evaluation and tests for general probabilistic predictive models. It applies to any such model, including classification and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classification. In particular, we reformulate and generalize the kernel calibration error, its estimators, and hypothesis tests using scalar-valued kernels, and evaluate the calibration of real-valued regression problems. 1 

1. INTRODUCTION

We consider the general problem of modelling the relationship between a feature X and a target Y in a probabilistic setting, i.e., we focus on models that approximate the conditional probability distribution P(Y |X) of target Y for given feature X. The use of probabilistic models that output a probability distribution instead of a point estimate demands guarantees on the predictions beyond accuracy, enabling meaningful and interpretable predicted uncertainties. One such statistical guarantee is calibration, which has been studied extensively in metereological and statistical literature (DeGroot & Fienberg, 1983; Murphy & Winkler, 1977) . A calibrated model ensures that almost every prediction matches the conditional distribution of targets given this prediction. Loosely speaking, in a classification setting a predicted distribution of the model is called calibrated (or reliable), if the empirically observed frequencies of the different classes match the predictions in the long run, if the same class probabilities would be predicted repeatedly. A classical example is a weather forecaster who predicts each day if it is going to rain on the next day. If she predicts rain with probability 60% for a long series of days, her forecasting model is calibrated for predictions of 60% if it actually rains on 60% of these days. If this property holds for almost every probability distribution that the model outputs, then the model is considered to be calibrated. Calibration is an appealing property of a probabilistic model since it provides safety guarantees on the predicted distributions even in the common case when the model does not predict the true distributions P(Y |X). Calibration, however, does not guarantee accuracy (or refinement)-a model that always predicts the marginal probabilities of each class is calibrated but probably inaccurate and of limited use. On the other hand, accuracy does not imply calibration either since the predictions of an accurate model can be too over-confident and hence miscalibrated, as observed, e.g., for deep neural networks (Guo et al., 2017) . In the field of machine learning, calibration has been studied mainly for classification problems (Bröcker, 2009; Guo et al., 2017; Kull et al., 2017; 2019; Kumar et al., 2018; Platt, 2000; Vaicenavicius et al., 2019; Widmann et al., 2019; Zadrozny, 2002) and for quantiles and confidence intervals of models for regression problems with real-valued targets (Fasiolo et al., 2020; Ho & Lee, 2005; Kuleshov et al., 2018; Rueda et al., 2006; Taillardat et al., 2016) . In our work, however, we do not restrict ourselves to these problem settings but instead consider calibration for arbitrary predictive models. Thus, we generalize the common notion of calibration as: Definition 1. Consider a model P X := P (Y |X) of a conditional probability distribution P(Y |X). Then model P is said to be calibrated if and only if P(Y |P X ) = P X almost surely. ( ) If P is a classification model, Definition 1 coincides with the notion of (multi-class) calibration by Bröcker (2009) ; Kull et al. (2019) ; Vaicenavicius et al. (2019) . Alternatively, in classification some authors (Guo et al., 2017; Kumar et al., 2018; Naeini et al., 2015) study the strictly weaker property of confidence calibration (Kull et al., 2019) , which only requires P (Y = arg max P X | max P X ) = max P X almost surely. (2) This notion of calibration corresponds to calibration according to Definition 1 for a reduced problem with binary targets Y := 1(Y = arg max P X ) and Bernoulli distributions P X := Ber(max P X ) as probabilistic models. For real-valued targets, Definition 1 coincides with the so-called distribution-level calibration by Song et al. (2019) . Distribution-level calibration implies that the predicted quantiles are calibrated, i.e., the outcomes for all real-valued predictions of the, e.g., 75% quantile are actually below the predicted quantile with 75% probability (Song et al., 2019, Theorem 1) . Conversely, although quantile-based calibration is a common approach for real-valued regression problems (Fasiolo et al., 2020; Ho & Lee, 2005; Kuleshov et al., 2018; Rueda et al., 2006; Taillardat et al., 2016) , it provides weaker guarantees on the predictions. For instance, the linear regression model in Fig. 1 empirically shows quantiles that appear close to being calibrated albeit being uncalibrated according to Definition 1. Empirically the predicted quantiles on 50 validation data points appear close to being calibrated, although model P is uncalibrated according to Definition 1. Using the framework in this paper, on the same validation data a statistical test allows us to reject the null hypothesis that model P is calibrated at a significance level of α = 0.05 (p < 0.05). See Appendix A.1 for details. Figure 1 also raises the question of how to assess calibration for general target spaces in the sense of Definition 1, without having to rely on visual inspection. In classification, measures of calibration such as the commonly used expected calibration error (ECE) (Guo et al., 2017; Kull et al., 2019; Naeini et al., 2015; Vaicenavicius et al., 2019) and the maximum calibration error (MCE) (Naeini et al., 2015) try to capture the average and maximal discrepancy between the distributions on the left hand side and the right hand side of Eq. (1) or Eq. ( 2), respectively. These measures can be generalized to other target spaces (see Definition B.1), but unfortunately estimating these calibration errors from observations of features and corresponding targets is problematic. Typically, the predictions are different for (almost) all observations, and hence estimation of the conditional probability P (Y |P X ), which is needed in the estimation of ECE and MCE, is challenging even for low-dimensional target spaces and usually leads to biased and inconsistent estimators (Vaicenavicius et al., 2019) . Kernel-based calibration errors such as the maximum mean calibration error (MMCE) (Kumar et al., 2018) and the kernel calibration error (KCE) (Widmann et al., 2019) for confidence and multi-class calibration, respectively, can be estimated without first estimating the conditional probability and hence avoid this issue. They are defined as the expected value of a weighted sum of the differences of the left and right hand side of Eq. ( 1) for each class, where the weights are given as a function of the predictions (of all classes) and chosen such that the calibration error is maximized. A reformulation with matrix-valued kernels (Widmann et al., 2019) yields unbiased and differentiable estimators without explicit dependence on P(Y |P X ), which simplifies the estimation and allows to explicitly account for calibration in the training objective (Kumar et al., 2018) . Additionally, the kernel-based framework allows the derivation of reliable statistical hypothesis tests for calibration in multi-class classification (Widmann et al., 2019) . However, both the construction as a weighted difference of the class-wise distributions in Eq. ( 1) and the reformulation with matrix-valued kernels require finite target spaces and hence cannot be applied to regression problems. To be able to deal with general target spaces, we present a new and more general framework of calibration errors without these limitations. Our framework can be used to reason about and test for calibration of any probabilistic predictive model. As explained above, this is in stark contrast with existing methods that are restricted to simple output distributions, such as classification and scalar-valued regression problems. A key contribution of this paper is a new framework that is applicable to multivariate regression, as well as situations when the output is of a different (e.g., discrete ordinal) or more complex (e.g., graph-structured) type, with clear practical implications. Within this framework a KCE for general target spaces is obtained. We want to highlight that for multi-class classification problems its formulation is more intuitive and simpler to use than the measure proposed by Widmann et al. (2019) based on matrix-valued kernels. To ease the application of the KCE we derive several estimators of the KCE with subquadratic sample complexity and their asymptotic properties in tests for calibrated models, which improve on existing estimators and tests in the two-sample test literature by exploiting the special structure of the calibration framework. Using the proposed framework, we numerically evaluate the calibration of neural network models and ensembles of such models.

2. CALIBRATION ERROR: A GENERAL FRAMEWORK

In classification, the distributions on the left and right hand side of Eq. ( 1) can be interpreted as vectors in the probability simplex. Hence ultimately the distance measure for ECE and MCE (see Definition B.1) can be chosen as a distance measure of real-valued vectors. The total variation, Euclidean, and squared Euclidean distances are common choices (Guo et al., 2017; Kull et al., 2019; Vaicenavicius et al., 2019) . However, in a general setting measuring the discrepancy between P(Y |P X ) and P X cannot necessarily be reduced to measuring distances between vectors. The conditional distribution P(Y |P X ) can be arbitrarily complex, even if the predicted distributions are restricted to a simple class of distributions that can be represented as real-valued vectors. Hence in general we have to resort to dedicated distance measures of probability distributions. Additionally, the estimation of conditional distributions P(Y |P X ) is challenging, even more so than in the restricted case of classification, since in general these distributions can be arbitrarily complex. To circumvent this problem, we propose to use the following construction: We define a random variable Z X ∼ P X obtained from the predictive model and study the discrepancy between the joint distributions of the two pairs of random variables (P X , Y ) and (P X , Z X ), respectively, instead of the discrepancy between the conditional distributions P(Y |P X ) and P X . Since (P X , Y ) d = (P X , Z X ) if and only if P(Y |P X ) = P X almost surely, model P is calibrated if and only if the distributions of (P X , Y ) and (P X , Z X ) are equal. The random variable pairs (P X , Y ) and (P X , Z X ) take values in the product space P × Y, where P is the space of predicted distributions P X and Y is the space of targets Y . For instance, in classification, P could be the probability simplex and Y the set of all class labels, whereas in the case of Gaussian predictive models for scalar targets P could be the space of normal distributions and Y be R. The study of the joint distributions of (P X , Y ) and (P X , Z X ) motivates the definition of a generally applicable calibration error as an integral probability metric (Müller, 1997; Sriperumbudur et al., 2009; 2012) between these distributions. In contrast to common f -divergences such as the Kullback-Leibler divergence, integral probability metrics do not require that one distribution is absolutely continuous with respect to the other, which cannot be guaranteed in general. Definition 2. Let Y denote the space of targets Y , and P the space of predicted distributions P X . We define the calibration error with respect to a space of functions F of the form f : P × Y → R as CE F := sup f ∈F E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) . By construction, if model P is calibrated, then CE F = 0 regardless of the choice of F. However, the converse statement is not true for arbitrary function spaces F. From the theory of integral probability metrics (see, e.g., Müller, 1997; Sriperumbudur et al., 2009; 2012) , we know that for certain choices of F the calibration error in Eq. ( 3) is a well-known metric on the product space P × Y, which implies that CE F = 0 if and only if model P is calibrated. Prominent examples include the maximum mean discrepancyfoot_1 (MMD) (Gretton et al., 2007) , the total variation distance, the Kantorovich distance, and the Dudley metric (Dudley, 1989, p. 310) . As pointed out above, Definition 2 is a generalization of the definition for multi-class classification proposed by Widmann et al. (2019) -which is based on vector-valued functions and only applicable to finite target spaces-to any probabilistic predictive model. In Appendix E we show this explicitly and discuss the special case of classification problems in more detail. Previous results (Widmann et al., 2019) The literature of integral probability metrics suggests that we can resort to estimating CE F from i.i.d. samples from the distributions of (P X , Y ) and (P X , Z X ). For the MMD, the Kantorovich distance, and the Dudley metric tractable strongly consistent empirical estimators exist (Sriperumbudur et al., 2012) . Here the empirical estimator for the MMD is particularly appealing since compared with the other estimators "it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension d of the space (for S = R d )" (Sriperumbudur et al. (2012) ). Our specific design of (P X , Z X ) can be exploited to improve on these estimators. If E Zx∼Px f (P x , Z x ) can be evaluated analytically for a fixed prediction P x , then CE F can be estimated empirically with reduced variance by marginalizing out Z X . Otherwise E Zx∼Px f (P x , Z x ) has to be estimated, but in contrast to the common estimators of the integral probability metrics discussed above the artificial construction of Z X allows us to approximate it by numerical integration methods such as (quasi) Monte Carlo integration or quadrature rules with arbitrarily small error and variance. Monte Carlo integration preserves statistical properties of the estimators such as unbiasedness and consistency.

3. KERNEL CALIBRATION ERROR

For the remaining parts of the paper we focus on the MMD formulation of CE F due to the appealing properties of the common empirical estimator mentioned above. We derive calibration-specific analogues of results for the MMD that exploit the special structure of the distribution of (P X , Z X ) to improve on existing estimators and tests in the MMD literature. To the best of our knowledge these variance-reduced estimators and tests have not been discussed in the MMD literature. Let k : (P × Y) × (P × Y) → R be a measurable kernel with corresponding reproducing kernel Hilbert space (RKHS) H, and assume that E P X ,Y k 1/2 (P X , Y ), (P X , Y ) < ∞ and E P X ,Z X k 1/2 (P X , Z X ), (P X , Z X ) < ∞. We discuss how such kernels can be constructed in a generic way in Section 3.1 below. Definition 3. Let F k denote the unit ball in H, i.e., F := {f ∈ H| f H ≤ 1}. Then the kernel calibration error (KCE) with respect to kernel k is defined as KCE k := CE F k = sup f ∈F k E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) . As known from the MMD literature, a more explicit formulation can be given for the squared kernel calibration error SKCE k := KCE 2 k (see Lemma B.2). A similar explicit expression for SKCE k was obtained by Widmann et al. (2019) for the special case of classification problems. However, their expression relies on Y being finite and is based on matrix-valued kernels over the finite-dimensional probability simplex P. A key difference to the expression in Lemma B.2 is that we instead propose to use real-valued kernels defined on the product space of predictions and targets. This construction is applicable to arbitrary target spaces and does not require Y to be finite.

3.1. CHOICE OF KERNEL

The construction of the product space P × Y suggests the use of tensor product kernels k = k P ⊗ k Y , where k P : P × P → R and k Y : Y × Y → R are kernels on the spaces of predicted distributions and targets, respectively. 3By definition, so-called characteristic kernels guarantee that KCE = 0 if and only if the distributions of (P X , Y ) and (P X , Z X ) are equal (Fukumizu et al., 2004; 2008) . Many common kernels such as the Gaussian and Laplacian kernel on R d are characteristic (Fukumizu et al., 2008) . 4 Szabó & Sriperumbudur (2018, Theorem 4) showed that a tensor product kernel k P ⊗ k Y is characteristic if k P and k Y are characteristic, continuous, bounded, and translation-invariant kernels on R d , but the implication does not hold for general characteristic kernels (Szabó & Sriperumbudur, 2018, Example 1) . For calibration evaluation, however, it is sufficient to be able to distinguish between the conditional distributions P(Y |P X ) and P(Z X |P X ) = P X . Therefore, in contrast to the regular MMD setting, it is sufficient that kernel k Y is characteristic and kernel k P is non-zero almost surely, to guarantee that KCE = 0 if and only if model P is calibrated. Thus it is suggestive to construct kernels on general spaces of predicted distributions as k P (p, p ) = exp -λd ν P (p, p ) , where d P (•, •) is a metric on P and ν, λ > 0 are kernel hyperparameters. The Wasserstein distance is a widely used metric for distributions from optimal transport theory that allows to lift a ground metric on the target space and possesses many important properties (see, e.g., Peyré & Cuturi, 2019, Chapter 2.4 ). In general, however, it does not lead to valid kernels k P , apart from the notable exception of elliptically contoured distributions such as normal and Laplace distributions (Peyré & Cuturi, 2019, Chapter 8.3) . In machine learning, common probabilistic predictive models output parameters of distributions such as mean and variance of normal distributions. Naturally these parameterizations give rise to injective mappings φ : P → R d that can be used to define a Hilbertian metric d P (p, p ) = φ(p) -φ(p ) 2 . For such metrics, k P in Eq. ( 4) is a valid kernel for all λ > 0 and ν ∈ (0, 2] (Berg et al., 1984, Corollary 3.3.3, Proposition 3.2.7) . In Appendix D.3 we show that for many mixture models, and hence model ensembles, Hilbertian metrics between model components can be lifted to Hilbertian metrics between mixture models. This construction is a generalization of the Wasserstein-like distance for Gaussian mixture models proposed by Chen et al. (2019; 2020) ; Delon & Desolneux (2020) . 3 Note that in contrast to the regular MMD we marginalize out Z and Z . Similar to the MMD, there exist consistent estimators of the SKCE, both biased and unbiased. Lemma 1. The plug-in estimator of SKCE k is non-negatively biased. It is given by SKCE k = 1 n 2 n i,j=1 h (P Xi , Y i ), (P Xj , Y j ) . Inspired by the block tests for the regular MMD (Zaremba et al., 2013) , we define the following class of unbiased estimators. Note that in contrast to SKCE k they do not include terms of the form h (P Xi , Y i ), (P Xi , Y i ) . Lemma 2. The block estimator of SKCE k with block size B ∈ {2, . . . , n}, given by SKCE k,B := n B -1 n/B b=1 B 2 -1 (b-1)B<i<j≤bB h (P Xi , Y i ), (P Xj , Y j ) , is an unbiased estimator of SKCE k . The extremal estimator with B = n is a so-called U-statistic of SKCE k (Hoeffding, 1948; van der Vaart, 1998) , and hence it is the minimum variance unbiased estimator. All presented estimators are consistent, i.e., they converge to SKCE k almost surely as the number n of data points goes to infinity. The sample complexity of SKCE k and SKCE k,B is O(n 2 ) and O(Bn), respectively.

3.3. CALIBRATION TESTS

A fundamental issue with calibration errors in general, including ECE, is that their empirical estimates do not provide an answer to the question if a model is actually calibrated. Even if the measure is guaranteed to be zero if and only if the model is calibrated, usually the estimates of calibrated models are non-zero due to randomness in the data and (possibly) the estimation procedure. In classification, statistical hypothesis tests of the null hypothesis H 0 : model P is calibrated, so-called calibration tests, have been proposed as a tool for checking rigorously if P is calibrated (Bröcker & Smith, 2007; Vaicenavicius et al., 2019; Widmann et al., 2019) . For multi-class classification, Widmann et al. (2019) suggested calibration tests based on the asymptotic distributions of estimators of the previously formulated KCE. Although for finite data sets the asymptotic distributions are only approximations of the actual distributions of these estimators, in their experiments with 10 classes the resulting p-value approximations seemed reliable whereas p-values obtained by so-called consistency resampling (Bröcker & Smith, 2007; Vaicenavicius et al., 2019) underestimated the p-value and hence rejected the null hypothesis too often (Widmann et al., 2019) . For fixed block sizes n/B SKCE k,B -SKCE k d -→ N 0, σ 2 B as n → ∞, and, under H 0 , n SKCE k,n d -→ ∞ i=1 λ i (Z i -1) as n → ∞ , where Z i are independent χ 2 1 distributed random variables. See Appendix B for details and definitions of the involved constants. From these results one can derive calibration tests that extend and generalize the existing tests for classification problems, as explained in Remarks B.1 and B.2. Our formulation illustrates also the close connection of these tests to different two-sample tests (Gretton et al., 2007; Zaremba et al., 2013) .

4. ALTERNATIVE APPROACHES

For two-sample tests, Chwialkowski et al. (2015) suggested the use of the so-called unnormalized mean embedding (UME) to overcome the quadratic sample complexity of the minimum variance unbiased estimator and its intractable asymptotic distribution. As we show in Appendix C, there exists an analogous measure of calibration, termed unnormalized calibration mean embedding (UCME), with a corresponding calibration mean embedding (CME) test. As an alternative to our construction based on the joint distributions of (P X , Y ) and (P X , Z X ), one could try to directly compare the conditional distributions P(Y |P X ) and P(Z X |P X ) = P X . For instance, Ren et al. (2016) proposed the conditional MMD based on the so-called conditional kernel mean embedding (Song et al., 2009; 2013) . However, as noted by Park & Muandet (2020) , its common definition as operator between two RKHS is based on very restrictive assumptions, which are violated in many situations (see, e.g., Fukumizu et al., 2013, Footnote 4 ) and typically require regularized estimates. Hence, even theoretically, often the conditional MMD is "not an exact measure of discrepancy between conditional distributions" (Park & Muandet (2020) ). In contrast, the maximum conditional mean discrepancy (MCMD) proposed in a concurrent work by Park & Muandet (2020) is a random variable derived from much weaker measure-theoretical assumptions. The MCMD provides a local discrepancy conditional on random predictions whereas KCE is a global real-valued summary of these local discrepancies. 5

5. EXPERIMENTS

In our experiments we evaluate the computational efficiency and empirical properties of the proposed calibration error estimators and calibration tests on both calibrated and uncalibrated models. By means of a classic regression problem from statistics literature, we demonstrate that the estimators and tests can be used for the evaluation of calibration of neural network models and ensembles of such models. This section contains only an high-level overview of these experiments to conserve space but all experimental details are provided in Appendix A.

5.1. EMPIRICAL PROPERTIES AND COMPUTATIONAL EFFICIENCY

We evaluate error, variance, and computation time of calibration error estimators for calibrated and uncalibrated Gaussian predictive models in synthetic regression problems. The results empirically confirm the consistency of the estimators and the computational efficiency of the estimator with block size B = 2 which, however, comes at the cost of increased error and variance. Additionally, we evaluate empirical test errors of calibration tests at a fixed significance level α = 0.05. The evaluations, visualized in Fig. 2 for models with ten-dimensional targets, demonstrate empirically that the percentage of incorrect rejections of H 0 converges to the set significance level as the number of samples increases. Moreover, the results highlight the computational burden of the calibration test that estimates quantiles of the intractable asymptotic distribution of n SKCE k,n by bootstrapping. 5 In our calibration setting, the MCMD is almost surely equal to (Park & Muandet, 2020, Theorem 3.7) . Although the definition of MCMD only requires a kernel kY on the target space, a kernel kP on the space of predictions has to be specified for the evaluation of its regularized estimates. sup f ∈F Y E Y |P X f (Y )|PX - E Z X |P X f (ZX )|PX , where FY := {f : Y → R| f H Y ≤ 1} for an RKHS HY with kernel kY : Y × Y → R. If kernel kY is characteristic, MCMD = 0 almost surely if and only if model P is calibrated As expected, due to the larger variance of SKCE k,2 the test with fixed block size B = 2 shows a decreased test power although being computationally much more efficient. 

5.2. FRIEDMAN 1 REGRESSION PROBLEM

The Friedman 1 regression problem (Friedman, 1979; 1991; Friedman et al., 1983 ) is a classic non-linear regression problem with ten-dimensional features and real-valued targets with Gaussian noise. We train a Gaussian predictive model whose mean is modelled by a shallow neural network and a single scalar variance parameter (consistent with the data-generating model) ten times with different initial parameters. Figure 3 shows estimates of the mean squared error (MSE), the average negative log-likelihood (NLL), SKCE k , and a p-value approximation for these models and their ensemble on the training and a separate test data set. All estimates indicate consistently that the models are overfit after 1500 training iterations. The estimations of SKCE k and the p-values allow to focus on calibration specifically, whereas MSE indicates accuracy only and NLL, as any proper scoring rule (Bröcker, 2009) , provides a summary of calibration and accuracy. The estimation of SKCE k in addition to NLL could serve as another source of information for early stopping and model selection. 

6. CONCLUSION

We presented a framework of calibration estimators and tests for any probabilistic model that captures both classification and regression problems of arbitrary dimension as well as other predictive models. We successfully applied it for measuring calibration of (ensembles of) neural network models. Our framework highlights connections of calibration to two-sample tests and optimal transport theory which we expect to be fruitful for future research. For instance, the power of calibration tests could be improved by heuristics and theoretical results about suitable kernel choices or hyperparameters (cf. Jitkrittum et al., 2016) . It would also be interesting to investigate alternatives to KCE captured by our framework, e.g., by exploiting recent advances in optimal transport theory (cf. Genevay et al., 2016) . Since the presented estimators of SKCE k are differentiable, we imagine that our framework could be helpful for improving calibration of predictive models, during training (cf. Kumar et al., 2018) or post-hoc. Currently, many calibration methods (see, e.g., Guo et al., 2017; Kull et al., 2019; Song et al., 2019) are based on optimizing the log-likelihood since it is a strictly proper scoring rule and thus encourages both accurate and reliable predictions. However, as for any proper scoring rule, "Per se, it is impossible to say how the score will rank unreliable forecast schemes [. . .]. The lack of reliability of one forecast scheme might be outbalanced by the lack of resolution of the other" (Bröcker (2009) ). In other words, if one does not use a calibration method such as temperature scaling (Guo et al., 2017) that keeps accuracy invariantfoot_5 , it is unclear if the resulting model is trading off calibration for accuracy when using log-likelihood for re-calibration. Thus hypothetically flexible calibration methods might benefit from using the presented calibration error estimators.

A EXPERIMENTS

The source code of the experiments and instructions for reproducing the results are available at https://github.com/devmotion/Calibration_ICLR2021. Additional material such as automatically generated HTML output and Jupyter notebooks is available at https: //devmotion.github.io/Calibration_ICLR2021/.

A.1 ORDINARY LEAST SQUARES

We consider a regression problem with scalar feature X and scalar target Y with input-dependent Gaussian noise that is inspired by a problem by Gustafsson et al. (2020) . Feature X is distributed uniformly at random in [-1, 1], and target Y is distributed according to Y ∼ sin(πX) + |1 + X| , where ∼ N (0, 0.15 2 ). We train a linear regression model P with homoscedastic variance using ordinary least squares and a data set of 100 i.i.d. pairs of feature X and target Y (see Fig. 4 ). - A validation data set of n = 50 i.i.d. pairs of X and Y is used to evaluate the empirical cumulative probability n -1 n i=1 1 [0,τ ] P (Y ≤ Y i |X = X i ) of model P for quantile levels τ ∈ [0, 1]. Model P would be quantile calibrated (Song et al., 2019) if τ = P X ,Y P (Y ≤ Y |X = X ) ≤ τ for all τ ∈ [0, 1], where (X, Y ) and (X , Y ) are independent identically distributed pairs of random variables (see Fig. 5 ). Additionally, we compute a p-value estimate of the null hypothesis H 0 that model P is calibrated using an estimation of the quantile of the asymptotic distribution of n SKCE k,n with 100000 bootstrap samples on the validation data set (see Remark B.2). Kernel k is chosen as the tensor product kernel k (p, y), (p , y ) = exp -W 2 (p, p ) exp -(y -y ) 2 /2 = exp -(m p -m p ) 2 + (σ p -σ p ) 2 exp -(y -y ) 2 /2 , where W 2 is the 2-Wasserstein distance and m p , m p and σ p , σ p denote the mean and the standard deviation of the normal distributions p and p (see Appendix D.1). We obtain p < 0.05 in our experiment, and hence the calibration test rejects H 0 at the significance level α = 0.05. 

A.2 EMPIRICAL PROPERTIES AND COMPUTATIONAL EFFICIENCY

We study two setups with d-dimensional targets Y and normal distributions P X of the form N (c1 d , 0.1 2 I d ) as predictions, where c ∼ U(0, 1). Since calibration analysis is only based on the targets and predicted distributions, we neglect features X in these experiments and specify only the distributions of Y and P X . In the first setup we simulate a calibrated model. We achieve this by sampling targets from the predicted distributions, i.e., by defining the conditional distribution of Y given P X as Y | P X = N (µ, Σ) ∼ N (µ, Σ). In the second setup we simulate an uncalibrated model of the form Y | P X = N (µ, Σ) ∼ N ([0.1, µ 2 , . . . , µ d ] T , Σ). We perform an evaluation of the convergence and computation time of the biased estimator SKCE k and the unbiased estimator SKCE k,B with blocks of size B ∈ {2, √ n, n}. We use the tensor product kernel k (p, y), (p , y ) = exp -W 2 (p, p ) exp -(y -y ) 2 /2 = exp -(m p -m p ) 2 + (σ p -σ p ) 2 exp -(y -y ) 2 /2 , where W 2 is the 2-Wasserstein distance and m p , m p and σ p , σ p denote the mean and the standard deviation of the normal distributions p and p . Figures 6 to 9 visualize the mean absolute error and the variance of the resulting estimates for the calibrated and the uncalibrated model with dimensions d = 1 and d = 10 for 500 independently drawn data sets of n ∈ {4, 16, 64, 256, 1024} samples of (P X , Y ). Computation time indicates the minimum time in the 500 evaluations on a computer with a 3.6 GHz processor. The ground truth values of the uncalibrated models were estimated by averaging the estimates of SKCE k,1000 for 1000 independently drawn data sets of 1000 samples of (P X , Y ) (independent from the data sets used for the evaluation of the estimates). Figures 6 and 7 illustrate that the computational efficiency of SKCE k,2 in comparison with the other estimators comes at the cost of increased error and variance for the calibrated models for fixed numbers of samples. # samples  10⁻⁴•⁸ 10⁻⁴•⁰ 10⁻³•² 10⁻²•⁴ mean error 2² 2⁴ 2⁶ 2⁸ 2¹⁰ # samples 10⁻¹⁰ 10⁻⁸ 10⁻⁶ variance 10⁻⁶•⁰ 10⁻⁴•⁵ 10⁻³•⁰ 10⁻¹•⁵ time [s] SKCE SKCE (B = 2) SKCE (B = √n) SKCE (B = n) 10⁻⁹•⁰ 10⁻⁷•⁵ 10⁻⁶•⁰ 10⁻⁴•⁵ variance 10⁻⁶•⁰ 10⁻⁴•⁵ 10⁻³•⁰ 10⁻¹•⁵ time [s] SKCE SKCE (B = 2) SKCE (B = √n) SKCE (B = n) 10⁻²•⁴ 10⁻²•⁰ 10⁻¹•⁶ 10⁻¹•² mean error 2² 2⁴ 2⁶ 2⁸ 2¹⁰ # samples 10⁻⁴•⁰ 10⁻³•² 10⁻²•⁴ variance 10⁻⁶•⁰ 10⁻⁴•⁵ 10⁻³•⁰ 10⁻¹•⁵ time [s] SKCE SKCE (B = 2) SKCE (B = √n) SKCE (B = n) 10⁻²•⁴ 10⁻²•⁰ 10⁻¹•⁶ 10⁻¹•² mean error 2² 2⁴ 2⁶ 2⁸ 2¹⁰ # samples 10⁻⁴•⁸ 10⁻⁴•⁰ 10⁻³•² 10⁻²•⁴ variance 10⁻⁶•⁰ 10⁻⁴•⁵ 10⁻³•⁰ 10⁻¹•⁵ time [s] SKCE SKCE (B = 2) SKCE (B = √n) SKCE (B = n)

A.3 FRIEDMAN 1 REGRESSION PROBLEM

We study the so-called Friedman 1 regression problem, which was initially described for 200 inputs in the six-dimensional unit hypercube (Friedman, 1979; Friedman et al., 1983) and later modified to 100 inputs in the 10-dimensional unit hypercube (Friedman, 1991) . In this regression problem real-valued target Y depends on input X via Y = 10 sin (πX 1 X 2 ) + 20(X 3 -0.5) 2 + 10X 4 + 5X 5 + , where noise is typically chosen to be independently standard normally distributed. We generate a training data set of 100 inputs distributed uniformly at random in the 10-dimensional unit hypercube and corresponding targets with identically and independently distributed noise following a standard normal distribution. We consider models P (θ,σ 2 ) of normal distributions with fixed variance σ 2 P (θ,σ 2 ) x = N (f θ (x), σ 2 ), where f θ (x), the model of the mean of the distribution P(Y |X = x), is given by a fully connected neural network with two hidden layers with 200 and 50 hidden units and ReLU activation functions. The parameters of the neural network are denoted by θ. We use a maximum likelihood approach and train the parameters θ of the model for 5000 iterations by minimizing the mean squared error on the training data set using ADAM (Kingma & Ba, 2015) (default settings in the machine learning framework Flux.jl (Innes, 2018; Innes et al., 2018) ). In each iteration, the variance σ 2 is set to the maximizer of the likelihood of the training data set. We train 10 models with different initializations of parameters θ. The initial values of the weight matrices of the neural networks are sampled from the uniform Glorot initialization (Glorot & Bengio, 2010) and the offset vectors are initialized with zeros. In Fig. 12 , we visualize estimates of accuracy and calibration measures on the training and test data set with 100 and 50 samples, respectively, for 5000 training iterations. The pinball loss is a common measure and training objective for calibration of quantiles (Song et al., 2019) . It is defined as E X,Y L τ Y, quantile(P X , τ ) , where L τ (y, ỹ) = (1 -τ )(ỹ -y) + + τ (y -ỹ) + and quantile(P x , τ ) = inf y {P x (Y ≤ y) ≥ τ } for quantile level τ ∈ [0, 1]. In Fig. 12 we plot the average pinball loss (pinball) for quantile levels τ ∈ {0.05, 0.1, . . . , 0.95}. We evaluate SKCE k,n (SKCE (unbiased)) and SKCE k (SKCE (biased)) for the tensor product kernel k (p, y), (p , y Additionally, we form ensembles of the ten individual models at every training iteration. The evaluations for the ensembles are visualized in Fig. 12 as well. Apart from the unbiased estimates of SKCE k , the estimates of the ensembles are consistently better than the average estimates of the ensemble members. For the mean squared error and the negative log-likelihood this behaviour is guaranteed theoretically by the generalized mean inequality. ) = exp -W 2 (p, p ) exp -(y -y ) 2 /2 = exp -(m p -m p ) 2 + (σ p -σ p ) 2 exp -(y -y ) 2 /2 ,

B THEORY

B.1 GENERAL SETTING Let (Ω, A, P) be a probability space. Define the random variables X : (Ω, A) → (X , Σ X ) and Y : (Ω, A) → (Y, Σ Y ) such that Σ X contains all singletons, and denote a version of the regular conditional distribution of Y given X = x by P(Y |X = x) for all x ∈ X . Let P : (X , Σ X ) → P, B(P) be a measurable function that maps features in X to probability measures in P on the target space Y. We call P a probabilistic model, and denote by P x := P (x) its output for feature x ∈ X . This gives rise to the random variable P X : (Ω, A) → P, B(P) as P X := P (X). We denote a version of the regular conditional distribution of Y given P X = P x by P(Y |P X = P x ) for all P x ∈ P.

B.2 EXPECTED AND MAXIMUM CALIBRATION ERROR

The common definition of the expected and maximum calibration error (Guo et al., 2017; Kull et al., 2019; Naeini et al., 2015; Vaicenavicius et al., 2019) for classification models can be generalized to arbitrary predictive models. 

B.3 KERNEL CALIBRATION ERROR

Recall the general notation: Let k : (P ×Y)×(P ×Y) → R be a kernel, amd denote its corresponding RKHS by H. If not stated otherwise, we assume that (K1) k(•, •) is Borel-measurable. (K2) k is integrable with respect to the distributions of (P X , Y ) and (P X , Z X ), i.e., E P X ,Y k 1/2 (P X , Y ), (P X , Y ) < ∞ and E P X ,Z X k 1/2 (P X , Z X ), (P X , Z X ) < ∞. Lemma B.1. There exist kernel mean embeddings µ P X Y , µ P X Z X ∈ H such that for all f ∈ H f, µ P X Y H = E P X ,Y f (P X , Y ) and f, µ P X Z X H = E P X ,Z X f (P X , Z X ). This implies that µ P X Y = E P X ,Y k(•, (P X , Y )) and µ P X Z X = E P X ,Z X k(•, (P X , Z X )). Proof. The linear operators T P X Y f := E P X ,Y f (P X , Y ) and T P X Z X f := E P X ,Z X f (P X , Z X ) for all f ∈ H are bounded since |T P X Y f | = | E P X ,Y f (P X , Y )| ≤ E P X ,Y |f (P X , Y )| = E P X ,Y | k((P X , Y ), •), f H | ≤ E P X ,Y k((P X , Y ), •) H f H ] = f H E P X ,Y k 1/2 ((P X , Y ), (P X , Y )) and similarly |T P X Z X f | ≤ f H E P X ,Z X k 1/2 ((P X , Z X ), (P X , Z X )). Thus Riesz representation theorem implies that there exist µ P X Y , µ P X Z X ∈ H such that T P X Y f = f, µ P X Y H and T P X Z X f = f, µ P X Z X H . The reproducing property of H implies µ P X Y (p, y) = k((p, y), •), µ P X Y H = E P X ,Y k((p, y), (P X , Y )) for all (p, y) ∈ P × Y, and similarly µ P X Z X (p, y) = E P X ,Z X k((p, y), (P X , Z X )). Lemma B.2. The squared kernel calibration error (SKCE) with respect to kernel k, defined as SKCE k := KCE 2 k , is given by SKCE k = E P X ,Y,P X ,Y k (P X , Y ), (P X , Y ) -2 E P X ,Y,P X ,Z X k (P X , Y ), (P X , Z X ) + E P X ,Z X ,P X ,Z X k (P X , Z X ), (P X , Z X ) , where (P X , Y , Z X ) is independently distributed according to the law of (P X , Y, Z X ) Proof. From Lemma B.1 we know that there exist kernel mean embeddings µ P X Y , µ P X Z X ∈ H that satisfy f, µ P X Y -µ P X Z X H = f, µ P X Y H -f, µ P X Z X H = E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) for all f ∈ H. Hence by the definition of the dual norm CE F k = sup f ∈F k E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) = sup f ∈F k f, µ P X ,Y -µ P X ,Z X H = µ P X ,Y -µ P X ,Z X H , which implies SKCE k = µ P X Y -µ P X Z X , µ P X Y -µ P X Z X H . From Lemma B.1 we obtain SKCE k = E P X ,Y,P X ,Y k (P X , Y ), (P X , Y ) -2 E P X ,Y,P X ,Z X k (P X , Y ), (P X , Z X ) + E P X ,Z X ,P X ,Z X k (P X , Z X ), (P X , Z X ) , which yields the desired result. Recall that (P X1 , Y 1 ), . . . , (P Xn , Y n ) is a validation data set that is sampled i.i.d. according to the law of (P X , Y ) and that for all (p, y), (p , y ) ∈ P × Y Proof. Let i, j ∈ {1, . . . , n}. By assumption (K2) we know that k (P Xi , Y i ), (P Xj , Y j ) ≤ k 1/2 (P Xi , Y i ), (P Xi , Y i ) k 1/2 (P Xj , Y j ), (P Xj , Y j ) < ∞ almost surely. Moreover, E Z X i k (P Xi , Z Xi ), (P Xj , Y j ) ≤ E Z X i k (P Xi , Z Xi ), (P Xj , Y j ) ≤ E Z X i k 1/2 (P Xi , Z Xi ), (P Xi , Z Xi ) k 1/2 (P Xj , Y j ), (P Xj , Y j ) < ∞ almost surely, and similarly E Z X i ,Z X j k (P Xi , Z Xi ), (P Xj , Z Xj ) < ∞ almost surely. Thus h (P Xi , Y i ), (P Xj , Y j ) ≤ k (P Xi , Y i ), (P Xj , Y j ) + E Z X i k (P Xi , Z Xi ), (P Xj , Y j ) + E Z X j k (P Xi , Y i ), (P Xj , Z Xj ) + E Z X i ,Z X j k (P Xi , Z Xi ), (P Xj , Z Xj ) < ∞ almost surely. Lemma 1. The plug-in estimator of SKCE k is non-negatively biased. It is given by SKCE k = 1 n 2 n i,j=1 h (P Xi , Y i ), (P Xj , Y j ) . Proof. From Lemma B.2 we know that KCE k < ∞, and Lemma B.3 implies that SKCE k < ∞ almost surely. For i = 1, . . . , n, the linear operators T i f := E Z X i f (P Xi , Z Xi ) for f ∈ H are bounded almost surely since |T i f | = E Z X i f (P Xi , Z Xi ) ≤ E Z X i f (P Xi , Z Xi ) = E Z X i k (P Xi , Z Xi ), • , f H ≤ E Z X i k (P Xi , Z Xi ), • H f H = f H E Z X i k 1/2 (P Xi , Z Xi ), (P Xi , Z Xi ) . Hence Riesz representation theorem implies that there exist ρ i ∈ H such that T i f = f, ρ i H almost surely. From the reproducing property of H we deduce that ρ i (p, y) = k (p, y), • , ρ i H = E Z X i k (p, y), (P Xi , Z Xi ) for all (p, y) ∈ P × Y almost surely. Thus by the definition of the dual norm the plug-in estimator KCE k satisfies KCE k = sup f ∈F k 1 n n i=1 f (P Xi , Y i ) -E Z X i f (P Xi , Z Xi ) = sup f ∈F k 1 n n i=1 k (P Xi , Y i ), • -ρ i , f H = sup f ∈F k 1 n n i=1 k (P Xi , Y i ), • -ρ i , f H = 1 n n i=1 k (G i , Y i ), • -ρ i H = 1 n n i=1 k (P Xi , Y i ), • -ρ i , n i=1 k (P Xi , Y i ), • -ρ i H 1/2 = 1 n n i,j=1 h (P Xi , Y i ), (P Xj , Y j ) 1/2 = SKCE 1/2 k < ∞ almost surely, and hence indeed SKCE 1/2 k is the plug-in estimator of KCE k . Since (P X , Y ), (P X , Y ), (P X1 , Y 1 ), . . . , (P Xn , Y n ) are identically distributed and pairwise independent, we obtain n 2 E SKCE k = n i,j=1, i =j E P X i ,Yi,P X j ,Yj h (P Xi , Y i ), (P Xj , Y j ) + n i=1 E P X i ,Yi h (P Xi , Y i ), (P Xi , Y i ) = n(n -1) E P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) + n E P X ,Y h (P X , Y ), (P X , Y ) = n(n -1)SKCE k + n E P X ,Y h (P X , Y ), (P X , Y ) . (B.1) With the same reasoning as above, there exist ρ, ρ ∈ H such that for all f ∈ H E Z X f (P X , Z X ) = f, ρ H and E Z X f (P X , Z X ) = f, ρ H almost surely. Thus we obtain h (P X , Y ), (P X , Y ) = k (P X , Y ), • -ρ, k (P X , Y ), • -ρ H almost surely, and therefore by Lemma B.2 and the Cauchy-Schwarz inequality SKCE k = E P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) = E P X ,Y,P X ,Y k (P X , Y ), • -ρ, k (G , Y ), • -ρ H ≤ E P X ,Y,P X ,Y k (P X , Y ), • -ρ, k (P X , Y ), • -ρ H ≤ E P X ,Y,P X ,Y k (P X , Y ), • -ρ H k (P X , Y ), • -ρ H ≤ E 1/2 P X ,Y k (P X , Y ), • -ρ 2 H E 1/2 P X ,Y k (P X , Y ), • -ρ 2 H . Since (P X , Y ) and (P X , Y ) are identically distributed, we obtain SKCE k ≤ E P X ,Y k (P X , Y ), • -ρ 2 H = E P X ,Y h (P X , Y ), (P X , Y ) . Thus together with Eq. (B.1) we get n 2 E SKCE k ≥ n(n -1)SKCE k + nSKCE k = n 2 SKCE k , and hence SKCE k has a non-negative bias. Lemma 2. The block estimator of SKCE k with block size B ∈ {2, . . . , n}, given by  SKCE k,B := n B -1 n/B b=1 B 2 -1 (b-1)B<i<j≤bB h (P Xi , Y i ), (P Xj , Y j ) , V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞, then for all b ∈ {1, . . . , n/B } V η b = σ 2 B := B 2 -1 2(B -2)ζ 1 + V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) , where η b is defined according to Eq. (B.2) and ζ 1 := E P X ,Y E 2 P X ,Y h (P X , Y ), (P X , Y ) -SKCE 2 k . (B.3) If model P is calibrated, it simplifies to σ 2 B = B 2 -1 E P X ,Y,P X ,Y h 2 (P X , Y ), (P X , Y ) . Proof. Let b ∈ {1, . . . , n/B }. Since V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞, the Cauchy- Schwarz inequality implies V η b < ∞ as well. As mentioned in the proof of Lemma 2 above, η b is a U-statistic of SKCE k . From the general formula of the variance of a U-statistic (see, e.g., Hoeffding, 1948, p. 298-299) we obtain V η b = B 2 -1 2 1 B -2 2 -1 ζ 1 + 2 2 B -2 2 -2 V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) = B 2 -1 2(B -2)ζ 1 + V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) , where ζ 1 = E P X ,Y E 2 P X ,Y h (P X , Y ), (P X , Y ) -SKCE 2 k . If model P is calibrated, then (P X , Y ) d = (P X , Z), and hence for all (p, y) ∈ P × Y E P X ,Y h (p, y), (P X , Y ) = E P X ,Y k (p, y), (P X , Y ) -E Z ∼p E P X ,Y k (p, Z ), (P X , Y ) -E P X ,Z k (p, y), (P X , Z) + E Z ∼p E P X ,Z k (p, Z ), (P X , Y ) = 0. This implies ζ 1 = E P X ,Y E 2 P X ,Y h (P X , Y ), (P X , Y ) = 0 and SKCE 2 k = 0 due to Lemma B.2. Thus σ 2 B = B 2 -1 E P X ,Y,P X ,Y h 2 (P X , Y ), (P X , Y ) , as stated above. Corollary B.1. Let B ∈ {2, . . . , n}. If V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞, then V SKCE k,B = n/B -1 σ 2 B . where σ 2 B is defined according to Lemma B.4. Proof. Since the estimators η 1 , . . . , η n/B in each block are pairwise independent, this is an immediate consequence of Lemma B.4. Corollary B.2. Let B ∈ {2, . . . , n}. If V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞, then n/B SKCE k,B -SKCE k d -→ N 0, σ 2 B as n → ∞, where block size B is fixed and σ 2 B is defined according to Lemma B.4. Proof. The result follows from Lemma 2, Lemma B.4, and the central limit theorem (see, e.g., Serfling, 1980, Theorem A in Section 1.9). Remark B.1. Corollary B.2 shows that SKCE k,B is a consistent estimator of SKCE k in the large sample limit as n → ∞ with fixed number B of samples per block. In particular, for the linear estimator with B = 2 we obtain n/2 SKCE k,2 -SKCE k d -→ N 0, σ 2 2 as n → ∞. Moreover, Lemma B.4 and Corollary B.2 show that the p-value of the null hypothesis that model P is calibrated can be estimated by Φ - n/B SKCE k,B σ B , where Φ is the cumulative distribution function of the standard normal distribution and σ B is the empirical standard deviation of the block estimates η 1 , . . . , η n/B , and Φ - n/B B(B -1) SKCE k,B √ 2 σ , where σ 2 is an estimate of E P X ,Y,P X ,Y h 2 (P X , Y ), (P X , Y ) . Similar p-value approximations for the two-sample test with blocks of fixed size were used by Chwialkowski et al. (2015) . Corollary B.3. Assume V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞. Let s ∈ {1, . . . , n/2 }. Then for all b ∈ {1, . . . , s} √ B η b -SKCE k d -→ N (0, 4ζ 1 ) as B → ∞, (B.4 ) where η b is defined according to Eq. (B.2) with n = Bs, the number s of equally-sized blocks is fixed, and ζ 1 is defined according to Eq. (B.3). If model P is calibrated, then √ B η b -SKCE k = √ B η b is asymptotically tight since ζ 1 = 0, B η b d -→ ∞ i=1 λ i (Z i -1) as B → ∞, (B.5) where Z i are independent χ 2 1 distributed random variables and λ i ∈ R are eigenvalues of the Hilbert-Schmidt integral operator Kf (p, y) := E P X ,Y h((p, y), (P X , Y ))f (P X , Y ) for Borel-measurable functions f : P × Y → R with E P X ,Y f 2 (P X , Y ) < ∞. Proof. Let s ∈ {1, . . . , n/2 } and b ∈ {1, . . . , s}. As mentioned above in the proof of Lemma 2, the estimator η b , defined according to Eq. (B.2), is a so-called U-statistic of SKCE k (see, e.g., van der Vaart, 1998) . Thus Eq. (B.4) follows from the asymptotic behaviour of U-statistics (see, e.g., van der Vaart, 1998, Theorem 12.3 ). If P is calibrated, then we know from the proof of Lemma B.4 that ζ 1 = 0, and hence η b is a so-called degenerate-U-statistic (see, e.g., van der Vaart, 1998, Section 12.3) . From the theory of degenerate U-statistics it follows that the sequence B η b converges in distribution to the limit distribution in Eq. (B.5), which is known as Gaussian chaos. Corollary B.4. Assume V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞. Let s ∈ {1, . . . , n/2 }. Then √ B SKCE k,B -SKCE k d -→ N (0, 4s -1 ζ 1 ) as B → ∞, where the number s of equally-sized blocks is fixed, n = Bs, and ζ 1 is defined according to Eq. (B.3). If model P is calibrated, then √ B SKCE k,B -SKCE k = √ B SKCE k,B is asymptotically tight since ζ 1 = 0, and B SKCE k,B d -→ s -1 ∞ i=1 λ i (Z i -s) as B → ∞, where Z i are independent χ 2 s distributed random variables and λ i ∈ R are eigenvalues of the Hilbert-Schmidt integral operator Kf (p, y) := E P X ,Y h((p, y), (P X , Y ))f (P X , Y ) for Borel-measurable functions f : P × Y → R with E P X ,Y f 2 (P X , Y ) < ∞. Proof. Since the estimators η 1 , . . . , η s in each block are pairwise independent, this is an immediate consequence of Corollary B.3. Remark B.2. Corollary B.4 shows that SKCE k,B is a consistent estimator of SKCE k in the large sample limit as B → ∞ with fixed number n/B of blocks. Moreover, for the minimum variance unbiased estimator with B = n, Corollary B.4 shows that under the null hypothesis that model P is calibrated n SKCE k,n d -→ ∞ i=1 λ i (Z i -1) as n → ∞, where Z i are independent χ 2 1 distributed random variables. Unfortunately quantiles of the limit distribution of ∞ i=1 λ i (Z i -1) (and hence the p-value of the null hypothesis that model P is calibrated) can not be computed analytically but have to be estimated by, e.g., bootstrapping (Arcones & Giné, 1992) , using a Gram matrix spectrum (Gretton et al., 2009) , fitting Pearson curves (Gretton et al., 2007) , or using a Gamma approximation (Johnson et al., 1994, p. 343, p. 359) . Corollary B.5. Assume V P X ,Y,P X ,Y h (P X , Y ), (P X , Y ) < ∞. Then n/B B SKCE k,B -SKCE k d -→ N (0, 4ζ 1 ) as B → ∞ and n/B → ∞, (B.6) where B is the block size and s is the number of equally-sized blocks, n = Bs, and ζ 1 is defined according to Eq. (B.3). If model P is calibrated, then n/B B SKCE k,B -SKCE k = n/B B SKCE k,B is asymp- totically tight since ζ 1 = 0, and n/B B SKCE k,B d -→ N 0, ∞ i=1 λ 2 i as B → ∞ and n/B → ∞, where λ i ∈ R are eigenvalues of the Hilbert-Schmidt integral operator Kf (p, y) := E P X ,Y h((p, y), (P X , Y ))f (P X , Y ) for Borel-measurable functions f : P × Y → R with E P X ,Y f 2 (P X , Y ) < ∞. Proof. The result follows from Corollary B.3 and the central limit theorem (see, e.g., Serfling, 1980, Theorem A in Section 1.9). Remark B.3. Corollary B.5 shows that SKCE k,B is a consistent estimator of SKCE k in the large sample limit as B → ∞ and n/B → ∞, i.e., as both the number of samples per block and the number of blocks go to infinity. Moreover, Corollaries B.3 and B.5 show that the p-value of the null hypothesis that P is calibrated can be estimated by Φ - n/B SKCE k,B σ B , where σ B is the empirical standard deviation of the block estimates η 1 , . . . , η n/B . Similar p-value approximations for the two-sample problem with blocks of increasing size were proposed and applied by Zaremba et al. (2013) .

C CALIBRATION MEAN EMBEDDING

C.1 DEFINITION Similar to the unnormalized mean embedding (UME) proposed by Chwialkowski et al. (2015) in the standard MMD setting, instead of the calibration error CE F k = µ P X Y -µ P X Z X H we can consider the unnormalized calibration mean embedding (UCME). Definition C.1. Let J ∈ N. The unnormalized calibration mean embedding (UCME) for kernel k with J test locations is defined as the random variable UCME 2 k,J = J -1 J j=1 µ P X Y (T j ) -µ P X Z X (T j ) 2 = J -1 J j=1 E P X ,Y k(T j , (P X , Y )) -E P X ,Z X k(T j , (P X , Z X )) 2 , where T 1 , . . . , T J are i.i.d. random variables (so-called test locations) whose distribution is absolutely continuous with respect to the Lebesgue measure on P × Y. As mentioned above, in many machine learning applications we actually have P × Y ⊂ R d (up to some isomorphism). In such a case, if k is an analytic, integrable, characteristic kernel, then for each J ∈ N UCME k,J is a random metric between the distributions of (P X , Y ) and (P X , Z X ), as shown by Chwialkowski et al. (2015, Theorem 2) . In particular, this implies that UCME k,J = 0 almost surely if and only if the two distributions are equal.

C.2 ESTIMATION

Again we assume (P X1 , Y 1 ), . . . , (P Xn , Y n ) is a validation data set of predictions and targets, which are i.i.d. according to the law of (P X , Y ). The consistent, but biased, plug-in estimator of UCME 2 k,J is given by UCME 2 k,J = J -1 J j=1 n -1 n i=1 k T j , (P Xi , Y i ) -E Z X i k T j , (P Xi , Z Xi ) 2 .

C.3 CALIBRATION MEAN EMBEDDING TEST

As Chwialkowski et al. (2015) note, if model P is calibrated, for every fixed sequence of unique test locations √ n UCME 2 k,J converges in distribution to a sum of correlated χ 2 random variables, as n → ∞. The estimation of this asymptotic distribution, and its quantiles required for hypothesis testing, requires a bootstrap or permutation procedure, which is computationally expensive. Hence Chwialkowski et al. (2015) proposed the following test based on Hotelling's T 2 -statistic (Hotelling, 1931) . For i = 1, . . . , n, let Z i :=    k T 1 , (P Xi , Y i ) -E Z X i k T 1 , (P Xi , Z Xi ) . . . k T J , (P Xi , Y i ) -E Z X i k T J , (P Xi , Z Xi )    ∈ R J , and denote the empirical mean and covariance matrix of Z 1 , . . . , Z n by Z and S, respectively. If UCME k,J is a random metric between the distributions of (P X , Y ) and (P X , Z X ), then the test statistic Q n := nZ T S -1 Z is almost surely asymptotically χ 2 distributed with J degrees of freedom if model P is calibrated, as n → ∞ with J fixed; moreover, if model P is uncalibrated, then for any fixed r ∈ R almost surely P(Q n > r) → 1 as n → ∞ (Chwialkowski et al., 2015, Proposition 2) . We call the resulting calibration test calibration mean embedding (CME) test.

D KERNEL CHOICE

A natural choice for the kernel k : (P × Y) × (P × Y) → R on the product space of predicted distributions P and targets Y is a tensor product kernel of the form k = k P ⊗ k Y , i.e., a kernel of the form k (p, y), (p , y ) = k P (p, p )k Y (y, y ), where k P : P × P → R and k Y : Y × Y → R are kernels on the spaces of predicted distributions and targets, respectively. As discussed in Section 3.1, if kernel k is characteristic, then the kernel calibration error KCE k of model P is zero if and only if P is calibrated. Unfortunately, as shown by Szabó & Sriperumbudur (2018, Example 1) , even if k P and k Y are characteristic, the tensor product kernel k = k P ⊗ k Y might not be characteristic. However, when analyzing calibration, it is sufficient to be able to distinguish distributions for which the conditional distributions P(Y |P X ) and P(Z X |P X ) = P X are not equal almost surely. Thus it is sufficient if k Y is characteristic and k P is non-zero almost surely. Many common kernels such as the Gaussian and Laplacian kernel on R d are characteristic and can therefore be chosen as kernel k Y for real-valued target spaces. The choice of k P might be less obvious since P is a space of probability distributions. Intuitively one might want to use kernels of the form k P p, p = exp -λd ν P (p, p ) , (D.1) where d P : P × P → R is a metric on P and ν, λ > 0 are kernel hyperparameters. Kernels of this form would be a generalization of the Gaussian and Laplacian kernel, and would clearly be non-zero almost surely. Unfortunately, this construction does not necessarily yield valid kernels. Most prominently, the Wasserstein distance does not lead to valid kernels k P in general (Peyré & Cuturi, 2019, Chapter 8.3 ). However, if d P (•, •) is a Hilbertian metric, i.e., a metric of the form d P (p, p ) = φ(p) -φ(p ) H for some Hilbert space H and mapping φ : P → H, then k P in Eq. (D.1) is a valid kernel for all λ > 0 and ν ∈ (0, 2] (Berg et al., 1984, Corollary 3.3.3, Proposition 3.2.7) .

D.1 NORMAL DISTRIBUTIONS

Assume that Y = R d and P = {N (µ, Σ) : µ ∈ R d , Σ ∈ R d×d psd}, i.e., the model outputs normal distributions P X = N (µ X , Σ X ). The distribution of these outputs is defined by the distribution of their mean µ X and covariance matrix Σ X . Let P x = N (µ x , Σ x ) ∈ P, y ∈ Y = R d , and γ > 0. We obtain E Zx∼Px exp -γ Z x -y 2 2 = I d + 2γΣ x -1/2 exp -γ(µ x -y) T I d + 2γΣ x -1 (µ x -y) from Mathai & Provost (1992, Theorem 3.2.a.3) . In particular, if Σ x = diag(Σ x,1 , . . . , Σ x,d ), then E Zx∼Px exp -γ Z x -y 2 2 = d i=1 1 + 2γΣ x,i -1/2 exp -γ 1 + 2γΣ x,i -1 µ x,i -y i 2 . Let P x = N (µ x , Σ x ) be another normal distribution. Then we have E Zx∼Px,Z x ∼P x exp -γ Z x -Z x 2 2 = I d + 2γΣ x -1/2 E Z x ∼P x exp -γ µ x -Z x T I d + 2γΣ x -1 µ x -Z x = I d + 2γ(Σ x + Σ x ) -1/2 exp -γ µ x -µ x T I d + 2γ(Σ x + Σ x ) -1 µ x -µ x . Thus if Σ x = diag(Σ x,1 , . . . , Σ x,d ) and Σ x = diag Σ x ,1 , . . . , Σ x ,d , then E Zx∼Px,Z x ∼P x exp -γ Z x -Z x 2 2 = d i=1 1 + 2γ(Σ x,i + Σ x ,i ) -1/2 exp -γ 1 + 2γ(Σ x,i + Σ x ,i ) -1 µ x,i -µ x ,i 2 . Hence we see that a Gaussian kernel k Y (y, y ) = exp -γ y -y 2 2 with inverse length scale γ > 0 on the space of targets Y = R d allows us to compute E Zx∼Px k Y (Z x , y) and E Zx∼Px,Z x ∼P x k Y (Z x , Z x ) analytically. Moreover, the Gaussian kernel is characteristic on R d (Fukumizu et al., 2008) . Hence, as discussed above, by choosing a kernel k P that is non-zero almost surely we can guarantee that KCE k = 0 if and only if model P is calibrated. On the space of normal distributions, the 2-Wasserstein distance with respect to the Euclidean distance between P x = N (µ x , Σ x ) and P x = N (µ x , Σ x ) is given by W 2 2 P x , P x = µ x -µ x 2 2 + Tr Σ x + Σ x -2 Σ x 1/2 Σ x Σ x 1/2 1/2 , which can be simplified to W 2 2 P x , P x = µ x -µ x 2 2 + Σ 1/2 x -Σ 1/2 x 2 Frob , if Σ x Σ x = Σ x Σ x . This shows that the 2-Wasserstein distance is a Hilbertian metric on the space of normal distributions. Hence as discussed above, the choice k P P x , P x = exp -λW ν 2 (P x , P x ) yields a valid kernel for all λ > 0 and ν ∈ (0, 2]. Thus for all λ, γ > 0 and ν ∈ (0, 2] k (p, y), (p , y ) = exp -λW ν 2 (p, p ) exp -γ y -y 2 2 is a valid kernel on the product space P × Y of normal distributions on R d and R d that allows to evaluate h (p, y), (p , y ) analytically and guarantees that KCE k = 0 if and only if model P is calibrated.

D.2 LAPLACE DISTRIBUTIONS

Assume that Y = R and P = {L(µ, β) : µ ∈ R, β > 0}, i.e., the model outputs Laplace distributions P X = L(µ X , β X ) with probability density function p X (y) = 1 2β X exp -β -1 X |y -µ X | for y ∈ Y = R. The distribution of these outputs is defined by the distribution of their mean µ X and scale parameter β X . Let P x = L(µ x , β x ) ∈ P, y ∈ Y = R, and γ > 0. If β x = γ -1 , we have E Zx∼Px exp -γ|Z x -y| = β 2 x γ 2 -1 -1 β x γ exp -β -1 x |µ x -y| -exp -γ|µ x -y| . Additionally, if β x = γ -1 , the dominated convergence theorem implies E Zx∼Px exp -γ|Z x -y| = lim γ→β -1 x β 2 x γ 2 -1 -1 β x γ exp -β -1 x |µ x -y| -exp -γ|µ x -y| = 1 2 1 + γ|µ x -y| exp -γ|µ x -y| . Let P x = L(µ x , β x ) be another Laplace distribution. If β x = γ -1 , β x = γ -1 , and β x = β x , we obtain E Zx∼Px,Z x ∼P x exp -γ|Z x -Z x | = γβ 3 x (β 2 x γ 2 -1)(β 2 x -β x 2 ) exp -β -1 x |µ x -µ x | + γβ 3 x (β 2 x γ 2 -1)(β x 2 -β 2 x ) exp -β x -1 |µ x -µ x | + 1 (β 2 x γ 2 -1)(β x 2 γ 2 -1) exp -γ|µ x -µ x | . As above, all other possible cases can be deduced by applying the dominated convergence theorem. More concretely, • if β x = β x = γ -1 , then E Zx∼Px,Z x ∼P x exp -γ|Z x -Z x | = 1 8 3 + 3γ|µ x -µ x | + γ 2 |µ x -µ x | 2 exp -γ|µ x -µ x | , • if β x = β x and β x = γ -1 , then E Zx∼Px,Z x ∼P x exp -γ|Z x -Z x | = 1 (β 2 x γ 2 -1) 2 exp -γ|µ x -µ x | + γ β x + |µ x -µ x | 2(β 2 x γ 2 -1) - β x γ (β 2 x γ 2 -1) 2 exp -β -1 x |µ x -µ x | , • if β x = β x and β x = γ -1 , then E Zx∼Px,Z x ∼P x exp -γ|Z x -Z x | = β x 3 γ 3 (β x 2 γ 2 -1) 2 exp -β x -1 |µ x -µ x | - 1 + γ|µ x -µ x | 2(β x 2 γ 2 -1) + β x 2 γ 2 (β x 2 γ 2 -1) 2 exp -γ|µ x -µ x | , • and if β x = β x and β x = γ -1 , then E Zx∼Px,Z x ∼P x exp -γ|Z x -Z x | = β 3 x γ 3 (β 2 x γ 2 -1) 2 exp -β -1 x |µ x -µ x | - 1 + γ|µ x -µ x | 2(β 2 x γ 2 -1) + β 2 x γ 2 (β 2 x γ 2 -1) 2 exp -γ|µ x -µ x | . The calculations above show that by choosing a Laplacian kernel k Y y, y = exp -γ|y -y | with inverse length scale γ > 0 on the space of targets Y = R, we can compute E Zx∼Px k Y (Z x , y) and E Zx∼Px,Z x ∼P x k Y (Z x , Z x ) analytically. Additionally, the Laplacian kernel is characteristic on R (Fukumizu et al., 2008) . Since the Laplace distribution is an elliptically contoured distribution, we know from Gelbrich (1990, Corollary 2) that the 2-Wasserstein distance with respect to the Euclidean distance between P x = L(µ x , β x ) and P x = L(µ x , β x ) can be computed in closed form and is given by W 2 2 P x , P x = (µ x -µ x ) 2 + 2(β x -β x ) 2 . Thus we see that the 2-Wasserstein distance is also a Hilbertian metric on the space of Laplace distributions, and hence k P P x , P x = exp -λW ν 2 (P x , P x ) is a valid kernel for 0 < ν ≤ 2 and all λ > 0. Therefore, as discussed above, for all λ, γ > 0 and ν ∈ (0, 2] k (p, y), (p , y ) = exp -λW ν 2 (p, p ) exp -γ|y -y | is a valid kernel on the product space P × Y of Laplace distributions and R that allows to evaluate h (p, y), (p , y ) analytically and guarantees that KCE k = 0 if and only if model P is calibrated.

D.3 PREDICTING MIXTURES OF DISTRIBUTIONS

Assume that the model predicts mixture distributions, possibly with different numbers of components. A special case of this setting are ensembles of models, in which each ensemble member predicts a component of the mixture model. Let p, p ∈ P with p = i π i p i and p = j π j p j , where π, π are histograms and p i , p j are the mixture components. For kernel k Y and y ∈ Y we obtain E Z∼p k Y (Z, y) = i π i E W ∼pi k Y (Z, y) and E Z∼p,Z ∼p k Y (Z, Z ) = i,j π i π j E Z∼pi,Z ∼p j k Y (Z, Z ). Of course, for these derivations to be meaningful, we require that they do not depend on the choice of histograms π, π and mixture components p i , p j . Definition D.1 (see Yakowitz & Spragins (1968) ). A family P of finite mixture models is called identifiable if two mixtures p = K i=1 π i p i ∈ P and p = K j=1 π j p j ∈ P, written such that all p i and all p j are pairwise distinct, are equal if and only if K = K and the indices can be reordered such that for all k ∈ {1, . . . , K} there exists some k ∈ {1, . . . , K} with π k = π k and p k = p k . Clearly, if P is identifiable, then the derivations above do not depend on the choice of histograms and mixture components. Prominent examples of identifiable mixture models are Gaussian mixture models and mixture models of families of products of exponential distributions (Yakowitz & Spragins, 1968) . Moreover, similar to optimal transport for Gaussian mixture models by Chen et al. (2019; 2020) If d(•, •) is a (Hilbertian) metric on the space of mixture components, then the Mixture Wasserstein distance of order s defined by MW s (p, p ) := inf w∈Π(π,π ) i,j w i,j d s (p i , p j ) 1/s , (D.2) is a (Hilbertian) metric on P. Proof. First of all, note that for all p, p ∈ P an optimal coupling ŵ exists (Villani, 2009, Theorem 4.1) . Moreover, i,j ŵi,j d s (p i , p j ) ≥ 0, and hence MW s (p, p ) exists. Moreover, since P is identifiable, we see that MW s (p, p ) does not depend on the choice of histograms and mixture components. Thus MW s is well-defined. Clearly, for all p, p ∈ P we have MW s (p, p ) ≥ 0 and MW s (p, p ) = MW s (p , p). Moreover, MW s s (p, p) = min w∈Π(π,π) i,j w i,j d s (p i , p j ) ≤ i,j π i δ i,j d s (p i , p j ) = i π i d s (p i , p i ) = i π i 0 2 = 0, and hence MW s (p, p) = 0. On the other hand, let p, p ∈ P with optimal coupling ŵ with respect to π and π , and assume that MW s (p, p ) = 0. We have p = i π i p i = i,j ŵi,j p i = i,j : ŵi,j >0 ŵi,j p i . Since MW s (p, p ) = 0, we have ŵi,j d s (p i , p j ) = 0 for all i, j, and hence d s (p i , p j ) = 0 if ŵi,j > 0. Since d is a metric, this implies p i = p j if ŵi,j > 0. Thus we get p = i,j : ŵi,j >0 ŵi,j p i = i,j : ŵi,j >0 ŵi,j p j = i,j ŵi,j p j = j π j p j = p . Function MW s also satisfies the triangle inequality, following a similar argument as Chen et al. (2019) . Let p (1) , p (2) , p (3) ∈ P and denote the optimal coupling with respect to π (1) and π (2) by ŵ( 12) , and the optimal coupling with respect to π (2) and π (3) by ŵ( 23) . Define w (13) by w i,k := j : π (2) j =0 ŵ(12) i,j ŵ(23) j,k π . Clearly w i,k ≥ 0 for all i, k, and we see that i w (13) i,k = i j : π (2) j =0 ŵ(12) i,j ŵ(23) j,k π (2) j = j : π (2) j =0 i ŵ(12) i,j ŵ(23) j,k π (2) j = j : π (2) j =0 π (2) j ŵ(23) j,k π (2) j = j : π (2) j =0 ŵ(23) j,k = π (3) - j : π (2) j =0 ŵ(23) j,k for all k. Since for all j, k, π 2) j ≥ ŵ(23) j,k , we know that π (2) j = 0 implies ŵ(23) j,k = 0 for all k. Thus for all k i w (13) i,k = π (3) . Similarly we obtain for all i k w (13) i,k = π (1) . Thus w (13) ∈ Π(π (1) , π (3) ), and therefore by exploiting the triangle inequality for metric d and the Minkowski inequality we get MW s p (1) , p (3) ≤ i,k w (13) i,k d s p (1) i , p (3) k 1/s = i,k j : π (2) j =0 ŵ(12) i,j ŵ(23) j,k π (2) j d s p (1) i , p (3) k 1/s ≤ i,k j : π (2) j =0 ŵ(12) i,j ŵ(23) j,k π (2) j d(p (1) i , p (2) j ) + d(p (2) j , p (3) k ) s 1/s ≤ i,k j : π (2) j =0 ŵ(12) i,j ŵ(23) j,k π (2) j d s p (1) i , p (2) j 1/s + i,k j : π (2) j =0 ŵ(12) i,j ŵ(23) j,k π (2) j d s p (2) j , p (3) k 1/s = i j : π (2) j =0 ŵ(12) i,j d s p (1) i , p (2) j 1/s + k j : π (2) j =0 ŵ(23) i,k d s p (2) j , p k 1/s ≤ i,j ŵ(12) i,j d s p (1) i , p j 1/s + j,k ŵ(23) i,k d s p (2) j , p k 1/s = MW s p (1) , p (2) + MW s p (2) , p (3) . Thus MW s is a metric, and it is just left to show that it is Hilbertian if d is Hilbertian. Since d is a Hilbertian metric, there exists a Hilbert space H and a mapping φ such that d(x, y) = φ(x) -φ(y) H . Let r 1 , . . . , r n ∈ R with i r i = 0 and p (1) , . . . , p (n) ∈ P. Denote the optimal coupling with respect to π (i) and π (j) by ŵ(i,j) . Then we have i,j r i r j k,l ŵ(i,j) k,l φ(p (i) k ) 2 H = i,k r i φ(p (i) k ) 2 H j r j l ŵ(i,j) k,l = i,k r i φ(p (i) k ) 2 H j r j π (i) k = i,k r i π (i) k φ(p (i) k ) 2 H j r j = 0, (D.3) and similarly i,j r i r j k,l ŵ(i,j) k,l φ(p (j) l ) 2 H = 0. (D.4) Moreover, for all k, l we get i,j r i r j ŵ(i,j) k,l φ p (i) k , φ p (j) l H = i r i ŵ(i,j) k,l φ p (i) k , j r j ŵ(i,j) k,l φ p (j) l H = i r i ŵ(i,j) k,l φ p (i) k 2 H ≥ 0, and hence i,j r i r j k,l ŵ(i,j) k,l φ(p (i) k ), φ(p (j) l ) H ≥ 0, (D.5) E CLASSIFICATION AS A SPECIAL CASE We show that the calibration error introduced in Definition 2 is a generalization of the calibration error for classification proposed by Widmann et al. (2019) . Their formulation of the calibration error is based on a weighted sum of class-wise discrepancies between the left hand side and right hand side of Definition 1, where the weights are output by a vector-valued function of the predictions. Hence their framework can only be applied to finite target spaces, i.e., if |Y| < ∞. Without loss of generality, we assume that Y = {1, . . . , d} for some d ∈ N \ {1}. In our notation, the previously defined calibration error, denoted by CCE (classification calibration error), with respect to a function space G ⊂ {f : P → R d } is given by CCE G := sup g∈G E P X y∈Y P(Y = y|P X ) -P X ({y}) g y P X . For the function class F := f : P × Y → R, (p, y) → g y (p) g ∈ G we get CCE G = sup f ∈F E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) = CE F . Similarly, for every function class F ⊂ {f : P × Y → R}, we can define the space G := g : P → R d , p → f (p, 1), . . . , f (p, d) T f ∈ F , for which CE F = sup g∈G E P X y∈Y P(Y = y|P X ) -P X ({y}) g y P X = CCE G . Thus both definitions are equivalent for classification models but the structure of the employed function classes differs. The definition of CCE is based on vector-valued functions on the probability simplex whereas the formulation presented in this paper uses real-valued function on the product space of the probability simplex and the targets. An interesting theoretical aspect of this difference is that in the case of KCE we consider real-valued kernels on P × Y instead of matrix-valued kernels on P, as shown by the following comparison. By e i ∈ R d we denote the ith unit vector, and for a prediction p ∈ P its representation v p ∈ R d in the probability simplex is defined as (v p ) y = p {y} for all targets y ∈ Y. for all y, y ∈ Y and p, p ∈ P. From the positive definiteness of kernel k it follows that K is a matrix-valued kernel (Micchelli & Pontil, 2005 , Definition 2). We obtain SKCE k = E P X ,Y,P X ,Y K(P X , P X ) Y,Y -2 E P X ,Y,P X ,Z X K(P X , P X ) Y,Z X + E P X ,Z X ,P X ,Z X K(P X , P X ) Z X ,Z X = E P X ,Y,P X ,Y e T Y K(P X , P X )e Y -2 E P X ,Y,P X ,Y e T Y K(P X , P X )v P X + E P X ,Y,P X ,Y v T P X K(P X , P X )v P X = E P X ,Y,P X ,Y (e Y -v P X ) T K(P X , P X )(e Y -v P X ), which is exactly the result by Widmann et al. (2019) for matrix-valued kernels. F TEMPERATURE SCALING Since many modern neural network models for classification have been demonstrated to be uncalibrated (Guo et al., 2017) , it is of high practical interest being able to improve calibration of predictive models. Generally, one distinguishes between calibration techniques that are applied during training and post-hoc calibration methods that try to calibrate an existing model after training. Temperature scaling (Guo et al., 2017) is a simple calibration method for classification models with only one scalar parameter. Due to its simplicity it can trade off calibration of different classes (Kull et al., 2019) , but conveniently it does not change the most-confident prediction and hence does not affect the accuracy of classification models with respect to the 0-1 loss. In regression, common post-hoc calibration methods are based on quantile binning and hence insufficient for our framework. Song et al. (2019) proposed a calibration method for regression models with real-valued targets, based on a special case of Definition 1. This calibration method was shown to perform well empirically but is computationally expensive and requires users to choose hyperparameters for a Gaussian process model and its variational inference. As a simpler alternative, we generalize temperature scaling to arbitrary predictive models in the following way. Definition F.1. Let P x be the output of a probabilistic predictive model P for feature x. If P x has probability density function p x with respect to a reference measure µ, then temperature scaling with respect to µ with temperature T > 0 yields a new output Q x whose probability density function q x with respect to µ satisfies q x ∝ p 1/T x . The notion for classification models given by Guo et al. (2017) can be recovered by choosing the counting measure on the classes as reference measure. For some exponential families on R d we obtain particularly simple transformations with respect to the Lebesgue measure λ d that keep the type of predicted distribution and its mean invariant. Hence in contrast to other calibration methods, for these models temperature scaling yields analytically tractable distributions and does not negatively impact the accuracy of the models with respect to the mean squared error and the mean absolute error. For instance, temperature scaling of multivariate power exponential distributions (Gómez et al., 1998) in R d , of which multivariate normal distributions are a special case, with respect to λ d corresponds to multiplication of their scale parameter with T 1/β , where β is the so-called kurtosis parameter (Gómez-Sánchez-Manzano et al., 2008) . For normal distributions, this corresponds to multiplication of the covariance matrix with T . Similarly, temperature scaling of Beta and Dirichlet distributions with respect to reference measure µ(dx) := x -1 (1 -x) -1 1 (0,1) (x)λ 1 (dx) and µ(dx) := d i=1 x -1 i 1 (0,1) d (x)λ d (dx), respectively, corresponds to division of the canonical parameters of these distributions by T without affecting the predicted mean value. All in all, we see that temperature scaling for general predictive models preserves some of the nice properties for classification models. For some exponential families such as normal distributions reference measure µ can be chosen such that temperature scaling is a simple transformation of the parameters of the predicted distributions (and hence leaves the considered model class invariant) that does not affect accuracy of these models with respect to the mean squared error and the mean absolute error.

G EXPECTED CALIBRATION ERROR FOR COUNTABLY INFINITE DISCRETE TARGET SPACES

In literature, ECE d and MCE d are defined for binary and multi-class classification problems (Guo et al., 2017; Naeini et al., 2015; Vaicenavicius et al., 2019) . For common distance measures on the probability simplex such as the total variation distance and the squared Euclidean distance, ECE d and MCE d can be formulated as a calibration error in the framework of Widmann et al. (2019) , which is a special case of the framework proposed in this paper for binary and multi-class classification problems. In contrast to previous approaches, our framework handles countably infinite discrete target spaces as well. For every problem with countably infinitely many targets, such as, e.g., Poisson regression, there exists an equivalent regression problem on the set of natural numbers. Hence without loss of generality we assume Y = N. Denote the space of probability distributions on N, the infinite dimensional probability simplex, with ∆ ∞ . Clearly, ∆ ∞ can be viewed as a subspace of the sequence space 1 that consists of all sequences x = (x n ) n∈N with x n ≥ 0 for all n ∈ N and x 1 = 1. Theorem G.1. Let 1 < p < ∞ with Hölder conjugate q. If F := {f : ∆ ∞ × N → R | E P X f (P X , n) n∈N p p ≤ 1}, then CE q F = E P X P(Y |P X ) -P X q q . Let µ be the law of P X . If F := {f : ∆ ∞ × N → R | E P X (f (P X , n)) n∈N 1 ≤ 1}, then Proof. Let 1 ≤ p ≤ ∞, and let µ be the law of P X and ν be the counting measure on N. Since both µ and ν are σ-finite measures, the product measure µ ⊗ ν is uniquely determined and σ-finite as well. CE F = µ-ess sup Using these definitions, we can reformulate F as Note that δ is well-defined since we assume that all singletons on ∆ ∞ are µ-measurable. Moreover, δ ∈ L q (∆ ∞ × N; µ ⊗ ν), which follows from (ξ, y) → P(Y = y | P X = ξ) and (ξ, y) → ξ({y}) being functions in L q (∆ ∞ × N; µ ⊗ ν). F = {f ∈ L p (∆ ∞ × N; µ ⊗ ν) | f p;µ⊗ν ≤ 1}. Since µ ⊗ ν is a σ-finite measure, the extremal equality of Hölder's inequality implies that CE F = sup f ∈F E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) = sup f ∈F E P X ,Y f (P X , Y ) -E P X ,Z X f (P X , Z X ) = sup f ∈F ∆ ∞ ×N f (ξ, y)δ(ξ, y) (µ ⊗ ν)(d(ξ, y)) = δ q;µ⊗ν . Note that the second equality follows from the symmetry of the function spaces F: for every f ∈ F, also -f ∈ F. Hence for 1 < p ≤ ∞, we obtain CE q F = ∆ ∞ ×N |δ(ξ, y)| q (µ ⊗ ν)(d(ξ, y)) = E P X (δ(P X , y)) y∈N q q = E P X P(Y |P X ) -P X q q . For p = 1, we get  CE F = µ-ess sup



The source code of the experiments is available at https://github.com/devmotion/ Calibration_ICLR2021. As we discuss in Section 3, the MMD is a metric if and only if the employed kernel is characteristic. As mentioned above, our framework rephrases and generalizes the construction used byWidmann et al. (2019). The matrix-valued kernels that they employ can be recovered by setting kP to a Laplacian kernel on the probability simplex and kY (y, y ) = δ y,y . For a general discussion about characteristic kernels and their relation to universal kernels we refer to the paper bySriperumbudur et al. (2011). Temperature scaling can be defined and applied for general probabilistic predictive models, see Appendix F.



Figure1: Illustration of a conditional distribution P(Y |X) with scalar feature and target. We consider a Gaussian predictive model P , obtained by ordinary least squares regression with 100 training data points (orange dots). Empirically the predicted quantiles on 50 validation data points appear close to being calibrated, although model P is uncalibrated according to Definition 1. Using the framework in this paper, on the same validation data a statistical test allows us to reject the null hypothesis that model P is calibrated at a significance level of α = 0.05 (p < 0.05). See Appendix A.1 for details.

Figure 2: Empirical test errors for 500 data sets of n ∈ {4, 16, 64, 256, 1024} samples from models with targets of dimension d = 10. The dashed black line indicates the set signficance level α = 0.05.

Figure 3: Mean squared error (MSE), average negative log-likelihood (NLL), SKCE k (SKCE (biased)), and p-value approximation (p-value) of ten Gaussian predictive models for the Friedman 1 regression problem versus the number of training iterations. Evaluations on the training data set (100 samples) are displayed in green and orange, and on the test data set (50 samples) in blue and purple. The green and blue line and their surrounding bands represent the mean and the range of the evaluations of the ten models. The orange and purple lines visualize the evaluations of their ensemble.

Figure 4: Data generating distribution P(Y |X) and predicted distribution P (Y |X) of the linear regression model. Training data is indicated by orange dots.

Figure 5: Cumulative probability versus quantile level for the linear regression model on the validation data (orange curve). The green curve indicates the theoretical ideal for a quantile-calibrated model.

Figure 6: Mean absolute error and variance of 500 calibration error estimates for data sets of n ∈ {4, 16, 64, 256, 1024} samples from the calibrated model of dimension d = 1.

Figure 7: Mean absolute error and variance of 500 calibration error estimates for data sets of n ∈ {4, 16, 64, 256, 1024} samples from the calibrated model of dimension d = 10.

Figure 8: Mean absolute error and variance of 500 calibration error estimates for data sets of n ∈ {4, 16, 64, 256, 1024} samples from the uncalibrated model of dimension d = 1.

Figure 9: Mean absolute error and variance of 500 calibration error estimates for data sets of n ∈ {4, 16, 64, 256, 1024} samples from the uncalibrated model of dimension d = 10.

Figure 10: Empirical test errors for 500 data sets of n ∈ {4, 16, 64, 256, 1024} samples from models with targets of dimension d = 1. The dashed black line indicates the set signficance level α = 0.05.

Figure 12: Estimates of different accuracy and calibration measures of ten Gaussian predictive models for the Friedman 1 regression problem versus the number of training iterations. Evaluations on the training data set (100 samples) are displayed in green and orange, and on the test data set (50 samples) in blue and purple. The green and blue line and their surrounding bands represent the mean and the range of the evaluations of the ten models. The orange and purple lines visualize the evaluations of their ensemble.

Let d(•, •) be a distance measure of probability distributions of target Y , and let µ be the law of P X . Then we call ECE d = E d P(Y |P X ), P X and MCE d = µ-ess sup d P(Y |P X ), P X the expected calibration error (ECE) and the maximum calibration error (MCE) of model P with respect to measure d, respectively.

h((p, y), (p , y )) := k((p, y), (p , y )) -E Z∼p k((p, Z), (p , y )) -E Z ∼p k((p, y), (p , Z )) + E Z∼p,Z ∼p k((p, Z), (p , Z )). Lemma B.3. For all i, j = 1, . . . , n, h (P Xi , Y i ), (P Xj , Y j ) < ∞ almost surely.

;Delon & Desolneux (2020), we can consider metrics of the form inf w∈Π(π,π ) i,jw i,j c s (p i , p j ) π ) = w : i w i,j = π j ∧ j w i,j = π i ∧ ∀i, j : w i,j ≥ 0are the couplings of π and π , and c(•, •) is a cost function between the components of the mixture model.Theorem D.1. Let P be a family of finite mixture models that is identifiable in the sense of Definition D.1, and let s ∈ [1, ∞).

Let k : (P × Y) × (P × Y) → R. We define the matrix-valued function K : P × P → R d×d by K(p, p ) y,y = k (p, y), (p , y )

As a concrete example,Widmann et al. (2019) used a matrix-valued kernel of the form (p, p ) → exp (-γ p -p )I d in their experiments. In our formulation this corresponds to the real-valued tensor product kernel (p, y), (p , y ) → exp (-γ p -p )δ y,y .

Y = y|P X = ξ) -ξ({y})|. Moreover, if F = {f : ∆ ∞ × N → R | µ-ess sup ξ∈∆ ∞ sup y∈N |f (ξ, y)| ≤ 1}, then CE F = E P X P(Y |P X ) -P X 1 .

Define the function δ : ∆ ∞ × N → R (µ ⊗ ν)-almost surely by δ(ξ, y) := P(Y = y | P X = ξ) -ξ({y}).

ξ∈∆ ∞ sup y∈N |δ(ξ, y)| = µ-ess sup ξ∈∆ ∞ sup y∈N | P(Y = y|P X = ξ) -ξ({y})|,which concludes the proof.

T 2 -statistic for UCME k,10 with 10 test locations (see Appendix C). We compute the empirical test errors (percentage of false rejections of the null hypothesis H 0 that model P is calibrated if P is calibrated, and percentage of false non-rejections of H 0 if P is not calibrated) at a fixed significance level α = 0.05 and the minimal computation time for the calibrated and the uncalibrated model with dimensions d = 1 and d = 10 for 500 independently drawn data sets of n ∈ {4, 16, 64, 256, 1024} samples of (P X , Y ). The 10 test predictions of the CME test are of the form N (m, 0.1 2 I d ) where m is distributed uniformly at random in the d-dimensional unit hypercube [0, 1] d , the corresponding 10 test targets are i.i.d. according to N (0, 0.1 2 I d ).Figures10 and 11show that all tests adhere to the set significance level asymptotically as the number of samples increases. The convergence of the CME test with 10 test locations is found to be much slower than the convergence of all other tests. The tests based on the tractable asymptotic distribution of n/B SKCE k,B for fixed block size B are orders of magnitudes faster than the test based on the intractable asymptotic distribution of n SKCE k,n , approximated with 1000 bootstrap samples. We see that the efficiency gain comes at the cost of decreased test power for smaller number of samples, explained by the increasing variance of SKCE k,B for decreasing block sizes B. However, in our examples the test based on SKCE k, √ n still achieves good test power for reasonably large number of samples (> 30).

where W 2 is the 2-Wasserstein distance and m p , m p and σ p , σ p denote the mean and the standard deviation of the normal distributions p and p (see Appendix D.1). The p-value estimate (p-value) is computed by estimating the quantile of the asymptotic distribution of n SKCE k,n with 1000 bootstrap samples (see Remark B.2). The estimates of the mean squared error and the average negative loglikelihood are denoted by MSE and NLL. All estimators indicate consistently that the trained models suffer from overfitting after around 1000 training iterations.

is an unbiased estimator of SKCE k .Proof. From Lemma B.2 we know that SKCE k < ∞, and Lemma B.3 implies that SKCE k,B < ∞ almost surely.

ACKNOWLEDGMENTS

We thank the reviewers for all the constructive feedback on our paper. This research is financially supported by the Swedish Research Council via the projects Learning of Large-Scale Probabilistic Dynamical Models (contract number: 2016-04278), Counterfactual Prediction Methods for Heterogeneous Populations (contract number: 2018-05040), and Handling Uncertainty in Machine Learning Systems (contract number: 2020-04122), by the Swedish Foundation for Strategic Research via the project Probabilistic Modeling and Inference for Machine Learning (contract number: ICA16-0015), by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and by ELLIIT.

annex

r i r j MW s s (p (i) , p (j) ) = i,jwhich shows that MW s s is a negative definite kernel (Berg et al., 1984, Definition 3.1.1) . Since 0 < 1/s < ∞, MW s is a negative definite kernel as well (Berg et al., 1984, Corollary 3.2.10) , which implies that metric MW s is Hilbertian (Berg et al., 1984, Proposition 3.3.2 ).Hence we can lift a Hilbertian metric for the mixture components to a Hilbertian metric for the mixture models. For instance, if the mixture components are normal distributions, then the 2-Wasserstein distance with respect to the Euclidean distance is a Hilbertian metric for the mixture components. When we lift it to the space P of Gaussian mixture models we obtain the MW 2 metric proposed by Chen et al. (2019; 2020) ; Delon & Desolneux (2020) . As shown by Delon & Desolneux (2020) , the discrete formulation of MW 2 obtained by our construction is equivalent to the definitionfor two Gaussian mixtures p, p on R n , where Π(p, p ) are the couplings of p and p (not of the histograms!) and GMM 2n (∞) = ∪ k≥0 GMM 2n (k) is the set of all finite Gaussian mixture distributions on R 2n . The construction of the discrete formulation as a solution to a constrained optimization problem similar to Eq. (D.7) can be generalized to mixtures of t-distributions. However, it is not possible for arbitrary mixture models such as mixtures of generalized Gaussian distributions, even though they are elliptically contoured distributions (Deledalle et al., 2018; Delon & Desolneux, 2020) .The optimal coupling of the discrete histograms can be computed efficiently using techniques from linear programming and optimal transport theory such as the network simplex algorithm and the Sinkhorn algorithm. As discussed above, if metric d P is of the form in Eq. (D.2), functions of the form k P (p, p ) = exp -λd ν P (p, p ) are valid kernels on P for all λ > 0 and ν ∈ (0, 2].Thus taken together, if k Y is a characteristic kernel on the target space Y and d(•, •) is a Hilbertian metric on the space of mixture components, then for all s ∈ [1, ∞), λ > 0, and ν ∈ (0, 2] 

