QUANTILE REGULARIZATION: TOWARDS IMPLICIT CALIBRATION OF REGRESSION MODELS

Abstract

Deep learning models are often poorly calibrated, i.e., they may produce overconfident predictions that are wrong, implying that their uncertainty estimates are unreliable. While a number of approaches have been proposed recently to calibrate classification models, relatively little work exists on calibrating regression models. Isotonic Regression has recently been advocated for regression calibration. We provide a detailed formal analysis of the side-effects of Isotonic Regression when used for regression calibration. To address these, we investigate the idea of quantile calibration (Kuleshov et al., 2018) , recast it as entropy estimation, and leverage the new formulation to construct a novel quantile regularizer, which can be used as a blackbox to calibrate any probabilistic regression model. Unlike most of the existing approaches for calibrating regression models, which are based on post hoc processing of the model's output, and require an additional dataset, our method is trainable in an end-to-end fashion, without requiring an additional dataset. We provide empirical results demonstrating that our approach improves calibration for regression models trained on diverse architectures that provide uncertainty estimates, such as Dropout VI, Deep Ensembles.

1. INTRODUCTION

For supervised machine learning, the notion of calibration of a learned predictive model is a measure of evaluating how well a model's confidence in its prediction matches with the correctness of these predictions. For example, a binary classifier will be considered perfectly calibrated if, among all predictions with probability score 0.9, 90% of the predictions are correct Guo et al. (2017) . Likewise, consider a probabilistic regression model that produces credible interval for the predicted outputs. In this setting, the model will be considered perfectly calibrated if the 90% confidence interval contains 90% of the true test outputs (Kuleshov et al., 2018) . Unfortunately, modern deep neural networks are known to be poorly calibrated (Guo et al., 2017) , raising questions on their reliability. The notion of calibration for classification problems was originally first considered in meteorology literature (Brier, 1950; Murphy, 1972; Gneiting & Raftery, 2007) and saw one of its first prominent usage used in the machine learning literature by (Platt et al., 1999) in context of Support Vector Machines (SVM), in order to obtain probabilistic predictions from SVMs which are inherently non-probabilistic models. Recently, there has been renewed interested in calibration, especially for classification models, after it has been shown (Guo et al., 2017) that modern deep neural networks for classification are often poorly calibrated. The pupular notions of calibration for classification include confidence calibration, multiclass calibration, classwise calibration, and confidence calibration (Kumar et al., 2019; Vaicenavicius et al., 2019; Kull et al., 2019) . Most calibration methods (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002; Guo et al., 2017; Kull et al., 2017; 2019) for classification models are post hoc, where they learn a calibration mapping R : [0, 1] → [0, 1] using an additional dataset to recalibrate an already trained model. There has been recent work showing some of these popular post hoc methods are either themselves miscalibrated or sample-inefficient (Kumar et al., 2019) and they do not actually help the model output well-calibrated probabilities. An alternative to post hoc processing is to ensure that model outputs well-calibrated probabilities after model training finishes. We refer to these as implicit calibration methods. Notably, such an approach does not require an additional dataset to learn the calibration mapping. While almost all post hoc calibration methods for classification models can be seen in a unified manner as density estimation methods (see section 2.1 ), existing implicit calibration methods for classification models have been designed with various, often distinct, considerations/approaches. Several heuristics like Mixup (Zhang et al., 2017; Thulasidasan et al., 2019) and Label Smoothing (Szegedy et al., 2016; Müller et al., 2019) that were part of high performance deep networks for classification were later shown empirically to achieve calibration. (Maddox et al., 2019) show that their optimization method instrinsically improves calibration. (Pereyra et al., 2017) found that penalizing high-confidence predictions acts as a regularizer. A more principled way of achieving implicit calibration is by minimizing a loss function that is tailored for calibration (Kumar et al., 2018) . This is somewhat similar in spirit to our proposed approach which aims to do it for regression models. Among the early approaches for calibrating regression models, (Gneiting et al., 2007) were the first to propose a framework for calibrating regression models. However, they do not provide any procedure to correct a miscalibrated model. Recently, (Kuleshov et al., 2018) introduced the notion of Quantile Calibration which intuitively says that the p confidence interval predicted by model should have target variable with probability p. They use a post hoc calibration method based on Isotonic Regression (Fawcett & Niculescu-Mizil, 2007) , which is a well-known calibration technique for classification models. The difference between Isotonic Calibration in classification and Isotonic Calibration in regression is in terms of (i) the dataset on which calibration mapping is learnt; and (ii) the function with which learnt calibration mapping is pre-composed . In the former case, it is pre-composed with a probability mass function (PMF) and whereas in the latter, it is pre-composed with a conditional density function (CDF). Both these differences have side effects; in particular (i) the nature of recalibration dataset already satisfies monotonicity constraint, so there is a risk of overfitting in case of smaller calibration datsets; and (ii) composing the CDF with a piecewise linear function can make the resultant CDF discontinuous and the corresponding PDF non-differentiable (see Sec. 3 for detailed discussion for side effects of the Isotonic Calibration approach). In another recent work, (Song et al., 2019) proposed a much stronger notion of calibration called Distributional Calibration which guarantees that among all instances whose predicted probability density function (PDF) of the response variable has mean µ and standard deviation σ, the marginal distribution of the target variable should have mean µ and standard deviation σ. They too propose a post hoc recalibration method based on Gaussian processes, which can be computationally expensive. Among other work, (Keren et al., 2018) , consider a different setting where neural networks for classification are used for regression problems and showed that temperature scaling (Hinton et al., 2015; Guo et al., 2017) and their proposed method based on empirical prediction intervals improves calibration for regression problems as well. Again, these are post hoc methods. Our contributions are summarized below: 1. We analyze in detail the side effects of Isotonic Calibration for regression models. We show how using Isotonic Calibration results in truncation of the support, which will result in assigning zero likelihood fortes t time. We also discuss about Isotonic Calibration resulting in nonsmooth PDFs, and its tendency to produce miscalibration when using small calibration datasets. 2. At test time, after composing the predicted CDF with the learned isotonic mapping, the mean prediction (point estimate) also changes. Kuleshov et al. (2018) do not acknowledge the changes in the mean estimate. While Song et al. (2019) acknowledge this issue, they use a trapezoidal approximation to remedy this. In contrast, we derive an analytical expression for the updated point estimate after isotonic calibration. We also provide a different expression for updated point estimate, which reduces the time-complexity from O(m) to O(1), where m is calibration dataset size. 3. In order to mitigate these shortcomings of Isotonic Calibration, we propose a simple, yet novel and general purpose, trainable loss function for quantile calibration where the smoothness of PDF/CDF is not sacrificed for well-calibrated probabilties. Our approach also eliminates the need for an additional calibration dataset. 4. We conduct extensive experiments on a wide range of architectures using the proposed loss function (Quantile Regularization) and show empirically that it improves calibration on wide range of architectures that produce uncertainty estimates. 

2. BACKGROUND AND DEFINITIONS

Before we proceed with definitions, we state the notation followed in the rest of paper. X , Y denotes the input and output space respectively. X, Y denote random variables modelling inputs and outputs. P denotes probability measure f, g are reserved for probability density functions (PDF) and F, G for cummulative density functions (CDF). We use c to denote the evaluation of a CDF at some particular value i.e., c = F (.) and p for evaluation of a PDF at some value i.e., p = f (.) Given sequence of elements a 1 , a 2 , . . . , a n , we use a (1) , a (2) , . . . a (n) for permutation s.t a (i) ≤ a (i+1) . Also, m denotes the calibration dataset size. (X, y) denotes the training data. Given random variables X, Y , we use KL(X||Y ) to denote the KL divergence between the corresponding distributions A probabilistic regression model can be seen as conditional PDF/conditional CDF. In the rest of the paper, we express it as conditional CDF M : X → (Y → [0, 1]). So, M(x) denotes model's predicted CDF for x ∈ X denoted as F x . In practice this is achieved by making model output parameters that parametrize the CDF, e.g., (µ, σ) for Gaussian, λ for Exponential, etc. In the rest of the paper, we consider Gaussian likelihood unless stated otherwise, because it is one of the most prevalent cases. Kuleshov et al. (2018) proposed the following notion of calibration of regression models, called quantile calibration. An appearling aspect of this definition is that we get reliable confidence intervals. Definition 1 (Quantile Calibration) Given a regression model M : X → (Y → [0, 1]) and X, Y jointly distributed as P, the model M is said to be Quantile Calibrated iff P [ M(X) ](Y ) ≤ p = p ∀p ∈ [0, 1] In words, [ M(X) ](Y ) is cumulative density that the model predicts for random input-response pairs drawn from the joint distribution of (X, Y ). (Vaicenavicius et al., 2019) call an analogous mapping in the context of classification as canonical calibration mapping. Theorem 1 For any Model M : X → (Y → [0, 1]) and given the canonical calibration mapping R(p) = P [M(X)](Y ) ≤ p , R • M is quantile calibrated Note that learning the mapping R reduces to density estimation of P [M(X)](Y ) ≤ p , which is a hard problem in itself. With this insight, and using the fact that the mapping is monotonically increasing, (Kuleshov et al., 2018) (2) (Mair et al., 2009) provides a nice survey of algorithms for Isotonic Regression. In particular Scikit-learn (Pedregosa et al., 2011) uses Pool Adjacent Violaters Algorithm (PAVA) of complexity O(m). Given the calibration dataset {x i , y i } m i=1 , Isotonic Calibration first builds recalibration dataset as Importantly, what this implies is that mean prediction before and after calibration changes as well. (Kuleshov et al. (2018) ) do not take this aspect into consideration, whereas (Song et al. (2019) ) uses trapezoidal approximation to find the new mean, and finding which has a time complexity O(m). In contrast, we derive an analytical expression for the new mean which reduces time complexity from O(m) to O(1) at test-time (See Eq .4), assuming a Gaussian likelihood model. D = M(x i )[y i ] , 1 m m j=1 I M(x j )[y j ] ≤ M (x i )[y i ] m i=1

3. ANALYSIS OF ISOTONIC CALIBRATION

Let {(x i , y i )} m i=1 be the calibration dataset. Let {(µ i , σ 2 i )} m i=1 be means and variances predicted by the learned model. Suppose we have obtained CDF values at actual outputs (y 1 , y 2 , • • • , y m ). Let these be (c 1 , c 2 , • • • , c m ). Then the re-calibration dataset after sorting based on first co-ordinates would be D = {(c (1) , 1 m ), (c (2) , 2 m ), • • • , (c (m) , 1)} where C = (c (1) , c (2) , • • • c (m) ) is obtained by sorting (c 1 , c 2 , • • • c m ) in ascending order. Now Isotonic Calibration fits isotonic regression on the dataset D. Our entire analysis is based on next crucial observation. Since ( 1 m , 2 m , • • • , 1) are already in increasing order, isotonic regression does not modify values. In the notation of Sec. 2.2 what this means is that, if we have that (a i , b i ) (c (i) , 1 i ) then it will be e i = 1 i , i.e., b i will be same as e i . So we do not even need to use Isotonic Regression and just linear interpolation on D yields the same result as Isotonic Regression. We first give conditions under which smoothness is lost and later derive the correction one has to use after isotonic calibration. All the proofs are provided in the supplementary material. Claim 1 Let Y be a random variable with CDF F , and let G = R • F be its CDF after composing with mapping R obtained from isotonic regression characterized by C = {c (1) , c (2) , • • • , c (m) }. If there exist i -1, i, i + 1 ∈ {0, m} ∧ c (i) -c (i-1) = c (i+1) -c (i) then the CDF G is not differentiable and its corresponding probability density function g is not continuous at F -1 (c (i) ) Because of this, the PDF of transformed r.v becomes discontinuous and spiky (see Fig. 2d ). Update likelihood inversely depends on m(c (i) -c (i+1) ) (see Eq. 10 for full expression). In many of cases, it becomes so small that sometimes the updated likelihood increases by factor of 10 2 to 10 5 for single point thereby completely destroying what the average log likelihood represents. To show this, we report the maximum likelihood attained among the entire test dataset after isotonic calibration in UCI experiments (see Table 4, Table 5 ). Now we derive the analytical expression for the updated mean after Isotonic Calibration assuming a Guassian likelihood. Claim 2 Let Y iso be the transformed random variable after applying isotonic mapping R on the random variable Y . Then the expectation of Y iso is as follows E[Y iso ] = µ - σ 2 m m-1 i=0 f (F -1 (c (i+1) )) -f (F -1 (c (i) )) (c (i+1) -c (i) ) The summation involves both recalibration dataset and test time prediction. Now we will use properties of quantile functions (Lemma. 1) to decouple the dependency, using which summation just depends on the recalibration dataset. Claim 3 Let c (i) = F µ (i) ,σ (i) (y (i) ) and p (i) = f µ (i) ,σ (i) (y (i) ) E[Y iso ] = µ -σ m i=0 1 m σ (i+1) p (i+1) -σ (i) p (i) c (i+1) -c (i) δ Now δ can be computed once, so at test time given µ, σ as model prediction, the updated mean after isotonic calibration is µ -δσ. With this, the time required is reduced from O(m) to O(1). Also, the construction of recalibration dataset as suggested in (Kuleshov et al., 2018) can result in truncation of the support of updated r.v (see fig. 2d ) and see Sec. B.1 for more on this. One way to remedy this is to use (x, ∞) to calibration dataset for any x ∈ R n . So the recalibration dataset becomes {(c (1) , 1 m+1 ), . . . , (c (m) , m n+1 ), (1, 1)} . To illustrate all these , we use Bayesian linear regression with 256 training examples generated from y = 3x + where ∼ N (0, 1). and x ∈ [-4, 4] randomly sampled and calibration dataset of size 32.

4. QUANTILE REGULARIZATION

In quantile calibration, we want P [M(X)](Y ) ≤ p = p ∀p ∈ [0, 1]. Our method is based on the idea that both right and left hand sides can be viewed as CDFs. Let R(p) = P [M(X)](Y ) ≤ p and S(p) = p. Here R can be seen as the CDF of [M(X)](Y ) while S can be seen as the CDF of Uniform[0,1]. Quantile calibration mandates these two CDFs to be equal. So, for a perfectly calibrated quantile model M, we have that [M(X)](Y ) be a uniform distribution. We seek to penalize the model when this r.v deviates from a uniform distribution which gives us a calibration loss that can be used as a regularizer with any regression loss, achieving highly desirable calibration during training itself. We name above proposed procedure as Quantile Regularization, hereafter denoted as QR. We use the KL divergence as the distance measure. Note that the KL divergence between any distribution on [0, 1] and the uniform distribution is negative of the differential entropy, which provides a very intuitive interpretation for the Quantile Regularization (QR). Essentially, to improve calibration while training, QR maximizes the differential entropy of [M (X)](Y ) i.e., the predicted cumulative density of target value. We formalize the statement for completeness below. Claim 4 Let M be any regression model. Then M is perfectly quantile calibrated iff KL [ M(X)](Y )||U = 0 (5) 2b shows the calibration mapping learnt using Isotonic Calibration with 32 samples. Fig. 2c shows predicted PDF at test time before applying Isotonic Calibration and truncation and shift in mean that happens if the mapping in Fig. 2b is applied . It also shows that such truncation can affect the support. Fig. 2d shows the resulting PDF after Isotonic Calibration

4.1. DIFFERENTIAL ENTROPY ESTIMATION

We need the differential entropy estimator for deriving our calibration loss function. There is a rich literature for entropy estimation. A brief overview about non-parametric entropy estimation can be found in (Beirlant & Dudewicz, 1997) . We use sample-spacing entropy estimation originally proposed in (Vasicek, 1976) . A k spacing of random variable is defined as the amount of probability mass between ordered samples that are k -1 samples apart. Let S be a one-dimensional r.v. with CDF F and assume we are given n samples {s i } n i=1 ∼ S and k s.t. 1 ≤ k ≤ n. Sample spacing entropy estimation is based on the following observation of k-spacings of random variables E S F (s (i+k) ) -F (s (i) ) = k n + 1 There are many formulations of sample spacing entropy estimators but we use the one in (Learned-Miller et al. (2003) ; Equation 8) which is Ĥ(S) = 1 n -k n-k i=1 log n + 1 k (s (i+k) -s (k) )

4.2. CALIBRATION LOSS FUNCTION

In our case, the random variable is [M(X)](Y ). Given training data with input-output pairs (x i , y i ), we need to get samples [M(x i )](y i ), compute the expression Ĥ(S), and maximize it. Note that, we want to make this part of the training loop to achieve implicit calibration. To do so, we need ordered samples to compute the Entropy in Eq. 6. However, inherently, sorting is not a differentiable operation. We therefore use NeuralSort (Grover et al., 2019)  for i ← 1 to m do 4: Φ i ← Φ(µ i , σ i ) Φ: CDF 5: c i ← Φ i [y i ] 6: end for 7: s ← DIFFSORT(c) 8: k ← √ n 9: e ← 1 n -k n-k i=1 log n + 1 k (s i+k -s i ) 10: return e 11: 12: end function Assume that X, y is the training data. Let (µ w , σ w ) = MODEL w (X), where w denotes the parameters of the model, (y, µ w , σ w ) denotes the architecture specific loss and CL be calibration loss computed by Algorithm 1. The overall loss function can be written as follows, where the L is the hyperparameter that controls the effect of QR. L(X, y, µ W , σ W ) = (y, µ w , σ w ) -L × CL(y, µ w , σ w ) 4.3 DEGENERATE BUT PERFECTLY QUANTILE CALIBRATED MODEL Every notion of calibration has examples of models such that it is perfectly calibrated according to that specific notion but is far from the ideal model. We prove below that Quantile Calibration is no different. Specifically, we give conditions under which a model is quantile calibrated and then use it to build a model that is degenerate in the sense that the model does not depend on the inputs while predicting output, but is a perfectly quantile calibrated model. Claim 5 Let f be the marginal distribution of Y and F x = M [x] be the model's predicted cumulative distribution for x ∈ X and f x be the corresponding predicted probability density then, if following holds, M is Quantile Calibrated f [F -1 x (p)] = f x (F -1 x (p)) ∀x, ∀p Now if we set f = f x ∀x, the above condition easily holds. So, a model that outputs marginal distribution of Y for every input x ∈ X , is perfectly quantile calibrated. Note that this observation is general that it does not require Gaussian likelihood, for it to be true. As a simple example, consider f (x, y) = N (y|5x, 1)N (x|0, 4). then the marginal distribution of y is f (y) = ∞ -∞ N (y|5x, 1)N (x|0, 4)dx = N (y|5.0, 5.4 2 .5 + 1 2 ) = N (y|0, 401) so the model which predicts N (y|0, 401) regardless of input is perfectly quantile calibrated, but the true model is N (y|5x, 1). So, a model can predict good confidence intervals despite being far from ideal. Therefore, we need model to be both well-calibrated and sharp.

5. EXPERIMENTS

We evaluate our approach on various regression datasets in terms of the calibration error as well as other standard metrics, such as root-mean-squared-error (RMSE) and negative log-likelihood (NLL).  {p i } M i=1 in (0, 1] with p M = 1. Given a test set {x i , y i } N n=1 , whose predictions are F n = F (x n ), the M-bin estimator of above integral gives us the calibration metric used in (Kuleshov et al., 2018) . CE(F ) = 1 0 P [M(X)](Y ) ≤ p -p 2 dp ≈ 1 M M i=1 N j=1 1 N I[F j (y j ) ≤ p i ] -p i 2

5.2. UCI DATA EXPERIMENTS

We consider two architectures -Dropout VI (Gal, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) . The dataset sizes ranges from 308 to 515345 and input feature dimensions ranges from 6 to 91. Every dataset, except Year Prediction MSD, is divided into 5 splits whereas for Year Prediction MSD there is a single split where we train on 463715 points and test on 51630 points. This experiment is repeated 3 times and averages are reported except for Year Prediction MSD. We use 2 hidden layer network with 128 units with ReLU activation, and trained with Adam Optimizer with a learning rate of 10 -2 . Results are presented in Table 1 and 2 2017) which combines principles of Dense-nets and U-net for Dense prediction tasks like semantic segmenation and depth estimation. We use Make3d Dataset (Saxena et al. (2005) ) which has 400 scenes for training and 134 scenes for testing. We use 57 and 103 layer Neural Network, which is denoted as FC-DenseNet57, FC-DenseNet103 in Jégou et al. (2017) . We set L = 0.1 and k = √ n where n is batchsize. Also, we use pooling layers on cummulative density values to make sure the locality is exploited and to decrease the computational time, while computing calibration loss. Results are presents in 2. Isotonic calibration is a non-parameteric method that requires large amount of data to work well. This could be another reason for the poor performance of isotonic regression in some cases. For large calibration datasets like protein structure (calibration dataset size m = 36, 584) and Year Prediction MSD ( calibration dataset size m = 463715), Isotonic Regression offers better improvements. Although it can work well in such setting, the PDF after calibration becomes unusable. QR not only works well on small datasets but also works well on very large datasets, like Year prediction MSD (nearly 60 % improvement on both architectures). There is a small increase in RMSE and NLL in models trained with QR, but in many cases it is negligible. A possible justification for this is that just because the model is well calibrated doesn't necessarily mean it is close to the true model. Being well0calibrated and being close to true model are two separate things (see Sec. 4.3).

6. CONCLUSION

We have proposed a black-box calibration loss function that can be used with any probabilistic regression model. Unlike current methods for quantile calibration, our method is implicit in nature and does not require an additional calibration dataset and, more importantly, the smoothness of the PDF is not lost. We conduct experiments to show effectiveness of our proposed method.

APPENDIX A PROOFS

A.1 PROOF OF THEOREM 1 Theorem 1 For any Model M : X → (Y → [0, 1]) and given canonical calibration mapping  R(p) = P [M(X)](Y ) ≤ p , R • M is quantile calibrated Proof: To show that R • M is quantile calibrated. we need to show that P[(R • M)[X][Y ] ≤ p] = p, ∀p ∈ [0, [ (R • M) [X][Y ] ≤ R(p)] = R (p) ∀p ∈ [0, 1] P (R • M)[X][Y ] ≤ R(p) = P R -1 (R • M)[X][Y ] ≤ R -1 R(p) R -1 is strictly increasing = P (M[X])[Y ] ≤ p = R(p) By definition A.2 PROOF OF CLAIM 1,2,3 Claim 1 Let Y be a random variable with CDF F , and let G = R • F be its CDF after composing with mapping R obtained from isotonic regression characterized by C = {c (1) , c (2) , • • • , c (m) }. If there exist i -1, i, i + 1 ∈ {0, m} ∧ c (i) -c (i-1) = c (i+1) -c (i) then the CDF G is not differentiable and its corresponding probability density function g is not continuous at F -1 (c (i) ) Proof: First, G can be expressed as follows G(x) =                                F (x) mc (1) . -∞ < x ≤ F -1 (c (1) ) F (x) -c (1) m(c (2) -c (1) ) + 1 m F -1 (c (1) ) < x ≤ F -1 (c (2) ) F (x) -c (2) m(c (3) -c (2) ) + 2 m F -1 (c (2) ) < x ≤ F -1 (c (3) ) . . . . . . F (x) -c (m-1) c (m) -c (n-1) + m-1 m F -1 (c (m-1) ) < x ≤ F -1 (c (m) ) Let a = F -1 (c (1) ). We will show that G not differentiable at a. Similarly we can show that it is not differentiable at the other m -2 switching points. The left derivative is as follows lim x→a - G(x) -G(a) x -a = lim x→a - F (x) m.c (1) - F (a) m.c (1) x -a = 1 mc (1) lim x→a - F (x) -F (a) x -a = F (a) mc (1) The right derivative is as follows lim x→a + G(x) -G(a) x -a = lim x→a + F (x) -c (1) m(c (2) -c (1) ) + 1 m -1 m x -a (9) = 1 m.(c (2) -c (1) ) lim x→a + F (x) -F (a) x -a = F (a) m.(c (2) -c (1) ) Hence G is not differentiable. Although the CDF is not differentiable at only a finite number of points, we can still get the PDF by piece-wise differentiation. g(x) =                              f (x) mc (1) . -∞ < x ≤ F -1 (c (1) ) f (x) m(c (2) -c (1) ) F -1 (c (1) ) < x ≤ F -1 (c (2) ) f (x) m(c (3) -c (2) ) F -1 (c (2) ) < x ≤ F -1 (c (3) ) . . . . . . f (x) m.(c (n) -c (m-1)) F -1 (c (m-1) ) < x ≤ F -1 (c (m) ) Now consider for any i - 1, i, i + 1 ∈ {0, m} s.t c i -c i-1 = c i+1 -c i . Let a = F -1 (c i ) then lim x→a -g(x) = lim x→a - f (x) m(c (i) -c (i-1) ) = f (a) m(c (i) -c (i-1) ) lim x→a + g(x) = lim x→a + f (x) m(c (i+1) -c (i) ) = f (a) m(c (i+1) -c (i) ) Since the right limit and left limit do not coincide and by construction of point a, we have that limit does not exist and therefore g(x) is not continuous at a Note that, most of times, the hypothesis is satisfied, so the smoothness is lost. Claim 2 Let Y iso be transformed random variable after applying isotonic mapping R on random variable Y . Then the expectation of Y iso is as follows E[Y iso ] = µ - σ 2 n m-1 i=0 f (F -1 (c (i+1) )) -f (F -1 (c (i) )) (c (i+1) -c (i) ) Proof: Assume that, before transformation, the random variable is distributed X ∼ N (µ, σ 2 ) so f (x) = 1 √ 2πσ 2 exp -(x -µ) 2 2σ 2 E[Y iso ] = ∞ -∞ x.g(x)dx = m-1 i=0 F -1 (c (i+1) ) F -1 (c (i) ) x. f (x) n.(c (i+1) -c (i) ) dx = m-1 i=0 1 n(c (i+1) -c (i) ) F -1 (c (i+1) ) F -1 (c (i) ) x. 1 √ 2π.σ 2 exp -(x -µ) 2 2.σ 2 dx = m-1 i=0 1 n(c (i+1) -c (i) ) F -1 (c (i+1) ) F -1 (c (i) ) (x -µ + µ). 1 √ 2π.σ 2 exp -(x -µ) 2 2.σ 2 dx x = x -µ + µ = m-1 i=0 1 m(c (i+1) -c (i) ) F -1 (c (i+1) ) F -1 (c (i) ) (x -µ) 1 √ 2π.σ 2 exp -(x -µ) 2 2.σ 2 = t and use sub dx + m-1 i=0 1 n(c (i+1) -c (i) ) F -1 (c (i+1) ) F -1 (c (i) ) µ 1 √ 2π.σ 2 exp -(x -µ) 2 2.σ 2 dx using linearity = m-1 i=0 1 m(c (i+1) -c (i) ) -σ 2 √ 2π.σ 2 exp -(x -µ) 2 2.σ 2 x=F -1 (c (i+1) ) x=F -1 (c (i) ) + m-1 i=0 µ m(c (i+1) -c (i) ) F (F -1 (c (i+1) )) -F (F -1 (c i )) using def of cdf = m-1 i=0 -σ 2 m f (F -1 (c (i+1) )) -f (F -1 (c (i) )) (c (i+1) -c (i) ) + µ m-1 i=0 1 m F (F -1 (c (i+1) )) -F (F -1 (c i )) (c (i+1) -c (i) ) 1 = µ - σ 2 n m-1 i=0 f (F -1 (c (i+1) )) -f (F -1 (c (i) )) (c (i+1) -c (i) ) Lemma 1 Let f µ,σ , F µ,σ , F -1 µ,σ be density, distribution, and quantile functions, respectively, of the normal distribution with mean µ and std σ. Then f µ,σ F -1 µ,σ F µ0,σ0 (y 0 ) = σ 0 σ f µ0,σ0 (y 0 ) Proof: We use the following three properties of normally distributed random variables 1. F µ,σ (y) = F 0,1 ( y-µ σ ) 2. f µ,σ (y) = 1 σ f 0,1 ( y-µ σ ) 3. F -1 µ,σ (p) = σ.F -1 0,1 (p) + µ f µ,σ F -1 µ,σ F µ0,σ0 (y 0 ) = f µ,σ F -1 µ,σ F 0,1 ( y 0 -µ 0 σ 0 ) by using (1) = f µ,σ σ.F -1 0,1 (F 0,1 ( y 0 -µ 0 σ 0 )) + µ by using (3) = f µ,σ σ. y 0 -µ 0 σ 0 + µ F -1 F (x) = x = 1 σ f 0,1 σ. y 0 -µ 0 σ 0 + µ -µ σ by using (2) = 1 σ f 0,1 y 0 -µ 0 σ 0 = σ 0 σ . 1 σ 0 f 0,1 y 0 -µ 0 σ 0 Mul and Div by σ0 = σ 0 σ f µ0,σ0 (y 0 ) by using (2) Claim 3 E[Y iso ] = µ -σ m i=0 1 m σ (i+1) p (i+1) -σ (i) p (i) c (i+1) -c (i) δ Proof: We first re-substitute c (i) = F µ (i+1) ,σ (i+1) (y (i+1) ), then using above claim, substituting that f (F -1 (c (0) )) = f (F -1 (0)) = lim x→-∞ f (x) = 0 µ - σ 2 m m i=0 f (F -1 (c (i+1) )) -f (F -1 (c (i) )) (c (i+1) -c (i) ) = µ - σ 2 m m i=0 f (F -1 (F µ (i+1) ,σ (i+1) (y (i+1) ))) -f (F -1 (F µ (i+1) ,σ (i+1) (y (i+1) ))) (c (i+1) -c (i) ) = µ - σ m σ (1) p (1) c (1) + m-1 i=1 σ (i+1) p (i+1) -σ (i) p (i) c (i+1) -c (i) by using f (F -1 (c (0) )) = 0, c (0) = 0 = µ - σ m m-1 i=0 σ (i+1) p (i+1) -σ (i) p (i) c (i+1) -c (i) σ (0) = 0, p (0) = 0, c (0) = 0 A.3 PROOFS OF CLAIM 4 AND CLAIM 5 Claim 4 Let M be any regression model. Then M is perfectly quantile calibrated iff KL [ M(X)](Y )||U = 0 Proof: Claim 5 Let f µ,σ be the marginal distribution of Y and F µx,σx = M [x] X = x be the model's predicted cumulative distribution for x ∈ X , then if f µ,σ [F -1 µx,σx (p)] = f µx,σx (F -1 µx,σx (p)) ∀x, ∀p ∈ [0, 1], we have that M is Quantile Calibrated. Proof: f M [X][Y ] (p) = X f M [x][Y ] X=x (p) . f X (x)dx = X d dp P M [x][Y ] X = x ≤ p . f X (x)dx = X d dp P (F µx,σx [Y ] ≤ p . f X (x)dx = X d dp P (Y ≤ F -1 µx,σx (p) . f X (x)dx = X d dp F µ,σ [F -1 µx,σx (p)] . f X (x)dx = X f µ,σ [F -1 µx,σx (p)] . 1 f µx,σx (F -1 µx,σx (p)) . f X (x)dx = X 1 . f X (x)dx = 1 Since we have that M [X][Y ]Uniform[0, 1] we can conclude that M is Perfectly Quantile Calibrated. B IMPLEMENTATION DETAILS B.1 CALCULATING NEGATIVE LOG LIKELIHOOD AFTER ISOTONIC CALIBRATION Note that we have analytical expression for the updated density function g (see Eq. 10 ). Given a test input x test , predicted CDF F , corresponding PDF f and target y test , now we describe the procedure to calculate likelihood for y test after isotonic calibration i.e., g(y test ). Recall that c (i) 's are ordered x-coordinates of recalibration dataset. Now since F -1 (c (i-1) ) ≤ x < F -1 (c (i) ) implies that c (i-1) ≤ F (x) < c (i) as F is monotonic. We can then do binary search for finding the correct i s.t c (i-1) ≤ F (x) < c (i) and scaling appropriately. This is summarized below 1: µ, σ = MODEL(x test ) 2: F y = F µ,σ (y test ) 3: f y = f µ,σ (y test ) 4: c = [0.0, c (1) , c (2) , • • • , c (m) ] 5: i = BINARYSEARCH(c, F y) 6: if F y ≤ c (m) then 7: f y iso = f y n * (c (i-1) -c (i) ) 8: else 9: f y iso = 0 10: end if 11: RETURN f y iso B.2 CALCULATION OF TRUNCATION POINT The above algorithm clearly elucidates why there is a possibility of assigning zero likelihood at test time after Isotonic Calibration as discussed in Sec. 3. If F (y test ) > c (m) then we have that g(y test ) = 0. So truncation point is y trun = F -1 (c (m) ). Then the support of random variable is reduced from (-∞, +∞) to (-∞, y trun ] . so for every point in (y trun , ∞) model assigns zero likelihood, which is extremely undesirable. A simple way to circumvent this proposed in the discussion in Sec. 3

C EXPERIMENTS C.1 BAYESIAN LINEAR REGRESSION

For Bayesian Linear Regressionm we use sklearn's (Pedregosa et al., 2011) Here model is Dropout. Datasets are divided into two groups based on the scale of Calibration Error , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, kin8nm}. On right, plot is for {fish, red, white, protein}. Figure 6 : On X-axis we have Root Mean Square Error (RMSE). On Y-axis we have that QR-reg parameter L. Each curve with same color represents RMSE for a particular dataset as we vary QR-reg parameter L for {1, 2, 3, 4, 5} . Spacing value is set to k = √ n. Here the model is Dropout. Datasets are divided into two groups based on the scale of RMSE , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, protein}. On right, plot is for {fish, red, white, kin8nm}. Here model is Deep Ensemble. Datasets are divided into two groups based on the scale of Calibration Error , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, kin8nm}. On right, plot is for {fish, red, white, protein}. √ n. Each curve with same color represents Cailibration Error (%) for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Dropout. Quantile Regularization parameter is fixed at L = 1. Datasets are divided into two groups based on the scale of Cailibration Error (%) , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht }. On right, plot is for {fish kin8nm, red, white,protein,year}. √ n. Each curve with same color represents RMSE for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Dropout. Quantile Regularization parameter is fixed at L = 1. Datasets are divided into two groups based on the scale of RMSE , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, protein,year}. On right, plot is for {fish, red, white, kin8nm}. √ n. Each curve with same color represents Cailibration Error (%) for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Deep Ensemble. Quantile Regularization parameter is fixed at L = 5. Datasets are divided into two groups based on the scale of Cailibration Error (%) , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht }. On right, plot is for {fish kin8nm, white,protein,year}. √ n. Each curve with same color represents RMSE for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Deep Ensemble. Quantile Regularization parameter is fixed at L = 5. Datasets are divided into two groups based on the scale of RMSE , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, protein,year}. On right, plot is for {fish, red, white, kin8nm} C.5 ISOTONIC CALIBRATION WITH K-FOLD CROSS VALIDATION Note that, in the UCI experiments, we used the training data as was done in Kuleshov et al. (2018) . In this sections this we consider Isotonic Calibration with K-fold cross validation. It is summarized as below. FM denotes the final model after recalibration that is used at test time. 



Figure 1: Computation of loss in the training loop when augmented with Quantile Regularization (QR). Parts in Red are the ones that constitute QR. Total Loss = Loss -Entropy

Figure 2: Fig. 2a shows 256 training data points. Fig.2bshows the calibration mapping learnt using Isotonic Calibration with 32 samples. Fig.2cshows predicted PDF at test time before applying Isotonic Calibration and truncation and shift in mean that happens if the mapping in Fig.2bis applied . It also shows that such truncation can affect the support. Fig.2dshows the resulting PDF after Isotonic Calibration

. We use L = 1 for Dropout-VI and L = 5 for Deep Ensembles. Spacing value is choosen as k = √ n for all datasets except Year Prediction MSD, for which we use k = 3 * √ n for both architectures where n is the batch size. See Sec. C.3 and Sec. C.4 for detailed experiments about how values of L, k influence calibration error and RMSE. The code and link to the datasets can be found here: https: //github.com/occam-ra-zor/QR 5.3 MONOCULAR DEPTH ESTIMATION Now we consider problem of Monocular Depth Estimation. We use architecture present in Jégou et al. (

Figure5: On X-axis we have Calibration Error (%). On Y-axis we have the QR-reg parameter L. Each curve with same color represents calibration error for a particular dataset as we vary QRreg parameter L for {1, 2, 3, 4, 5} . Spacing value is set to k = √ n. Here model is Dropout. Datasets are divided into two groups based on the scale of Calibration Error , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, kin8nm}. On right, plot is for {fish, red, white, protein}.

Figure7: On X-axis we have Calibration Error (%). On Y-axis we have the QR-reg parameter L. Each curve with same color represents calibration error for a particular dataset as we vary QR-reg parameter L for {1, 2, 3, 4, 5} . Spacing value is set to k = √ m. Here model is Deep Ensemble. Datasets are divided into two groups based on the scale of Calibration Error , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, kin8nm}. On right, plot is for {fish, red, white, protein}.

Figure8: On X-axis we have Root Mean Square Error (RMSE). On Y-axis we have the QR-reg parameter L. Each curve with same color represents RMSE for a particular dataset as we vary QR-reg parameter L for {1, 2, 3, 4, 5}. Spacing value is set to k = √ m. Here the model is Deep Ensemble. Datasets are divided into two groups based on the scale of RMSE , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, protein }. On right, plot is for {fish, kin8nm, red, white}.

Figure10: On X-axis we have Root Mean Square Error (RMSE). On Y-axis we have the spacing multiplier t. Resultant Spacing value for spacing multiplier t is k = t.√ n. Each curve with same color represents RMSE for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Dropout. Quantile Regularization parameter is fixed at L = 1. Datasets are divided into two groups based on the scale of RMSE , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, protein,year}. On right, plot is for {fish, red, white, kin8nm}.

Figure11: On X-axis we have Cailibration Error (%). On Y-axis we have the spacing multiplier t. Resultant Spacing value for spacing multiplier t is k = t.√ n. Each curve with same color represents Cailibration Error (%) for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Deep Ensemble. Quantile Regularization parameter is fixed at L = 5. Datasets are divided into two groups based on the scale of Cailibration Error (%) , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht }. On right, plot is for {fish kin8nm, white,protein,year}.

Figure12: On X-axis we have Root Mean Square Error (RMSE). On Y-axis we have the spacing multiplier t. Resultant Spacing value for spacing multiplier t is k = t.√ n. Each curve with same color represents RMSE for a particular dataset as we vary spacing multiplier t for {1, 2, 3, 4, 5} . Here the model is Deep Ensemble. Quantile Regularization parameter is fixed at L = 5. Datasets are divided into two groups based on the scale of RMSE , which is useful for viewing the plots. On left, plot is for {airfoil, boston, concrete, yacht, protein,year}. On right, plot is for {fish, red, white, kin8nm}

1: Initialize K Models , M[1], . . . , M[K] 2: Get K Training data sets, TrainData[1], . . . , TrainData[K] 3: Get K calibration data sets, CalibrationDataset[1], . . . , CalibrationDataset[K]

and sort points in D based on first coordinates and then applies Isotonic Regression to get an isotonic mapping R . Let x test be a test input and F xtest = M(x test ) be the CDF of its predicted output. Now, during post-hoc calibration, we get a new CDF as G xtest = R • F xtest . Confidence intervals can be obtained from G xtest . Note that the parameters that parametrize the G xtest (after calibration) are different from F xtest (before calibration).

as a differentiable relaxation to sorting. We summarize our Quantile Regularization algorithm below Algorithm 1 Quantile Regularization Precondition: (x i , y i ) are n i.i.d training instances and µ i , σ i = MODEL w (x i ) and DIFFSORT is any differentiable relaxation to sorting operation.

Base Model is Dropout-VI model without Quantile Regularization and QR is when Base Model is trained with Quantile Regularization. NLL stands for negative log likelihood. RMSE stands for Root Mean Square Error. Bold represents there is no overlap over 1 std interval.

Table. 3 and additional details and experiments in Sec. C.6 5.4 DISCUSSION Both, in case of Dropout VI and Deep Ensembles, calibration error improves when trained with quantile regularizer. Table 3 indicates Quantile Regularization is effective in large scale architec-Base Model is Deep Ensemble without Quantile Regularization and QR is when Base Model is trained with Quantile Regularization. NLL stands for negative log likelihood. RMSE stands for Root Mean Square Error. Bold represents there is no overlap over 1 std interval

Base represents model trained without Quantile Regularization and QR represents Base model trained with Quantile Regularization.

1] since we are assuming that R (p) is invertible function, which gives us that it is surjective. An equivalent way of showing this is that P

base+iso is DropoutVI without Quantile Regularization and after isotonic calibration and QR is when Base Model is trained with Quantile Regularization and after isotonic calibration. RMSE stands for Root Mean Square Error. Maximum Likelihood represents maximum of the likelihoods among test time points

base+iso is Deep Ensemble without Quantile Regularization and after isotonic calibration and QR+iso is when Base Model is trained with Quantile Regularization and after isotonic calibration. RMSE stands for Root Mean Square Error. Maximum Likelihood represents maximum of the likelihoods among test time points C.3 VARYING QUANTILE REGULARIZATION PARAMETER L In the following sections, we show how quantile regularization parameter L affects both calibration error and Root Mean Square Error (RMSE) for dropout-VI in Sec. C.3.1 and for deep ensembles in Sec. C.3.2. We do so by fixing spacing value to k = √ n, where n is batch-size and varying L

Base Model is Dropout-VI model without Quantile Regularization and QR is when Base Model is trained with Quantile Regularization. '5 fold cross validation' column considers the procedure described above.

Base Model is Deep Ensemble without Quantile Regularization and QR is when Base Model is trained with Quantile Regularization. '5 fold cross validation' column considers the procedure described above.

C.2.2 CALIBRATION PLOTS

We scaled down the images to 115 × 153. We used batch size of 4 and learning rate of 5e -5 and trained it for 1500 epochs with adam optimizer with step decay of 4. C.6.1 USAGE OF POOLING LAYERS Instead of directly using Quantile Regularization as suggested in Alg. 1, we added average pooling layer before computing entropy over cumulative density values. This is justified in principle because average of CDF functions is again a valid CDF. Note that this is serves two purposes. 

D DATASETS DIMENSION

We consider following datasets (size-of-data,num-input-features): AirFoil (1503 ,6) , Bouston Housing (506,13), Concrete Strength (1030 ,8),Fish Toxicity (908,7),Kin8nm (8192, 9), Protein Structure (45730, 10), Red Wine (1599, 12), White Wine (4898, 12), Yacht Hydrodynamics (308,6) , year prediction MSD (515345,91) 

