QUANTILE REGULARIZATION: TOWARDS IMPLICIT CALIBRATION OF REGRESSION MODELS

Abstract

Deep learning models are often poorly calibrated, i.e., they may produce overconfident predictions that are wrong, implying that their uncertainty estimates are unreliable. While a number of approaches have been proposed recently to calibrate classification models, relatively little work exists on calibrating regression models. Isotonic Regression has recently been advocated for regression calibration. We provide a detailed formal analysis of the side-effects of Isotonic Regression when used for regression calibration. To address these, we investigate the idea of quantile calibration (Kuleshov et al., 2018), recast it as entropy estimation, and leverage the new formulation to construct a novel quantile regularizer, which can be used as a blackbox to calibrate any probabilistic regression model. Unlike most of the existing approaches for calibrating regression models, which are based on post hoc processing of the model's output, and require an additional dataset, our method is trainable in an end-to-end fashion, without requiring an additional dataset. We provide empirical results demonstrating that our approach improves calibration for regression models trained on diverse architectures that provide uncertainty estimates, such as Dropout VI, Deep Ensembles.

1. INTRODUCTION

For supervised machine learning, the notion of calibration of a learned predictive model is a measure of evaluating how well a model's confidence in its prediction matches with the correctness of these predictions. For example, a binary classifier will be considered perfectly calibrated if, among all predictions with probability score 0.9, 90% of the predictions are correct Guo et al. (2017) . Likewise, consider a probabilistic regression model that produces credible interval for the predicted outputs. In this setting, the model will be considered perfectly calibrated if the 90% confidence interval contains 90% of the true test outputs (Kuleshov et al., 2018) . Unfortunately, modern deep neural networks are known to be poorly calibrated (Guo et al., 2017) , raising questions on their reliability. The notion of calibration for classification problems was originally first considered in meteorology literature (Brier, 1950; Murphy, 1972; Gneiting & Raftery, 2007) and saw one of its first prominent usage used in the machine learning literature by (Platt et al., 1999) in context of Support Vector Machines (SVM), in order to obtain probabilistic predictions from SVMs which are inherently non-probabilistic models. Recently, there has been renewed interested in calibration, especially for classification models, after it has been shown (Guo et al., 2017) that modern deep neural networks for classification are often poorly calibrated. The pupular notions of calibration for classification include confidence calibration, multiclass calibration, classwise calibration, and confidence calibration (Kumar et al., 2019; Vaicenavicius et al., 2019; Kull et al., 2019) . Most calibration methods (Platt et al., 1999; Zadrozny & Elkan, 2001; 2002; Guo et al., 2017; Kull et al., 2017; 2019) for classification models are post hoc, where they learn a calibration mapping R : [0, 1] → [0, 1] using an additional dataset to recalibrate an already trained model. There has been recent work showing some of these popular post hoc methods are either themselves miscalibrated or sample-inefficient (Kumar et al., 2019) and they do not actually help the model output well-calibrated probabilities. An alternative to post hoc processing is to ensure that model outputs well-calibrated probabilities after model training finishes. We refer to these as implicit calibration methods. Notably, such an approach does not require an additional dataset to learn the calibration mapping. While almost all post hoc calibration methods for classification models can be seen in a unified manner as density estimation methods (see section 2.1 ), existing implicit calibration methods for classification models have been designed with various, often distinct, considerations/approaches. Several heuristics like Mixup (Zhang et al., 2017; Thulasidasan et al., 2019) and Label Smoothing (Szegedy et al., 2016; Müller et al., 2019) that were part of high performance deep networks for classification were later shown empirically to achieve calibration. (Maddox et al., 2019) show that their optimization method instrinsically improves calibration. (Pereyra et al., 2017) found that penalizing high-confidence predictions acts as a regularizer. A more principled way of achieving implicit calibration is by minimizing a loss function that is tailored for calibration (Kumar et al., 2018) . This is somewhat similar in spirit to our proposed approach which aims to do it for regression models. Among the early approaches for calibrating regression models, (Gneiting et al., 2007) were the first to propose a framework for calibrating regression models. However, they do not provide any procedure to correct a miscalibrated model. Recently, (Kuleshov et al., 2018) introduced the notion of Quantile Calibration which intuitively says that the p confidence interval predicted by model should have target variable with probability p. They use a post hoc calibration method based on Isotonic Regression (Fawcett & Niculescu-Mizil, 2007) , which is a well-known calibration technique for classification models. The difference between Isotonic Calibration in classification and Isotonic Calibration in regression is in terms of (i) the dataset on which calibration mapping is learnt; and (ii) the function with which learnt calibration mapping is pre-composed . In the former case, it is pre-composed with a probability mass function (PMF) and whereas in the latter, it is pre-composed with a conditional density function (CDF). Both these differences have side effects; in particular (i) the nature of recalibration dataset already satisfies monotonicity constraint, so there is a risk of overfitting in case of smaller calibration datsets; and (ii) composing the CDF with a piecewise linear function can make the resultant CDF discontinuous and the corresponding PDF non-differentiable (see Sec. 3 for detailed discussion for side effects of the Isotonic Calibration approach). In another recent work, (Song et al., 2019) proposed a much stronger notion of calibration called Distributional Calibration which guarantees that among all instances whose predicted probability density function (PDF) of the response variable has mean µ and standard deviation σ, the marginal distribution of the target variable should have mean µ and standard deviation σ. They too propose a post hoc recalibration method based on Gaussian processes, which can be computationally expensive. Among other work, (Keren et al., 2018) , consider a different setting where neural networks for classification are used for regression problems and showed that temperature scaling (Hinton et al., 2015; Guo et al., 2017) and their proposed method based on empirical prediction intervals improves calibration for regression problems as well. Again, these are post hoc methods. Our contributions are summarized below: 1. We analyze in detail the side effects of Isotonic Calibration for regression models. We show how using Isotonic Calibration results in truncation of the support, which will result in assigning zero likelihood fortes t time. We also discuss about Isotonic Calibration resulting in nonsmooth PDFs, and its tendency to produce miscalibration when using small calibration datasets. 2. At test time, after composing the predicted CDF with the learned isotonic mapping, the mean prediction (point estimate) also changes. Kuleshov et al. (2018) do not acknowledge the changes in the mean estimate. While Song et al. ( 2019) acknowledge this issue, they use a trapezoidal approximation to remedy this. In contrast, we derive an analytical expression for the updated point estimate after isotonic calibration. We also provide a different expression for updated point estimate, which reduces the time-complexity from O(m) to O(1), where m is calibration dataset size. 3. In order to mitigate these shortcomings of Isotonic Calibration, we propose a simple, yet novel and general purpose, trainable loss function for quantile calibration where the smoothness of PDF/CDF is not sacrificed for well-calibrated probabilties. Our approach also eliminates the need for an additional calibration dataset. 4. We conduct extensive experiments on a wide range of architectures using the proposed loss function (Quantile Regularization) and show empirically that it improves calibration on wide range of architectures that produce uncertainty estimates.

