CONDITIONAL COVERAGE ESTIMATION FOR HIGH-QUALITY PREDICTION INTERVALS Anonymous

Abstract

Deep learning has achieved state-of-the-art performance to generate high-quality prediction intervals (PIs) for uncertainty quantification in regression tasks. The high-quality criterion requires PIs to be as narrow as possible, whilst maintaining a pre-specified level of data (marginal) coverage. However, most existing works for high-quality PIs lack accurate information on conditional coverage, which may cause unreliable predictions if it is significantly smaller than the marginal coverage. To address this problem, we propose a novel end-to-end framework which could output high-quality PIs and simultaneously provide their conditional coverage estimation. In doing so, we design a new loss function that is both easyto-implement and theoretically justified via an exponential concentration bound. Our evaluation on real-world benchmark datasets and synthetic examples shows that our approach not only outperforms the state-of-the-arts on high-quality PIs in terms of average PI width, but also accurately estimates conditional coverage information that is useful in assessing model uncertainty.

1. INTRODUCTION

Prediction interval (PI) is poised to play an increasingly prominent role in uncertainty quantification for regression tasks (Khosravi et al., 2010; 2011; Galván et al., 2017; Rosenfeld et al., 2018; Tagasovska & Lopez-Paz, 2018; 2019; Romano et al., 2019; Wang et al., 2019; Kivaranovic et al., 2020) . A high-quality PI should be as narrow as possible, whilst maintaining a pre-specified level of data coverage or marginal coverage (Pearce et al., 2018) . Compared with PIs obtained based on coverage-only consideration, the "high-quality" criterion is beneficial in balancing between marginal coverage probability and interval width. However, the conditional coverage given a feature, which is critical for making reliable context-based decisions, is unassessed and missing in most existing works on high-quality PIs. In the presence of heteroskedasticity and model misspecification, the marginal coverage can be very different from the conditional coverage at a given point, which affects the downstream decision-making task that relies on the uncertainty information provided by the PI. Our main goal is to meaningfully incorporate and assess conditional coverages in high-quality PIs. Conditional coverage estimation is challenging for two reasons. First is that the natural evaluation metric of conditional coverage error, an L p distance between the estimated and ground-truth conditional coverages, is difficult to compute as it requires obtaining the conditional probability given feature x, which is arguably as challenging as the regression problem itself. Our first goal in this paper is to address this issue by developing a new metric called calibration-based conditional coverage error for conditional coverage estimation measurement. Our approach is inspired from the calibration notion in classification (Guo et al., 2017) . The basic idea is to relax conditional coverage at any given point to being averaged over all points that bear the same estimated value. An estimator satisfying the relaxed property is regarded as well-calibrated. In regression, calibration-based conditional coverage error provides a middle ground between the enforcement of marginal coverage (lacking any conditional information) and conditional coverage (computationally intractable). Compared with conditional coverage, this middle-ground metric can be viewed as a "dimension reduction" of the conditioning variable from the original sample space to the space [0, 1], so that we can easily discretize to compute the empirical metric values. The second challenge is the discontinuity in the above metrics that hinders efficient training of PIs that are both high-quality and possess reliable conditional coverage information. To address this, we design a new loss function based on a combination of the high-quality criterion and a coverage assessment loss. The latter can be flexibly added as a separate module to any neural network (NN) used to train PIs. It is based on an empirical version of a tight upper bound on the coverage error in terms of a Kullback-Leibler (KL) divergence, which can be readily employed for running gradient descent. We theoretically show how training with our proposed loss function attains this upperbounding value via a concentration bound. We also demonstrate the empirical performance of our approach in terms of PI quality and conditional coverage assessment compared with benchmark methods.

Summary of Contributions:

(1) We identify the conditional coverage estimation problem as a new challenge for high-quality PIs and introduce a new evaluation metric for coverage estimation. (2) We propose an end-to-end algorithm that can simultaneously construct high-quality PIs and generate conditional coverage estimates. In addition, we provide theoretical justifications on the effectiveness of our algorithm by developing concentration bounds relating the coverage assessment loss and conditional coverage error. (3) By evaluating on benchmark datasets and synthetic examples, we empirically demonstrate that our approach not only achieves high performance on conditional coverage estimation, but also outperforms the state-of-the-art algorithms on high-quality PI generation.

2. EVALUATING CONDITIONAL COVERAGE FOR HIGH-QUALITY PIS

Let X ∈ X and Y ∈ Y ⊂ R be random variables denoting the input feature and label, where the pair (X, Y ) follows an (unknown) ground-truth joint distribution π(X, Y ). Let π(Y |X) be the conditional distribution of Y given X. We are given the training data D := {(x i , y i ), i = 1, 2, • • • , n} where (x i , y i ) are i.i.d. realizations of random variables (X, Y ). A PI refers to an interval [L(x), U (x)] where L, U are two functions mapping from X to Y trained on the data D. [L(x), U (x)] is called a PI at prediction level 1 -α (0 ≤ α ≤ 1) if its marginal coverage is not less than 1 -α, i.e., P[Y ∈ [L(X), U (X)]|L, U ] ≥ 1 -α where P is with respect to a new test point (X, Y ) ∼ π. We say that [L(x), U (x)] is of high-quality if its marginal coverage attains a pre-specified target prediction level and has a short width on average. In particular, a best-quality PI at prediction level 1 -α is an optimal solution to the following constrained optimization problem: min L,U E[U (X) -L(X)] subject to P[Y ∈ [L(X), U (X)]|L, U ] ≥ 1 -α. (2.1) The high-quality criterion has been widely adopted in previous work (see Section 6). However, this criterion alone may fail to carry important model uncertainty information at specific test points. Consider a simple example where x ∼ Uniform[0, 1], y = 0 for x ∈ [0, 0.95] and y|x ∼ Uniform[0, 1] for x ∈ (0.95, 1]. Then according to equation 2.1, a best-quality 95% PI is precisely L(x) = U (x) = 0 for all x ∈ [0, 1]. This PI has nonconstant coverage if we condition at different points (1 for x ∈ [0, 0.95] and 0 for x ∈ (0.95, 1]), and can deviate significantly from the overall coverage 95%. More examples to highlight the need of obtaining conditional coverage information can be found in our numerical experiments in Section 5.1. To mitigate the drawback of the high-quality criterion, we define: Definition 2.1 (Conditional Coverage and Its Estimator). The conditional coverage associated with a PI [L(x), U (x)] is A(x) := P[Y ∈ [L(X), U (X)]|L, U, X = x] for a.e. x ∈ X , where P is taken with respect to π(Y |X). For a (conditional) coverage estimator P , which is a measurable function from X to [0, 1], we define its L p conditional coverage error ( CE p ) as CE p := P[Y ∈ [L(X), U (X)]|L, U, X] -P (X) L p (X ) where the L p -norm is taken with respect to the randomness of X (1 ≤ p ≤ +∞). Note that evaluating CE p relies on approximating the conditional coverage A(x), which can be as challenging as the original prediction problem. To address this, we leverage the similarity of estimating A(x) to generating prediction probabilities in binary classification, which motivates us to borrow the notion of calibration in classification. This idea is based on a relaxed error criterion by looking at the conditional coverage among all points that bear the same coverage estimator value, instead of conditioning at any given point. The resulting error metric then only relies on probabilities conditioned on variables in a much lower-dimensional space [0, 1] than X . To explain concretely, we introduce a "perfect-calibrated coverage estimator" as: Definition 2.2 (Perfect Calibration). A coverage estimator P is called a perfect-calibrated coverage estimator associated with [L(x), U (x)] if it satisfies P (x) = P[Y ∈ [L(X), U (X)]|L, U, P (X) = P (x)], a.e. P (x) ∈ [0, 1]. (2.2) where a.e. is with respect to the probability measure on [0, 1] induced by the random variable P (X). Equation 2.2 means that a point x with conditional coverage estimate P (x) = p has an average coverage of precisely p, among all points in X that possess the same conditional coverage estimated value. That is, the average coverage of the PI restricted on the subset {x ∈ X : P (x) = p} should be precisely p. Corresponding to Definition 2.2, we define: Definition 2.3 (Calibration-based Error). An L p (1 ≤ p ≤ +∞) calibration-based conditional coverage error, or coverage error for short (CE p ), of a coverage estimator P is: CE p := P[Y ∈ [L(X), U (X)]|L, U, P (X)] -P (X) L p (X ) (2.3) where L p -norm is taken with respect to the randomness of P (X). In the above definition the conditional probability P[Y ∈ [L(X), U (X)]|L, U, P (X) ] is a measurable function of random variable P (X), say γ( P (X)). By a change of variable, CE p p := γ( P (X)) -P (X) p L p (X ) = 1 0 |γ(t) -t| p dF P (X) (t) (2.4) where F P (X) is a probability distribution of P (X) on [0, 1]. Here, CE p only requires estimating γ(t) for t ∈ [0, 1], which can be done easily by discretizing [0, 1] for empirical calculation. We call the empirical L p calibration-based conditional coverage error ECE p . More details of this empirical error can be found in Appendix A.4. A calibration-based error CE p provides a middle ground between the enforcement of marginal coverage and conditional coverage. The ground-truth conditional coverage is perfectly calibrated, but not vice versa. However, if we enforce the perfect calibration criterion for a coverage estimator to hold when restricted to any measurable subset in X , then the choice of estimator will reduce uniquely to the conditional coverage (Definition A.1 and Lemma A.2). In this sense, CE p is an error metric that is a natural relaxation of CE p , and although less precise, CE p is computationally much more tractable than CE p . Evaluation Metric for Coverage Estimator. We use ECE 1 as the primary evaluation metric to measure the quality of a coverage estimator. A high ECE 1 value of a coverage estimator indicates an unreliable coverage estimation while a small ECE 1 value indicates that the coverage estimator is close to the perfect-calibrated property. Ideally, an effective algorithm should output a coverage estimator with a small ECE 1 value.

3. METHODOLOGY

We propose a novel end-to-end algorithm, named coverage assessment network (CaNet), to simultaneously generate a coverage estimator along with the high-quality PI. As illustrated in Figure 1 , our CaNet consists of two major modules: (1) predictor module and (2) coverage assessment module (Ca-Module). The predictor module provides the upper and lower bound of the estimated PIs. Meanwhile, the Ca-Module is added to the output layer to access the conditional coverage information of PIs from the predictor module. Our model is jointly optimized by three loss functions: coverage assessment loss L CA , intervals width loss L IW , and coverage probability loss L CP . Benefited from these powerful modules, the CaNet can generate and validate the coverage estimator at the same time without any requirement for further post-processing steps. Figure 1 : The framework of our proposed coverage assessment network (CaNet). Coverage Assessment Module. Our Ca-Module consists of two neurons fully connected to the last hidden layer to estimate the conditional coverage. After passing through the softmax activation function, it outputs a two-point probability distribution P (x), 1 -P (x) where P (x) is the coverage estimator of the PI from the predictor module. Our Ca-Module can be easily integrated into the output layer of deep neural networks to estimate their conditional coverage. Loss Function Design and Tuning Procedure. Our loss function is a sum of the predictor loss and the coverage loss. The predictor loss aims to narrow the prediction intervals as much as possible, while maintaining a specified marginal coverage of data. Inspired by (Khosravi et al., 2010; 2011; Pearce et al., 2018; Rosenfeld et al., 2018) , our predictor loss is formed by the sum of intervals width (IW) loss L IW and coverage probability (CP) loss L CP : L IW = 1 n n i=1 (U (x i ) -L(x i )), L CP = 1 n n i=1 ki , CP = 1 n n i=1 k i , where k i indicates whether each data point has been captured by the PIs: k i = 1 if L(x i ) ≤ y i ≤ U (x i ) and k i = 0 otherwise. ki is a soft version of k i , which is defined as: ki := σ(λ 3 (U (x i )-y i ))• σ(λ 3 (y i -L(x i ))), where λ 3 ≥ 0 is a tunable parameter and σ(t) := 1 1+e -t is the sigmoid function. Therefore, L CP is a soft version of CP that can be used for gradient descent. Associated with the Ca-Module, we introduce a coverage assessment loss L CA to estimate the conditional coverage: L CA = 1 n n i=1 k i log( P (x i )) + (1 -k i ) log(1 -P (x i )) (3.3) We will show in Section 4 that the expectation of coverage assessment loss L CA provides an upper bound for both the conditional coverage error (Definition 2.1) and calibration-based conditional coverage error (Definition 2.3). Hence, minimizing L CA contributes to the recovery of the conditional coverage. In order to run gradient-based methods, we replace the discrete indicator (k i , 1 -k i ) in L CA with its soft version ( ki , 1 -ki ): LCA = 1 n n i=1 ki log( P (x i )) + (1 -ki ) log(1 -P (x i )) (3.4) Our total loss function for the CaNet is defined as: Total Loss = L IW + λ 1 (1 -L CP ) + λ 2 LCA (3.5) where λ 1 ≥ 0, λ 2 ≥ 0 are tunable parameters. We propose an easy-to-implement yet effective tuning procedure to pick up these parameters. Please refer to Appendix D for more algorithm details. Deep Ensembles. Following previous research (Lee et al., 2015; Lakshminarayanan et al., 2017; Pearce et al., 2018; Fort et al., 2019; Ovadia et al., 2019; Gustafsson et al., 2020; Pearce et al., 2020) , we apply the deep ensemble technique to provide more robust and better results. During the training period, with the same hyperparameters λ i , i = 1, 2, 3, m networks are trained with different initializations. The prediction results from i-th network are denoted as: ([L i (x), U i (x)], Pi (x), 1 -Pi (x)). Finally, the output from CaNet is: Lower bound L := m i=1 1 m L i , U pper bound Ū := m i=1 1 m U i , Coverage estimator P := m i=1 1 m Pi . (3.6)

4. THEORETICAL ANALYSIS

In this section, we provide theoretical insights that minimizing L CA is equivalent to minimizing a tight upper bound of the conditional coverage error and thus can recover the true conditional coverage. To achieve this, we first prove that both CE p and CE p are bounded above by the expectation of a Kullback-Leibler divergence-type random variable K 1 (X). Then we show that L CA is an empirical counterpart of K 1 (X) and establish a concentration bound between L CA and E[K 1 (X)]. Theorem 4.1. Let A(x) := P[Y ∈ [L(X), U (X)]|L, U, X = x] be the conditional coverage in Definition 2.1. Let K(x) = A(x) log A(x) P (x) + (1 -A(x)) log 1-A(x) 1-P (x) . Then CE p ≤ CE p ≤ 1 2 E[K(X)] αp/2 , ∀1 ≤ p ≤ +∞ where α p = 1, ∀1 ≤ p ≤ 2 and α p = 2 p , ∀2 ≤ p ≤ +∞. Moreover, the inequality is attainable if, e.g., P (x) equals the conditional coverage A(x). From Theorem 4.1, we see that minimizing E[K(x)] is equivalent to minimizing a tight upper bound for the coverage error. For every x, K(x) is the Kullback-Leibler divergence between the distributions represented by ( P (x), 1 -P (x)) and (A(x), 1 -A(x)). K(x) = K 0 (x) + K 1 (x), where K 0 (x) = A(x) log(A(x)) + (1 -A(x)) log(1 -A(x)) and K 1 (x) = -A(x) log( P (x)) -(1 - A(x)) log(1 -P (x)). Minimizing E[K(X)] over P is equivalent to minimizing E[K 1 (X)]. The type of results in Theorem 4.1 that bounds an L p conditional coverage error via a Kullback-Leiblertype error is new as far as we know. Next, to show L CA approximates E[K 1 (X)], we need the following assumptions: Assumption 4.2. The four classes of functions ([L(x), U (x)], P (x), 1 -P (x)) output by the neural network (NN) in Figure 1 have finite VC dimensions, say they are bounded above by V 0 . Assumption 4.2 holds for a wide range of NNs (e.g., Theorem 8.14 in Anthony & Bartlett (2009) , Theorem 7 in Bartlett et al. (2019) ). In particular, it holds for the one we adopt in the experiments (where we use the ReLU-activated NN to construct ψ i , i = 1, 2, 3, 4; see Section 5): Theorem 4.3. Suppose ψ i , i = 1, 2, 3, 4 are the pre-activated output neurons of the NN in Figure 1 using the ReLU activation function. Then Assumption 4.2 holds. Moreover, suppose the NN has W parameters and U computation units (nodes). Then V 0 = O(W U ). Assumption 4.4. | log( P (x))| ≤ M , | log(1 -P (x))| ≤ M for all x and P . This is a natural assumption in practice because log( P (x)) and log(1 -P (x)) are replaced by log( P (x) + ) and log(1 -P (x) + ) respectively to avoid explosion when implementing the algorithm. In particular, in our experiments in Section 5, = 0.1 6 and thus M = 14. Let  F = {f (x, y) = I y∈[L(x),U = {(x i , y i ), i = 1, 2, • • • , n} where (x i , y i ) are i.i.d. samples ∼ π. Recall that the (hard) coverage estimator assess- ment loss is L CA = -1 n n i=1 f (x i , y i ) log( P (x i )) + (1 -f (x i , y i )) log(1 -P (x i )) . Then for any t > 0, we have P sup f ∈F , P ∈G |L CA -E[K 1 (X)]| ≥ t ≤ C * e -nt 2 16M 2 . where C * only depends on V 0 in Assumption 4.2. Theorem 4.5 shows that the coverage assessment loss approximates E[K 1 (x)] well with an exponential tail bound. The difficulty in analyzing Theorem 4.5 lies in the fact that the hypothesis classes in Assumption 4.2 (which are constructed by the NN) are different from the hypothesis class used in L CA . To overcome this difficulty, we use the theory of VC-subgraph classes to connect the VC dimension among multiple hypothesis classes, including the class of ψ i , the four classes of output functions, and F, log G. Then we establish the covering number bound for the class F and log G, and finally prove Theorem 4.5. To conclude, minimizing E[K 1 (X)] over P is equivalent to minimizing E[K(X)], which in turn is minimizing a tight upper bound for the coverage assessment loss. Our coverage assessment loss empirically approximates E[K 1 (X)] well, so that its minimization can ultimately reduce the conditional coverage error.

5. EXPERIMENTS

Experimental Setup. We empirically verify the effectiveness of our proposed CaNet on both synthetic examples and benchmark regression datasets. These datasets have been widely used for the evaluation of methods in regression tasks (Hernández-Lobato & Adams, 2015; Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Rosenfeld et al., 2018; Pearce et al., 2018; Zhu et al., 2019) . In addition, we adopt the same experimental procedure in Pearce et al. (2018) for data normalization and dataset splitting. To avoid overfitting, we apply a simple network architecture with only 2 hidden layers and each hidden layer has 64 neurons. For each hidden layer, the ReLU activation function is applied to capture the non-linear features. We empirically set the ensemble number m to 5, as the smallest number leading to a stable prediction results. Please refer to Appendix E for implementation details, including those for baseline algorithms. Evaluation Metrics. To evaluate the conditional coverage estimation of our CaNet, we examine the quality of our In this section, we conduct a series of experiments on synthetic examples to directly compare our prediction results with the ground-truth conditional coverage. In these examples, the conditional coverage can be analytically calculated under the known data distribution. Figure 2 compares the conditional coverage with our predicted coverage under the following settings: x ∼ Uniform[-2, 2] and y|x is drawn from f i (x) = 1 3 sin(x) + ε i (x), x ∈ [-2, 2] where ε 1 (x) = 0.1 × N (0, 1), ε 2 (x) = 0.1|x| × N (0, 1), ε 3 (x) = 0.1|x| × t 4 . N (0, 1) is the standard Gaussian variable and t 4 is the standard t random variable with 4 degrees of freedom. Then, the conditional coverage in Definition 2.1 can be analytically calculated as: where F i is the cumulative distribution function of N (0, 0.1 2 ) for i = 1, N (0, (0.1x) 2 ) for i = 2, and 0.1|x| × t 4 for i = 3. As shown in Figure 2 , the conditional coverages of high-quality PIs at different points diverge among each other and they deviate from the marginal coverage. Thus, having access to the marginal coverage for the whole dataset is not sufficient for decision making, which highlights the need of conditional coverage. In addition, the predicted coverage estimator of our model is highly consistent with the conditional coverage on all of the synthetic examples. These results confirm that our CaNet can accurately estimate the conditional coverage on noisy datasets. P[Y ∈ [L(X), U (X)]|L, U, X = x] = F i (U (x) - 1 3 sin(x)) -F i (L(x) - 1 3 sin(x))

5.2. PERFORMANCE OF PIS ON BENCHMARK DATASETS

In this section, we compare the predictor module of our CaNet on real-world benchmark datasets with following baseline algorithms: (1) nearest-neighbors kernel conditional density estimation (NNKCDE) (Dalmasso et al., 2020) , (2) quantile regression forest (QRF) (Meinshausen, 2006) , (3) split conformal learning (SCL) (Lei et al., 2018) and (4) the quality-driven PI method (QD-Ens) (Pearce et al., 2018) . We quote the results from (Pearce et al., 2018) as a comparison since we share the same experiment setup. (Guo et al., 2017; Kull et al., 2019) , the ECE 1 values from CaNet are similar and sometimes less than their post-calibrated ECE 1 results (usually around 1% to 3%), even though the size of most regression datasets are smaller than the classification datasets. These results further demonstrate that our CaNet can accurately estimate the coverage information of 95% high-quality PIs on real-world regression tasks. Coverage for PIs at Different Prediction Levels. We conduct multiple experiments on different PI prediction levels to show the robustness of our CaNet. By only modifying the parameter λ 1 in Equation 3.5, our Ca-Module could get access to different levels of coverage probability. (Box & Tiao, 2011) . While powerful, these approaches could not directly provide the conditional coverage information investigated in this work. Coverageonly criteria, on the other hand, focus solely on coverage satisfaction as the guarantee. These approaches include conformal learning (CL) and its conditional variants (Vovk et al., 2005; 2009; Lei & Wasserman, 2014; Lei et al., 2015; 2018; Kuchibhotla & Ramdas, 2019; Romano et al., 2019; Barber et al., 2019a; b) . CL is desirably distribution-or model-free, and in many cases enjoys finitesample guarantees. However, unlike high-quality PIs, they do not explicitly account for the interval width as a quality metric. We also mention conditional density estimation (Holmes et al., 2007; Dutordoir et al., 2018; Izbicki & Lee, 2016; Dalmasso et al., 2020; Freeman et al., 2017; Izbicki et al., 2017) and closely relatedly quantile regression (Koenker & Hallock, 2001; Meinshausen, 2006) as PI construction approaches by converting from the estimated conditional quantile function. These approaches focus on the quality of conditional distribution/quantile, instead of the high-quality criterion only. In this work, we use deep learning to construct high-quality PIs (Khosravi et al., 2010; Pearce et al., 2018; Kivaranovic et al., 2020) . Uncertainty Measurement in Deep Learning. The Bayesian framework offers principled approaches for model uncertainty measurement by computing the posterior distribution over the NN parameters (MacKay, 1992; Neal, 2012) . These approaches can also be used to construct PIs. However, they provide a different perspective from the frequentist view taken in this paper, and they focus on the parameter uncertainty of the NN instead of the coverage over a test point. In addition, exact Bayesian inference is computationally intractable for deep neural networks, making it less practical to implement. Gal & Ghahramani (2016) applied a Monte Carlo dropout method to proxy the inference. Directly generated from the networks, softmax response is also commonly used for uncertainty measurement on deep learning models (Bridle, 1990; Lakshminarayanan et al., 2017; Geifman et al., 2018; Sensoy et al., 2018; Ozbulak et al., 2018) . Moreover, Niculescu-Mizil & Caruana (2005) showed that NNs typically produce well-calibrated probabilities on binary classification tasks without the need of any post-hoc techniques. In this paper, we also follow this line of work and use the softmax output to access the model uncertainty information.

7. CONCLUSION

In this paper, we identify and investigate the conditional coverage estimation problem for highquality PIs, which is critical for risk-based decision making in regression fields. To address the challenge, we propose an end-to-end algorithm with two powerful modules: coverage assessment module and predictor module. Benefited from these powerful modules, our model can generate and validate the coverage estimator without any requirement for further post-processing steps. In addition, we conduct theoretical analysis to show the effectiveness of our proposed model. Experimental results on synthetic examples and benchmark datasets further demonstrate that our model can robustly provide accurate coverage estimation while simultaneously producing a high-quality PI. Moreover, our Ca-Module can be easily integrated into other deep-learning-based algorithms to get access to their coverage information, opening up more opportunities for broad applications. In the future, we will extend our work by conducting comparison studies with Bayesian methods. Note that in high-quality criteria, only Type II coverage is considered in the constraint but Type IV coverage is lacking. Since Type I and III are not considered in our paper, for simplicity Type II coverage is called the marginal coverage, and Type IV coverage is called the conditional coverage in Definition 2.1. Throughout Appendix, we let A(x) := P[Y ∈ [L(X), U (X)]|L, U, X = x] denote the conditional coverage in Definition 2.1. Moreover, since we only concern Type II and IV coverage in this work, we make the following convention. Throughout the Appendix A-B, P and E should be understood as probability and expectation conditional on L, U . So for A(x), we could simply write A(x) := P[Y ∈ [L(X), U (X)]|X = x] and omit "conditional on L, U ".

A.2 THE TERMINOLOGY "PERFECT-CALIBRATED"

The terminology "perfect-calibrated" is borrowed from the confidence calibration in classification tasks. We first review the name of "confidence" in classification. Confidence calibration is the problem of predicting probability estimates representative of the true correctness likelihood (Guo et al., 2017) . Intuitively, an reliable confidence should reflect the true correctness likelihood of the prediction (Kumar et al., 2019) . For example, given 100 predictions, each with confidence of 0.8, we expect that 80 should be correctly classified (Guo et al., 2017) . Now let h be the prediction of any models, which is a map from X to Y, trained on the data D. According to their definition, the "best" confidence map P should be the true probability of correctness: P (x) = E[1 h(X) is correct for Y |h, X = x], ∀x ∈ X . (A.1) which is a measureable function from X to [0, 1]. In particular, we consider h as a PI in regression tasks and "correctness" of a PI is naturally defined as the success of coverage on the outcome Y . Then the right hand side of Equation (A.1) becomes E[1 h(X) is correct for Y |h, X = x] = E[1 Y ∈[L(X),U (X)] |L, U, X = x] = P[Y ∈ [L(X), U (X)]|L, U, X = x] . which is the conditional coverage in our Definition 2.1. In addition, Guo et al. (2017) introduce the perfect-calibrated confidence as follows: P( Ŷ = Y | Ŷ , P = p) = p, ∀p ∈ [0, 1] where Ŷ is the class prediction and Ŷ = Y means that the predicted and true class label coincide. Obviously, if P (X) = P( Ŷ (X) = Y |X), i.e., the "best" confidence, then the above equality holds. Transferring this idea into PIs, we can naturally define the perfect-calibrated coverage estimator as p = E[1 h(X) is correct for Y |L, U, P = p] = P[Y ∈ [L(X), U (X)]|L, U, P = p], ∀p ∈ [0, 1] . which is the conditional coverage in our Definition 2.2. Under review as a conference paper at ICLR 2021

A.3 DETAILS ON COVERAGE ESTIMATOR

A perfect-calibrated coverage estimator inherits some properties of a conditional coverage. For example, both of them have the following interpretation: If we have 1000 testing points for a PI, each with the same conditional/perfect-calibrated coverage 0.9, then approximately 900 of them are correctly covered by the PI. Note that the conditional coverage is uniquely defined, but a perfectcalibrated coverage estimator is not necessarily so. Moreover, we have the following facts: (a) The conditional coverage is always perfect-calibrated, but not vice versa. (b) A perfect-calibrated coverage estimator can be viewed as an averaged conditional coverage. (c) A perfect-calibrated coverage estimator is less "informative" than the conditional coverage. A perfect-calibrated coverage estimator is an averaged conditional coverage. Let P be a (general) coverage estimator. We have P[Y ∈ [L(X), U (X)]| P (X) = P (x)] =P[Y ∈ [L(X), U (X)]|X ∈ P -1 ( P (x))] = P[Y ∈ [L(X), U (X)], X ∈ P -1 ( P (x))] P[X ∈ P -1 ( P (x))] = t∈ P -1 ( P (x)) E[1 Y ∈[L(X),U (X)] |X = t]P[X ∈ dt] P[X ∈ P -1 ( P (x))] = t∈ P -1 ( P (x)) A(t)P[X ∈ dt] P[X ∈ P -1 ( P (x))] Suppose P is a perfect-calibrated coverage estimator. Then we have P (x) = t∈ P -1 ( P (x)) A(t)P[X ∈ dt] P[X ∈ P -1 ( P (x))] which implies that P (x) is a weighted average of A(t) over the set P -1 ( P (x)) with weights based on the marginal distribution of X. A conditional coverage is perfect-calibrated. If P (x) = A(x), then A(t) = A(x) for any t ∈ A -1 (A(x)) so t∈A -1 (A(x)) A(t)P[X ∈ dt] P[X ∈ A -1 (A(x))] = A(x) t∈A -1 (A(x)) P[X ∈ dt] P[X ∈ A -1 (A(x))] = A(x). This shows that A(x) must be a perfect-calibrated coverage estimator. Another way to see this is taking conditional expectation given A(X) = p in the Definition 2.1. Then we get p = E[A(X)|A(X) = p] = E[P[Y ∈ [L(X), U (X)]|X]|A(X) = p] = P[Y ∈ [L(X), U (X)]|A(X) = p] by the tower property. A perfect-calibrated coverage estimator may be less informative and may not be the conditional coverage. Suppose we have a PI [L(X), U (X)] at the exact prediction level of 1 -α, i.e., P[Y ∈ [L(X), U (X)]] = 1 -α. Then the constant coverage estimator P (x) = 1 -α, ∀x ∈ X can be viewed as an average coverage estimator over the entire space X . It is a perfect-calibrated coverage estimator since by definition, P[Y ∈ [L(X), U (X)]| P (X) = 1 -α] = P[Y ∈ [L(X), U (X)]] = 1 -α = P (X). But it is not a conditional coverage in general (e.g., the second synthetic example in Section 5). Now we give an extension of the definition of "perfect-calibrated" coverage estimator, allowing it to be defined on any measurable subsets. Proof. We first show that the conditional coverage is perfect-calibrated. Taking conditional expectation given {A(X) = p, X ∈ S} in the Definition 2.1. Then we get p = E[A(X)|A(X) = p, X ∈ S] = E[P[Y ∈ [L(X), U (X)]|X]|A(X) = p, X ∈ S] = P[Y ∈ [L(X), U (X)]|A(X) = p, X ∈ S] by the tower property. So A(x) is a perfect-calibrated coverage estimator on any measurable subset S with P(S) > 0. Hence CE p (S) = 0 for any measurable subsets S ⊂ X with P(S) > 0. On the other hand, similarly to Section A.3, we can express P[Y ∈ [L(X), U (X)]| P (X) = P (x), X ∈ S] = t∈ P -1 ( P (x))∩S A(t)P[X ∈ dt] P[X ∈ P -1 ( P (x)) ∩ S] . Suppose P (x) is not the conditional coverage, then P[ P (X) = A(X)] > 0. Without loss of generality, we assume P[ P (X) > A(X)] > 0. Let S 0 := {x ∈ X : P (x) > A(x)}. Note that S 0 = ∪ +∞ n=1 {x ∈ X : P (x) > A(x) + 1 n }. Since P(S 0 ) > 0, there exists a n 0 such that S := {x ∈ X : P (x) > A(x) + 1 n0 } and P(S) > 0. Then for x ∈ S, we have t∈ P -1 ( P (x))∩S A(t)P[X ∈ dt] P[X ∈ P -1 ( P (x)) ∩ S] ≤ t∈ P -1 ( P (x))∩S ( P (t) -1 n0 )P[X ∈ dt] P[X ∈ P -1 ( P (x)) ∩ S] = P (x) - 1 n 0 . Then we have CE p (S) ≥ CE 1 (S) ≥ 1 n 0 > 0 so CE p (S) > 0, which is a contradiction. Hence P (x) is the conditional coverage. Next, we describe in detail how to discretize the right-hand-side of equation 2.4 for empirical calculation. We construct a discrete version of (2.2) and then introduce an empirical counterpart of CE p (2.3), which we refer to as L p empirical calibration-based conditional coverage error ECE p . The ideas behind these are natural extensions of the classification case (Guo et al., 2017; Kull et al., 2019; Kumar et al., 2019; Nixon et al., 2019) into PIs. We consider the following partition ∆ of [0, 1]. Let [0, 1] be divided into M intervals I m = (a m-1 , a m ] (m = 1, • • • , M ) where 0 = a 0 ≤ a 1 ≤ • • • ≤ a M = 1. Let B m = {i = 1, • • • , n : P (x i ) ∈ I m }, i.e. , the set (bin) of indices i of samples whose prediction coverage estimator P (x i ) falls into the interval I m . Note that coverage estimators that are close to each other will fall into the same interval. The coverage probability (i.e., the proportion of successful coverage) in B m is defined as: cp(B m ) = 1 |B m | i∈Bm 1 yi∈[L(xi),U (xi)] . (A.4) The average estimated coverage in B m is defined as: (Guo et al., 2017) . cove(B m ) = 1 |B m | i∈Bm P (x i ) (A. Based on the partition ∆, we can introduce an empirical version of CE p (which we refer to as ECE p ) as: Definition A.5. The L p empirical calibration-based conditional coverage error (ECE p ) of a coverage estimator P is defined as ECE p = (cp(B m(i) ) -cove(B m(i) )) i=1,2,••• ,n l p . (A.6) where the l p is the standard p-norm in R n and B m(i) is the bin containing sample i. Equivalently, ECE p = M m=1 |B m | n |cp(B m ) -cove(B m )| p 1 p , 1 ≤ p < +∞. ECE ∞ = max m=1,2,••• ,M |cp(B m ) -cove(B m )| . Note that ECE p is discontinuous and cannot be used easily for training with gradient-based methods. Therefore, we use the coverage assessment loss L CA introduced in Section 3 for conditional coverage estimation. Finally, we give some explanations about why the conditional coverage error CE p cannot be used easily for training. Unlike L CA which is unbiased and provides guaranteed estimation accuracy for E[K 1 (X)] (Theorem 4.5), it is in general not easy to establish an empirical calculation for CE p . Take CE 2 as an instance. A heuristic argument is to use θ := 1 n n i=1 (k i -P (x i )) 2 to approximate CE 2 2 = E[|A(X) -P (X)| 2 ] where k i = 0 or 1 indicates whether each data point has been captured by the PIs; see Section 3. Unfortunately, θ is in general not an unbiased estimator of CE 2 2 due to the following observations: E[ θ] = E[(k 1 -P (x 1 )) 2 ] = E E[(k 1 -P (x 1 )) 2 |x 1 ] = E E[k 2 1 |x 1 ] -2 P (x 1 )E[k 1 |x 1 ] + P (x 1 ) 2 = E E[k 1 |x 1 ] -2 P (x 1 )E[k 1 |x 1 ] + P (x 1 ) 2 = E A(x 1 ) -2 P (x 1 )A(x 1 ) + P (x 1 ) 2 = E A(x 1 ) 2 -2 P (x 1 )A(x 1 ) + P (x 1 ) 2 + E[A(x 1 ) -A(x 1 ) 2 ] = E[|A(X) -P (X)| 2 ] + E[A(X) -A(X) 2 ]. Hence E[ θ] = CE 2 2 unless A(X) = 0 or 1 almost surely. Note that the gap E[A(X) -A(X) 2 ] does not depend on n and thus not vanish as n grows. To obtain a reasonable estimator for CE p , one need more information about A(x) besides the marginal coverage E[A(X)]. However, this information is usually hard to obtain locally because of the nature of (possibly high-dimensional) feature space. In fact, this has led us to the idea of "dimension reduction" of the sample space to the space [0, 1] and the definition of CE p .

B MATHEMATICAL DEVELOPMENTS FOR THEOREM 4.1

This section proves Theorem 4.1: we show that both coverage error and conditional coverage error are tightly bounded above by the expectation of a Kullback-Leibler divergence-type random variable K 1 (x). This means that minimizing E[K 1 (X)] can recover the true conditional coverage and effectively reduce the coverage error. The proof consists of several inequalities regarding coverage errors and their relations. We first begin with the following connection between CE p and CE p . Theorem B.1. For any PI and its associated P , the L p coverage error is always less than or equal to the L p conditional coverage error, i.e., CE p ≤ CE p , ∀1 ≤ p ≤ +∞ Proof. We note that the function t → |t| p is a convex function. We also note that σ( P (X)) ⊂ σ(X) where σ(Y ) represents the σ-field generated by a random variable Y . CE p p =E A(X) -P (X) p =E E|A(x) -P (X)| p | P (X)] ≥E E[A(x) -P (X)| P (X)] p by Jensen's inequality =E E[1 Y ∈[L(X),U (X)] | P (X)] -P (X) p by the tower property =CE p p Therefore we have CE p ≤ CE p , ∀1 ≤ p ≤ +∞. Next we have the following bounds on L p conditional coverage error: Theorem B.2. The L p conditional coverage error is bounded above by a power function of the L 2 conditional coverage error. Formally, CE p ≤ CE αp 2 , ∀1 ≤ p ≤ +∞, where α p = 1, ∀1 ≤ p ≤ 2 and α p = 2 p , ∀2 ≤ p ≤ +∞. Proof. By Hölder's inequality, CE p ≤ CE 2 , if 1 ≤ p ≤ 2. Since 0 ≤ |A(x) -P (x)| ≤ 1, then, |A(x) -P (x)| p ≤ |A(x) -P (x)| 2 , ∀p ≥ 2 and thus CE p p = E[|A(X) -P (X)| p ] ≤ E[|A(X) -P (X)| 2 ] ≤ CE 2 2 , ∀p ≥ 2. Next, recall that K(x) = A(x) log A(x) P (x) + (1 -A(x)) log 1 -A(x) 1 -P (x) , K 0 (x) = A(x) log(A(x)) + (1 -A(x)) log(1 -A(x)), K 1 (x) = -A(x) log( P (x)) -(1 -A(x)) log(1 -P (x)). Theorem B.3. The L 2 conditional coverage error is bounded above by the expectation of K(x). Formally, CE αp 2 ≤ 1 2 E[K(X)] αp/2 , where α p is defined in Theorem B.2. Proof. For any fixed x, consider two random variables with Bernoulli distributions: W 1 = 1 w.p. A(x), 0 w.p. 1 -A(x). W 2 = 1 w.p. P (x), 0 w.p. 1 -P (x). Let P i be the distribution of W i . It follows from Pinsker's inequality, e.g., Theorem 2.16 in (Massart, 2007) , that P 1 -P 2 2 T V ≤ 1 2 K(P 1 , P 2 ). where T V denotes the total variation distance and K denotes the KL divergence. Since P i is the Bernoulli distribution, we can express it as |A(x) -P (x)| 2 ≤ 1 2 K(x) Taking expectation, we obtain E[|A(X) -P (X)| 2 ] ≤ 1 2 E[K(X)]. Hence, C JUSTIFICATION OF ASSUMPTION 4.2 AND MATHEMATICAL DEVELOPMENTS FOR THEOREM 4.5 CE αp 2 ≤ 1 2 E[K(X)] In this section, we analyze the rationality of Assumption 4.2 and build essential ingredients for proving Theorem 4. In some literature, the VC dimension V (C) of the class C is alternatively defined as the largest n for which there exists a set of size n {x 1 , • • • , x n } shattered by C, i.e., it is the value in definition C.1 minus 1. We can more formally define the VC dimension by the growth function as follows: Proof. The result follows from Lemma 9.9 (viii) in Kosorok ( 2007) since all of the transformations are monotone functions. Our second observation is about F = {f (x, y) = I y∈[L(x),U ] : L, U are output by the NN}. Note that the domain of functions in F is different from the domain of functions in H i (i = 1, 2) as it includes the outcome space. Below we derive a result that connects the VC dimension of H i with that of F. Theorem C.12. Suppose V (H i ) ≤ V 0 (i = 1, 2). We have that V (1 -F) ≤ V (F) ≤ 10(V 0 -1) < +∞ where 1 -F := {1 -f (x, y) : f ∈ F}. Proof. The first inequality follows from Lemma 9.9 (viii) in Kosorok (2007) . We consider the following two classes: F 1 := {I L(x)≤t : L ∈ H 1 , t ∈ R}, F 2 := {I U (x)≥t : U ∈ H 2 , t ∈ R}. Since the functions in F 1 are all indicator functions, by Lemma C.5, V (F 1 ) = V ({{(x, t) : L(x) ≤ t} : L ∈ H 1 , t ∈ R}). Note that the latter is the VC dimension of the close supergraphs of all functions in H 1 . Then by Definition C.3 and Lemma C.4, we have V ({{(x, t) : L(x) ≤ t} : L ∈ H 1 , t ∈ R}) = V (H 1 ) Therefore we have V (F 1 ) = V (H 1 ) ≤ V 0 Similarly, V (F 2 ) = V (H 2 ) ≤ V 0 Note that we can write I y∈[L(x),U (x)] = I L(x)≤y I U (x)≥y By the definition of growth functions, Π F (m) := max (x1,y1),••• ,(xm,ym) {(I y1∈[L(x1),U (x1)] , • • • , I ym∈[L(xm),U (xm)] ) : L ∈ H 1 , U ∈ H 2 } ≤ max (x1,y1),••• ,(xm,ym) {(I L(x1)≤y1 , • • • , I L(xm)≤ym ) : L ∈ H 1 } × max (x1,y1),••• ,(xm,ym) {(I U (x1)≥y1 , • • • , I U (xm)≥ym ) : U ∈ H 2 } =Π F1 (m)Π F2 (m) ≤ em V 0 -1 2(V0-1) for all m ≥ V 0 where the last inequality is due to the Sauer-Shelah lemma. Taking m = 10(V 0 -1), we obtain em V 0 -1 2(V0-1) = (10e) 2(V0-1) ≤ 750 V0-1 < 2 m Combining the above inequality, we have Π F (m) < 2 m . This shows that V (F) ≤ m = 10(V 0 -1).  L CA = - 1 n n i=1 f (x i , y i ) log( P (x i )) + (1 -f (x i , y i )) log(1 -P (x i )) . Then for any t > 0, we have P sup f ∈F , P ∈G |L CA -E[K 1 (X)]| ≥ t ≤ C * e -nt 2 16M 2 where C * only depends on V 0 in Assumption 4.2. Proof. Note that E[f (x i , y i )|x i ] = A(x i ) for any fixed L and U . Taking expectation on L CA , we have E[L CA ] = E [E[L CA |x 1 , x 2 , • • • , x n ]] = E - 1 n n i=1 A(x i ) log( P (x i )) + (1 -A(x i )) log(1 -P (x i )) = E[K 1 (X)]. We consider the first part A(x) log( P (x)). The second part can be done using the same argument. Note that by Theorem 9.15 in Kosorok ( 2007), we have sup Q log N ( M, F • G , L 2 (Q)) ≤ sup Q log N ( /2, F, L 2 (Q)) + sup Q log N ( M/2, G , L 2 (Q)). Consider the class 1 2 + 1 2M F • G := { 1 2 + 1 2M (f (x, y) log( P (x))) : f ∈ F, log( P ) ∈ G } which consists of functions taking values in [0, 1]. We have sup Q log N ( , 1 2 + 1 2M F • G , L 2 (Q)) = sup Q log N (2 M, F • G , L 2 (Q)) ≤ sup Q log N ( , F, L 2 (Q)) + sup Q log N ( M, G , L 2 (Q)) ≤K 2 1 1/e where the last inequality follows from Lemma C.13 and Lemma C.14, and K 2 only depends on V (F) and V (G). (Recall that we have shown V (G ) ≤ V (G) in Lemma C.14.) Moreover, by Theorems C.11 and C.12, we can claim that K 2 only depends on V 0 . This inequality shows that 1 2 + 1 2M F • G satisfies the conditions in Theorem 2.14.10 in Van der Vaart & Wellner (1996) and thus for every δ > 0 and t > 0, P sup φ∈ 1 2 + 1 2M F •G 1 n n i=1 φ(x i , y i ) -E[φ(x, y)] ≥ t ≤ Ce D( √ nt) U +δ e -2nt 2 where U =  . Let δ = 1 -U . Note that -2( √ nt) 2 + D( √ nt) ≤ -( √ nt) 2 + (D/2) 2 . Hence we have P sup φ∈ 1 2 + 1 2M F •G 1 n n i=1 φ(x i , y i ) -E[φ(x, y)] ≥ t ≤ C * e -nt 2 where C * only depends on K 2 , or, only depends on V 0 . This shows that P sup f ∈F , P ∈G 1 n n i=1 f (x i , y i ) log( P (x i )) -E[A(x) log( P (x))] ≥ t ≤ C * e -nt 2 4M 2 . A similar result can be established for the second part since the hypothesis classes there have been studied in Theorem C.11 and C.12: P sup f ∈F , P ∈G 1 n n i=1 (1 -f (x i , y i )) log(1 -P (x i )) -E[(1 -A(x)) log(1 -P (x))] ≥ t ≤C * e -nt 2 4M 2 . Combining the two parts and noting the following fact: {sup |γ + β| ≥ t} ⊂{sup |γ| + sup |β| ≥ t} ⊂{sup |γ| ≥ t 2 } ∪ {sup |β| ≥ t 2 }, we conclude that P sup f ∈F , P ∈G |L CA -E[K 1 (x)]| ≥ t ≤ C * e -nt 2 16M 2 where C * only depends on V 0 . Lastly, the following corollary explicitly connects our theoretical developments to the experimental setup: Corollary C.17. Suppose the NN is designed as the one specified in the experiments (Section 5). The training data D = {(x i , y i ), i = 1, 2, • • • , n} where (x i , y i ) are i.i.d. samples ∼ π. Then for any t > 0, we have P sup f ∈F , P ∈G |L CA -E[K 1 (X)]| ≥ t ≤ C * e -nt 2 16M 2 . where C * only depends on V 0 in Assumption 4.2. Proof. We note that Assumptions 4.2 and 4.4 hold in this case by Theorem 4.3 and the observation after Assumption 4.4. So Theorem 4.5 implies Corollary C.17.

D ALGORITHM DETAILS

We provide additional algorithm details for our framework in Section 3. Algorithm 1 is the description of our tuning procedure for hyper-parameters λ 1 , λ 2 , λ 3 . Let the marginal coverage probability CP D and the average coverage estimation AC D on the validation set D be defined as CP D = 1 |D | i∈D 1 yi∈[L(xi),U (xi)] , AC D = 1 |D | i∈D P (x i ). where ([L(x), U (x)], P (x)) are prediction results from the deep ensemble. Then, λ i , i = 1, 2, 3 are adjusted to ensure that CP D coincides roughly with AC D , and CP D attains the target prediction level.

E EXPERIMENTAL DETAILS AND MORE RESULTS

This section illustrates experimental details and more experimental results from our proposed model. Algorithm 1: Tuning algorithm Goal: Tune hyperparameters λ 1 , λ 2 , and λ 3 ; Input: Prediction level 1 -α, training dataset D, validation dataset D ; Procedure: (1) Initialize λ i (i = 1, 2, 3) so that CP D is nontrivial, i.e., not (almost) 0 or 1. (2) While CP D is nontrivial: tune λ 2 and λ 3 so that |CP D -AC D | ≤ (e.g., = 1%.) (3) Otherwise tune λ 1 such that CP D is nontrivial. Do step 2 again until we find λ 2 and λ 3 . (4) Tune λ 1 such that CP D > 1 -α where λ 2 and λ 3 are fixed from (3). Output: λ 1 , λ 2 , and λ 3 .

E.1 EXPERIMENTAL DETAILS

Table 3 gives a detailed description about the datasets we use. These open-access real-world benchmark regression datasets are widely used for the evaluation of methods in regression tasks (Hernández-Lobato & Adams, 2015; Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Rosenfeld et al., 2018; Pearce et al., 2018; Zhu et al., 2019) . For synthetic datasets, 2000 i.i.d data are generated for each synthetic setting and randomly split into 1000 training data and 1000 testing data. For benchmark datasets, we first do the data normalization and then randomly split 80% data for training and 20% for testing. The choice of 80%/20% split, compared to the 90%/10% split in Pearce et al. (2018) , is motivated from the need to increase the test size in order to get a meaningful ECE evaluation. The latter is due to that evaluating ECE requires binning, where using a larger number of bins approximates more closely CE, but also requires a larger test size to sustain enough statistical quality for the resulting ECE estimate. This delicate tradeoff motivates us to increase the share of the test set in our split. Following Pearce et al. (2018) , our hyper-parameters are selected using the validation set from a random split. Then, they are fixed during the evaluation on other random splits. As specified in Section 2 (Equation (2.4)) and Appendix A.4 (Equation (A.6)), ECE is evaluated based on dividing [0,1] into M sub-intervals. The larger M is, the more precise is in using ECE to approximate CE, the latter being the ideal conditional coverage error estimator. On the other hand, a larger test set size is needed to support the use of a larger M without deteriorating the statistical quality of the ECE. This delicate tradeoff motivates us to increase the share of the test set in our split. For synthetic datasets, the hyperparameters and corresponding results in Figure 2 are: (a) λ 1 = 1.7, λ 2 = 10 -5 , λ 3 = 1500, CP = 0.95, IW = 0.40, ECE 1 = 0.62%. (b) λ 1 = 1.9, λ 2 = 10 -5 , λ 3 = 1000, CP = 0.96, IW = 0.40, ECE 1 = 0.12%. (c) λ 1 = 3.4, λ 2 = 10 -5 , λ 3 = 1000, CP = 0.95, IW = 0.50, ECE 1 = 0.65%. For benchmark datasets, the implementation details for baseline algorithms in Table 1 are: (1) Nearest-neighbors kernel conditional density estimation (NNKCDE). The algorithm is based on Section 2.1 in Dalmasso et al. (2020) . We use the same Python code provided by Dalmasso et al. (2020) with the default Gaussian kernel. Two tuning parameters, i.e., the number of nearest neighbors k and the bandwidth h of the smoothing kernel, are chosen in a principled way by minimizing the CDE loss on validation data, the same way as in Dalmasso et al. (2020) . (2) Quantile regression forest (QRF). The algorithm is based on Meinshausen (2006) . We use the RandomForestQuantileRegressor from the package scikit-garden in Python. (3) Split conformal learning (SCL). The algorithm based on Algorithm 2 in Lei et al. (2018) . The regression algorithm inside SCL that we use is a neural network with mean square loss. The neural network has the same structure of hidden layers as in Section 5. 

E.3 COMPARISONS WITH A TWO-STAGE APPROACH

We compare the performance of CaNet with a two-stage approach to further demonstrate the effectiveness of our Ca-Module. The two-stage approach is implemented with two separate steps: (1) given a regression dataset, we train a neural network to generate the prediction interval, (2) after getting the predictor, we train another network to estimate the conditional coverage of the PI from the previous stage using L CA loss. Figure 3 compares the reliability diagrams (introduced in A.4) and the coverage histograms of CaNet and the two-stage approach on the dataset "Protein". The coverage histograms demonstrate the percentage of samples in each bin B m (equation A.5) for m ∈ {1, • • • , M }. The average estimated coverage of our model closely matches its coverage probability, while the average estimated coverage of the two-stage algorithm is substantially lower than its coverage probability. In addition, the ECE 1 from CaNet (0.77%) is much lower than the ECE 1 from the two-stage approach (3.9%).



x)] : L, U are output by the NN}, G = { P (x) : P is output by the NN}. Theorem 4.5. Suppose Assumptions 4.2 and 4.4 hold. The training data D

Figure 2: Prediction and conditional coverage of 95% PIs on synthetic examples. (a) CP = 0.95, IW = 0.40, ECE 1 = 0.62%. (b) CP = 0.96, IW = 0.40, ECE 1 = 0.12%. (c) CP = 0.95, IW = 0.50, ECE 1 = 0.65%. The predicted coverage estimation from CaNet is highly consistent with the conditional coverage under different noise settings.

Type I: P[Y ∈ [L(X), U (X)]] (marginal coverage); Type II: P[Y ∈ [L(X), U (X)]|L, U ] (conditional coverage given the PI); Type III: P[Y ∈ [L(X), U (X)]|X = x] (conditional coverage given X = x); Type IV: P[Y ∈ [L(X), U (X)]|L, U, X = x](conditional coverage given the PI and X = x).

5) where P (x i ) is the coverage estimator for sample i. cp(B m ) and cove(B m ) approximate the left and right hand sides of (2.2) respectively in the interval I m . A perfect-calibrated coverage estimator should satisfy cp(B m ) = cove(B m ) for all m ∈ {1, • • • , M }. The diagram of cp(B m ) versus cove(B m ) for all m ∈ {1, • • • , M } is called the reliability diagram in some literature

3 and 4.5. The difficulty in analyzing Theorem 4.5 lies in the fact that the hypothesis classes in Assumption 4.2 (which are constructed by the NN) are different from the hypothesis class used in L CA . To overcome this difficulty, we use the theory of VC-subgraph classes to analyze the connection between the VC dimension of the two hypothesis classes. C.1 REVIEW OF THE VC DIMENSION For self-contained purpose, we first review the definition of the VC-subgraph class and VC dimension. Definition C.1. Consider an arbitrary collection {x 1 , • • • , x n } of points in a set X and a collection C of subsets of X . We say that C shatters {x 1, • • • , x n } if all of 2 n possible subsets of {x 1 , • • • , x n } can be written as A = C ∩ {x 1 , • • • , x n } for some C ∈ C. The VC dimension V (C) of the class C is the smallest n for which no set of size n {x 1 , • • • , x n } is shattered by C.If C shatters sets of arbitrarily large size, we set V (C) = ∞. We say that C is a VC-class if V (C) < ∞.

the constants C and D depend on K 2 and δ only

Evaluation metrics of different models on benchmark datasets. The CP values are marked in blue if they meet the 95% prediction level. The best IW results, marked in bold, are achieved by models with the smallest IW value among those that meet the 95% prediction level. Our model outperforms the baseline algorithms on high-quality PI generation. Meanwhile, it provides accurate coverage estimation on real-world datasets.

ECE 1 results of our model on benchmark datasets with different coverage probabilities.

.3 PERFORMANCE OF COVERAGE ESTIMATOR ON BENCHMARK DATASETSCoverage for 95% PIs. We use ECE 1 to evaluate the coverage estimation performance on realworld datasets as the conditional coverage is unknown. As shown in Table1, the ECE 1 on all experiments are generally around or less than 1%, with better performance on larger datasets. The coverage estimators produced by CaNet have small ECE 1 values, which are very close to the perfect-calibrated coverage estimators (Definition 2.2). Compared with ECE 1 values obtained from the state-of-the-art algorithms in classification tasks

Table 2 reports the CP, IW and ECE 1 values from the CaNet at different PI prediction levels on three benchmark datasets. Results for more datasets can be found in Appendix E. As can be seen, all ECE 1 values in Table 2 are fairly small (∼ 1%), demonstrating the stability of our proposed model. Thus, our CaNet can provide accurate coverage estimation on PIs at different prediction levels. These results demonstrate the robustness of our CaNet on real-world datasets, further suggesting its broad applicability.

This Appendix presents further results and discussions and it consists of five parts. Appendix A gives more detailed properties on coverage estimator and coverage error. Appendix B contains the mathematical argument for Theorem 4.1. Appendix C discusses how to achieve Assumption 4.2 and the proof of Theorem 4.3 and 4.5. Appendix D presents our algorithm details as a supplement to Section 3. Appendix E illustrates experimental details and more experimental results.A FURTHER DETAILS ON COVERAGE ESTIMATOR AND COVERAGE ERRORA.1 COVERAGE PROBABILITY TYPES OF PISZhang et al. (2019)  introduce the following four coverage probability types of PIs. In general, most of the coverage in PIs considered in the literature falls into one of these types.

Theorem C.16 (Restated Theorem 4.5). Suppose Assumptions 4.2 and 4.4 hold. The training data D = {(x i , y i ), i = 1, 2, • • • , n} where (x i , y i ) are i.i.d. samples ∼ π. Recall that the (hard) coverage estimator assessment loss is

gives additional experimental results.

Full names and details of benchmarking regression datasets. N is the number of samples in the dataset and d is the dimension of the feature vector.

Evaluation metrics of our CaNet on benchmark datasets and synthetic examples with different coverage probabilities.

annex

Definition A.1. A coverage estimator P is called a perfect-calibrated coverage estimator on a measurable subset S ⊂ X with P(S) > 0 associated with [L(x), U (x)] if it satisfies P (x) = P[Y ∈ [L(X), U (X)]|L, U, P (X) = P (x), X ∈ S], a.e. P (x) ∈ [0, 1].(A.2)where a.e. is with respect to the probability measure on [0, 1] induced by the random variable P (X)| S . Note that the conditional probability space is standard: (S, F S := {A ∩ S : A ∈ F}, P S (A ∩ S) := P(A|S)).(As our convention in Section A.1, we will omit "conditional on L, U " for simplicity.)Lemma A.2. (a) A coverage estimator is the conditional coverage if and only if it is a perfectcalibrated coverage estimator on any positive-probability measurable subset S of X .(b) Suppose P is a perfect-calibrated coverage estimator on two disjoint positive-probability measurable subsets S 1 , S 2 . Then P is a perfect-calibrated coverage estimator on S 1 ∪ S 2 .Proof. (a) The proof can be found in Lemma A.4.(b) We note that by law of total probability,Hence P is a perfect-calibrated coverage estimator on S 1 ∪ S 2 . Lemma A.2(a) is motivated from a theoretical point of view. It provides a guidance that in order to well resemble the conditional coverage, an estimator should be perfect-calibrated on as many subsets on the feature space as possible.

A.4 DETAILS ON COVERAGE ERROR

In Section 2, we have introduced CE p to quantify the discrepancy between a coverage estimator and a perfect-calibrated coverage estimator, and CE p to quantify the discrepancy between a coverage estimator and the conditional coverage. We note that by Hölder's inequalityA larger value of p corresponds to a larger CE value. Continuing Definition A.1, we can further introduce the calibration-based conditional coverage error on a measurable subset as follows:Definition A.3. An L p (1 ≤ p ≤ +∞) calibration-based conditional coverage error, or coverage error for short, of a coverage estimator P on a measurable subset S ⊂ X with P(S) > 0 is defined as:where L p -norm is taken with respect to the randomness of P (X) on the conditional probability space (S, F S := {A ∩ S : A ∈ F}, P S (A ∩ S) := P(A|S)). In particular, we have CE p := CE p (X ).Lemma A.4. A coverage estimator P is the conditional coverage if and only if its coverage error CE p (S) = 0 for any measurable subset S ⊂ X with P(S) > 0. In particular, a coverage estimator is the conditional coverage if and only if it is a perfect-calibrated coverage estimator on any measurable subset S of X with P(S) > 0.Definition C.2. Define the n th shatter coefficient (or growth function) of C asis the (open) subgraph of f . A collection F of measurable real functions on the sample space X is a VC-subgraph class or VC-class, if the collection of all subgraphs of functions in F forms a VC-class of sets (as sets in X × R). Let V (F) denote the VC dimension of the set of subgraphs of F. Proof. This result follows from Lemma 9.33 and Lemma 9.9(iv) in Kosorok (2007) .For indicator functions of sets, we have the following equivalence. Lemma C.5. For any class C of sets in a set X , the class F C of indicator functions of sets in C is a VC-class if and only if C is a VC-class. Moreover, whenever at least one of C or F C is VC-class, the respective VC dimensions are equal.Proof. This is Lemma 9.8 in Kosorok (2007) . Note that the sets of C are in X while the subgraphs of functions of F C are in X × R.

C.2 JUSTIFYING ASSUMPTION 4.2

We first restate the assumption: Assumption C.6 (Restated Assumption 4.2). The four classes of functions ([L(x), U (x)], P (x), 1-P (x)) output by the neural network (NN) in Figure 1 have finite VC dimensions, say they are bounded above by V 0 .In Figure 1 , the output four neurons of the NN are denoted as (L(x), U (x), P (x), 1 -P (x)). We further let (ψ 1 (x), ψ 2 (x), ψ 3 (x), ψ 4 (x)) denote the pre-activated values of (L(x), U (x), P (x), 1 -P (x)). In other words, L(x) = min(ψ 1 (x), ψ 2 (x)),where σ is the sigmoid function. Let the function classes Assumption 4.2 holds for a wide range of NNs, in particular the one we adopt in the experiments (where we use the ReLU-activated NN to construct ψ i , i = 1, 2, 3, 4 ; see Section 5). Our first result is to concretely show that the four NN outputs above, H 1 , H 2 , G and 1 -G, under the ReLU setting, all have finite VC dimensions and thus satisfy Assumption 4.2. Theorem C.7 (Restated Theorem 4.3). Suppose ψ i , i = 1, 2, 3, 4 are the pre-activated output neurons of the NN in Figure 1 using the ReLU activation function. Then Assumption 4.2 holds. Moreover, suppose the NN has W parameters and U computation units (nodes). Then V 0 = O(W U ).Proof. First, we look at L(x) and U (x). Note that the class of ψ i (i = 1, 2) is constructed by a NN with the ReLU activation function. Therefore by Theorem 8 in Bartlett et al. (2019) By Lemma 9.9 (i) in Kosorok ( 2007), we haveBy Lemma 9.9 (ii) in Kosorok ( 2007), we haveNext, we look at P (x) and 1 -P (x). We add an additional neuron after the layer where ψ 3 , ψ 4 stand. This neuron is defined as ψ 5 = ψ 3 -ψ 4 which is a linear combination of ψ 3 and ψ 4 . Note that the class of ψ 5 is constructed by a NN with the ReLU activation function and linear activation function (by adding one unit and two parameters in the originial NN). Therefore by Theorem 8 in Bartlett et al. (2019) ,By Lemma 9.9 (viii) in Kosorok ( 2007), we havesince σ is a monotone function. Again, by Lemma 9.9 (viii) in Kosorok ( 2007), we haveWe also list some results for other activations here. From these results, and using the same argument as above, we see that Assumption 4.2 holds similarly for all these activations. Lemma C.8. Suppose the class of functions is constructed by a NN with W parameters and U units with activation functions that are piecewise polynomials with at most p pieces and of degree at most d. Then it has VC dimension O(W U log((d + 1)p)).Proof. This is Theorem 8 in Bartlett et al. (2019) .Note that the activation functions in Lemma C.8 include in particular the ReLU activation and linear activation.Lemma C.9. Suppose the class of functions is constructed by a NN with W parameters with binary as well as linear activation function. Then it has VC dimension O(W 2 ).Proof. This is Theorem 5 in Sontag (1998) .Lemma C.10. Suppose the class of functions is constructed by a NN with W parameters and U units with activation function that is the standard sigmoid function (except that the output unit being a linear threshold unit). Then it has VC dimension O(W 2 U 2 ).Proof. This is Theorem 8.13 in Anthony & Bartlett (2009) . G := log(G) := {log( P (x)) : P is output by the NN}. Let N ( , F, L 2 (Q)) denote the covering number, i.e., the minimal number of balls {g : gh L 2 (Q) < } of radius needed to cover the set F. We need the following bounds:where the constant K 2 depends on V (F) only.Proof. It follows from Theorem 2.6.7 in Van der Vaart & Wellner (1996) that there exists a universal constant K such thatwhere K 3 := log(KV (F)(16e) V (F ) ) and K 2 = K 3 + V (F) -1 only depending on V (F).We remark that a similar result can also be obtained for the class 1 -F by Theorem C.12.Lemma C.14. Suppose G is a class of functions P : X → [0, 1] with a finite VC dimension V (G) and | log( P (x))| ≤ M . Let G := {log( P ) : P ∈ G}. Then, for every 0 < < 1, supwhere the constant K 2 depends on V (G) only.Proof. First note that φ(t) := log(t) is a monotone function. Hence G := {log( P ) : P ∈ G} is a VC-class with VC dimension ≤ V (G) by Lemma 9.9 (viii) in Kosorok (2007) . The rest of the proof is similar to C.13.We remark that a similar result can also be obtained for the class log As discussed in Section 4, this is a natural assumption in practice because log( P (x)) and log(1 -P (x)) are replaced by log( P (x) + ) and log(1 -P (x) + ) respectively to avoid explosion when implementing the algorithm. In particular, in our experiments in Section 5, = 0.1 6 and thus M = 14.We are now ready to prove Theorem 4.5:

