CHEPAN: CONSTRAINED BLACK-BOX UNCERTAINTY MODELLING WITH QUANTILE REGRESSION

Abstract

Most predictive systems currently in use do not report any useful information for auditing their associated uncertainty and evaluating the corresponding risk. Taking it for granted that their replacement may not be advisable in the short term, in this paper we propose a novel approach to modelling confidence in such systems while preserving their predictions. The method is based on the Chebyshev Polynomial Approximation Network (the ChePAN), a new way of modelling aleatoric uncertainty in a regression scenario. In the case addressed here, uncertainty is modelled by building conditional quantiles on top of the original pointwise forecasting system considered as a black box, i.e. without making assumptions about its internal structure. Furthermore, the ChePAN allows users to consistently choose how to constrain any predicted quantile with respect to the original forecaster. Experiments show that the proposed method scales to large size data sets and transfers the advantages of quantile regression to estimating black-box uncertainty.

1. INTRODUCTION

Figure 1 : Description of the uncertainty modelling of a black-box predictive system, β. This modelling is done by means of an uncertainty wrapper (the only part of the ChePAN that requires a neural network), which produces all of the distribution ppy | xq as quantiles, q ppy|xq . The ChePAN ensures that the original prediction of β corresponds to a desired statistic of ppy | xq, i.e. the constraint. The present paper proposes a novel method for adding aleatoric uncertainty estimation to any pointwise predictive system currently in use. Considering the system as a black box, i.e. avoiding any hypothesis about the internal structure of the system, the method offers a solution to the technical debt debate. The concept of technical debt was introduced in 1992 to initiate a debate on the long-term costs incurred when moving quickly in software engineering (Sculley et al. (2015) ; Cunningham (1992) ). Specifically, most of the predictive systems currently in use have previously required much effort in terms of code development, documentation writing, unit test implementation, preparing dependencies or even their compliance with the appropriate regulations (e.g., medical (Ustun & Rudin (2016) ) or financial models (Rudin (2019) ) may have to satisfy interpretability constraints). However, once the system is being used with real-world problems, a new requirement can arise regarding the confidence of its predictions when the cost of an erroneous prediction is high. That being said, replacing the currently-in-use system may not be advisable in the short term. To address this issue, k"0 their coefficients, tC k u d´1 k"0 the coefficients of the integrated polynomial, β the black box function and P the conditional prediction of the quantile τ . the aim of this work is to report any information that is useful for auditing the system's associated uncertainty without modifying its predictions. In general terms, sources of uncertainty can be understood by analysing the conditional members of this joint distribution: ppy, xq " ş M ppy | x, M qppM | xqppxq dM where M P M is the family (assumed non-finite) of models being considered. Not all methods developed to model uncertainty can be applied in the black-box scenario, since the main hypothesis is that the black box is a fixed single model and unknown internally. Here, we refer specifically to those solutions that model epistemic uncertainty, which requires modelling ppM | xq. By epistemic, we mean that uncertainty which can derive from ignorance about the model, including, for example, ensemble models (Lakshminarayanan et al. ( 2017)), Bayesian neural networks (Rasmussen (1996) ; Blundell et al. (2015) ; Hernández-Lobato & Adams (2015b); Teye et al. ( 2018)) or MC-Dropout (Gal & Ghahramani (2016) ). However, the black box could be a non-parametric predictive system or even a handcrafted rulebased system, as shown in Figure 1 . Hence the reason for studying aleatoric uncertainty (Der Kiureghian & Ditlevsen (2009) ; Kendall & Gal (2017) ; Brando et al. (2019) ), which originates from the variability of possible correct answers given the same input data, ppy | xq. This type of uncertainty can be tackled by modelling the response variable distribution. For instance, imposing a conditional normal distribution where the location parameter is the black-box function and the corresponding scale parameter is learnt. However, the more restricted the assumptions made about this distribution, the more difficult it will be to model heterogeneous distributions. One solution to this limitation is the type of regression analysis used in statistics and econometrics known as Quantile Regression (QR), which will provide a more comprehensive estimation. Unlike classic regression methods, which only estimate a selected statistic such as the mean or the median, QR allows us to approximate any desired quantile. The main advantage of this method is that it allows confidence intervals to be captured without having to make strong assumptions about the distribution function to be approximated. Recently, several works (Dabney et al. (2018a) ; Tagasovska & Lopez-Paz (2018) ; Brando et al. (2019) ) have proposed a single deep learning model that implicitly learns all the quantiles at the same time, i.e. the model can be evaluated for any real value τ P r0, 1s to give a pointwise estimation of any quantile value of the response variable. Nevertheless, these QR solutions are not directly applicable to the uncertainty modelling of a black box because the predicted quantiles need to be linked to the black-box prediction in some way. In the present paper, we propose a novel method for QR based on estimating the derivative of the final function using a Chebyshev polynomial approximation to model the uncertainty of a blackbox system. Specifically, this method disentangles the estimation of a selected statistic β of the distribution ppy | xq from the estimation of the quantiles of ppy | xq (shown in Figure 2 ). Hence, our method is not restricted to scenarios where we can jointly train both estimators, but can also be applied to pre-existing regression systems as a wrapper that produces the necessary information to evaluate aleatoric uncertainty. Additionally, the proposed method scales to several real-world data sets. This paper is organised as follows. Section 2 states the real-world motivation of the current research as well as the contribution it will be presented. Section 3 introduces the problem of QR and reviews the classic approach to use with neural networks, showing how it cannot be applied directly to constrained black-box uncertainty modelling. Section 4 explores an approach for modelling the derivative of a function using neural networks. The two previous sections provide the baseline for developing our proposed model and its properties, which is presented in Section 5. And finally, in Section 6, we show how our model can be applied in large data sets and defines a new way of modelling the aleatoric uncertainty of a black box. The results are then summarised in the conclusion.

2. RESEARCH GOAL AND CONTRIBUTION

The present article was motivated by a real-world need that appears in a pointwise regression forecasting system of a large company. Due to the risk nature of the internal problem where it is applied, uncertainty modelling is important. However, similarly to the medical or financial cases presented in the introduction, interpretability requirements were essential in defining the model currently used by the company, which does not report confidence any prediction made. The need for this research arises in cases where the replacement of the aforementioned system is not advisable in the short term, despite the ongoing need for the uncertainty estimation of that system.

Definition of constrained black-box uncertainty modelling

From the probabilistic perspective, solving a regression problem involves determining a conditional density model, qpy | xq. This model fits an observed set of samples D " pX, Y q " px i , y i q | x i P R D , y i P R ( n i"1 , which we assume to be sampled from an unknown distribution ppy | xq. i.e. the real data. Given this context, the pointwise forecasting system mentioned above is a function, β : R D Ñ R, which tries to approximate a certain conditional summary statistic (a percentile or moment) of ppy | xq. Regarding the notation, we will call the "constraint" the known or assumed summary statistic that is approximated by βpxq (e.g. if β is reducing the mean square error, then it corresponds to the conditional mean. Otherwise, if it minimises the mean absolute error, it corresponds to the median). Importantly, in the constrained black-box uncertainty modelling context, the mismatch between the real conditional statistic and the black box, β, becomes a new source of aleatoric uncertainty that is different from the one derived from the data. However, the way to model it continues to be by estimating ppy | xq. Therefore, a poorly estimated β will impact the modelling of ppy | xq, given that we always force the constraint to be satisfied (as shown in Figure 3 of the Experiment section). So far, we have attempted to highlight the fact that we do not have a strong hypothesis about the internals of this β function, we have only assumed that it approximates a certain statistic of ppy | xq. Accordingly, we call this function the "constrained black box". This flexible assumption will enable us to consider several pointwise models as β, as shown in Figure 1 . The overall goal of the present article is, taking a pre-defined black box βpxq that estimates a certain conditional summary statistic of ppy | xq, to model qpy | xq under the constraint that if we calculate the summary statistic of this predicted conditional distribution, it will correspond to βpxq. As mentioned in the Introduction, since we have a fixed black box, we are unable to apply Bayesian techniques such as those that infer the distribution of parameters within the model, ppM | xq. In general, even though they are very common techniques in generic uncertainty modelling, no such epistemic uncertainty techniques can be applied in this context due to the limitation of only having a single fixed model. In addition, it should be noted that not all models that estimate ppy | xq can be used in the constrained black-box uncertainty modelling context. To solve this problem, we require models that predict qpy | xq but also force the chosen conditional summary statistic of qpy | xq to have the same value as βpxq. The main contribution of this work is to present a new approach that allows us not only to outperform other baseline models when tackling this problem, but also to decide which kind of constraint we wish to impose between βpxq and qpy | xq. The qpy | xq will be approximated using Quantile Regression (explained in Section 3) and the constraint will be created considering the integration constant of the qpy | xq derivative (shown in Section 5.1).

3. CONDITIONAL QUANTILE REGRESSION

In Quantile Regression (QR), we estimate q in a discrete manner by means of quantiles, which does not assume any typical parametric family distribution to the predicted p, i.e. it goes beyond central tendency or unimodality assumptions. For each quantile value τ P r0, 1s and each input value x P R D , the conditional quantile function will be f : r0, 1s ˆRD Ñ R. In our case, we use deep learning as a generic function approximator (Hornik et al. (1989) ) to build the model f , as we shall see later. Consequently, f is a parametric function that will be optimised by minimising the following loss function with respect to their weights w, Lpx, y, τ q " `y ´fw pτ, xq ˘¨`τ ´1ry ă f w pτ, xqs ˘(1) where 1rcs denotes the indicator function that verifies the condition c. Equation 1 is an asymmetric convex loss function that penalises overestimation errors with weight τ and underestimation errors with weight 1 ´τ . Recently, different works (Dabney et al. (2018b; a) ; Wen et al. (2017) ) have proposed deep learning models that minimise a QR loss function similar to Equation 1. For instance, in the field of reinforcement learning, the Implicit Quantile Network (IQN) model was proposed (Dabney et al. (2018a) ) and subsequently applied to solve regression problems as the Simultaneous Quantile Regression (SQR) model (Tagasovska & Lopez-Paz (2019) ) or the IQN in (Brando et al. (2019) ). These models consist of a neural network ψ : r0, 1s ˆRD Ñ R such that it directly learns the function f that minimises Equation 1, i.e. f " ψ. In order to optimise ψ for all possible τ values, these models pair up each input x with a sampled τ " Up0, 1q from a uniform distribution in each iteration of the stochastic gradient descent method. Thus, the final loss function is an expectation over τ of Equation 1. However, these QR models cannot be applied to the constrained black-box scenario, given that they do not link their predicted quantiles with a pointwise forecasting system in a constrained way (Section 5.1). Other models, such as quantile forests, have a similar limitation. In the next section, we introduce the other main part required to define our proposed method.

4. MODELLING THE DERIVATIVE WITH A NEURAL NETWORK

Recently, a non-QR approach was proposed to build a monotonic function based on deep learning: the Unconstrained Monotonic Neural Network (UMNN) (Wehenkel & Louppe (2019) ). The UMNN estimates the derivative of a function by means of a neural network, φ, which has its output restricted to strictly positive values, i.e. approaching Hpzq such that Hpzq " ż z 0 φptq dt `Hp0q. (2) Therefore, if the neural network φpzq « BH Bz pzq ą 0, this is in fact a sufficient condition to force Hpzq to be monotone. To compute the integral of BH Bz , the UMNN approximates the integral of Equation 2 using the Clenshaw-Curtis quadrature, which has a closed expression. The UMNN is designed to obtain a general monotonic function with respect to all the model inputs, z, but our interest is to build a partial monotonic function with respect to the quantile value, as we will explain hereafter. The partial monotonic function will be obtained using the Clenshaw-Curtis Network (CCN) model, which is an extension of the UMNN model introduced in Section A.3 of the Appendix and an intermediate step we took to arrive at the main proposal of the current article. Importantly, we have not included it in the main article because it cannot be applied to the constrained black-box uncertainty modelling scenario (as described in Section A.3). an approach which is uniformly defined over all of the interval. We call this approach the Chebyshev Polynomial Approximation Network (ChePAN), which allows us to transfer the advantages of quantile regression to the constrained uncertainty modelling of a black box. As Figure 2 shows, the ChePAN contains a neural network φ : r0, 1s ˆRD Ñ R `that only produces positive outputs and models the derivative of the final function with respect to τ . The goal is to optimise the neural networks φpτ, xq by calculating the coefficients of a truncated Chebyshev polynomial expansion ppτ, x; dq of degree d with respect to τ . That is, we will use a Chebyshev polynomial (described in Section A.1 of the Appendix) to give a representation of the neural network, φ, uniformly defined in τ P r0, 1s. After that, we will use its properties to model the uncertainty of a black box in a constrained way (described in Section 5.1). Internally, the ChePAN considers a finite mesh of quantile values, called Chebyshev roots, tt k u d´1 k"0 Ă r0, 1s and defined by t k - 1 2 cos πpk `1 2 q d `1 2 , 0 ď k ă d. The truncated Chebyshev expansion of a function can be interpreted as a linear transformation using a set of evaluations of φ at the roots, i.e. tφpt k , xqu d´1 k"0 . This linear transformation gives a vector of coefficients, which are known as Chebyshev coefficients and depend on x, i.e. tc k pxqu d´1 k"0 , as illustrated in Figure 2 . The implementation of a linear transformation generally has a square complexity. However, the transformation involved in Chebyshev coefficients can be computed efficiently with a Θpd log dq complexity. In fact, the algorithm that speeds the computation is based on the Fast Fourier Transform (FFT) and known as the Discrete Cosine Transform of type-II (DCT-II) (discussed in Section A.1 of the Appendix). Once the Chebyshev coefficients c k pxq have been computed, we can write them in a linear combination of Chebyshev polynomials T k ptq, i.e. ppτ, x; dq - 1 2 c 0 pxq `d´1 ÿ k"1 c k pxqT k p2τ ´1q, where T k ptq are defined recurrently as T 0 ptq " 1, T 1 ptq " t, and T k`1 ptq " 2tT k ptq ´Tk´1 ptq for k ě 1. These polynomials T k do not need to be explicitly computed to evaluate p on a quantile (Clenshaw (1955) ). Note that, given the construction of the coefficients c k pxq, the ppt k , x; dq is equal to φpt k , xq at each of the root points t k . These equalities must be understood in terms of machine precision in the numerical representation system, classically " 10 ´16 in double-precision or " 10 ´8 in singleprecision arithmetic. In Figure 2 , we denote this root evaluation step as p t k . The final goal is to provide P pτ, x; dq so that it approximates the integral of p, that is ş τ 0 ppt, x; dq dt. Specifically, the integral will also be the integral of the neural network φ, P pτ, x; dq « Φpτ, xq " ż τ 0 φpt, xq dt `Kpxq. (5) Since φpτ, xq is defined as positive for all τ P r0, 1s, then P pτ, x; dq will be an increasing function with respect to τ . Additionally, given that ppτ, x; dq is a Chebyshev polynomial (defined in Equation 4), its integral w.r.t. τ is simply the integral of the Chebyshev polynomial T k , which corresponds to a new Chebyshev polynomial. Using the recurrent definition of T k , we deduce the indefinite integrals ż T 0 ptq dt " T 1 ptq, ż T 1 ptq dt " T 2 ptq 4 ´T0 ptq 4 , ż T k ptq dt " T k´1 ptq 2pk ´1q ´Tk`1 ptq 2pk `1q , which leads to the conclusion that P can be given in terms of Chebyshev coefficients as well. Thus, P pτ, x; dq - 1 2 C 0 pxq `d´1 ÿ k"1 C k pxqT k p2τ ´1q, where the coefficients C k pxq have a recurrent expression in terms of a Toeplitz matrix (see Clenshaw (1955) ). Indeed, by ordering the coefficients of the integral in Equation 4, we deduce that C k pxq - c k´1 pxq ´ck`1 pxq 4k , 0 ă k ă d ´1, C d´1 pxq - c d´2 pxq 4pd ´1q , and C 0 pxq depends on the constant of integration Kpxq in Equation 5and the other coefficient values in Equation 7. This freedom of the predicted τ in C 0 pxq allows us to impose a new condition, which becomes a uniform condition in all of the intervals r0, 1s. In Section 5.1, we will discuss how to define the C 0 pxq depending on the black box desired.

5.1. ADDING AN UNCERTAINTY ESTIMATION TO A BLACK-BOX PREDICTION SYSTEM

In this subsection, we tackle the constrained black-box uncertainty modelling problem introduced in Section 2. The main assumption is that we have a pointwise predictive system, which we will refer to as βpxq and approximates a desired statistic such as the mean, median or a certain quantile of ppy | xq, as shown in Figure 1 . It is not necessary for this system to be a deep learning model or even parametric. All that the ChePAN requires to train its neural network, φ, are the corresponding β-evaluation values of the training set, i.e. tx, βpxqu. Thus, the ChePAN calculates the conditioned response distribution to the input without assuming asymmetry or unimodality with respect to this distribution, as well as associating the desired statistic of this distribution to βpxq. The formula used to calculate the constant of integration, C 0 pxq, will depend on which statistic we choosefoot_1 . If we impose the quantile τ " 0 to be β (which we shall call ChePAN-β=q 0 ), then C 0 pxq " 2βpxq ´2 d´1 ÿ k"1 C k pxqp´1q k . However, if we force the quantile τ " 1 to be the β (which we shall call ChePAN-β=q 1 ), then C 0 pxq " 2βpxq ´2 d´1 ÿ k"1 C k pxq. For instance, the prediction of extreme weather events involves the forecasting system to predict the maximum or minimum values of ppy | xq. In these cases, this pre-trained system could be used as β in Equation 9or Equation 10, respectively, to determine the overall quantile distribution of ppy | xq, taking β as a reference point. If the median (equivalently, τ " 0.5) is the β (which we shall call ChePAN-β=Med), then C 0 pxq " 2βpxq ´2 d´1 ÿ k"1 k even p´1q k{2 C k pxq. Finally, the mean is forced to be the β (which we shall call ChePAN-β=Mean), then C 0 pxq " 2βpxq ´2 d´1 ÿ k"1 k odd C k pxq k 2 ´4 . Additionally, βpxq can be approximated by means of another neural network, which can be simultaneously optimised with φpτ, xq. We will use this approach to compare the ChePAN and other baseline models in the results section regarding black-box modelling.

6. EXPERIMENTS

The source code used to reproduce the results of the ChePAN in the following experiments can be found in the Github repositoryfoot_2 . The DCT-II method referred to in Section 5 was used in the aforementioned source code. In this section, we describe the performance of the proposed models compared to other baselines. The main goal is to show that by using QR the ChePAN is an improvement on other black-box uncertainty modelling baselines because it avoids centrality or unimodality assumptions, while also allowing users to choose how to constrain the predicted quantiles with respect to the black-box prediction.

6.1. MODELS UNDER EVALUATION

Exponential power distributions satisfy the condition that one of the parameters corresponds to the mode. Thus, those models that approximate such parametric distributions where the mode parameter is the black-box function and estimate the other parameter related to uncertainty can be used as baselines. • The Heteroscedastic Normal distribution (N) Similarly to (Bishop (1994) ; Kendall & Gal (2017) ; Tagasovska & Lopez-Paz (2019) ; Brando et al. (2019) ), two neural networks, µ and σ, can be used to approximate the conditional normal distribution, N pµpxq, σpxqq, such that they maximise the likelihood. In the black-box scenario proposed here, µ is the black-box function and we only need to optimise the σ neural network. Once optimised, the desired quantile τ can be obtained with F pτ, xq " µpxq `σpxq ? 2 ¨erf ´1p2τ ´1q, τ P p0, 1q, where erf ´1 is the inverse error function. • The Heteroscedastic Laplace distribution (LP) As a more robust alternative to outlier values, a conditional Laplace distribution, LP `µpxq, bpxq ˘, can be considered. Here, the quantile function is F pτ, xq " µpxq ``b logp2τ q ˘¨1rτ ď 1 2 s ´`b logp2 ´2τ q ˘¨1rτ ą 1 2 s, τ P p0, 1q. • The Chebyshev Polynomial Approximation Network (ChePAN) In order to use the same black boxes as the other baselines, Equation 12 is considered, given that these black boxes are optimising the mean square error. Other alternative equations are considered in the pseudo code and in Figure 6 of the supplementary material.

6.2. DATA SETS AND EXPERIMENT SETTINGS

All experiments were implemented in TensorFlow (Abadi et al. (2015) ) and Keras (Chollet et al. (2019) ), running in a workstation with Titan X (Pascal) GPU and GeForce RTX 2080 GPU. All the details of the data sets used and model hyper-parameters for the results section are described in the supplementary material.

6.3. RESULTS

Table 1 shows a comparison of uncertainty modelled for two given black-box systems (a Random Forest (RF) (Liaw et al. (2002) ) and an XGBoost (Chen & Guestrin (2016) )) in four data sets. The Figure 3 : Heterogeneous synthetic distribution proposed by (Brando et al. (2019) ). In the upper part of the figure, the learnt quantiles, φ, are noisy because their mean is the black box defined as an inaccurate MSE Random Forest (RF), β, following Equation 12. In the lower part, φ and β are learnt and asymmetries and multimodalities can be seen more clearly, while still respecting the constraint in Equation 12. first four columns correspond to each part of the synthetic distribution proposed by (Brando et al. (2019) ) and shown in Figure 3 , the fifth column is the full Year Prediction MSD UCI dataset (Dua & Graff (2017a) ), predicting the release year of a song from 90 audio features and, finally, the last two columns correspond to predicting the room price forecasting of Airbnb flats (RPF) in Barcelona and Vancouver, extracted from (Brando et al. (2019) ). The mean of the QR loss value (see Equation 1) is evaluated for ten thousand randomly selected quantiles for ten executions of each model tm k u 10 k"1 , L m k pX test , Y test q " Ntest ÿ i"1 Nτ ÿ j"1 `yi ´fm k pτ j , x i q ˘¨`τ j ´1ry i ă f m k pτ j , x i qs Ntest ¨Nτ , where N test is the number of points in the test set, N τ " 10, 000 the number of Monte Carlo samplings and f m k any of the models considered in Table 1 . Considering how the QR loss is defined in Equation 1, its value not only informs us about each system's performance but also how generically calibrated its predicted quantiles are. Furthermore, in Table 1 we observe that the ChePAN outperforms other methods in most cases due to it transferring the capacity to capture asymmetries and multimodalities of QR in ppy | xq to the black-box problem, where our uncertainty modelling needs to be restricted in order to maintain the corresponding statistic associated with the black box. This restriction of conserving the black box can be seen qualitatively in the upper part of Figure 3 , where such a restriction must be met in any situation, i.e. even if performance worsens because the black box, βpxq, is not correctly fitted (as described in Section 2). In this case, βpxq is an inaccurate Random Forest predicting the mean. Importantly, the ChePAN propagates the βpxq noise to the predicted quantiles (in blue) because the constraint is always forced. On the other hand, the ability of ChePAN to model heterogeneous distributions using QR is better displayed in the lower part of Figure 3 . In this case, the black box is a neural network that is learnt concurrently with the quantiles. Since the black box is better approximated, the quantiles are better. Finally, since Table 1 shows that there is a similar performance order between the baselines when using the RF or XGBoost, we also want to show additional experiments that directly measure the calibration of the predicted quantiles and compare the predicted width of certain desired intervals. 3 of the appendix considering the 0.025 and the 0.975 quantiles. For the sake of



CHEPAN: THE CHEBYSHEV POLYNOMIAL APPROXIMATION NETWORKIn this section, we will extend the UMNN to a model that is only monotonic with respect to the quantile input τ . Moreover, we will exploit the fact that the quantile domain is in r0, 1s to provide All details of how such formulas are reached can be found in the supplementary material. The camera-ready version of this paper will include all of the source codes to reproduce the experiments.



Figure 2: Graphic representation of the ChePAN. For any degree d, tp ti u d´1i"0 are evaluations of the initial Chebyshev polynomial expansion, tc k u d´1 k"0 their coefficients, tC k u d´1 k"0 the coefficients of the integrated polynomial, β the black box function and P the conditional prediction of the quantile τ .

Following the UCI data sets used in (Hernández-Lobato & Adams (2015b); Gal & Ghahramani (2016); Lakshminarayanan et al. (2017); Tagasovska & Lopez-Paz (2019)), we performed two empirical studies to assess this point in a black-box scenario where the black box is an MSE-XGBoost. Following the proposed hidden layers architecture in (Tagasovska & Lopez-Paz (2019)), the Prediction Interval Coverage Probability (PICP) and the Mean Prediction Interval Width (MPIW) are reported in Table

Mean and standard deviation of the QR loss value, mean ˘std, of 10 executions for each Black box -Uncertainty wrapper using all of the test distributions in Figure3and three data sets (described in Section A.6). The ranges that overlap with the best range are highlighted in bold. 42.37 ˘0.04 23.19 ˘1.00 66.44 ˘0.26 151.51 ˘0.24 57.50 ˘.05 23.47 ˘.14 27.27 ˘.39 RF -LP 42.88 ˘0.04 22.10 ˘0.03 67.13 ˘0.09 153.06 ˘0.22 57.58 ˘.02 23.07 ˘.17 28.06 ˘.12 RF -ChePAN 41.52 ˘0.35 23.19 ˘0.70 65.98 ˘0.20 148.39 ˘0.16 48.28 ˘.18 23.17 ˘.07 28.16 ˘.14 XGBoost -N 42.42 ˘0.05 23.35 ˘0.99 66.38 ˘0.26 149.35 ˘0.40 51.17 ˘.08 24.52 ˘.26 27.79 ˘.08 XGBoost -LP 42.90 ˘0.02 23.02 ˘0.43 67.13 ˘0.17 150.94 ˘0.12 51.24 ˘.02 22.63 ˘.11 27.86 ˘.07 XGB. -ChePAN 41.95 ˘0.40 23.69 ˘0.68 65.89 ˘0.17 146.20 ˘0.30 48.54 ˘.08 22.00 ˘.04 27.51 ˘.13 N 43.63 ˘2.89 23.70 ˘6.85 67.45 ˘1.68 148.78 ˘2.88 49.00 ˘.24 27.28 ˘1.25 28.62 ˘1.61 LP 43.46 ˘0.15 20.72 ˘0.47 68.06 ˘0.82 149.99 ˘0.64 48.67 ˘.28 23.51 ˘.28 22.32 ˘.06 ChePAN 41.72 ˘0.24 22.94 ˘1.81 68.55 ˘6.61 145.93 ˘3.14 46.76 ˘.25 20.67 ˘.40 21.97 ˘.12

annex

of all the folds using the mean absolute error between the empirical predicted calibration and the perfect ideal calibration of 980 equidistant quantiles using Equation 14.completeness, in Figure 4 and its associated table we have also computed an additional metric not only to verify the calibration of the 0.025 and 0.975 quantiles, but also to obtain a measure of general calibration considering the entire quantile distribution. Given N τ -equidistant set of quantiles to evaluate, τ " r10 ´2, . . . , 1 ´10 ´2s, the % of actual test data that falls into each predicted quantile can be compared to each real quantile value as follows,In addition, two extra figures showing the disentangled visualisation of this calibration metric from each quantile can be found in Figure 5 of the Appendix. As all of the figures and tables show, in terms of calibration, the ChePAN generally displays a better performance in the black-box scenario than the other models.

7. CONCLUSION

The uncertainty modelling of a black-box predictive system requires the designing of wrapper solutions that avoid assumptions about the internal structure of the system. Specifically, this could be a non-deep learning model (such as the one presented in Table 1 and Figure 3 ) or even a nonparametric predictive system, as proposed in Figure 1 . Therefore, not all models or types of uncertainties can be considered using this framework.The present paper introduces the Chebyshev Polynomial Approximation Network (ChePAN) model, which is based on Chebyshev polynomials and deep learning models and has a dual purpose: firstly, it predicts the aleatoric uncertainty of any pointwise predictive system; and secondly, it respects the statistic predicted by the pointwise system.To conclude, then, the ChePAN transfers the advantages of Quantile Regression (QR) to the problem of modelling aleatoric uncertainty estimation in another existing and fixed pointwise predictive system (denoted as β and referred to as a black box). Experiments using different large-scale real data sets and a synthetic one that contains several heterogeneous distributions confirm these novel features.

