UNCERTAINTY PREDICTION FOR DEEP SEQUENTIAL REGRESSION USING META MODELS Anonymous authors Paper under double-blind review

Abstract

Generating high quality uncertainty estimates for sequential regression, particularly deep recurrent networks, remains a challenging and open problem. Existing approaches often make restrictive assumptions (such as stationarity) yet still perform poorly in practice, particularly in presence of real world non-stationary signals and drift. This paper describes a flexible method that can generate symmetric and asymmetric uncertainty estimates, makes no assumptions about stationarity, and outperforms competitive baselines on both drift and non drift scenarios. This work helps make sequential regression more effective and practical for use in real-world applications, and is a powerful new addition to the modeling toolbox for sequential uncertainty quantification in general.

1. INTRODUCTION

The ability to quantify the uncertainty of a model is one of the fundamental requirements in trusted, safe, and actionable AI (Arnold et al., 2019; Jiang et al., 2018; Begoli et al., 2019) . This paper focuses on uncertainty quantification in regression tasks, particularly in the context of deep neural networks (DNN). We define a sequential task as one involving an ordered series of input elements, represented by features, and an ordered series of outputs. In sequential regression tasks (SRT), the output elements are (possibly multivariate) real-valued variables. SRT occur in numerous applications, among others, in weather modeling, environmental modeling, energy optimization, and medical applications. When the cost of making an incorrect prediction is particularly high, such as in human safety, models without a reliable uncertainty estimation are perceived high risk and may not be adopted. Uncertainty prediction in DNNs has been subject to active research, in particular, spurred by what has become known as the "Overconfidence Problem" of DNNs Guo et al. (2017) , and by their susceptibility to adversarial attacks Madry et al. (2017) . However, the bulk of work is concerned with non-sequential, classification tasks (see Section 2) leaving a noticeable gap for SRT. In this paper we introduce a meta-modeling concept as an approach to achieving high-quality uncertainty quantification in DNNs for SRT. We demonstrate that it not only outperforms competitive baselines but also provides consistent results across a variety of drift scenarios. We believe the approach represents a new powerful addition to the modeling toolbox in general. The novel contributions of this paper are summarized as follows: (1) Application of the meta-modeling concept to SRT, (2) Developing a joint base-meta model along with a comparison to white-and black-box alternatives, (3) Generating asymmetric uncertainty bounds in DNNs, and (4) Proposing a new evaluation methodology for SRT.

2. RELATED WORK

Classical statistics on time series offers an abundance of work dealing with uncertainty quantification (Papoulis & Saunders, 1989) . Most notably in econometrics, a variety of heteroskedastic variance models lead to highly successful application in financial market volatility analyses (Engle, 1982; Bollerslev, 1986; Mills, 1991) . An Autoregressive Conditional Heteroskedastic, or ARCH, model (Engle, 1982) , and its generalized version, GARCH, (Bollerslev, 1986) are two such methods, the latter of which serves as one of our baselines. An illuminating study (Kendall & Gal, 2017) describes an integration of two sources of uncertainty, namely the epistemic (due to model) and the aleatoric (due to data). The authors propose a variational approximation of Bayesian Neural Networks and an implicit Gaussian model to quantify both types of variability in a non-sequential classification and regression task. Based on Nix & Weigend (1994) , Lakshminarayanan et al. (2017) also uses an implicit Gaussian model to improve the predictive performance of a base model, again in a non-sequential setting. Similar to Kendall & Gal (2017) , the study does not focus on comparing the quality of the uncertainty to one generated by other methods. We use the implicit variance model of Kendall & Gal (2017) ; Oh et al. (2020) ; Lakshminarayanan et al. (2017) , as well as the method of variational dropout of Gal & Ghahramani (2016) ; Kendall & Gal (2017) as baselines in our work. A meta-modeling approach was taken in Chen et al. (2019) aiming at the task of instance filtering using white-box models. The work relates to ours through the meta-modeling concept but concentrates on classification in a non-sequential setting. Besides its application in filtering, meta-modeling has been widely applied in the task of learning to learn and lifelong learning (Schmidhuber, 1987; Finn et al., 2019) . However, it should be pointed out that the two applications of meta-modeling are not comparable due to their different objectives. Uncertainty in data drift conditions was assessed in a recent study (Snoek et al., 2019) . The authors employ calibration-based metrics to examine various methods for uncertainty in classification tasks (image and text data), and conclude, among others that most methods' quality degrades with drift. Acknowledging drift as an important experimental aspect, our study takes it into account by testing in matched and drifted scenarios. Finally, Shen et al. (2018) described a multi-objective training of a DNN in wind power prediction, minimizing two types of cost related to coverage and bandwidth. We expand on these metrics in Section 3.3.

3.1. META MODELING APPROACH

The basic concept of Meta Modeling (MM), depicted in Figure 1 , involves a combination of two models comprising a base model, performing the main task (e.g., regression), and a meta model, learning to predict the base model's error behavior. Depending on the amount of information shared between these two, we distinguish several settings, namely (1) base model is a black-box (BB), (2) base is a white-box (WB, base parameters are accessible), and (3) base and meta components are trained jointly (JM). The advantages of WB and JM are obvious: rich information is available for the meta model to capture salient patterns for it to generate accurate predictions. On the other hand, the BB setting often occurs in practice and is a given. involving a loss l z quantifying the target error (residual) from the base model, and l m quantifying the prediction error of the meta model. In general, l b , l z , and l m , may differ. The expectations are estimated using an available dataset. We used the L 2 F norm for l b and l m , and L 1 for l z , as described in Section 4. Given differentiable loss functions and the DNN setting, the base and the meta model can be integrated in a single network (JM). In this case the parameters are estimated jointly via φ * , γ * = arg min φ,γ E[βl b (ŷ, y) + (1 -β)l m (ẑ, l z (ŷ, y))] (1) whereby dedicated output nodes of the network generate ŷt and ẑt , and β is a hyper-parameter trading off the base with the meta loss. Thus, one part of the network tackles the base task, minimizing the base residual, while another models the residual as the eventual measure of uncertainty. As done in (Kendall & Gal, 2017) , one can argue that the base objective minimizes the epistemic (parametric) uncertainty, while the meta objective captures the aleatoric uncertainty present in the data. Due to their interaction, the base loss is influenced by the estimated uncertainty encouraging it to focus on feature-space regions with lower aleatoric uncertainty. Moreover, we conjecture, the DNN base model is encouraged to encode the input in ways suitable for uncertainty quantification. Figure 2 shows an overview of a sequential DNN architecture applied throughout our study. It includes a base encoder-decoder pair and a meta decoder connected to them. Each of these contains a recurrent memory cell -the LSTM (Hochreiter & Schmidhuber, 1997) . The role of the encoder is to process the sequential input, x, compress its information in a context vector and pass it to the base decoder. The recurrent decoder produces the regression output ŷ in M time steps feeding its predictions as input in the next time steps. Evolving in time, both base LSTMs update their internal states b t and h t , whereby the last state, b N , serves as the context vector for the decoder. This architecture has gained wide popularity in applications such as speech-to-text (Chiu et al., 2018; Tüske et al., 2019 ), text-to-speech (Sotelo et al., 2017) , machine translation (Sutskever et al., 2014) , and image captioning (Rennie et al., 2016) . Following the MM concept, we attach an additional decoder (the meta decoder) via connections to the encoder and decoder outputs. The context vector, b N , is transformed by a fully connected layer (FCN in Figure 2 ), and both the ŷt output as well as the internal state, h t , are fed into the meta component. As mentioned above, the meta decoder generates uncertainty estimates, ẑt . Given the architecture depicted in Figure 2 , we summarize the three settings as follows:(1) Joint Model (JM): parameters are trained according to Eq. ( 1) with certain values of β. Generating Symmetric and Asymmetric Bounds The choice of loss function, l z , gives rise to two scenarios. If l z is an even function, e.g., l z (ŷ, y) = ŷy 1 , the meta-model targets z capture base error equally in both directions: above and below the target. Hence, the uncertainty ẑ predicted at test time will represent a interval symmetric around ŷ. If, on the other hand, l z takes the sign in z into account, it is possible to dedicate separate network nodes γ l , γ u ∈ γ to capturing lower and upper band estimates, ẑl and ẑu , respectively, thus accomplishing asymmetric prediction. Let δ = ŷy. For the asymmetric scenario the meta objective is modified as follows: 2020) applied a Gaussian model N (µ, σ 2 ) to the output of a neural network predictor, where µ represents the prediction and σ 2 its uncertainty due to observational (aleatoric) noise. The model is trained to minimize the negative log-likelihood (NLL), with the variance being an implicit uncertainty parameter (in that it is trained indirectly) which is allowed to vary across the feature space (heteroskedasticity). We apply the Gaussian in the sequential setting by planting it onto the base decoder's output (replacing the meta decoder) and train the network using the NLL objective: γ * = arg min γ E[l m (z l , max{δ, 0}) + l m (z u , max{-δ, 0})] φ * = arg min φ E M t=1 D d=1 (ŷ t,d -y t,d ) 2 σ 2 t,d + log σ 2 t,d with D output nodes modeling the regression variable, ŷt , and separate D output nodes modeling the log σ 2 t , at time t. Variational Dropout Gal & Ghahramani (2016) established a connection between dropout (Srivastava et al., 2014), i.e., the process of randomly omitting network connections, and an approximate Bayesian inference. We apply the variational dropout method to the base encoder and decoder. By performing multiple runs per test sequence, each with a different random dropout pattern, the base predictions are calculated as the mean and the base uncertainty as the variance over such runs. This Bayesian method, along with the variational approximation, captures the parametric (epistemic) uncertainty of the model, hence it fundamentally differs from the Gaussian model as well as our proposed approach. GARCH Variance Introduced in (Bollerslev, 1986; Engle, 1982) , the Generalized Autoregressive Conditional Heteroskedastic (GARCH) variance model belongs among the most popular statistical methods. A GARCH(p,q) assumes the series to follow an autoregressive moving average model and estimates the variance at time t as a linear combination of past q residual terms, 2 , and p previous variances, σ 2 : σ 2 t = α 0 + q i=1 α i 2 t-i + p i=0 β i σ 2 t-i . The α 0 term represents a constant component of the variance. The parameters, α, β are estimated via maximum-likelihood on a training set. The GARCH process relates to the concept in Figure 1 in that it acts as the meta-model predicting the squared residual. We use the GARCH as a baseline only on one of the datasets for reasons discussed in Section 4. Constant-Band Baseline A consistent comparison of uncertainty methods is difficult due to the fact that each generates an uncertainty around different base predictions. Therefore, as a reference we also generate a constant symmetric band around each base predictor. Such a bound represents a homoskedastic process -a sensible choice in many well-behaved sequential regression problems, corresponding to a GARCH(0,0) model. We will use this reference point to compute a relative gain of each method as explained in Section 3.3.

3.3. EVALUATION METHODOLOGY

Core Metrics Unlike with classification tasks, where standard calibration-based metrics apply (Snoek et al., 2019) , we need to consider two aspects arising in regression, roughly speaking: (1) what is the extent of observations falling outside the uncertainty bounds (Type 1 cost), and (2) how excessive are the bounds (Type 2 cost). An optimal bound captures all of the observation while being least excessive in terms of its bandwidth. Shen et al. (2018) , among others, defined two measures reflecting these aspects (miss rate and bandwidth) which we adopt below (Eqs. ( 3) and ( 4)) while adding two refinements (Eqs. ( 5) and ( 6)). Let ŷl = ŷẑl and ŷu = ŷ + ẑu denote the predicted lower and upper bound, respectively. Recall that ŷ ∈ R D×M . We define the following metrics: 6), captures this. The type 2 cost is captured by the Bandwidth, Eq. ( 4). However, its range is indirectly compounded by the underlying variation in ŷ and y. Therefore we propose the Excess measure, Eq. ( 5), which also reflects the Type 2 cost, but just the portion above the minimum bandwidth necessary to include the observation. Missrate(ŷ l , ŷu , y) = 1 - 1 M D d,t:y dt ∈[ŷ l dt ,ŷ u dt ] 1 (3) Bandwidth(ŷ l , ŷu , y) = 1 2M D D d=1 M t=1 ŷu dt -ŷl dt (4) Excess(ŷ l , ŷu , y) = 1 M D d,t:y dt ∈[ŷ l dt ,ŷ u dt ] min y dt -ŷl dt , ŷu dt -y dt (5) Deficit(ŷ l , ŷu , y) = 1 M D d,t:y dt / ∈[ŷ l dt ,ŷ u dt ] min |y dt -ŷl dt |, |y dt -ŷu dt | (6) Calibration In general, DNNs offer few guarantees about the behavior of their output. DNNs tend to produce miscalibrated classification probabilities (Guo et al., 2017) . In order to evaluate the uncertainty across models, it is necessary to establish a common operating point (OP). We achieve this via a scaling calibration. For symmetric bounds, we assume that y = ŷ + ẑ where ∈ R D×M is a random i.i.d. matrix and ẑ is the predicted non-negative uncertainty band. Let Z dt = y dt -ŷ dt ẑdt . Using a held-out dataset, we can obtain an empirical distribution in each output dimension: cdf({Z dt } 1≤t≤M ), d = 1, ..., D. It is then possible to find the value ε This scaling is applied in our evaluation to compare the excess, bandwidth and deficit at fixed miss rates as well as setting a minimum cost OP (see Section 3.3). An Algorithm to find a scale factor for a desired value of any of the four metrics in O(Mfoot_1 ) operations is given in the Appendix.

Metrics Used in Reporting

The following OP-based measures are used in reporting: (1) Excess, Deficit, Bandwidth at a fixed Missrate, averaged over Missrate = {0.1, 0.05, 0.01}, and (2) Minimum Excess-Deficit cost, where cost =foot_0 2 (Excess + Deficit) with the minimum found over all calibrations (OPs). For each system and measure, m s , a symmetric constant-band baseline, m f ixed , is also generated and a relative gain with respect to this reference calculated: gain s = 100 × m f ixed -ms m f ixed %. Finally, the error rate of the base predictor is calculated as E base (ŷ, y) = 1 D D d=1 ŷd -y d 1 y d 1 , where ŷd , y d are d-th row vectors of ŷ, y.

4. EXPERIMENTS

Datasets Two sequential regression datasets, namely the Metro Interstate Traffic Volume (MITV) dataset 1 and the SPE9 Reservoir Production Rates (SPE9PR) dataset 2 , were experimented with. Both originate from real-world applications, involve sequential input/output variables, and provide for scenarios with varying degrees of difficulty. The MITV dataset is a collection of hourly weather features with the target of the regression being the hourly traffic volume, recorded by the Minnesota DoT continuously between 2012 and 2018. The SPE9PR, on the other hand, is a large collection of mathematical simulations of a reservoir field with varying input and output sequences with each simulation comprising a sequence with 100 time steps. The regression targets in this case are multivariate and correspond to field production rates. The SPE9PR also contains a data partition collected under distributional drift. Full detail on both datasets including their preprocessing can be found in the Appendix. Training Procedure Each dataset was partitioned into TRAIN, DEV, DEV2, and TEST sets (with SPE9PR also providing a TEST-drift set), as listed Table 1 . While the TRAIN/DEV partitions served basic training, the DEV2 was used in determining hyperparameters and operating points (calibration). The TEST sets were used to produce the reported metrics. While a single sample in the SPE9PR represents a complete sequence of 100 steps, the MITV data come as a single contiguous sequence. The partitioning of the MITV set is strictly ordered by time, whereby the DEV sequence follows TRAIN, DEV2 follows DEV, and TEST follows DEV2. After partitioning, each MITV sequence was processed by a sliding window of length 36 hours (in 1-hour steps). This resulted in a series of (n -35) × 36 subsequences (n denotes the partition size) to feed the encoder-decoder model. When testing on the MITV, DNN predictions from such sliding windows were recombined into a contiguous prediction sequence again. The base encoder-decoder network (see Figure 2 ) is trained using the Adam optimizer (Kingma & Ba, 2014) with a varying initial learning rate, lr, in two stages: (1) Training of all parameters using TRAIN while providing the ground truth as the decoder input at each time step. (2) Building on the previous, the training continues, however, decoder predictions from step t -1 are fed as decoder inputs at step t-a mode referred to as emulation by Bengio et al. (2015) . All hyperparameter values are listed in the Appendix. JOINT MODEL, SYMMETRIC (JMS), AND ASYMMETRIC (JMA): The common training steps are performed using the objective in Eq. ( 1), with β = 1.0, first. Then, the joint training continues with β = 0.5 as long as the objective improves on DEV. In a final step, the model switches to using DEV as training with β = 0.0 until no improvement on TRAIN is seen. A similar procedure is followed for the JMA, except using Eq. ( 2 JOINT MODEL WITH VARIANCE (JMV): The two common steps are performed using the NLL objective (see Section 3.2) with the variance-related parameters first fixed, and, in a subsequent step, allowing the variance parameters to be adjusted, until convergence. This is to aid stability in training (Nix & Weigend, 1994) . DROPOUT MODEL SYMMETRIC (DOMS): Dropout with a rate of 0.25 (inputs, outputs) and 0.1 (internal states), determined as best performing on the DEV2 set, were applied in both the base encoder and the decoder. The two common steps were performed. At test time, the model was run 10 times per each test sequence to obtain the mean prediction and standard deviation. GARCH: The GARCH model from Section 3.2 is used with p = q = 5. The lag value was determined using an autocorrelation chart showing attenuation at lags > 5. We only apply this baseline to the MITV dataset as it provides a contiguous time series. The model parameters were trained using the DEV2 partition. Throughout the experiments, the size of each LSTM cell was kept fixed at 32 for the base encoder/decoder, and at 16 for the meta decoder. The base sizing has been largely driven by our preliminary study showing it suffices in providing accurate base predictions. Testing Procedure As mentioned above the SPE9PR dataset has two TEST partitions: one for a matched and one for a drifted condition. While the MITV dataset does not provide an explicit source of drift, we induce drift by creating a discrepancy in the modeling procedure between training and test: In the non-drift condition, the DNN's decoder is given access to the past 12 hours worth of traffic observations to make a forecast for the next 24 hours. This is achieved by spanning a 36-hour window and feeding the decoder inputs the first 12 hours of ground truth, during training. Now to create the drift scenario we test the MITV model without providing those first 12 hours of observations and the model uses their own predictions for that period instead. This emulates a "model drift" condition in that the model, trained to rely on actual observations, is getting its own noisy predictions. Results with Symmetric Bounds Table 2 compares the proposed symmetric-bounds systems (JMS, WBMS, BBMS) with the baselines (JMV, DOMS, GARCH). The relative error of the base predictor is given in the E base column. The uncertainty quality is reported in Table 2 is the average gain in excess-deficit metrics, as defined in Section 3.3. Columns labeled as G * contain measurements made at an operating point (OP) determined on the test set itself, while those labeled as G x use an OP from a held-out (DEV2) set. While G x reflects generalization of the calibration, G * values are interesting as they reveal the potential of each method. Based on a paired permutation test (Dwass, 1957) all but entries marked with † are mutually significant at p < 0.01. From Table 2 we make the following observations: (1) the JMS model dominates all other models across all conditions. The fact that it outperforms the WBMS indicates there is a benefit to the joint training setup, as conjectured earlier. (2) The WBMS dramatically outperforms the BBMS model, which remains only as good as a constant band for MITV data, indicating it is hard to reliably predict residuals from only the input features. (3) The most competitive baseline is the JMV model. As discussed in Section 3.2, the JMV shares some similarity with the meta-modeling approach. (4) The JMS and WBMS models perform particularly well in the strong drift scenario (SPE9PR), suggesting that white-box features play an essential role in achieving generalization. Finally, (5) the DOMS model works well on MITV data but provides no benefit in the SPE9PR, which could be due the aleatoric uncertainty playing a dominant role in this dataset. In almost all cases, however, the averaging of base predictions in DOMS results in lowest error rates of the base predictor. Representative samples of JMS and JMV uncertainty bounds are shown in Figure 4 (MITV) and Figure 5 (SPE9PR). They illustrate a clear trend in the results, namely that the JMS (also seen with WBMS) model are better able to cover the actual observation, particularly when the base prediction tends to make large errors. Additional plots can be found in the Appendix, and also a notebook to visualize all test samples is provided as part of the Supplementary Material. Results with Asymmetric Bounds Generating asymmetric bounds is a new intriguing aspect of DNN-based meta-models. Using the JMA model, we first recorded the accuracy with which the asymmetric output agrees in sign (orientation) with the observed base discrepancy. Averaged over each of the two datasets, this accuracy is at 83.3%, and 91.1%. The promise of asymmetric bounds lies in its potential to reduce the bandwidth cost. Since the Excess and Deficit metrics ignore the absolute bandwidth, we also evaluate the JMA model using the Bandwidth metric (Eq. ( 4)), averaged over the same OPs. The results are shown in Table 3 comparing the JMA model to the best symmetric model, JMS. The JMS model outperforms JMA in all scenarios on Excess-Deficit, however, compared on the bandwidth metric, the JMA dominates benefiting from its orientation capability. Upon visual inspection the output of the JMA is appreciably better in bandwidth: Figure 6 shows samples on both datasets. In most instances the bounds behave as expected, expending the bulk of bandwidth in the correct direction. An interesting question arises whether it is possible to utilize the asymmetric output as a correction on the base predictor. Our preliminary investigation shows that a naive combination leads to degradation in the base error, however, this question remains of interest for future work.

5. CONCLUSIONS

In this work we demonstrated that meta-modeling (MM) provides a powerful new framework for uncertainty prediction. Through a systematic evaluation of the proposed MM variants we report considerable relative gains over a constant reference baseline and show that they not only outperform all competitive baselines but also show stability across drift scenarios. A jointly trained model integrating the base with a meta component fares best, followed by a white-box setup, indicating that trainable white-box features play an essential role in the task. Besides symmetric uncertainty, we also investigated generating asymmetric bounds using dedicated network nodes and showed their benefit in reducing the uncertainty bandwidth. We believe these results open an exciting new research avenue for uncertainty quantification in sequential regression. review as a conference paper at ICLR 2021 Algorithm 1 Find best scale for a given metric value Input: Observation, base and meta predictions {ŷ t , y t , ẑl t , ẑu t } 1≤t≤M ; metric function f ; target value ρ * Output: Scale factor ε * for t ← 1 to M do δ t ← ŷt -y t . ε t ← δt ẑl t for δ t ≥ 0 -δt ẑu t otherwise for k ← 1 to M do ŷl k ← ŷk -ε t ẑl k ŷu k ← ŷk -ε t ẑu k end for ρ t ← f ŷl k , ŷu k , y k 1≤k≤M end for t * ← arg min t |ρ t -ρ * | ε * ← ε t * A ALGORITHM TO FIND A SCALING FACTOR Section 3.3 discusses the scaling calibration in the context of the four metrics: Missrate, Bandwidth, Excess, and Deficit. Algorithm 1 finds a scale factor for a desired value of any of these four metrics in O(M 2 ) operations.

B.1.1 METRO INTERSTATE TRAFFIC VOLUME (MITV)

The dataset is a collection of hourly westbound-traffic volume measurements on Interstate 94 reported by the Minnesota DoT ATR station 301 between the years 2012 and 2018. These measurements are aligned with hourly weather featuresfoot_2 as well as holiday information, also part of the dataset. The target of regression is the hourly traffic volume. This dataset was released in May, 2019. The MITV input features were preprocessed to convert all categorical features to trainable vector embeddings, as outlined in Figure 2 . All real-valued features as well as the regression output were standardized before modeling (with the test predictions restored to their original range before calculating final metrics). Overall dataset statistics are listed in Table 1 and further processing steps are given in Section 4.

B.1.2 MITV PREPROCESSING

As described in Section B.1.1 and Table 1 , the MITV dataset comes with 8 input features, among which 3 are categorical. Here we list the relevant parsing and encoding steps used in our setup. The raw time stamp information was parsed to extract additional features such as day of the week, day of the month, year-day fraction, etc. Table 4 shows the corresponding list. Standardization was performed on the input as well as output, as per Table 4 , whereby the model predictions were transformed to their original range before calculating final metrics.

B.1.3 SPE9 RESERVOIR PRODUCTION RATES (SPE9PR)

This dataset originates from an application of oil reservoir modeling. A reservoir model (RM) is a space-discretized approximation of a geological region subject to modeling. Given a sequence of drilling actions (input), a physics-based PDE-solver (simulator) is applied to the RM to generate  Y 3 N 3 day of week integer ∈ [0, 6] Y 3 N 3 month integer ∈ [0, 11] Y 3 N 3 frac yday real ∈ [ 1 365 , 1] N - Y 1 weather type integer ∈ [0, 10] Y 3 N 3 holiday type integer ∈ [0, 11] Y 3 N 3 temperature real ∈ R N - Y 1 rain 1h real ∈ R + 0 N - Y 1 snow 1h real ∈ R + 0 N - Y 1 clouds all real ∈ [0, 100] N - Y 1 Total 20 OUTPUT traffic volume real ∈ R + 0 N - Y 1 Total 1 sequences of future production rates (oil, gas, water production), typically over long horizons Killough (1995) . The objective is to train a DNN and accurately predict outputs on unseen input sequences. We used the publicly available SPE9foot_3 RM, considered a reference for benchmarking reservoir simulation in the industry, and an open-source simulatorfoot_4 to produce 28,000 simulations, each with 100 randomized actions (varying type, location, and control parameters of a well) inducing production rate sequences over a span of 25 years, in 90-day increments, i.e., 100 time steps. Furthermore, the RM was partitioned into two regions, A and B. While most of the actions are located in the region A, we also generated 1000 sequences with actions located in the region B thus creating a large degree of mismatch between training and test. The test condition in region B will be referred to as "drift" scenario.

B.1.4 SPE9PR PREPROCESSING

The Table 5 lists details on the SPE9PR features (also refer to Section B.1.3 and Table 1 ). The SPE9PR dataset contains input sequences of actions and output sequences of production rates. An action (feature type of well), at a particular time, represents a decision whether to drill, and if so, what type of well to drill (an injector or a producer well), or not to drill (encoded by "0"), hence the cardinality is 3. In case of a drill decision, further specifications apply, namely the x-and y-location on the surface of the reservoir, local geological features at the site, and well control parameters. There are 15 vertical cells in the SPE9 each coming with 3 geological features (rel. permeability, rel. porosity, rock type), thus the local geology is a 45-dimensional feature vector at a particular (x, y) location. Finally, every well drilled so far may be controlled by a parameter called "Bottom-Hole Pressure" (BHP). Since we provision up to 100 wells of each of the two types, a 200-dimensional vector arises containing BHP values for these wells at any given time. Standardization was performed on the input as well as output as specified in Table 5 whereby the model predictions were transformed to their original range before calculating and reporting final metrics. (1) learning rate, regularization, batch, and LSTM size were adopted from an unrelated experimental study performed on a modified reservoir SPE9 (Anonymized), (2) We used DEV2 to determine the dropout rates in the DOMS model. The value β = 0.5 was chosen ad-hoc (as a midpoint between pure base and pure meta loss) without further optimization.

B.3 IMPLEMENTATION NOTES

All DNNs were implemented in Tensorflow 1.11. Training was done on a Tesla K80 GPU, with total training time ranging between 3 (MITV) and 24 (SPE9PR) hours. The GARCH Python implementation provided in the arch library was used. C ADDITIONAL RESULTS

C.1 INDIVIDUAL METRICS

For a more detailed view of the averages in Table 2 , we show a split by the individual metrics in Table 9 , and, for the SPE9PR which has a total of four output variables, a split by the individual variables in Table 10 .

C.2 ADDITIONAL VISUALIZATIONS

In addition to the sample visualizations shown in Section 2 for the JMS, JMV, and JMA systems, here we show same sections of the data and visualize output of all systems. Figures 7 and 8 show the first simulation in the test set of the SPE9PR dataset for the drift and non-drift condition and all its output components, respectively. Figures 9 and 10 show the output on the MITV drift and non-drift condition, respectively. For each model, a miss rate value of 0.1 across the entire test set was used in the visualizations.

C.2.1 INTERACTIVE NOTEBOOK

We also provide an interactive notebook that allows for inspecting all system output on an arbitrary portion of the test data in both the non-drift and drift condition. Please refer to the README file within the zip-file uploaded as the Supplementary Material part of our submission.

C.2.2 JMV VARIANCE TUNING ON DEV DATA

Section 4 lists individual training steps for each system. It is noted that the meta-modeling arrangements have used the DEV partition for tuning in a final step. The motivation for using a partition not included in training the base model is the avoidance of meta-training on biased targets, i.e., targets generated by the base model on its own training data. In this context, a question arises whether a sim- ilar tuning step could help the JMV model. We followed the training steps described in Section 4 and then updated the network nodes tied to the variance parameter while keeping the rest of the network fixed. The Table 11 shows the results on the MITV dataset. It seems the benefit of the tuning step does not materialize. In all but the cross-validated drift case the gain decreases (albeit insignificantly) when applying the DEV-only tuning. We conjecture that the benefit of the tuning step exists with the meta-model because of the direct supervision of the meta-model's prediction. In contrast, the variance in the JMV setting is learned implicitly and may not suffer from the "biased-target" problem mentioned above. Table 9 : Symmetric gains split by individual metric (compare to Table 2 ) Model Deficit@0.01 Deficit@0.05 Deficit@0.1 Excess@0.01 Excess@0.05 Excess@0.1 MinCost Average 



https://archive.ics.uci.edu/ml/machine-learning-databases/00492/ https://developer.ibm.com/technologies/artificial-intelligence/data/oil-reservoir-simulations provided by OpenWeatherMap https://github.com/OPM/opm-data/blob/master/spe9/SPE9.DATA https://opm-project.org/



Figure 1: The concept of meta modeling We now formalize the MM concept as it applies to sequential regression. Let ŷ = F φ (x) be the base model function parametrized by φ, where x = x 1 , ..., x N and ŷ = ŷ1 , ..., ŷM represent sequences of N input feature vectors and M Ddimensional output vectors, with ŷ ∈ R D×M . Let ẑ = G γ (ŷ, x, φ) denote the meta model, parameterized by γ, taking as input the predictions, the original features, and the parameters of the base to produce a sequence of error predictions, ẑ ∈ R D×M . The parameters φ are obtained by solving an optimization problem, arg min φ E[l b (ŷ, y)], using a smooth loss function l b , e.g., the Frobenius norm l b = ŷy 2 F . Similarly, the parameters γ are determined via arg min γ E[l m (ẑ, z)] = arg min γ E[l m (ẑ, l z (ŷ, y))]involving a loss l z quantifying the target error (residual) from the base model, and l m quantifying the prediction error of the meta model. In general, l b , l z , and l m , may differ. The expectations are estimated using an available dataset. We used the L 2 F norm for l b and l m , and L 1 for l z , as described in Section 4. Given differentiable loss functions and the DNN setting, the base and the meta model can be integrated in a single network (JM). In this case the parameters are estimated jointly via

(2) White-Box model (WB): base parameters φ are trained first, followed by parameters γ, also accessing φ. (3) Black-Box model (BB): same as (2) without access to φ.

Figure 2: Encoder-Decoder architecture integrating a base and a meta model

Figure 3: Bandwidth, excess, and deficit costs.

Implicit Heteroskedastic Variance Lakshminarayanan et al. (2017); Kendall & Gal (2017); Oh et al. (

quantile p, e.g., ε (0.95) d , and construct the prediction bound at test time: ŷdt -ẑdt ε (p) d , ŷdt + ẑdt ε (p) d . Assuming Z d is stationary, in expectation, this bound will contain the desired proportion p of the observations, i.e., E[Missrate(ŷ l , ŷu , y)] = 1 -p.

). WHITE-BOX MODEL, SYMMETRIC (WBMS): The basic training is performed with β = 1.0. Next, only meta model parameters are estimated using the DEV/TRAIN sets with β = 0.0. BLACK-BOX MODEL, SYMMETRIC (BBMS): Base training is performed. The base model processes the DEV set to generate residual z. A separate encoder-decoder model is then trained using (x, z).

Figure 4: Sample of traffic volume predictions with uncertainty generated by the JMS and JMV models, along with a constant bound (around JMV) (miss rate set to 0.1 on TEST).

Figure 5: SPE9PR samples of oil (left) and water (right) production rates ("drift" scenario, miss rate set to 0.1 on TEST).

Figure 6: Samples of asymmetric bounds produced by the JMA. MITV sample (left) and SPE9PR (right) correspond to segments shown in Figure 4 and 5.

Figure 7: Sample from SPE9PR (simulation 0, drift condition), all components and systems shown.

Figure 8: Sample from SPE9PR (simulation 0, non-drift condition), all components and all systems shown.

Figure 9: Sample from MITV (drift condition), all systems shown.

Figure 10: Sample from MITV (non-drift condition), all systems shown.

Overview statistics of the MITV and the SPE9PR datasets Figure 3 illustrates these metrics. The relative proportion of observations lying outside the bounds (miss rate) ignores the extent of the bound's short fall. The Deficit, Eq. (

Relative optimum and cross-validated gains (G * , G x ) using the Excess-Deficit metrics. E base denotes the base predictor's error. Within each column, elements marked † are in a statistical tie, all other values are mutually significant at p < 0.01 %G x E base %G * %G x E base %G * %G x E base %G * %G x

Relative optimum and cross-validated gains (G * and G x ) on Bandwidth and Excess-Deficit metrics for the asymmetric JMA model.

MITV Input and Output Specifications

SPE9PR Input and Output Specifications

Relative optimum and cross-validated gains (G * , G xval ) for the MITV dataset, using the Excess-Deficit metrics. E base denotes the base predictor's error. Within each column, elements marked † are in a statistical tie, all other values are mutually significant at p < MITV System match drift E base %G * %G xval E base %G * %G xval

Relative optimum and cross-validated gains (G * , G xval ) for the SPE9PR dataset, using the Excess-Deficit metrics. E base denotes the base predictor's error. Within each column, elements marked † are in a statistical tie, all other values are mutually significant at p < 0.01 SPE9PR System match drift E base %G * %G xval E base %G * %G xval

%G * , MITV "Non-Drift" Scenario

