UNCERTAINTY PREDICTION FOR DEEP SEQUENTIAL REGRESSION USING META MODELS Anonymous authors Paper under double-blind review

Abstract

Generating high quality uncertainty estimates for sequential regression, particularly deep recurrent networks, remains a challenging and open problem. Existing approaches often make restrictive assumptions (such as stationarity) yet still perform poorly in practice, particularly in presence of real world non-stationary signals and drift. This paper describes a flexible method that can generate symmetric and asymmetric uncertainty estimates, makes no assumptions about stationarity, and outperforms competitive baselines on both drift and non drift scenarios. This work helps make sequential regression more effective and practical for use in real-world applications, and is a powerful new addition to the modeling toolbox for sequential uncertainty quantification in general.

1. INTRODUCTION

The ability to quantify the uncertainty of a model is one of the fundamental requirements in trusted, safe, and actionable AI (Arnold et al., 2019; Jiang et al., 2018; Begoli et al., 2019) . This paper focuses on uncertainty quantification in regression tasks, particularly in the context of deep neural networks (DNN). We define a sequential task as one involving an ordered series of input elements, represented by features, and an ordered series of outputs. In sequential regression tasks (SRT), the output elements are (possibly multivariate) real-valued variables. SRT occur in numerous applications, among others, in weather modeling, environmental modeling, energy optimization, and medical applications. When the cost of making an incorrect prediction is particularly high, such as in human safety, models without a reliable uncertainty estimation are perceived high risk and may not be adopted. Uncertainty prediction in DNNs has been subject to active research, in particular, spurred by what has become known as the "Overconfidence Problem" of DNNs Guo et al. (2017) , and by their susceptibility to adversarial attacks Madry et al. (2017) . However, the bulk of work is concerned with non-sequential, classification tasks (see Section 2) leaving a noticeable gap for SRT. In this paper we introduce a meta-modeling concept as an approach to achieving high-quality uncertainty quantification in DNNs for SRT. We demonstrate that it not only outperforms competitive baselines but also provides consistent results across a variety of drift scenarios. We believe the approach represents a new powerful addition to the modeling toolbox in general. The novel contributions of this paper are summarized as follows: (1) Application of the meta-modeling concept to SRT, (2) Developing a joint base-meta model along with a comparison to white-and black-box alternatives, (3) Generating asymmetric uncertainty bounds in DNNs, and (4) Proposing a new evaluation methodology for SRT.

2. RELATED WORK

Classical statistics on time series offers an abundance of work dealing with uncertainty quantification (Papoulis & Saunders, 1989) . Most notably in econometrics, a variety of heteroskedastic variance models lead to highly successful application in financial market volatility analyses (Engle, 1982; Bollerslev, 1986; Mills, 1991) . An Autoregressive Conditional Heteroskedastic, or ARCH, model (Engle, 1982) , and its generalized version, GARCH, (Bollerslev, 1986) are two such methods, the latter of which serves as one of our baselines. An illuminating study (Kendall & Gal, 2017) 2019) aiming at the task of instance filtering using white-box models. The work relates to ours through the meta-modeling concept but concentrates on classification in a non-sequential setting. Besides its application in filtering, meta-modeling has been widely applied in the task of learning to learn and lifelong learning (Schmidhuber, 1987; Finn et al., 2019) . However, it should be pointed out that the two applications of meta-modeling are not comparable due to their different objectives. Uncertainty in data drift conditions was assessed in a recent study (Snoek et al., 2019) . The authors employ calibration-based metrics to examine various methods for uncertainty in classification tasks (image and text data), and conclude, among others that most methods' quality degrades with drift. Acknowledging drift as an important experimental aspect, our study takes it into account by testing in matched and drifted scenarios. Finally, Shen et al. ( 2018) described a multi-objective training of a DNN in wind power prediction, minimizing two types of cost related to coverage and bandwidth. We expand on these metrics in Section 3.3.

3.1. META MODELING APPROACH

The basic concept of Meta Modeling (MM), depicted in Figure 1 , involves a combination of two models comprising a base model, performing the main task (e.g., regression), and a meta model, learning to predict the base model's error behavior. Depending on the amount of information shared between these two, we distinguish several settings, namely (1) base model is a black-box (BB), (2) base is a white-box (WB, base parameters are accessible), and (3) base and meta components are trained jointly (JM). The advantages of WB and JM are obvious: rich information is available for the meta model to capture salient patterns for it to generate accurate predictions. On the other hand, the BB setting often occurs in practice and is a given. (1)



describes an integration of two sources of uncertainty, namely the epistemic (due to model) and the aleatoric (due to data). The authors propose a variational approximation of Bayesian Neural Networks and an implicit Gaussian model to quantify both types of variability in a non-sequential classification and regression task. Based on Nix & Weigend (1994), Lakshminarayanan et al. (2017) also uses an implicit Gaussian model to improve the predictive performance of a base model, again in a non-sequential setting. Similar to Kendall & Gal (2017), the study does not focus on comparing the quality of the uncertainty to one generated by other methods. We use the implicit variance model of Kendall & Gal (2017); Oh et al. (2020); Lakshminarayanan et al. (2017), as well as the method of variational dropout of Gal & Ghahramani (2016); Kendall & Gal (2017) as baselines in our work. A meta-modeling approach was taken in Chen et al. (

Figure 1: The concept of meta modeling We now formalize the MM concept as it applies to sequential regression. Let ŷ = F φ (x) be the base model function parametrized by φ, where x = x 1 , ..., x N and ŷ = ŷ1 , ..., ŷM represent sequences of N input feature vectors and M Ddimensional output vectors, with ŷ ∈ R D×M . Let ẑ = G γ (ŷ, x, φ) denote the meta model, parameterized by γ, taking as input the predictions, the original features, and the parameters of the base to produce a sequence of error predictions, ẑ ∈ R D×M . The parameters φ are obtained by solving an optimization problem, arg min φ E[l b (ŷ, y)], using a smooth loss function l b , e.g., the Frobenius norm l b = ŷy 2 F . Similarly, the parameters γ are determined via arg min γ E[l m (ẑ, z)] = arg min γ E[l m (ẑ, l z (ŷ, y))]involving a loss l z quantifying the target error (residual) from the base model, and l m quantifying the prediction error of the meta model. In general, l b , l z , and l m , may differ. The expectations are estimated using an available dataset. We used the L 2 F norm for l b and l m , and L 1 for l z , as described in Section 4. Given differentiable loss functions and the DNN setting, the base and the meta model can be integrated in a single network (JM). In this case the parameters are estimated jointly via φ * , γ * = arg min φ,γ

