DEEP PROBABILISTIC TIME SERIES FORECASTING OVER LONG HORIZONS

Abstract

Recent advances in neural network architectures for time series have led to significant improvements on deterministic forecasting metrics like mean squared error. We show that for many common benchmark datasets with deterministic evaluation metrics, intrinsic stochasticity is so significant that simply predicting summary statistics of the inputs outperforms many state-of-the-art methods, despite these simple forecasters capturing essentially no information from the noisy signals in the dataset. We demonstrate that using a probabilistic framework and moving away from deterministic evaluation acts as a simple fix for this apparent misalignment between good performance and poor understanding. With simple and scalable approaches for uncertainty representation we can adapt state-of-the-art architectures for point prediction to be excellent probabilistic forecasters, outperforming complex probabilistic methods constructed from deep generative models (DGMs) on popular benchmarks. Finally, we demonstrate that our simple adaptations to point predictors yield reliable probabilistic forecasts on many additional problems of practical significance, namely large and highly stochastic datasets of climatological and economic data.



Weather Figure 1 : predictions on exchange rate (left), ETTm2 (a sequence of electricity transformer temperature readings, center), and weather (right) for NHiTS (Challu et al., 2022b ), Autoformer (Wu et al., 2021) , and last value predictions, as well as the historical standard deviation of the change from the last observed value. On the exchange (Lai et al., 2018) and ETTm2 (Zhou et al., 2021) datasets there is minimal structure to be exploited except on very short horizons, and forecasts tend to under-perform simple baselines. On semi-structured datasets like weather, models can capture some overall structure, such as NHiTS accurately predicting the final values in the forecasting window, but are still only on par with naive predictions. From these plots we see why probabilistic evaluation is necessary and point estimates are insufficient. (Cui et al., 2021) was also introduced as a trivial baseline but has worse performance than our constants. traffic, with strong periodicity from both daily and weekly cycles, we may be able to forecast point predictions with a high degree of accuracy. However, in stochastic time series which display only modest structure (e.g. periodicity or seasonality), such as precipitation or wind speed patterns, we cannot hope to produce accurate predictions of specific future outcomes using only historic observations. Researchers working in architecture design for timeseries frequently overlook intrinsic stochasticity in benchmark datasets. In Figure 1 we show how even sophisticated methods struggle to forecast accurately on data with a low signal-to-noise ratio. In Table 1 we take this observation further and show that, shockingly, naive constant predictors outperform two state-of-the-art time series models (Lai et al., 2018; Zhou et al., 2021) 2021), but include new columns showing the performance of simply predicting either the mean or the last value of the observations. Notably, these datasets span many domains of practical interest (finance, energy, and climatology), and contain varying levels of structure and periodicity. We present these surprising shortcomings of state-of-the-art models in order to encourage adoption of more meaningful evaluations. Although these deep learning methods are extremely good at extracting and extrapolating trends and periodic structure far into the future, aspiring to predict a constant value in the best case reflects a misalignment. Meaningful comparisons require a baseline that excels in both highly predictable and stochastic environments. Keeping in mind the need for probabilistic, rather than deterministic frameworks for forecasting, we instead ask if the strong trend extrapolation performance of point prediction models is being overlooked in the probabilistic time series literature. For datasets that have both noise and structure (e.g. wind), we find that high performing point predictors, such as NHiTS, typically outperform the mean forecasts provided by state of the art probabilistic methods. In Figure 2 we show the cumulative mean squared error for forecasts produced by NHiTS and the mean prediction taken from several popular probabilistic frameworks on common benchmark datasets as well as real world data.foot_2 Motivated by improving probabilistic forecasting while retaining the strength in trend extrapolation found in some deterministic models, we explore adapting point predictors to the probabilistic setting rather than building new models entirely from the ground up. In particular, we examine two baselines for constructing high performing probabilistic forecasters, built on quantile regression and heteroscedastic variance models (Section 3). These methods leverage advances in architecture design for time series modeling without succumbing to the pitfalls of point prediction. In Section 4, we provide a detailed comparison of our models with state-of-the-art deep learning methods for probabilistic forecasting, demonstrating that recent advances in architecture design are directly relevant to uncertainty quantification. Finally, in Section 5, we demonstrate INTRODUCTIONFollowing deep learning's breakthroughs in sequence modeling for text and audio, significant research effort has sought to achieve comparable success in time series, where the unique challenges of data scarcity and long-range dependencies have created a niche for creative architecture design. In rapid succession, many new models and architectures have demonstrated improved point predictions on benchmarks adopted by the community(Challu et al., 2022b; Salinas et al., 2020; Wu et al., 2021). These methods hold incredible promise for real impact on time series based decision making, especially in economic domains that require highly accurate long-term predictions.While demonstrating steady improvements, however, research on deep learning for point prediction frequently ignores a key but simple fact: the real world is complex and predicting the future accurately from past observations alone is often impossible. In highly structured time series, such as observed Models and datasets described in Sections 4 and 5 along with plots of cumulative CRPS scores.



on several widely reported MSE evaluations. The numbers shown are taken directly from Challu et al. (2022b), Wu et al. (2021), and Zhou et al. (

Multivariate results with varying prediction lengths. Bolded results indicate the best performing model, and italics the second best. In all cases simple statistics of the input data to the model are either the first or second best performing models in terms of bothMSE and MAE  accuracy. Historical Inertia (HI)

funding

MSE MAE MSE MAE MSE MAE MSE MAE Exchan.

