MQTRANSFORMER: MULTI-HORIZON FORECASTS WITH CONTEXT DEPENDENT AND FEEDBACK-AWARE ATTENTION

Abstract

Recent advances in neural forecasting have produced major improvements in accuracy for probabilistic demand prediction. In this work, we propose novel improvements to the current state of the art by incorporating changes inspired by recent advances in Transformer architectures for Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own history based on the context for which it is producing a forecast. We also present a novel positional encoding that allows the neural network to learn context-dependent seasonality functions as well as arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster (Wen et al., 2017) models display excess variability by failing to leverage previous errors in the forecast to improve accuracy. We propose a novel decoder-self attention scheme for forecasting that produces significant improvements in the excess variation of the forecast.

1. INTRODUCTION

Time series forecasting is a fundamental problem in machine learning with relevance to many application domains including supply chain management, finance, healthcare analytics, and more. Modern forecasting applications require predictions of many correlated time series over multiple horizons. In multi-horizon forecasting, the learning objective is to produce forecasts for multiple future horizons at each time-step. Beyond simple point estimation, decision making problems require a measure of uncertainty about the forecasted quantity. Access to the full distribution is usually unnecessary, and several quantiles are sufficient (many problems in Operations Research use the 50 th and 90 th percentiles, for example). As a motivating example, consider a large e-commerce retailer with a system to produce forecasts of the demand distribution for a set of products at a target time T . Using these forecasts as an input, the retailer can then optimize buying and placement decisions to maximize revenue and/or customer value. Accurate forecasts are important, but -perhaps less obviously -forecasts that don't exhibit excess volatility as a target date approaches minimize costly, bull-whip effects in a supply chain (Chen et al., 2000; Bray and Mendelson, 2012) . Recent work applying deep learning to time-series forecasting focuses primarily on the use of recurrent and convolutional architectures (Nascimento et al., 2019; Yu et al., 2017; Gasparin et al., 2019; Mukhoty et al., 2019; Wen et al., 2017) foot_0 . These are Seq2Seq architectures (Sutskever et al., 2014) -which consist of an encoder which takes an input sequence and summarizes it into a fixedlength context vector, and a decoder which produces an output sequence. It is well known that Seq2Seq models suffer from an information bottleneck by transmitting information from encoder to decoder via a single hidden state. To address this Bahdanau et al. ( 2014) introduces a method called attention, allowing the decoder to take as input a weighted combination of relevant latent encoder states at each output time step, rather than using a single context to produce all decoder outputs. While NLP is the predominate application of attention architectures, in this paper we show how novel attention modules and positional embeddings can be used to introduce proper inductive biases for probabilistic time-series forecasting to the model architecture. Even with these shortcomings, this line of work has lead to major advances in forecast accuracy for complex problems, and real-world forecasting systems increasingly rely on neural nets. Accordingly, a need for black-box forecasting system diagnostics has arisen. Stine and Foster (2020b;a) use probabilistic martingales to study the dynamics of forecasts produced by an arbitrary forecasting system. They can be used to detect the degree to which forecasts adhere to the martingale model of forecast evolution (Heath and Jackson, 1994) and to detect unnecessary volatility (above and beyond any inherent uncertainty) in the forecasts produced. Thus, Stine and Foster (2020b;a) describe a way to connect the excess variation of a forecast to accuracy misses against the realized target. While, multi-horizon forecasting networks such as (Wen et al., 2017; Madeka et al., 2018) , minimize quantile loss -the network architectures do not explicitly handle excess variation, since forecasts on any particular date are not made aware of errors in the forecast for previous dates. In short, such tools can be used to detect flaws in forecasts, but the question of how to incorporate that information into model design is unexplored. Our Contributions In this paper, we are concerned with both improving forecast accuracy and reducing excess forecast volatility. We present a set of novel architectures that seek to remedy some of inductive biases that are currently missing in state of the art MQ-Forecasters (Wen et al., 2017) . The major contributions of this paper are 1. Positional Encoding from Event Indicators: Current MQ-Forecasters use explicitly engineered holiday "distances" to provide the model with information about the seasonality of the time series. We introduce a novel positional encoding mechanism that allows the network to learn a seasonality function depending on other information of the time series being forecasted, and demonstrate that its a strict generalization of conventional position encoding schemes. 2. Horizon-Specific Decoder-Encoder Attention: Wen et al. ( 2017); Madeka et al. ( 2018) and other MQ-Forecasters learn a single encoder representation for all future dates and periods being forecasted. We present a novel horizon-specific decoder-encoder attention scheme that allows the network to learn a representation of the past that depends on which period is being forecasted. 3. Decoder Self-Attention for Forecast Evolution: To the best of our knowledge, this is the first work to consider the impacts of network architecture design on forecast evolution. Importantly, we accomplish this by using attention mechanisms to introduce the right inductive biases, and not by explicitly penalizing a measure of forecast variability. This allows us to maintain a single objective function without needing to make trade-offs between accuracy and volatility. By providing MQ-Forecasters with the structure necessary to learn context information dependent encodings, we observe major increases in accuracy (3.9% in overall P90 quantile loss throughout the year, and up to 60% during peak periods) on our demand forecasting application along with a significant reduction in excess volatility (52% reduction in excess volatility at P50 and 30% at P90). We also apply MQTransformer to four public datasets, and show parity with the state-of-the-art on simple, univariate tasks. On a substantially more complex public dataset (retail forecasting) we demonstrate a 38% improvement over the previously reported state-of-the-art, and a 5% improvement in P50 QL, 11% in P90 QL versus our baseline. Because our innovations are compatible with efficient training schemes, our architecture also achieves a significant speedup (several orders of magnitude greater throughput) over earlier transformer models for time-series forecasting. 2 BACKGROUND AND RELATED WORK  :t,i , x (s) i ), where y t+s,i , y :t,i , x :t,i , x t:,i , x (s) denote future observations of the target time series i, observations of the target time series observed up until time t, the past covariates, known future information, and static covariates, respectively.



For a complete overview seeBenidis et al. (2020)



2.1 TIME SERIES FORECASTINGFormally, the task considered in our work is the high-dimensional regression problem p(y t+1,i , . . . , y t+H,i |y :t,i , x

