MQTRANSFORMER: MULTI-HORIZON FORECASTS WITH CONTEXT DEPENDENT AND FEEDBACK-AWARE ATTENTION

Abstract

Recent advances in neural forecasting have produced major improvements in accuracy for probabilistic demand prediction. In this work, we propose novel improvements to the current state of the art by incorporating changes inspired by recent advances in Transformer architectures for Natural Language Processing. We develop a novel decoder-encoder attention for context-alignment, improving forecasting accuracy by allowing the network to study its own history based on the context for which it is producing a forecast. We also present a novel positional encoding that allows the neural network to learn context-dependent seasonality functions as well as arbitrary holiday distances. Finally we show that the current state of the art MQ-Forecaster (Wen et al., 2017) models display excess variability by failing to leverage previous errors in the forecast to improve accuracy. We propose a novel decoder-self attention scheme for forecasting that produces significant improvements in the excess variation of the forecast.

1. INTRODUCTION

Time series forecasting is a fundamental problem in machine learning with relevance to many application domains including supply chain management, finance, healthcare analytics, and more. Modern forecasting applications require predictions of many correlated time series over multiple horizons. In multi-horizon forecasting, the learning objective is to produce forecasts for multiple future horizons at each time-step. Beyond simple point estimation, decision making problems require a measure of uncertainty about the forecasted quantity. Access to the full distribution is usually unnecessary, and several quantiles are sufficient (many problems in Operations Research use the 50 th and 90 th percentiles, for example). As a motivating example, consider a large e-commerce retailer with a system to produce forecasts of the demand distribution for a set of products at a target time T . Using these forecasts as an input, the retailer can then optimize buying and placement decisions to maximize revenue and/or customer value. Accurate forecasts are important, but -perhaps less obviously -forecasts that don't exhibit excess volatility as a target date approaches minimize costly, bull-whip effects in a supply chain (Chen et al., 2000; Bray and Mendelson, 2012) . Recent work applying deep learning to time-series forecasting focuses primarily on the use of recurrent and convolutional architectures (Nascimento et al., 2019; Yu et al., 2017; Gasparin et al., 2019; Mukhoty et al., 2019; Wen et al., 2017) foot_0 . These are Seq2Seq architectures (Sutskever et al., 2014) -which consist of an encoder which takes an input sequence and summarizes it into a fixedlength context vector, and a decoder which produces an output sequence. It is well known that Seq2Seq models suffer from an information bottleneck by transmitting information from encoder to decoder via a single hidden state. To address this Bahdanau et al. (2014) introduces a method called attention, allowing the decoder to take as input a weighted combination of relevant latent encoder states at each output time step, rather than using a single context to produce all decoder outputs. While NLP is the predominate application of attention architectures, in this paper we show how novel attention modules and positional embeddings can be used to introduce proper inductive biases for probabilistic time-series forecasting to the model architecture. Even with these shortcomings, this line of work has lead to major advances in forecast accuracy for complex problems, and real-world forecasting systems increasingly rely on neural nets. Accordingly, a need for black-box forecasting system diagnostics has arisen. Stine and Foster (2020b; a) use probabilistic martingales to study the dynamics of forecasts produced by an arbitrary forecasting system. They can be used to detect the degree to which forecasts adhere to the martingale model of forecast evolution (Heath and Jackson, 1994) and to detect unnecessary volatility (above and beyond any inherent uncertainty) in the forecasts produced. Thus, Stine and Foster (2020b; a) describe a way to connect the excess variation of a forecast to accuracy misses against the realized target. While, multi-horizon forecasting networks such as (Wen et al., 2017; Madeka et al., 2018) , minimize quantile loss -the network architectures do not explicitly handle excess variation, since forecasts on any particular date are not made aware of errors in the forecast for previous dates. In short, such tools can be used to detect flaws in forecasts, but the question of how to incorporate that information into model design is unexplored. Our Contributions In this paper, we are concerned with both improving forecast accuracy and reducing excess forecast volatility. We present a set of novel architectures that seek to remedy some of inductive biases that are currently missing in state of the art MQ-Forecasters (Wen et al., 2017) . The major contributions of this paper are 1. Positional Encoding from Event Indicators: Current MQ-Forecasters use explicitly engineered holiday "distances" to provide the model with information about the seasonality of the time series. We introduce a novel positional encoding mechanism that allows the network to learn a seasonality function depending on other information of the time series being forecasted, and demonstrate that its a strict generalization of conventional position encoding schemes. 2. Horizon-Specific Decoder-Encoder Attention: Wen et al. (2017) ; Madeka et al. (2018) and other MQ-Forecasters learn a single encoder representation for all future dates and periods being forecasted. We present a novel horizon-specific decoder-encoder attention scheme that allows the network to learn a representation of the past that depends on which period is being forecasted. 3. Decoder Self-Attention for Forecast Evolution: To the best of our knowledge, this is the first work to consider the impacts of network architecture design on forecast evolution. Importantly, we accomplish this by using attention mechanisms to introduce the right inductive biases, and not by explicitly penalizing a measure of forecast variability. This allows us to maintain a single objective function without needing to make trade-offs between accuracy and volatility. By providing MQ-Forecasters with the structure necessary to learn context information dependent encodings, we observe major increases in accuracy (3.9% in overall P90 quantile loss throughout the year, and up to 60% during peak periods) on our demand forecasting application along with a significant reduction in excess volatility (52% reduction in excess volatility at P50 and 30% at P90). We also apply MQTransformer to four public datasets, and show parity with the state-of-the-art on simple, univariate tasks. On a substantially more complex public dataset (retail forecasting) we demonstrate a 38% improvement over the previously reported state-of-the-art, and a 5% improvement in P50 QL, 11% in P90 QL versus our baseline. Because our innovations are compatible with efficient training schemes, our architecture also achieves a significant speedup (several orders of magnitude greater throughput) over earlier transformer models for time-series forecasting. 2 BACKGROUND AND RELATED WORK  :t,i , x (f ) :t,i , x (s) i ), where y t+s,i , y :t,i , x :t,i , x t:,i , x (s) denote future observations of the target time series i, observations of the target time series observed up until time t, the past covariates, known future information, and static covariates, respectively. For sequence modeling problems, Seq2Seq (Sutskever et al., 2014) is the canonical deep learning framework and although applied this architecture to neural machine translation (NMT) tasks, it has since been adapted to time series forecasting (Nascimento et al., 2019; Yu et al., 2017; Gasparin et al., 2019; Mukhoty et al., 2019; Wen et al., 2017; Salinas et al., 2020; Wen and Torkkola, 2019) . The MQ-Forecaster framework (Wen et al., 2017) solves (1) above by treating each series i as a sample from a joint stochastic process and feeding into a neural network which predicts Q quantiles for each horizon. These types of models, however, inherit from the Seq2Seq architecture the limited contextual information available to the decoder as it produces each estimate y q t+s,i , the q th quantile of the distribution of the target at time t + s y t+s,i . Seq2Seq models rely on a single encoded context to produce forecasts for all horizons, imposing an information bottleneck and making it difficult for the model to understand long term dependencies. Our MQTransformer architecture, like other MQ-Forecasters, uses the direct strategy: the model outputs the quantiles of interest directly, rather than the parameters of a distribution from which samples are to be generated. This has been shown (Wen et al., 2017) to outperform parametric models, like DeepAR (Salinas et al., 2020) , on a wide variety of tasks. Recently, Lim et al. (2019) consider an application of attention to multi-horizon forecasting, but their method still produces a single context for all horizons. Furthermore, by using an RNN decoder their models do not enjoy the same scaling properties as MQ-Forecaster models. To the best of our knowledge, our work is the first to devise attention mechanisms for this problem that readily scale. Bahdanau et al. (2014) introduced the concept of an attention mechanism to solve the information bottleneck and sequence alignment problems in Seq2Seq architectures for NMT. Recently, attention has enjoyed success across a diverse range of applications including natural language processing (NLP), computer vision (CV) and time-series forecasting tasks (Galassi et al., 2019; Xu et al., 2015; Shun-Yao Shih and Fan-Keng Sun and Hung-yi Lee, 2019; Kim and Kang, 2019; Cinar et al., 2017; Li et al., 2019; Lim et al., 2019) . Many variants have been proposed including self-attention and dot-product attention (Luong et al., 2015; Cheng et al., 2016; Vaswani et al., 2017; Devlin et al., 2019) , and transformer architectures (end-to-end attention with no recurrent layers) achieve state-of-the-art performance on most NLP tasks. Time series forecasting applications exhibit seasonal trends and the absolute position encodings commonly used in the literature cannot be applied. Our work differs from previous work on relative position encodings (Dai et al., 2019; Huang et al., 2018; Shaw et al., 2018) in that we learn a representation from a time series of indicator variables which encode events relevant to the target application (such as holidays and promotions). If event indicators relevant to the application are provided, then this imposes a strong inductive bias that will allow the model to generalize well to future observations. Existing encoding schemes either involve feature engineering (e.g. sinusoidal encodings) or have a maximum input sequence length, ours requires no feature engineering -the model learns it directly from raw data -and it extends to arbitrarily long sequences.

2.2. ATTENTION MECHANISMS

In the vanilla transformer (Vaswani et al., 2017) , a sinusoidal position embedding is added to the network input and each encoder layer consists of a multi-headed attention block followed by a feed-forward sub-layer. For each head i, the attention score between query q s and key k t is defined as follows for the input layer A h s,t = (x s + r s ) W h, q W h k (x t + r t ) where x s , r s are the observation of the time series and the position encoding, respectively, at time s. Section 3 introduces attention mechanisms that differ in their treatment of the position dependent biases. See Appendix A for additional discussion of attention mechanisms.

2.3. MARTINGALE DIAGNOSTICS

Originally the martingale model of forecast evolution (MMFE) was conceived as a way to simulate demand forecasts used in inventory planning problems (Heath and Jackson, 1994) . Denoting by Y T |t the forecast for Y T made at time t ≤ T , the MMFE assumes that the forecast process { Y T |t } t is martingale. Informally, a martingale captures the notion that a forecast should use all information available to the forecasting system at time t. Mathematically, a discrete time martingale is a stochastic process {X t } such that E[X t+1 |X t , . . . , X 1 ] = X t . We assume a working knowledge of martingales and direct the reader to Williams (1991) for a thorough coverage in discrete time. Taleb (2018) describe how martingale forecasts correspond to rational updating, then expanded by Augenblick and Rabin (2019) . Taleb (2018) , Taleb and Madeka (2019) and Augenblick and Rabin (2019) go on to develop tests for forecasts that rule out martingality and indicate irrational or predictable updating for binary bets. Stine and Foster (2020a; b) further extend these ideas to quantile forecasts. Specifically, they consider the coverage probability process p t := P[Y T ≤ τ |Y s , s ≤ t] = E[I(Y T ≤ τ )|Y s , s ≤ t], where τ denotes the forecast announced in the first period t = 0. Because {p t } is also a martingale, the authors show that E[(p T -π) 2 ] = T t=1 E[(p t -p t-1 ) 2 ] = π(1 -π) , where π = p 0 is the expected value of p T , a Bernoulli random variable, across realizations of the coverage process. In the context of quantile forecasting, π is simply the quantile forecasted. The measure of excess volatility proposed is the quadratic variation process associated with {p t }, Q s := s t=0 (p t -p t-1 ) 2 . While this process is not a martingale, we do know that under the MMFE assumption, E[Q T ] = π(1 -π). A second quantity of interest is the martingale V t := Q t -(p t -π) 2 which follows the typical structure of subtracting the compensator to turn a sub-martingale into a martingale. In Section 4 we leverage the properties of {V t } and {Q t } to compare the dynamics of forecasts produced by a variety of models, demonstrating that our feedback-aware decoder self-attention units reduce excess forecast volatility.

3. METHODOLOGY

As mentioned, this work is motivated in part by the needs of the consumers of forecasting systems. We therefore care about whether or not our innovations can be used in practice. Our methodology must scale to forecasting tens of thousands or millions of signals, at hundreds of horizons. We extend the MQ-Forecaster family of models (Wen et al., 2017) because it, unlike many other architectures considered in the literature, can be applied at a large-scale (millions of samples) due to its use of forking sequences -a technique to dramatically increase the effective batch size during training and avoid expensive data augmentation. In this section we present our MQTransformer architecture, building upon the MQ-Forecaster framework. For ease of exposition, we reformulate the generic probabilistic forecasting problem in (1) as p(y t+1,i , . . . , y t+H,i |y :t,i , x :t,i , x (l) i , x (g) , s i ), where x :t,i are past observations of all covariates, x (l) i = {x (l) s,i } ∞ s=1 are known covariates specific to time-series i, x (g) = {x (g) s } ∞ s=1 are the global, known covariates. In this setting, known signifies that the model has access to (potentially noisy) observations of past and future values. Note that this formulation is equivalent to (1), and that known covariates can be included in the past covariates x :t . When it can be inferred from context, the time series index i is omitted.

3.1. LEARNING OBJECTIVE

We train a quantile regression model to minimize the quantile loss, summed over all forecast creation dates and quantiles t q k L q (y t+k , y (q) t+k ), where L q (y, y) = q(y -y) + + (1 -q)( y -y) + , (•) + is the positive part operator, q denotes a quantile, and k denotes the horizon.

3.2. NETWORK ARCHITECTURE

The design of the architecture is similar to MQ-RNN (Wen et al., 2017) , and consists of encoder, decoder and position encoding blocks (see Figure 4 in Appendix B). The position encoding outputs, for each time step t, are a representation of global position information, r (g) t = PE (g) t (x (g) i ) , as well as time-series specific context information, r (l) t = PE (l) t (x (l) i ). Intuitively, r (g) t captures position information that is independent of the time-series i (such as holidays), whereas r (l) t encodes timeseries specific context information (such as promotions). In both cases, the inputs are a time series of indicator variables and require no feature-engineering or handcrafted functions. The encoder then summarizes past observations of the covariates into a sequence of hidden states h t := encoder(y :t , x :t , r (g) :t , r (l) :t , s). Using these representations, the decoder produces an H × Q matrix of forecasts Y t = decoder(h :t , r (g) , r (l) ). Note that in the decoder, the model has access to position encodings. In this section we focus on our novel attention blocks and position encodings; the reader is directed to Appendix B for the other architecture details. MQTransformer Now we describe a design, evaluated in Section 4, following the generic pattern given above. We define the combined position encoding as r := [r (g) ; r (l) ]. In the encoder we use a stack of dilated temporal convolutions (van den Oord et al., 2016; Wen et al., 2017) to encode historical time-series and a multi-layer perceptron to encode the static features as (3).  h 1 t = TEMPORALCONV(y:t, x:t, r:t) h 2 t = FEEDFORWARD(s) ht = [h 1 t ; h 2 t ], (3) c hs t,h = HSATTENTION(h:t, r) c a t = FEEDFORWARD(ht, r) ct = [c a t,1 ; • • • ; c hs t,H ; c a t ] c t,h = DSATTENTION(c:t, h:t, r), Our decoder incorporates our horizon specific and decoder self-attention blocks, and consists of two branches. The first (global) branch summarizes the encoded representations into horizon-specific (c hs t,h ) and horizon agnostic (c a t ) contexts. Formally, the global branch c t := m G (•) is given by (4). The output branch consists of a self-attention block followed by a local MLP, which produces outputs using the same weights for each horizon. For FCT t and horizon h, the output is given by ( y 1 t+h , . . . , y Q t+h ) = m L (c a t , c hs t,h , c t,h , r t+h ), where c :t denotes the output of the global branch, up through the FCT t. Next we describe the specifics of our position encoding and attention blocks.

3.3. LEARNING POSITION AND CONTEXT REPRESENTATIONS FROM EVENT INDICATORS

Prior work typically uses a variant on one of two approaches to provide attention blocks with position information: (1) a handcrafted representation (such as sinusoidal encodings) or (2) a matrix M ∈ R L×d of position encoding where L is the maximum sequence length and each row corresponds to the position encoding for time point. In contrast, our novel encoding scheme maps sequences of indicator variables to a d-dimensional representations. For demand forecasting, this enables our model to learn an arbitrary function of events (like holidays and promotions) to encode position information. As noted above, our model includes two position encodings: r (g) t := P E (g) t (x (g) ) and r (l) t := P E (l) t (x (l) ), one that is shared among all time-series i and one that is specific. For the design we use in Section 4, P E (g) is implemented as a bidirectional 1-D convolution (looking both forward and backward in time) and P E (l) is an MLP applied separately at each time step. Figure 1 shows an example of P E (g) learned from holiday indicator variables. For reference, MQ-(C)RNN (Wen et al., 2017) uses linear holiday and promotion distances to represent position information. Connection to matrix embeddings Another way to view our position encoding scheme is as a form of set based indexing into rows of an infinite dimensional matrix. We note that the traditional method of learning a matrix embedding M can be recovered as a special case of our approach. Consider a sequence of length L, and take x (g) := [e 1 , . . . , e L ] where e s is used to denote the vector in R L with a 1 in the s th position and 0s elsewhere. To recover the matrix embedding scheme, we define PE matrix t (x (g) ) := x (g), t M. Thus we see that our scheme is a strict generalization of the matrix embedding approach commonly used in the NLP community. A h s,t,r vs,r,

3.4. CONTEXT DEPENDENT AND FEEDBACK-AWARE ATTENTION

H(t, h) := {(s, r)|s + r = t + h} (8) Horizon-Specific Decoder-Encoder Attention Our horizon-specific attention mechanism is a multi-headed attention mechanism where the projection weights are shared across all horizons. Each head corresponds to a different horizon. It differs from a traditional multi-headed attention mechanism in that its purpose is to attend over representations of past time points to produce a representation specific to the target period. In our architecture, the inputs to the block are the encoder hidden states and position encodings. Mathematically, for time s and horizon h, the attention weight for the value at time t is computed as (5). Observe that there are two key differences between these attention scores and those in the vanilla transformer architecture: (a) projection weights are shared by all H heads, (b) the addition of the position encoding of the target horizon h to the query. The output of our horizon specific decoderencoder attention block, c hs t,h , is obtained by taking a weighted sum of the encoder hidden contexts, up to a maximum look-back of L periods as in (6). Decoder Self-Attention The martingale diagnostic tools developed in (Stine and Foster, 2020b) indicate a deep connection between accuracy and volatility. We leverage this connection to develop a novel decoder self-attention scheme for multi-horizon forecasting. To motivate the development, consider a model which forecasts values of 40, 60 when the demand has constantly been 50 units. We would consider this model to have excess volatility. Similarly, a model forecasting 40, 60 when demand jumps between 40 and 60 units would not be considered to have excess volatility. This is because the first model fails to learn from its past forecasts -it continues jumping between 40, 60 when the demand is 50 units. In order to ameliorate this, we need to pass the information of the previous forecast errors into the current forecast. For each FCT t and horizon h, the model attends on the previous forecasts using a query containing the demand information for that period. The attention mechanism has a separate head for each horizon. Rather than attend on the demand information and prior outputs directly, a richer representations of the same information is used: the demand information at time t is incorporated via the encoded context h t and previous forecasts are represented via the corresponding horizon-specific context c hs s,r -in the absence of decoder-self attention c hs s,r would be passed through the local MLP to generate the forecasts. Formally, the attention scores are given by ( 7). The horizon-specific and feedback-aware outputs, c hs t,h , are given by ( 8). Note how we sum only over previous forecasts of the same period. We conduct our experiments on a subset of products (∼ 2 million products) in the US store. Each model is trained using a single machine with 8 NVIDIA V100 Tensor Core GPUs, on three years of demand data (2015-2018); one year (2018-2019) is held out for back-testing. Forecast Accuracy Table 3 summarizes several key metrics that demonstrate the accuracy improvements achieved by adding our proposed attention mechanisms to the MQ-CNN architecturethe full set of results can be found in Appendix C. Introducing the Horizon-Specific Decoder-Encoder attention alone yields improvements along all metrics evaluated. Overall we see a 1.6% improvement in P50 QL and a 3.9% improvement in P90 QL. Notably, the attention mechanism yields significant improvements on short LT-SP (LT-SP 0/4). Further, Table 3 demonstrates improved performance on seasonal peaks and promotions. Observe that while MQ-CNN performs well on some seasonal peaks, it also is misaligned and fails to ramp-down post-peak -post-peak ramp-down issues occur when the model continues to forecast high for target weeks after the peak week. By including MQTransformer's attention mechanisms in the architecture, we see a 43% improvement for Seasonal Peak 1 and a 56% improvement on Post-Peak Rampdown. In retail, promotions are used to provide a demand lift for products. Accordingly, a model should be able to react to the upcoming promotion and forecast an accurate lift in demand for the target weeks in which the promotion is placed. From Table 3 we see that MQTransformer achieves a see a 49% on items with Promotion Type 1 versus the baseline.

Forecast Volatility

We study the effect of our proposed attention mechanisms on excess forecast volatility using diagnostic tools recently proposed by Stine and Foster (2020b; a) . Figure 2 plots the process {V t } (see Section 2). In the plot, the lines should appear horizontal under the MMFE. Any deviation above this (on an aggregate level) indicates excess volatility in the forecast evolution. We can observe that while none of the models produce ideal forecasts, both attention models outperform the Baseline with the attention model with both proposed attention mechanisms performing the best in terms of these evolution diagnostics. The green line corresponds to the attention model with only the horizon-specific decoder-encoder attention. We can see that compared to the baseline, this model achieves up to 27% reduction in excess volatility at P50 and 7% at P90. By also adding decoder-self attention we see a further reduction in excess volatility of an additional 20% at P50 and 21% at P90. 

4.2. PUBLICLY AVAILABLE DATASETS

Following Lim et al. (2019) , we consider applications to brick-and-mortar retail sales, electricity load, securities volatility and traffic forecasting. For the retail task, we predict the next 30 days of sales, given the previous 90 days of history. This dataset contains a rich set of static, time series, and known features. At the other end of the spectrum, the electricity load dataset is univariate. See Appendix D for additional information about these tasks. Table 4 compares MQTransformer's performance with other recent worksfoot_3 -DeepAR (Salinas et al., 2020) , ConvTrans (Li et al., 2019) , MQ-RNN (Wen et al., 2017) , and TFT (Lim et al., 2019) . Our MQTransformer architecture is competitive with or beats the state-of-the-art on the electricity load, volatility and traffic prediction tasks, as shown in Table 4 . On the most challenging task, it dramatically outperforms the previously reported state of the art by 38% and the MQ-CNN baseline by 5% at P50 and 11% at P90. Because MQ-CNN and MQTransformer are trained using forking sequences, we can use the entire training population, rather than downsample as is required to train TFT (Lim et al., 2019) -see Appendix D. To ascertain what portion of the gain is due to learning from more trajectories, versus our innovations alone, we retrain the optimal MQTransformer architecture using a random sub-sample of 450K trajectories (the same sampling procedure as TFT) and without using forking sequences -the results are indicated in parentheses in Table 4 . We can observe that MQTransformer still dramatically outperforms TFT, and its performance is similar to the MQ-CNN baseline trained on all trajectories. Furthermore, on the Favorita retail forecasting task, Figure 3 shows that as expected, MQTransformer substantially reduces excess volatility in the forecast evolution compared to the MQ-CNN baseline. Somewhat surprisingly, TFT exhibits much lower volatility than does MQTransformer. In Figure 3 , the right hand plot displays quantile loss as the target date approaches -trajectories for each model are zero centered to emphasize the trends exhibited. While TFT is less volatile, it is also less Computational Efficiency For purposes of illustration, consider the Retail dataset. Prior work (Lim et al., 2019) was only able to make use of 450K out of 20M trajectories, and the optimal TFT architecture required 13 minutes per epoch (minimum validation error at epoch 6)foot_4 using a single V100 GPU. By contrast, our innovations are compatible with forking sequences, and thus our architecture can make use of all available trajectories. To train, MQTransformer requires only 5 minutes per epoch (on 20M trajectories) using a single V100 GPU (minimum validation error reached after 5 epochs). Of course, some of the differences in runtime can be attributed to use of different deep learning frameworks, but it is clear that MQTransformer (and other MQForecasters) can be trained much more efficiently than models like TFT and DeepAR.

5. CONCLUSIONS AND FUTURE WORK

In this work, we present three novel architecture enhancements that improve bottlenecks in state of the art MQ-Forecasters. We presented a series of architectural innovations for probabilistic time-series forecasting including a novel alignment decoder-encoder attention, as well as a decoder self-attention scheme tailored to the problem of multi-horizon forecasting. To the best of our knowledge, this is the first work to consider the impact of model architecture on forecast evolution. We also demonstrated how position embeddings can be learned directly from domain-specific event indicators and horizonspecific contexts can improve performance for difficult sub-problems such as promotions or seasonal peaks. Together, these innovations produced significant improvements in the excess variation of the forecast and accuracy across different dimensions. Finally, we applied our model to several public datasets, where it outperformed the baseline architecture by 5% at P50, 11% at P90 and the previous reported state-of-the-art (TFT) by 38% on the most complex task. On the three less complex public datasets, our architecture achieved parity or slightly exceeded previous state of the art results. Beyond accuracy gains, by making our architecture innovations compatible with forking sequences, our model achieves massive increases in throughput compared to existing transformer architectures for time-series forecasting. An interesting direction we intend to explore in future work is incorporating encoder self-attention so that the model can leverage arbitrarily long historical time series, rather than the fixed length consumed by the convolution encoder.

A ADDITIONAL BACKGROUND AND RELATED WORK

A.1 ATTENTION MECHANISMS Attention mechanisms can be viewed as a form of content based addressing, that computes an alignment between a set of queries and keys to extract a value. Formally, let q 1 , . . . , q t , k 1 , . . . , k t and v 1 , . . . , v t be a series of queries, keys and values, respectively. The s th attended value is defined as c s = t i=1 score(q s , k t )v t , where score is a scoring function -commonly score(u, v) := u v. In the vanilla transformer model, q s = k s = v s = h s , where h s is the hidden state at time s. Because attention mechanisms have no concept of absolute or relative position, some sort of position information must be provided. Vaswani et al. (2017) uses a sinusoidal positional encoding added to the input to an attention block, providing each token's position in the input time series.

B MQTRANSFORMER ARCHITECTURE DETAILS

In this section we describe in detail the layers in our MQTransformer architecture, which is based off of the MQ-Forecaster framework (Wen et al., 2017) and uses a wavenet encoder (van den Oord et al., 2016) for time-series covariates. Before describing the layers in each component, Figure 4 outlines the MQTransformer architecture. On different datasets, we consider the following variations: choice of encoding for categorical variables, a tunable parameter d h (dimension of hidden layers), dropout rate p drop , a list of dilation rates for the wavenet encoder, and a list of dilation rates for the position encoding. The ReLU activation function is used throughout the network.

B.1 INPUT EMBEDDINGS AND POSITION ENCODING

Static categorical variables are encoded using either one-hot encoding or an embedding layer. Timeseries categorical variables are one-hot encoded, and then passed through a single feed-forward layer of dimension d h . The global position encoding module takes as input the known time-series covariates, and consist of a stack of dilated, bi-directional 1-D convolution layers with d h filters. After each convolution is a ReLU activation, followed by a dropout layer with rate p drop , and the local position encoding is implemented as a single dense layer of dimension d h .

B.2 ENCODER

After categorical encodings are applied, the inputs are passed through the encoder block. The encoder consists of two components: a single dense layer to encode the static features, and a stack of dilated, temporal convolutions. The the time-series covariates are concatenated with the position encoding to form the input to the convolution stack. The output of the encoder block is produced by replicating the encoded static features across all time steps and concatenating with the output of the convolution.

B.3 DECODER

Please see Table 1 for a description of the blocks in the decoder. The dimension of each head's in both the horizon-specific and decoder self-attention blocks is d h /8 . The dense layer used to compute c a t has dimension d h . The output block is two layer MLP with hidden layer dimension d h , and weights are shared across all time points and horizons. The output layer has one output per horizon, quantile pair. Figure 4 : MQTransformer architecture with learned global/local positional encoding, horizon-specific decoder-encoder attention, and decoder self-attention improvement for Seasonal Peak 2, a 7% improvement for Seasonal Peak 3, and a 56% improvement on Post-Peak Rampdown. 

D EXPERIMENTS ON PUBLIC DATASETS

In this section we describe the experiment setup used for the public datasets in Section 4.

D.1 DATASETS

We evaluate our MQTransformer on four public datasets. We summarize the datasets and preprocessing logic below; the reader is referred to Lim et al. (2019) for more details.

RETAIL

This dataset is provided by the Favorita Corporacion (a major Grocery chain in Ecuador) as part of a Kagglefoot_5 to predict sales for thousands of items at multiple brick-and-mortar locations. In total there are 135K items (item, store combinations are treated as distinct entities), and the dataset contains a variety of features including: local, regional and national holidays; static features about each item; total sales volume at each location. The task is to predict log-sales for each (item, store) combination over the next 30 days, using the previous 90 days of history. The training period is January 1, 2015 through December 1, 2015. The following 30 days are used as a validation set, and the 30 days after that as the test set. These 30 day windows correspond to a single forecast creation time. While Lim et al. (2019) extract only 450K samples from the histories during the train window, there are in fact 20M trajectories avalaible for training -because our models can produce forecasts for multiple trajectories (FCDs) simultaneously, we train using all available data from the training window. For the volatility analysis presented in Figure 3 , we used a 60 day validation window (March 1, 2016 through May 1, 2016), which corresponds to 30 forecast creation times.

ELECTRICITY

This dataset consists of time series for 370 customers of at an hourly grain. The univariate data is augmented with a day-of-week, hour-of-day and offset from a fixed time point. The task is to predict hourly load over the next 24 hours for each customer, given the past seven days of usage. From the training period (January 1, 2014 through September, 1 2019) 500K samples are extracted.

TRAFFIC

This dataset consists of lane occupancy information for 440 San Francisco area freeways. The data is aggregated to an hourly grain, and the task is to predict the hourly occupancy over the next 24 hours given the past seven days. The training period consist of all data before 2008-06-15, with the final 7 days used as a validation set. The 7 days immediately following the training window is used for evaluation. The model takes as input lane occupancy, hour of day, day of week, hours from start and an entity identifier.

VOLATILITY

The volatility dataset consists of 5 minute sub-sampled realized volatility measurements from 2000-01-03 to 2019-06-28. Using the past one year's worth of daily measurements, the goal is to predict the next week's (5 business days) volatility. The period ending on 2015-12-31 is used as the training set, 2016-2017 as the validation set, and 2018-01-01 through 2019-06-28 as the evaluation set. The region identifier is provided as a static covariate, along with time-varying covariates daily returns, day-of-week, week-of-year and month. A log transformation is applied to the target.

D.2 TRAINING PROCEDURE

We only consider tuning two hyper-parameters, size of hidden layer d h ∈ {32, 64, 128} and learning rate α ∈ {1 × 10 -2 , 1 × 10 -3 , 1 × 10 -4 }. The model is trained using the ADAM optimizer Kingma and Ba (2015) with parameters β 1 = 0.9, β 2 = 0.999, = 1e -8 and a minibatch size of 256, for a maximum of 100 epochs and an early stopping patience of 5 epochs.



For a complete overview seeBenidis et al. (2020) Wen et al. (2017), Figure shows MQ-CNN (labeled "MQ CNN wave") outperforms MQ-RNN (all variants) and DeepAR (labeled "Seq2SeqC") on the test set. Results for TFT, MQ-RNN, DeepAR and ConvTrans are fromLim et al. (2019). For MQ-CNN and MQTransformer, we use their pre-processing and evaluation code to ensure parity Timing results obtained by running the source code provided byLim et al. (2019) The original competition can be found here.



Figure 1: Position encoding learned from daily-grain event indicators

Figure2: Martingale diagnostic process {V t } averaged over all weeks in test period(2018)(2019)

MQTransformer encoder and decoder

Attention weight and output computations for blocks introduced in Section 3.4

Aggregate Quantile Loss Metrics evaluate our architecture on a demand forecasting problem for a large-scale e-commerce retailer with the objective of producing multi-horizon forecasts that span up to one year. Each horizon is specified by a lead time (LT), number of periods from the FCT to the start of the horizon, and a span (SP), number of periods covered by the forecast, combination. To assess the effects of each innovation, we ablate by removing components one at a time. MQTransformer is denoted as Dec-Enc & Dec-Self Att, Dec-Enc Att -which contains only the horizon-specific decoder-encoder unit -and Baseline -the vanilla MQ-CNN model. MQ-CNN is selected as the baseline since prior work 2 demonstrate that MQ-CNN outperforms MQ-RNN and DeepAR on this dataset, and as can be seen in Table4, MQ-CNN similarly outperforms MQ-RNN and DeepAR on public datasets.

P50 (50th percentile) and P90 (90th percentile) QL on electricity and retail datasets with the best results on each task emphasized. For the retail task, MQTransformer has results in parentheses, which correspond to training without forking sequences on 450K trajectories only.accurate as it fails to incorporate newly available information. By contrast, MQTransformer is both less volatile and more accurate when compared with MQ-CNN. See Appendix D for more details on the experiment setup and training procedure used.

52-week aggregate quantile loss metrics with for a set of representative lead times and spans

P90 quantile loss metrics on seasonal peak target weeks

P90 quantile loss metrics on item, weeks with promotions

C LARGE SCALE DEMAND FORECASTING EXPERIMENTS C.1 EXPERIMENT SETUP

In this section we describe the details of the model architecture and training procedure used in the experiments on the large-scale demand forecasting application.

TRAINING PROCEDURE

Because we did not have enough history available to set aside a true holdout set, all models are trained for 100 epochs, and the final model is evaluated on the test set. For the same reason, no hyperparameter tuning was performed.

ARCHITECTURE AND HYPERPARAMETERS

The categorical variables consist of static features of the item, and the timeseries categorical variables are event indicators (e.g. holidays). The parameters are summarized in Table 5 . Tables 6, 7 , and 8 contain the full set of results on the large scale demand forecasting task. Per the discussion in Section 4, we use MQ-CNN as the baseline model and we did not compare to TFT due to scaling issues and the fact that on the public task that was most similar (retail), MQ-CNN significantly outperformed TFT.Quantile loss by horizon Table 6 demonstrates how the attention mechanism yields significant improvements in shorter LTSP (e.g. LTSP 3/1 and LTSP 0/4), 7% improvement in P90 QL for LTSP 3/1 and 7.6% improvement in P90 QL for LTSP 0/4. We still see improvements for longer LTSP, but they are less substantial: 3.8% improvement in P90 QL for LTSP 12/3 and 3.9% improvement in P90 QL for LTSP 0/33. By also adding decoder-self attention, we continue to see improved results for shorter LTSP compared to only decoder-encoder attention, but we do see slight degradations for longer LTSP when comparing to decoder-encoder attention.Promotions Performance In Table 8 we see that MQTransformer outperforms the prior state of the art on all promotion types. After adding the horizon-specific decoder-encoder and decoder-self attentions, versus the baseline, we see a 49% improvement for Promotion Type 1 products, a 31% improvement for Promotion Type 2 products, and a 17% improvement for Promotion Type 3 products.Peak Performance Table 3 illustrates that while MQCNN performs well on some seasonal peaks, it also is misaligned and fails to rampdown post-peak -ramp-down issues occur when the model continues to forecast high for target weeks after the peak week. By including MQTransformer's attention mechanisms in the architecture, we see a 43% improvement for Seasonal Peak 1, a 21%We train a model for each hyperparameter setting in the search grid (6 combinations), select the one with the minimal validation loss and report the selected model's test-set error in Table 4 .

D.3 ARCHITECTURE DETAILS

Our MQTransformer architecture used for these experiments contain a single tune-able hyperparameter -hidden layer dimension d h . Dataset specific settings are used for the dilation rates. For static categorical covariates we use an embedding layer with dimension d h and use one-hot encoding for time-series covariates. A dropout rate of 0.15 and ReLU activations are used throughout the network.The only difference between this variant and the one used for the non-public large scale demand forecasting task is the use of an embedding layer for static, categorical covariates rather than one-hot encoding.

D.4 REPORTED MODEL PARAMETERS

The optimal parameters for each task are given in Table 9 . ,2,4,8,16,32] [1,2,4,8,14] 

