VQ-TR: VECTOR QUANTIZED ATTENTION FOR TIME SERIES FORECASTING

Abstract

Modern time series datasets can easily contain hundreds or thousands of temporal time points, however, Transformer based models scale poorly to the size of the sequence length constraining their context size in the seq-to-seq setting. In this work, we introduce VQ-TR which maps large sequences to a discrete set of latent representations as part of the Attention module. This allows us to attend over larger context windows with linear complexity with respect to the sequence length. We compare this method with other competitive deep learning and classical univariate probabilistic models and highlight its performance using both probabilistic and point forecasting metrics on a variety of open datasets from different domains.

1. INTRODUCTION

Deep learning based probabilistic time series forecasting models (Benidis et al., 2022) typically consists of a component that learns representations of a time series' history, while another component learns some emission model (point or probabilistic) conditioned on this representation of the history. One can typically use Recurrent Neural Networks (RNNs), Casual Convolutional networks or the popular architecture at time of writing namely, Transformers (Vaswani et al., 2017) to learn historical representations. Transformers offer good inductive bias for the forecasting task (Zhou et al., 2021) , as they can look back over the full context history of a time series, while RNNs suffer from forgetting and convolutions network have limited temporal receptive fields for the amount of parameters. Transformers however suffer from quadratic complexity for memory and compute in the size of sequence over which they are learning representations. This constrains the size of contexts over which Transformer based models can make predictions over which can potentially hinder these models from learning long-range dependencies as well as constraining the depth of these models leading to poorer representations being learnt. A number of techniques have been developed to deal with this issue by reducing the computation or reducing the sequence length. In this work we start with an observation on the approximation of the Attention module in the Transformer and design a novel encoder-decoder Transformer based architecture for the forecasting use case which is linear in its computational and memory use with respect to the sequence size. We do this by incorporating a Vector Quantization (van den Oord et al., 2017) into the Transformer which allows us to learn discrete representations in a hierarchical fashion.

2. BACKGROUND 2.1 PROBABILISTIC TIME SERIES FORECASTING

The task of probabilistic time series forecasting in the univariate setting consists of training on a dataset of D ≥ 1 time series D train = {x i 1:T i } where i ∈ {1, . . . , D} and at each time point t we have x i t ∈ R or in N. We are tasked with predicting the potentially complex distribution of the next P > 1 time steps into the future and we are given a back-testing test set D test = {x i T i +1:T i +P }. Each time index t is in reality a date-time value which increments regularly based on the frequency of the dataset in question and the last training point T i for each time series may or may not be the same date-time. Autoregressive models like (Graves, 2013; Salinas et al., 2019b) estimate the prediction density by decomposing the joint distribution of all P points via the chain-rule of probability as: p X (x i T i +1:T i +P ) ≈ Π P t=1 p(x i T i +t |x i 1:T i -1+t , c i 1:T i +P ; θ), parameterized by some model with trained weights θ. This requires that the next time point is conditioned on all the past and covariates c i t (detailed in Section 3.3), which is computationally challenging to scale especially if the time series has a large history. Models like DeepAR (Salinas et al., 2019b) typically resort to the seq-to-seq paradigm (Sutskever et al., 2014) and consider some context window of fixed sized C sampled randomly from the full time series history to learn some historical representation and use this representation in the decoder to learn the distribution of the subsequent time points of the context. This does however mean that the model falls short of capturing long-range temporal dependencies in its prediction which can lead to worse approximation of the future distribution. Encoder-decoder Transformers (Vaswani et al., 2017) naturally fit the seq-to-seq paradigm where N encoding Transformer layers can be used to learn a context window sized sequence of representations denoted by: {h t } C-1 t=1 = Enc • • • • • Enc({concat(x i t , c i t+1 )} C-1 t=1 ; θ). Afterwards M layers of a causal or masked decoding Transformer can be used to model the subsequent P future time points conditioned on the encoding representations as: Π C+P -1 t=C p(x i t+1 |x i t:C , c i t+1:C+1 , h 1 , . . . , h C-1 ; θ). For example if we assume the data comes from a Gaussian then the outputs from the Transformer's M decoders can be passed to a layer which returns appropriately signed parameters of a Gaussian whose log-likelihood, given by C+P -1 t=C log p N (x i t+1 |x i t:C , c i t+1:C+1 , h 1 , . . . , h C-1 ; θ), can be maximized for all i and t from the D train using stochastic gradient descent (SGD) as detailed in Section 3.1. Although Transformers offer a better alternative to recurrent neural networks (RNN) (like the LSTM (Hochreiter & Schmidhuber, 1997) ) that maps inputs onto a set of J ≥ 1 discrete latent variables called the codebook {z 1 , . . . , z J }, and a decoder that reconstructs the inputs from the resulting discrete vectors. The input vector is quantized with respect to its distance to its nearest codebook vector: Quantize(q) := z n , n where n = arg min j ∥q -z j ∥ 2 . ( ) The codebook is learned by back-propgation of the gradient coming upstream of the VQ module and due to the non-differentiable operation one uses the Straight-Through gradient estimator (Hinton et al., 2012; Bengio et al., 2013) to copy the gradients downstream. Additionally the VQ has two extra losses namely the latent loss which encourages the alignment of the codebook vectors to the inputs of the VQ as well as a commitment loss which penalizes the inputs from switching codebook vectors too frequently. This is done via the "stop-gradient" or "detach" operators of deep learning frameworks which blocks gradients from flowing into its argument. Thus the additional two VQ losses can be written as: ∥sg(q) -z∥ 2 2 + β∥sg(z) -q∥ 2 2 , where β is the hyperparameter weighting the commitment loss. Since the optimal codes would be the k-means clusters of the input representations, van den Oord et al. (2017) provides an exponential moving average training scheme of the latents instead of the latent loss (first term of (2)). Additionally, to aid learning, Jukebox (Dhariwal et al., 2020) proposes to replace the codebook vectors that have an exponential moving average cluster size less than some threshold by a random incoming vector from the batch.

3. VQ-TR MODEL

We motivate this method with an observation on the effect of approximations of the query vector in self-attention. Recall that in self-attention the incoming sequence of vectors are mapped to query, key, and value vectors. For each t indexed vector, this is denoted by q t , k t , and v t , respectively. Let us denote the approximation of the query vector q t by qt . The attention weight for step t attending on some u step is (Phuong & Hutter, 2022 ) w tu = exp(q T t k u ) j exp(q T t k j ) , and the output representation is denoted by o t , where o t = u w tu v u . We then have the following: Theorem 1. If max q T t k u -qT t k u ≤ δ then for sufficiently small δ > 0 the attention weight with respect to the approximation qt given by ŵtu is bounded by w tu (1 -2δ) ≤ ŵtu ≤ w tu (1 + 2δ), and as a result the output representation is |o t -ôt | ⪯ 2δo t . Proof. w tu = exp(q T t k u ) j exp(q T t k j ) = exp(q T t k u -qT t k u + qT t k u ) j exp(q T t k j -qT t k j + qT t k j ) = exp(q T t k u ) exp(q T t k u -qT t k u ) j exp(q T t k j ) exp(q T t k j -qT t k j ) . Since, max j q T t k j -qT t k j ≤ δ, then exp(-δ) ≤ exp(q T t k j -qT t k j ) ≤ exp(δ) ∀j exp(-2δ) ≤ w tu ŵtu ≤ exp(2δ) assuming δ is small, w tu (1 -2δ) ≤ ŵtu ≤ w tu (1 + 2δ) or, | ŵtu -w tu | ≤ 2δw tu . Since o t = u w tu v u , |o t -ôt | ⪯ u w tu v u -ŵtu v u ⪯ u |w tu -ŵtu | v u ⪯ u 2δw tu v u ⪯ 2δo t . With the above result we see that we can bound the error in the output representation of self-attention as a result of approximating the query vector. We can see how to make sure δ is small using a discrete set of approximations given by {z 1 , . . . , z J } to the query vectors. We ideally want to min z1,...,z J max q∈Q,k∈K J min j=1 q T k -z T j k or, min z1,...,z J max q∈Q,k∈K J min j=1 q T k -z T j k 2 min z1,...,z J max q∈Q,k∈K J min j=1 (q -z j ) T kk T (q -z j ) Instead of minimizing the maximum over all possible q, k, we can minimize the sum or mean, i.e., min z1,...,z J E q∈Q,k∈K J min j=1 (q -z j ) T kk T (q -z j ) min z1,...,z J E q∈Q J min j=1 (q -z j ) T E k∈K kk T (q -z j ) min z1,...,z J E q∈Q J min j=1 (q -z j ) T Σ k + µ k µ T k (q -z j ) Letting ∥x∥ 2 M = x T Mx min z1,...,z J E q∈Q J min j=1 ∥q -z j ∥ 2 Σ k +µ k µ T k which is same as the weighted K-means objective. The weights depend on the covariance Σ k and mean µ k of the key vectors. Thus, in order to learn good K-means approximations of the query vectors from a discrete set, we introduce the following model. In the VQ-TR model we modify the Transformer's encoder architecture by first mapping the C incoming vectors, denoted by X ∈ R C×F , through a VQ module: Z 0 , indices := VQ(X) which will return the sequence of C indices of the set of only J vectors denoted by Z 0 ∈ R J×F . We can apply Transformer based cross-attention to obtain latent Z 1 ∈ R J×F : Z 1 := CrossAttn(X, Z 0 ). Since there are only J latent vectors and typically in practice J ≪ C, we can process them further via self-attention L times: Z l+1 := SelfAttn(Z l ). Finally, we return the original number of sequence by gathering the resulting latent via the indices with respect to the quantization of the input vectors X: R C×F ∋ Z := Gather(Z L+1 , indices). This construction leads to an architecture with memory and compute complexity of O(CJ) + O(LJ 2 ) from the the cross-attention and latent self-attention respectively (Jaegle et al., 2021; Hawthorne et al., 2022) per number of encoding layers N . One downside however to this architecture is that we lose the ability to model causal latents. On the other hand, as is commonly done in time series forecasting, we will use this non-causal encoder part of our architecture to learn discrete representations of large context windows, while the decoder will be the causal Transformer decoder and therefore scale O(M P 2 ) for M decoding layers. Since P ≪ C for the datasets we train on, this will not hinder us from training or doing inference conditioned on large histories. We present a schematic of the VQ-TR model in Figure 1 for both the training (Section 3.1) and inference (Section 3.2) scenarios. A added benefit of this approach is that we can mitigate the quadratic memory and computation issue caused by a large sequence of input vectors in the vanilla Transformer to learn long term dependencies. 

3.1. TRAINING

Given a set D train of D ≥ 1 time series, we construct batches B of inputs by randomly sampling time series {x i 1:T i }, with i ∈ Z + such that i ≤ D, then selecting random t ∈ Z + with t ≤ T i -C -P , and sampling context windows {x i t:t+C }, and subsequent prediction windows {x i t+C:t+C+P }, for fixed context window length C and prediction window length P . We can then for each batch step minimize the negative log-likelihood of the predicted distribution with respect to the ground truth predictions together with the N latent and commitment losses from the VQ module of the encoder jointly. This is in contrast to the practice of first learning the discrete latent representations in an unsupervised fashion and then using these latents for down stream tasks as in for example the DALL•E (Ramesh et al., 2021) model.

3.2. INFERENCE

At inference time we go over each time series i ∈ D train and feed the last context sized window (except for the last entry) to the encoder and the very last entry to decoder to obtain the parameters of the distribution of the next time point. We can now sample one or more values from this distribution and feed it back to the decoder to obtain samples for each time point of our desired horizon of P time steps. Note that we only need to run the encoder once in order to predict and can repeat tensors in the batch dimension to obtain many samples from the distribution in parallel. If a point forecast is required then we can evaluate the empirical mean or median at each time point of the prediction. Unlike some generative modeling models, here we do not sample from a reduced temperature distribution to obtain high quality samples as we are interested in the empirical data distribution of the next time point conditioned on the past as well as covariates.

3.3. COVARIATES

Positional encoding give the Transformer the ability to encode positional information of sequences when needed since Attention is a permutation equivariant layer. In the time series setting we can naturally create positional encodings like Rotary Positional Embedding (RoPE) (Su et al., 2021 ) via date-time covariates. More specifically, for a particular time point t, depending on the frequency of the time series i, we can create hour-of-day, day-of-week, week-of-month, etc. features as a vector we denote by c i t . Due to their temporal nature we can build these covariates for all future time points we wish to forecast for. Additional covariates can be constructed by considering the running means, the age of a time series as well as embedding the identity i of each time series in a dataset via Embedding layers, as done in the DeepAR method.

3.4. SCALING

As detailed in the Salinas et al. (2019b), time series data can be of an arbitrary numerical magnitude within a dataset. This is unlike the vision, NLP or even audio modalities, and so in order to train a shared model over potentially very different time series we calculate the mean value of the signal within its context window and divide the signal with it to normalize it. This context window scale value is kept as a covariate and more importantly the model's output distribution is transformed back to the original scale via it during training and inference to calculate the log-probabilities or to sample from respectively. If the scaling cannot be done in the output distribution's parameter space one can also do it in the data space after sampling. All deep learning based methods in Section 4 incorporate this heuristic.

4. EXPERIMENTS

We test the performance of VQ-TR for the forecasting task in this section with respect to a number of methods on a number of open datasets.  CRPS(F, x) = R (F (y) -I{x ≤ y}) 2 dy, where I{x ≤ y} is 1 if x ≤ y and 0 otherwise. We approximate the CDF via empirical samples at each time point and the final metric is averaged over the prediction horizon and time series of a dataset. The point-forecasting performance of models is measured by the normalized root mean square error (NRMSE), the mean absolute scaled error (MASE) Hyndman & Koehler (2006) , and the symmetric mean absolute percentage error (sMAPE) Makridakis (1993) . For pointwise metrics, we use sampled medians with the exception of NRMSE, where we take the mean over our prediction samples. The results of our extensive experiments are detailed in Table 2 . As can be seen VQ-TR performs competitively with respect to the methods compared, where the models have been trained using the hyperparameters from their respective papers using Student-T (-t), Negative Binomial (-nb) or Implicit Quantile Network (-iqn) emission heads. In particular for VQ-TR we can afford to use a larger context length of C = 20 × P , where P is the prediction horizon for each dataset, with the total number of encoder layers N = 2 and decoder layers M = 6. For training we use J = 25 codebook vectors, batch size of 256 for 20 epochs using the Adam (Kingma & Ba, 2015) optimizer with default parameters, and a learning rate of 0.001. At inference time we sample S = 100 times for each time point and feed these samples in parallel via the batch dimension autoregressively through the decoder to produce the reported metrics. The full source code will be made available after the review period.

5. RELATED WORKS

This method separates the length of the input sequence from the computation of the attention block by using a discrete set of latents. This strategy of reducing the computational cost is similar to the Perceiver ( 

6. SUMMARY AND DISCUSSION

We have presented VQ-TR a novel architecture which scales linearly with the encoder sequence size as a probabilistic forecasting model and demonstrated its performance against competitive models on a number of open datasets. VQ-TR reports good performance at test time and we can control the trade-off of computation and memory use by increasing or decreasing the number of discrete latents J. As the reader might guess, this architecture can also work in other sequential modeling usecases and in future work we would like investigate the performance of VQ-TR for NLP, Audio or Vision based problems. 



http://www.unic.ac.cy/test/wp-content/uploads/sites/2/2018/09/ M4-Competitors-Guide.pdf



Figure 1: VQ-TR model which consists of N encoding vector-quantized cross-attention blocks and M causal decoding transformer blocks. During training (left) the encoder can take a potentially long sequence of length C -1 from a time series and the decoder outputs the prediction length P parameters of some chosen distribution which are learned via the negative log-likelihood together with the N Vector Quantizer losses. During inference (right) we pass the last C -1 length context window seen during training to the encoder and the very last value in training to the decoder, which allows us to sample the next time step which we can autoregressively pass back to the decoder to obtain predictions for our desired horizon.

DeepAR Salinas et al. (2019b): an RNN based probabilistic model which learns the parameters of some chosen distribution for the next time point; • MQCNN Wen et al. (2017): a Convolutional Neural Network model which outputs chosen quantiles of the forecast upon which we regress the ground truth via Quantile loss; • SQF-RNN Gasthaus et al. (2019): an RNN based non-parametric method which models the quantiles via linear splines and also regresses the Quantile loss; • IQN-RNN Gouttes et al. (2021): combines an RNN model with an Implicit Quantile Network (IQN) Dabney et al. (2018) head to learn the distribution similar to SQF-RNN; • VQ-AR Rasul et al. (2022): an RNN based encoder-decoder model which quantizes its input via a VQ; as well as the classical ETS Hyndman & Khandakar (2008) which is an exponential smoothing method using weighted averages of past observations with exponentially decaying weights as the observations get older together with Gaussian additive errors (E) modeling trend (T) and seasonality (S) effects separately. We follow the recommendations of the M4 competition Makridakis et al. (2020) for reporting forecasting performance metrics. In this regard, we report the mean scale interval score Gneiting & Raftery (2007) (MSIS 5 ) for a 95% prediction interval, the 50-th and 90-th quantile percentile loss (QL50 and QL90 respectively), as well as the Continuous Ranked Probability Score (CRPS) Gneiting & Raftery (2007); Matheson & Winkler (1976) score. The CRPS is a proper scoring rule which measures the compatibility of a predicted cumulative distribution function (CDF) F with the groundtruth samples x as

Figure 2: VQ-TR and other Transformer based model's memory usage when training on the different methods with the same hyper-parameters.

Number of time series, domain, frequency, total training time steps and prediction length properties of the training datasets used in the experiments. Elec. 1 , Traffic 2 , Taxi 3 , and Wikipedia 4 preprocessed exactly as in Salinas et al. (2019a), with their properties listed in Table1. As can be noted in the table, we do not need to normalize scales for the Traffic dataset. From the names of the datasets, we see that we cover a number of time series domains including finance, weather, energy, logistics and page-views

Forecasting metrics (lower is better) using: SQF-RNN with 50 knots, ETS, MQCNN, and IQN-RNN, DeepAR, VQ-AR and VQ-TR with Student-T (-t), Negative Binomial (-nb) or IQN (-iqn) emission heads, on the open datasets. The best metrics are highlighted in bold.

Hawthorne et al., 2022), Set Transformer (Lee et al., 2019), Luna (Ma et al., 2021) and Compression Transformer (Rae et al., 2020) models. Perceiver-AR is the closest related method, however it is a decoder only architecture and thus at inference time with many parallel samples for the probabilistic forecasting usecase, we have to run the cross-attention over a large context window for P times causing both memory and computation bottlenecks. The use of VQ in sequential generative modeling has been explored in Audio/Speech setting Dhariwal et al. (2020); Zeghidour et al. (2021); Baevski et al. (2020) where typically a VQ-VAE is trained on the data and then a generative model trained on these learned latents separately.The use of VQ in time series forecasting problem has been explored in the VQ-AR(Rasul et al., 2022) model however it uses an RNN to encode the history of a time series to a discrete latent which is then used for the forecasting decoder RNN. In contrast this work incorporates the VQ module as part of an approximate Attention block to mitigate issues of large temporal contexts over which to forecast over.

Forecasting metrics (lower is better) using Vanilla Transformer and other Transformer based models with Student-T (-t), Negative Binomial (-nb) or IQN (-iqn) emission heads, on the open datasets. The best metrics are highlighted in bold.

ACKNOWLEDGMENTS

We acknowledge and thank the authors and contributors of the many open source libraries that were used in this work, in particular: GluonTS (Alexandrov et al., 2020), Vector Quantize PyTorch (Wang & Contributors, 2022), NumPy (Harris et al., 2020), Pandas (Pandas development team, 2020), Matplotlib (Hunter, 2007), Jupyter (Kluyver et al., 2016) and PyTorch (Paszke et al., 2019).

ETHICS STATEMENT

Time series models considered in this work have been trained on open datasets which do not contain any personal information or ways of deducing personal information from them. All the models considered however can potentially be used, when trained with personal information, to predict behaviour or deduce private information like location for example, which could be potentially risky.

REPRODUCIBILITY STATEMENT

We will open source the code to the paper after the review period. The dataset and preprocessing used in this work is based on open datasets with predefined preprocessing provided by GluonTS (Alexandrov et al., 2020) . All the experiments have their hyperparameters in the appropriate Jupyter (Kluyver et al., 2016) notebooks.

