VQ-TR: VECTOR QUANTIZED ATTENTION FOR TIME SERIES FORECASTING

Abstract

Modern time series datasets can easily contain hundreds or thousands of temporal time points, however, Transformer based models scale poorly to the size of the sequence length constraining their context size in the seq-to-seq setting. In this work, we introduce VQ-TR which maps large sequences to a discrete set of latent representations as part of the Attention module. This allows us to attend over larger context windows with linear complexity with respect to the sequence length. We compare this method with other competitive deep learning and classical univariate probabilistic models and highlight its performance using both probabilistic and point forecasting metrics on a variety of open datasets from different domains.

1. INTRODUCTION

Deep learning based probabilistic time series forecasting models (Benidis et al., 2022) typically consists of a component that learns representations of a time series' history, while another component learns some emission model (point or probabilistic) conditioned on this representation of the history. One can typically use Recurrent Neural Networks (RNNs), Casual Convolutional networks or the popular architecture at time of writing namely, Transformers (Vaswani et al., 2017) to learn historical representations. Transformers offer good inductive bias for the forecasting task (Zhou et al., 2021) , as they can look back over the full context history of a time series, while RNNs suffer from forgetting and convolutions network have limited temporal receptive fields for the amount of parameters. Transformers however suffer from quadratic complexity for memory and compute in the size of sequence over which they are learning representations. This constrains the size of contexts over which Transformer based models can make predictions over which can potentially hinder these models from learning long-range dependencies as well as constraining the depth of these models leading to poorer representations being learnt. A number of techniques have been developed to deal with this issue by reducing the computation or reducing the sequence length. In this work we start with an observation on the approximation of the Attention module in the Transformer and design a novel encoder-decoder Transformer based architecture for the forecasting use case which is linear in its computational and memory use with respect to the sequence size. We do this by incorporating a Vector Quantization (van den Oord et al., 2017) into the Transformer which allows us to learn discrete representations in a hierarchical fashion.

2. BACKGROUND 2.1 PROBABILISTIC TIME SERIES FORECASTING

The task of probabilistic time series forecasting in the univariate setting consists of training on a dataset of D ≥ 1 time series D train = {x i 1:T i } where i ∈ {1, . . . , D} and at each time point t we have x i t ∈ R or in N. We are tasked with predicting the potentially complex distribution of the next P > 1 time steps into the future and we are given a back-testing test set D test = {x i T i +1:T i +P }. Each time index t is in reality a date-time value which increments regularly based on the frequency of the dataset in question and the last training point T i for each time series may or may not be the same date-time. Autoregressive models like (Graves, 2013; Salinas et al., 2019b) estimate the prediction density by decomposing the joint distribution of all P points via the chain-rule of probability as: p X (x i T i +1:T i +P ) ≈ Π P t=1 p(x i T i +t |x i 1:T i -1+t , c i 1:T i +P ; θ), parameterized by some model with trained weights θ. This requires that the next time point is conditioned on all the past and covariates c i t (detailed in Section 3.3), which is computationally challenging to scale especially if the time series has a large history. Models like DeepAR (Salinas et al., 2019b) typically resort to the seq-to-seq paradigm (Sutskever et al., 2014) and consider some context window of fixed sized C sampled randomly from the full time series history to learn some historical representation and use this representation in the decoder to learn the distribution of the subsequent time points of the context. This does however mean that the model falls short of capturing long-range temporal dependencies in its prediction which can lead to worse approximation of the future distribution. Encoder-decoder Transformers (Vaswani et al., 2017) naturally fit the seq-to-seq paradigm where N encoding Transformer layers can be used to learn a context window sized sequence of representations denoted by: {h t } C-1 t=1 = Enc • • • • • Enc({concat(x i t , c i t+1 )} C-1 t=1 ; θ). Afterwards M layers of a causal or masked decoding Transformer can be used to model the subsequent P future time points conditioned on the encoding representations as: Π C+P -1 t=C p(x i t+1 |x i t:C , c i t+1:C+1 , h 1 , . . . , h C-1 ; θ). For example if we assume the data comes from a Gaussian then the outputs from the Transformer's M decoders can be passed to a layer which returns appropriately signed parameters of a Gaussian whose log-likelihood, given by C+P -1 t=C log p N (x i t+1 |x i t:C , c i t+1:C+1 , h 1 , . . . , h C-1 ; θ), can be maximized for all i and t from the D train using stochastic gradient descent (SGD) as detailed in Section 3.1. Although Transformers offer a better alternative to recurrent neural networks (RNN) (like the LSTM (Hochreiter & Schmidhuber, 1997) or GRU (Chung et al., 2014) which apart from being sequential suffer from forgetting with large context windows) or Convolutional models like TCN (Bai et al., 2018) (which have limited temporal receptive fields) they scale, in terms of compute and memory, quadratically with the size of sequence length per layer. To reduce the computational requirements of Transformers, which is an active area of research, one can employ a number of strategies, for example by compressing the sequence, exploiting locality or by mitigating computation for each of the input entity.

2.2. VECTOR QUANTIZATION (VQ)

The VQ-VAE (van den Oord et al., 2017; Razavi et al., 2019) is an encoder-decoder Variational Autoencoder (VAE) (Kingma & Welling, 2019) that maps inputs onto a set of J ≥ 1 discrete latent variables called the codebook {z 1 , . . . , z J }, and a decoder that reconstructs the inputs from the resulting discrete vectors. The input vector is quantized with respect to its distance to its nearest codebook vector: Quantize(q) := z n , n where n = arg min j ∥q -z j ∥ 2 . ( ) The codebook is learned by back-propgation of the gradient coming upstream of the VQ module and due to the non-differentiable operation one uses the Straight-Through gradient estimator (Hinton et al., 2012; Bengio et al., 2013) to copy the gradients downstream. Additionally the VQ has two extra losses namely the latent loss which encourages the alignment of the codebook vectors to the inputs of the VQ as well as a commitment loss which penalizes the inputs from switching codebook vectors too frequently. This is done via the "stop-gradient" or "detach" operators of deep learning frameworks which blocks gradients from flowing into its argument. Thus the additional two VQ losses can be written as: ∥sg(q) -z∥ 2 2 + β∥sg(z) -q∥ 2 2 , (2) where β is the hyperparameter weighting the commitment loss. Since the optimal codes would be the k-means clusters of the input representations, van den Oord et al. (2017) provides an exponential

