VQ-TR: VECTOR QUANTIZED ATTENTION FOR TIME SERIES FORECASTING

Abstract

Modern time series datasets can easily contain hundreds or thousands of temporal time points, however, Transformer based models scale poorly to the size of the sequence length constraining their context size in the seq-to-seq setting. In this work, we introduce VQ-TR which maps large sequences to a discrete set of latent representations as part of the Attention module. This allows us to attend over larger context windows with linear complexity with respect to the sequence length. We compare this method with other competitive deep learning and classical univariate probabilistic models and highlight its performance using both probabilistic and point forecasting metrics on a variety of open datasets from different domains.

1. INTRODUCTION

Deep learning based probabilistic time series forecasting models (Benidis et al., 2022) typically consists of a component that learns representations of a time series' history, while another component learns some emission model (point or probabilistic) conditioned on this representation of the history. One can typically use Recurrent Neural Networks (RNNs), Casual Convolutional networks or the popular architecture at time of writing namely, Transformers (Vaswani et al., 2017) to learn historical representations. Transformers offer good inductive bias for the forecasting task (Zhou et al., 2021) , as they can look back over the full context history of a time series, while RNNs suffer from forgetting and convolutions network have limited temporal receptive fields for the amount of parameters. Transformers however suffer from quadratic complexity for memory and compute in the size of sequence over which they are learning representations. This constrains the size of contexts over which Transformer based models can make predictions over which can potentially hinder these models from learning long-range dependencies as well as constraining the depth of these models leading to poorer representations being learnt. A number of techniques have been developed to deal with this issue by reducing the computation or reducing the sequence length. In this work we start with an observation on the approximation of the Attention module in the Transformer and design a novel encoder-decoder Transformer based architecture for the forecasting use case which is linear in its computational and memory use with respect to the sequence size. We do this by incorporating a Vector Quantization (van den Oord et al., 2017) into the Transformer which allows us to learn discrete representations in a hierarchical fashion.

2. BACKGROUND 2.1 PROBABILISTIC TIME SERIES FORECASTING

The task of probabilistic time series forecasting in the univariate setting consists of training on a dataset of D ≥ 1 time series D train = {x i 1:T i } where i ∈ {1, . . . , D} and at each time point t we have x i t ∈ R or in N. We are tasked with predicting the potentially complex distribution of the next P > 1 time steps into the future and we are given a back-testing test set D test = {x i T i +1:T i +P }. Each time index t is in reality a date-time value which increments regularly based on the frequency of the dataset in question and the last training point T i for each time series may or may not be the same date-time. Autoregressive models like (Graves, 2013; Salinas et al., 2019b) estimate the prediction density by decomposing the joint distribution of all P points via the chain-rule of probability as: p X (x i T i +1:T i +P ) ≈ Π P t=1 p(x i T i +t |x i 1:T i -1+t , c i 1:T i +P ; θ),

