UNIVARIATE VS MULTIVARIATE TIME SERIES FORE-CASTING WITH TRANSFORMERS

Abstract

Multivariate time series forecasting is a challenging problem and a number of Transformer-based long-term time series forecasting models have been developed to tackle it. These models, however, are impeded by the additional information available in multivariate forecasting. In this paper we propose a simple univariate setting as an alternative method for producing multivariate forecasts. The univariate model is trained on each individual dimension of the time series. This single model is then used to forecast each dimension of the multivariate forecast in turn. A comparative study shows that our setting outperforms state-of-the-art Transformers in the multivariate setting in benchmark datasets. To investigate why, we set three hypotheses and verify them via an empirical study, which leads to a criterion for when our univariate setting is likely to lead to better performance and reveals flaws in the current multivariate Transformers for long-term time series forecasting.

1. INTRODUCTION

In an ever increasingly digital world, technology and the cost of data collection is becoming cheaper. Time series are being generated of greater lengths and dimensionality, and so more is being demanded of time series forecasting (TSF). Their application ranges across industry, such as electricity forecasting, stock prediction, and health, (Torres et al., 2021) and so any improvements in TSF can have far reaching benefits for society. The advent of deep learning brought with it multiple different TSF architectures and saw new ground continuously broken with models such as the recurrent neural network (RNN) (Hochreiter & Schmidhuber, 1997) , temporal convolutional networks (TCNs) (Bai et al., 2018), and attention (Vaswani et al., 2017) . Attention based models offer the unique benefit that the path length between distant dependencies is always one. This property has seen models, such as the Transformer, outperform previous state of the art (SOTA) methods by a significant margin in fields such as natural language processing (NLP) (Brown et al., 2020) and computer vision (Dosovitskiy et al., 2020) . While the Transformer performs very well in TSF, it suffers from O(l 2 ) complexity, where l is the length of the input to the model. Predictive patterns can be found across distant time steps and so increasing the length of the input to the model has been found to improve accuracy (Li et al., 2019) . The field of research for efficient Transformers for long-term TSF has therefore been an important new frontier. The authors evaluate each of these models on the same pool of benchmark datasets, in both a multivariate and univariate setting. While this is valuable in determining which model achieves the best forecasting accuracy, the method by which their univariate mode is implementing means that comparisons between the two settings cannot be made. In the multivariate mode, multivariate sequences of dimension d are inputted and all of the d dimensions are forecasted in a single multivariate output. The reported loss is the average loss of the forecast over every dimension. Their univariate setting, however, involves selecting a single dimension from each dataset, which the model is trained and tested on. Since only one dimension is forecasted, the results cannot be compared to the multivariate results which are comprised of all dimensions. We alter the univariate mode, enabling it to be compared with its multivariate counterpart. We train a single univariate model to forecast every dimension individually. These separate forecasts can be combined to recreate the multivariate forecast. In doing so, we find that our univariate setting produces more accurate forecasts than the multivariate setting, despite its limitation of only being able to operate on a single dimensions at a time. The finding is surprising since the intra-dimensional patterns that the univariate model learns are still present in the multivariate setting. The multivariate model, in theory, should be able to learn these, as well as inter-dimensional patterns. This then raises two questions. Whether the multivariate datasets contain predictive interdimensional patterns for the models to learn. And secondly, why the multivariate models are unable to learn the same intra-dimensional patterns that univariate models find. Via empirical study, we address these two issues and in doing so devise guidelines for when a multivariate model is likely to be more appropriate than combining the results of a univariate one. Our work should provide valuable insight into improving Transformer based long-term TSF models for the multivariate setting.

2. METHOD

We compare the multivariate, and our univariate setting across four transformer based TSF models and find that in the majority of cases, our univariate setting achieves SOTA results. Multivariate models, in theory, should be able to match or outperform their counterpart since they have access to the same information, plus more. We address three questions that this raises: (1) -There are no inter-dimensional patterns for the multivariate models to learn. (2) -The multivariate models are impeded by the additional dimensions (the dimension count). (3) -Models in the multivariate setting require more training data to outperform the univariate mode.

2.1. MODELS

We evaluate our univariate setting and its multivariate counterpart with four models. Three of the models are SOTA Transformer-based long-term TSF, such as the Informer, by Zhou et al. (2021) , which achieves a computational complexity of O(l • log(l)) via its ProbSparse self-attention mechanism, only allowing each key to attend to the most dominant queries. The Autoformer, by Xu et al. (2021) , likewise achieves O(l • log(l)) with its Auto-Correlation mechanism which discovers period-based dependencies. The FEDformer, by Zhou et al. (2022) , achieves linear complexity with its Fourier enhanced structure and by representing the series in the time domain and randomly selecting a constant number of Fourier components. Additionally, we test with a Vanilla Transformer (VT) to examine what advantages its O(l 2 ) complexity bring. Including this model will also help us to generalise our findings to all Transformerbased TSF architectures. We aim to keep the model as simple as possible. The time series are embedded with a 1-D convolutional neural network (CNN) of kernel size, 7. The output of the single layered transformer is projected to the horizon length h with a fully connected layer. The exact implementation of the models can be found in our repository www.github.com.

2.2. BENCHMARK DATASETS

We evaluate the models on all the standard benchmark datasets that have been used by previous long-term TSF models such as the Informer, Autoformer and the FEDformer. The datasets are, Electricity, which contains hourly usage measurements of 321 customers from 2012-2015. Electricity Transformer Temperature (ETT), collected from 2016-2018 in China, which includes 7 features/dimensions of the load and oil temperatures of two stations. Of the 4 variations



Multiple efficient Transformers have been developed for TSF such as the Informer by Zhou et al. (2021) and the Autoformer by Xu et al. (2021) which both achieve a complexity of O(l • log(l)). The FEDformer, by Zhou et al. (2022), manages to achieves linear complexity.

