UNIVARIATE VS MULTIVARIATE TIME SERIES FORE-CASTING WITH TRANSFORMERS

Abstract

Multivariate time series forecasting is a challenging problem and a number of Transformer-based long-term time series forecasting models have been developed to tackle it. These models, however, are impeded by the additional information available in multivariate forecasting. In this paper we propose a simple univariate setting as an alternative method for producing multivariate forecasts. The univariate model is trained on each individual dimension of the time series. This single model is then used to forecast each dimension of the multivariate forecast in turn. A comparative study shows that our setting outperforms state-of-the-art Transformers in the multivariate setting in benchmark datasets. To investigate why, we set three hypotheses and verify them via an empirical study, which leads to a criterion for when our univariate setting is likely to lead to better performance and reveals flaws in the current multivariate Transformers for long-term time series forecasting.

1. INTRODUCTION

In an ever increasingly digital world, technology and the cost of data collection is becoming cheaper. Time series are being generated of greater lengths and dimensionality, and so more is being demanded of time series forecasting (TSF). Their application ranges across industry, such as electricity forecasting, stock prediction, and health, (Torres et al., 2021) and so any improvements in TSF can have far reaching benefits for society. The advent of deep learning brought with it multiple different TSF architectures and saw new ground continuously broken with models such as the recurrent neural network (RNN) (Hochreiter & Schmidhuber, 1997) , temporal convolutional networks (TCNs) (Bai et al., 2018) , and attention (Vaswani et al., 2017) . Attention based models offer the unique benefit that the path length between distant dependencies is always one. This property has seen models, such as the Transformer, outperform previous state of the art (SOTA) methods by a significant margin in fields such as natural language processing (NLP) (Brown et al., 2020) and computer vision (Dosovitskiy et al., 2020) . While the Transformer performs very well in TSF, it suffers from O(l 2 ) complexity, where l is the length of the input to the model. Predictive patterns can be found across distant time steps and so increasing the length of the input to the model has been found to improve accuracy (Li et al., 2019) . The field of research for efficient Transformers for long-term TSF has therefore been an important new frontier. Multiple efficient Transformers have been developed for TSF such as the Informer by Zhou et al. (2021) and the Autoformer by Xu et al. (2021) which both achieve a complexity of O(l • log(l)). The FEDformer, by Zhou et al. (2022) , manages to achieves linear complexity. The authors evaluate each of these models on the same pool of benchmark datasets, in both a multivariate and univariate setting. While this is valuable in determining which model achieves the best forecasting accuracy, the method by which their univariate mode is implementing means that comparisons between the two settings cannot be made. In the multivariate mode, multivariate sequences of dimension d are inputted and all of the d dimensions are forecasted in a single multivariate output. The reported loss is the average loss of the forecast over every dimension. Their univariate setting, however, involves selecting a single dimension from each dataset, which the model is trained and tested on. Since only one dimension is forecasted, the results cannot be compared to the multivariate results which are comprised of all dimensions. We alter the univariate mode, enabling it to be compared with its multivariate counterpart. We train a single univariate model to forecast every dimension individually. These separate forecasts can be combined to recreate the multivariate forecast. In doing so, we find that our univariate setting produces more accurate forecasts than the multivariate setting, despite its limitation of only being able to operate on a single dimensions at a time. The finding is surprising since the intra-dimensional patterns that the univariate model learns are still present in the multivariate setting. The multivariate model, in theory, should be able to learn these, as well as inter-dimensional patterns. This then raises two questions. Whether the multivariate datasets contain predictive interdimensional patterns for the models to learn. And secondly, why the multivariate models are unable to learn the same intra-dimensional patterns that univariate models find. Via empirical study, we address these two issues and in doing so devise guidelines for when a multivariate model is likely to be more appropriate than combining the results of a univariate one. Our work should provide valuable insight into improving Transformer based long-term TSF models for the multivariate setting.

2. METHOD

We compare the multivariate, and our univariate setting across four transformer based TSF models and find that in the majority of cases, our univariate setting achieves SOTA results. Multivariate models, in theory, should be able to match or outperform their counterpart since they have access to the same information, plus more. We address three questions that this raises: (1) -There are no inter-dimensional patterns for the multivariate models to learn. (2) -The multivariate models are impeded by the additional dimensions (the dimension count). (3) -Models in the multivariate setting require more training data to outperform the univariate mode.

2.1. MODELS

We evaluate our univariate setting and its multivariate counterpart with four models. Three of the models are SOTA Transformer-based long-term TSF, such as the Informer, by Zhou et al. (2021) , which achieves a computational complexity of O(l • log(l)) via its ProbSparse self-attention mechanism, only allowing each key to attend to the most dominant queries. The Autoformer, by Xu et al. (2021) , likewise achieves O(l • log(l)) with its Auto-Correlation mechanism which discovers period-based dependencies. The FEDformer, by Zhou et al. (2022) , achieves linear complexity with its Fourier enhanced structure and by representing the series in the time domain and randomly selecting a constant number of Fourier components. Additionally, we test with a Vanilla Transformer (VT) to examine what advantages its O(l 2 ) complexity bring. Including this model will also help us to generalise our findings to all Transformerbased TSF architectures. We aim to keep the model as simple as possible. The time series are embedded with a 1-D convolutional neural network (CNN) of kernel size, 7. The output of the single layered transformer is projected to the horizon length h with a fully connected layer. The exact implementation of the models can be found in our repository www.github.com.

2.2. BENCHMARK DATASETS

We evaluate the models on all the standard benchmark datasets that have been used by previous long-term TSF models such as the Informer, Autoformer and the FEDformer. 

2.3. UNIVARIATE FORECASTING

Recent SOTA long-term TSF models, Informer, Autoformer and the FEDformer are evaluated on a range of datasets in both the univariate and multivariate setting, however, due to differences between the methods, comparisons between the two cannot be made. In the multivariate setting, multivariate sequences are both inputted and outputted from the model, M m . The reported loss is the average loss over the entire forecast over every dimension. However, the univariate setting involves selecting a single dimension from each dataset, which the model is trained and tested on. Since only one dimension is forecasted, the results cannot be compared to the multivariate results which are comprised of all dimensions. Therefore we alter the univariate setting to produce a multivariate output. Traditionally, to obtain a multivariate forecast, Xt with univariate models, M t , each dimension of the input, X, would be separated, Eq (1), and a different model applied to each dimension, Eq (2). This requires d separate models to be trained, incurring a significant computational cost for high dimensional datasets. The Traffic dataset we use has 862 dimensions and so 862 models would be required. Therefore, we train a single univariate model, M u , on all the individual dimensions. During training, this involves repeatedly taking sequences from a random dimension to feed into the model. The end result is a model that can forecast any single dimension. To produce the multivariate forecast, Xu , all the univariate forecasts are concatenated together, see Eq (3). For the multivariate model, M m , the input, X, is not separated and the entire multivariate forecast, Xm , is produced in one step, see Eq (4). With our univariate setting, the forecast Xu can be directly compared with the forecast Xm from the multivariate model, and does not require the computational cost of producing Xt . X = x 1 , x 2 , • • • , x d (1) Xt = M t1 (x 1 ), M t2 (x 2 ), • • • , M t d (x d ) (2) Xu = M u (x 1 ), M u (x 2 ), • • • , M u (x d ) (3) Xm = M m (X)

2.4. INTER-DIMENSIONAL DEPENDENCIES

Multivariate models have the potential to outperform univariate models since they have access to more information. Univariate models only operate over single dimensions, limiting the model to learning intra-dimensional patterns. Multivariate models operate over multiple dimensions and so can learn inter-dimensional patterns, in addition to the intra-dimensional kind. We therefore want to understand how the multivariate models are making the forecast and whether they are utilising inter-dimensional patterns. If the benchmark datasets contain predictive interdimensional patterns, then the multivariate models should outperform their univariate counterpart. Attention based models can often lend themselves to interpretation via their attention weights. The query key dot products show which inputs are amplified and therefore of high predictive relevance. This method, however, is not realisable due to the use of one dimensional CNNs being used to embed the input time sequence. The kernel of the CNN sums all the dimensions together before attention is applied, obfuscating direct access to the dimensions. Instead, to find the strength of inter-dimensional dependencies we look at what impact each input dimension has on the forecasted dimensions. If one dimension contributes to the forecast of multiple output dimensions, then the model has likely learnt a predictive pattern between input dimensions. A pattern that could not be identified with a univariate model. We achieve this by computing the Jacobian matrix, J, of the output. For every forecasted dimension within every time step of the output, X, we compute the partial derivative with respect to the input X, J =     ∂ X11 ∂X . . . ∂ X1h ∂X . . . . . . . . . ∂ Xd1 ∂X . . . ∂ Xdh ∂X     (5) We are interested in the overall affect of each input dimension with the output dimensions, and so we simplify J by taking the average along the time step axis. This reduces the size of J from d × h to d, where h is the forecast horizon, Ji = 1 h h k=0 |J i,k | (6) Each element within J is still of size d × l, where l is the length of the input, and so we summarise this further by averaging all the time steps, D = 1 l l i=0 Ji The matrix D ∈ R d×d shows the overall influence each input dimensions has over each output dimension. Input dimensions that have a high impact on certain output dimensions, can be interpreted as having high predictive utility for those specific outputs. From D we are able to tell whether inter-dimensional patterns have been learnt. If no such patterns have been found, D should be similar to the identity matrix. If the patterns are present then input dimensions should affect multiple output dimensions, see Figure 1 .

2.5. DIMENSION COUNT IMPACT

To better understand the interplay between the number of dimensions a dataset has and the performance of the trained model, we create an array of new datasets. These are subsets of the benchmark datasets with dimensions removed. For example, the Electricity dataset contains 321 dimensions, see Table 1 . We take r randomly selected dimensions, maintaining the series length l, to create the new subdataset. This is then used to train and test the model. By repeating this process with a range of values of r we are able to create a plot of how the number of dimensions affects the accuracy of the multivariate model. When r is equal to the full number of dimensions within the dataset, d, the case generalises to the full multivariate setting. When r is equal to 1 we reach the univariate setting. This experiment will give a clear view on how additional dimensions impact the performance of a model.

2.6. DATASET SIZE REQUIREMENTS

Often in machine learning, if a model performs poorly, a lack of training data can be blamed. And so we test the data requirements for both the univariate and multivariate setting. While it is challenging if not impossible to increase the size of a dataset, we can easily reduce the size by removing time steps. If it is expected that increasing the size would improve the performance, than it should hold that reducing it would worsen the performance. 

3.1. UNIVARIATE VS MULTIVARIATE

We test the four architectures in both the multivariate setting and our univariate setting. Both methods produce a forecast involving all the dimensions enabling direct comparisons to be made. The results can be found in Table 2 . All models are trained to minimise the MSE and then evaluated on both MSE and MAE. We highlight the best overall loss in bold, and the best losses per model are underlined. The VT, Informer and the Autoformer experience a significant improvement in forecasting accuracy when in our univariate setting. The results for the FEDformer are more mixed with neither mode having a clear advantage over the other. Overall the VT performs the best which is not unexpected due to it having the highest computational complexity of O(l 2 ). In Appendix C we test the VT in the traditional univariate setting, where a separate model is trained for each dimension, see Eq (2). We are only able to carry out this test for one architecture due to the high computation cost. The traffic dataset alone has 862 dimensions, see Table 1 , requiring 862 models to be trained and tested, which took 3 weeks on a pair of Nvidia V100 GPUs. The success of our univariate setting is surprising as it has less information to base the forecast on. The following experiments explore the reasons behind this finding.

3.2. INTER-DIMENSIONAL DEPENDENCIES

By visualising the impact each input dimension has on the dimensions within the forecast, we are able to identify whether the model has learnt inter-dimensional patterns. These will present themselves as an input dimension which affects multiple output dimensions. For the VT, the Exchange dataset saw the largest improvement in the univariate setting and the Illness dataset was the only one that saw an improvement when in the multivariate setting. We visualise and compare the input output dimension interactions for the two datasets in Figure 1 . For Exchange, a strong diagonal pattern through the center is present. This indicates that each dimension is highly independent and the model has not learnt any inter-dimensional predictive patterns. In contrast, the visualisation for the Illness dataset shows multiple vertical lines. This shows that the model is drawing on multiple dimensions when predicting each output dimension; the forecast is based on inter-dimensional patterns. For example, dimension 3 is predictive of all dimensions, especially 0 and 1. Dimension 4, on the other hand, has nearly no predictive value. The presence of predictive inter-dimensional patterns which the VT in the multivariate setting has learnt, explains why it outperform its univariate counterpart. However, in Appendix B we give the visualisations for all the datasets and it can be seen that Electricity and Traffic also have interdimensional patterns but the univariate setting still outperforms the other. These datasets have a much higher dimensional count, see Table 1 , and in the next section we explore the impact of Figure 1 : Here we visualise the importance of each input dimension, relative to each output dimension, given by a brighter colour. The dimensions with Exchange are showing to be highly independent, with no predictive inter-dimensional patterns being learnt. In contrast, the vertical columns of the Illness visualisation show that the model uses the information within specific dimensions to forecast multiple others. Visualisations for all the other datasets with the VT model can be found in Appendix B.

3.3. DIMENSION COUNT IMPACT

Our univariate setting outperforms the multivariate even when inter-dimensional patterns are present. By training the VT for a varying number of dimensions per dataset we are able to determine the precise impact of the dimension count, see Figure 2 . For 4 of the 6 datasets, there is a clear and strong positive correlation between the number of dimensions and the model's forecasting loss. Exchange has the strongest link. The model accuracy is significantly reduced as more dimensions are added. A notable exception is the Illness dataset, which has a weak negative correlation meaning that the model benefits from access to the other dimensions and is likely learning inter-dimensional patterns. In Appendix A we give an alternative version of Figure 2 where each individual data point is given. We also explain why there is a generally higher variance for the first points of each plot. In summary, Figure 2 shows that the VT model is impeded by high dimensional datasets and that this affect drowns out the benefits of inter-dimensional patterns for the Electricity and Traffic datasets.

3.4. DATASET SIZE REQUIREMENTS

Larger training datasets can often improve the performance of a model. We investigate whether either our univariate or the multivariate setting would be improved with additional data. We make the assumption that if the performance decreases with less data, then it will improve with more data. See Figure 3 . The data requirements for the multivariate setting is greater than that of the univariate setting. For Electricity, ETT, Traffic and Weather datasets, the univariate converges rapidly, whereas, the multivariate needs a larger amount of data, with Weather requiring the most (80% of the full training set). All these datasets appear to converge before a set size of 100% and so the performance of the multivariate setting is likely not limited by the amount of data. The results for the Exchange and Illness datasets are more complex. For exchange, the univariate plot quickly improves and converges but the multivariate setting sees no improvement and is unstable suggesting that more data is required. The Illness dataset gives the surprising pattern of an increasing loss with an increasing data size. This suggests that as more data is added to the training set, it becomes less representative of the test split, however, we are cautious to draw conclusions from this due to the exceptionally small size of the dataset. Illness has a times series length 70 times shorter than the longest, ETT, and Exchange is the second smallest dataset, see Table 1 . In summary, the multivariate setting requires larger datasets than the univariate, but the 4 largest datasets are large enough for this not to be a issue. The two smallest datasets, Exchange and Illness would appear to benefit from more information due to the unusual and unstable results. 

4. CONCLUSION

Our univariate setting outperforms its multivariate counterpart in the majority of test cases. This is due to the models being impeded by the additional dimensions in the multivariate setting, pointing to a flaw in the current Transformer based long-term TSF architectures. We expect that a form of variable selection is likely needed to avoid the noise from the dimensions crowding out useful inter-dimensional signals. Work from Lim et al. (2021) is relevant, however their mechanism selects relevant input features at each time step and does not address the issue of certain dimensions only being relevant to the forecast of other specific dimensions. If a dataset contains inter-dimensional predictive patterns and has a low dimension count, the multivariate setting should perform best. In all other cases, we expect our univariate mode to achieve the best forecasting accuracy. Our code will be made publicly available on the completion of the review process. Table 3 : Here we include the traditional univariate setting, which involves training a separate model for each dimension, see Eq (2). We compare the setting to the multivariate and our univariate one. Likely due to the extra capacity of the d models, the setting improves in the datasets that have a high number of dimensions. This method, however, is untenable due to the massive number of models that are required to be trained. For this reason we were only able to test with one architecture, VT. Values are bold if they are the best in the row and underlined if they are the best in the row for the model.



Figure 5: A visualisation of the importance of each input dimension in relation to each output dimension over all the datasets for the VT model. Brighter colours indicate a greater relevance.

The datasets are, Electricity, which contains hourly usage measurements of 321 customers from 2012-2015. Electricity Transformer Temperature (ETT), collected from 2016-2018 in China, which includes 7 features/dimensions of the load and oil temperatures of two stations. Of the 4 variations we use the M2. Exchange, the daily exchange rates of 8 countries between 1990 to 2016. Influenczalike illness (ILI), the weekly number of patients in the United states from 2002 to 2021, which includes 7 features. Traffic, 862 dimensions of hourly lane occupancy rates in California between 2015-2016. Weather, which contains 21 meteorological indicators for the year of 2020. Table1gives a useful summary of the dataset statistics. Statistics of the 6 benchmark datasets.

Therefore, we create an array of subdatasets with reduced training set sizes. The validation and test splits remain unchanged. When time steps are removed, they are removed in chronological order starting at the beginning of the series. Training and testing a model on all the subdatasets, in both the univariate and multivariate setting, shows how sensitive each mode is to the amount of data. Comparing the VT (Vanilla Transformer), Informer, Autoformer and the FEDformer in our univariate and the multivariate settings. Values are bold if they are the best in the row and underlined if they are the best in the row for the model. The univariate setting outperforms its multivariate counterpart in the vast majority of all tests.

Here we explore how training VT models on datasets with a varying number of dimensions impacts forecasting accuracy. Random subsets of dimensions are selected from every dataset and a model is trained. This is repeated multiple times to calculate the standard error. The dashed line indicates the linear trend line, and for most datasets accuracy worsens as the model is trained on additional dimensions. The Illness dataset is the only one that sees an improvement when all dimension are available.

Evaluating model performance on reduced training set sizes. The loss is plotted against the training size, which is the fraction of the training data that is used. A values of 0.2 means that only the final 20% of the set is used. The error bars represent the standard error.

annex

There is a greater variance at the lower dimensions due to the high variation in how hard each dimension is to forecast. At higher dimension counts, this variance averages out. 

