CROSSFORMER: TRANSFORMER UTILIZING CROSS-DIMENSION DEPENDENCY FOR MULTIVARIATE TIME SERIES FORECASTING

Abstract

Recently many deep models have been proposed for multivariate time series (MTS) forecasting. In particular, Transformer-based models have shown great potential because they can capture long-term dependency. However, existing Transformerbased models mainly focus on modeling the temporal dependency (cross-time dependency) yet often omit the dependency among different variables (crossdimension dependency), which is critical for MTS forecasting. To fill the gap, we propose Crossformer, a Transformer-based model utilizing cross-dimension dependency for MTS forecasting. In Crossformer, the input MTS is embedded into a 2D vector array through the Dimension-Segment-Wise (DSW) embedding to preserve time and dimension information. Then the Two-Stage Attention (TSA) layer is proposed to efficiently capture the cross-time and cross-dimension dependency. Utilizing DSW embedding and TSA layer, Crossformer establishes a Hierarchical Encoder-Decoder (HED) to use the information at different scales for the final forecasting. Extensive experimental results on six real-world datasets show the effectiveness of Crossformer against previous state-of-the-arts.

1. INTRODUCTION

Multivariate time series (MTS) are time series with multiple dimensions, where each dimension represents a specific univariate time series (e.g. a climate feature of weather). MTS forecasting aims to forecast the future value of MTS using their historical values. MTS forecasting benefits the decision-making of downstream tasks and is widely used in many fields including weather (Angryk et al., 2020) , energy (Demirel et al., 2012) , finance (Patton, 2013) , etc. With the development of deep learning, many models have been proposed and achieved superior performances in MTS forecasting (Lea et al., 2017; Qin et al., 2017; Flunkert et al., 2017; Rangapuram et al., 2018; Li et al., 2019a; Wu et al., 2020; Li et al., 2021) . Among them, the recent Transformer-based models (Li et al., 2019b; Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a; Zhou et al., 2022; Chen et al., 2022) show great potential thanks to their ability to capture long-term temporal dependency (cross-time dependency). Besides cross-time dependency, the cross-dimension dependency is also critical for MTS forecasting, i.e. for a specific dimension, information from associated series in other dimensions may improve prediction. For example, when predicting future temperature, not only the historical temperature, but also historical wind speed helps to forecast. Some previous neural models explicitly capture the cross-dimension dependency, i.e. preserving the information of dimensions in the latent feature space and using convolution neural network (CNN) (Lai et al., 2018) or graph neural network (GNN) (Wu et al., 2020; Cao et al., 2020) to capture their dependency. However, recent Transformer-based models only implicitly utilize this dependency by embedding. In general, Transformer-based models embed data points in all dimensions at the same time step into a feature vector and try to capture dependency among different time steps (like Fig. 1 (b) ). In this way, cross-time dependency is well captured, but cross-dimension dependency is not, which may limit their forecasting capability. To fill the gap, we propose Crossformer, a Transformer-based model that explicitly utilizes crossdimension dependency for MTS forecasting. Specifically, we devise Dimension-Segment-Wise (DSW) embedding to process the historical time series. In DSW embedding, the series in each dimension is first partitioned into segments and then embedded into feature vectors. The output of DSW embedding is a 2D vector array where the two axes correspond to time and dimension. Then we propose the Two-Stage-Attention (TSA) layer to efficiently capture the cross-time and cross-dimension dependency among the 2D vector array. Using DSW embedding and TSA layer, Crossformer establishes a Hierarchical Encoder-Decoder (HED) for forecasting. In HED, each layer corresponds to a scale. The encoder's upper layer merges adjacent segments output by the lower layer to capture the dependency at a coarser scale. Decoder layers generate predictions at different scales and add them up as the final prediction. The contributions of this paper are: 1) We dive into the existing Transformer-based models for MTS forecasting and figure out that the cross-dimension dependency is not well utilized: these models simply embed data points of all dimensions at a specific time step into a single vector and focus on capturing the cross-time dependency among different time steps. Without adequate and explicit mining and utilization of cross-dimension dependency, their forecasting capability is empirically shown limited. 2) We develop Crossformer, a Transformer model utilizing cross-dimension dependency for MTS forecasting. This is one of the few transformer models (perhaps the first to our best knowledge) that explicitly explores and utilizes cross-dimension dependency for MTS forecasting. 3) Extensive experimental results on six real-world benchmarks show the effectiveness of our Crossformer against previous state-of-the-arts. Specifically, Crossformer ranks top-1 among the 9 models for comparison on 36 out of the 58 settings of varying prediction lengths and metrics and ranks top-2 on 51 settings.

2. RELATED WORKS

Multivariate Time Series Forecasting. MTS forecasting models can be roughly divided into statistical and neural models. Vector auto-regressive (VAR) model (Kilian & L Ãtkepohl, 2017) and Vector auto-regressive moving average (VARMA) are typical statistical models, which assume linear cross-dimension and cross-time dependency. With the development of deep learning, many neural models have been proposed and often empirically show better performance than statistical ones. TCN (Lea et al., 2017) and DeepAR (Flunkert et al., 2017) treat the MTS data as a sequence of vectors and use CNN/RNN to capture the temporal dependency. LSTnet (Lai et al., 2018) employs CNN to capture cross-dimension dependency and RNN for cross-time dependency. Another category of works use graph neural networks (GNNs) to capture the cross-dimension dependency explicitly for forecasting (Li et al., 2018; Yu et al., 2018; Cao et al., 2020; Wu et al., 2020) . For example, MTGNN (Wu et al., 2020) uses temporal convolution and graph convolution layers to capture crosstime and cross-dimension dependency. These neural models capture the cross-time dependency through CNN or RNN, which have difficulty in modeling long-term dependency. Transformers for MTS Forecasting. Transformers (Vaswani et al., 2017) have achieved success in natural language processing (NLP) (Devlin et al., 2019) , vision (CV) (Dosovitskiy et al., 2021) and speech processing (Dong et al., 2018) . Recently, many Transformer-based models have been proposed for MTS forecasting and show great potential (Li et al., 2019b; Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a; Zhou et al., 2022; Du et al., 2022) . LogTrans (Li et al., 2019b) 



In this paper, the meaning of complexity refers to both time and space overhead and they are often the same. This is also in line with the presentation in existing works(Wu et al., 2021a; Liu et al., 2021a).



proposes the LogSparse attention that reduces the computation complexity of Transformer from O(L 2 ) to O L(log L) 2 1 . Informer (Zhou et al., 2021) utilizes the sparsity of attention score through KL divergence estimation and proposes ProbSparse self-attention which achieves O(L log L) complexity. Autoformer (Wu et al., 2021a) introduces a decomposition architecture with an Auto-Correlation mechanism to Transformer, which also achieves the O(L log L) complexity. Pyraformer (Liu et al., 2021a) introduces a pyramidal attention module that summarizes features at different resolutions and models the temporal dependencies of different ranges with the complexity of O(L). FEDformer (Zhou et al., 2022) proposes that time series have a sparse representation in frequency domain and develop a frequency enhanced Transformer with the O(L) complexity. Preformer (Du et al., 2022) divides the embedded feature vector sequence into segments and utilizes segment-wise correlation-based attention

funding

* Junchi Yan is the correspondence author. This work was in part supported by NSFC (61972250, U19B2035, 62222607) and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

