CROSSFORMER: TRANSFORMER UTILIZING CROSS-DIMENSION DEPENDENCY FOR MULTIVARIATE TIME SERIES FORECASTING

Abstract

Recently many deep models have been proposed for multivariate time series (MTS) forecasting. In particular, Transformer-based models have shown great potential because they can capture long-term dependency. However, existing Transformerbased models mainly focus on modeling the temporal dependency (cross-time dependency) yet often omit the dependency among different variables (crossdimension dependency), which is critical for MTS forecasting. To fill the gap, we propose Crossformer, a Transformer-based model utilizing cross-dimension dependency for MTS forecasting. In Crossformer, the input MTS is embedded into a 2D vector array through the Dimension-Segment-Wise (DSW) embedding to preserve time and dimension information. Then the Two-Stage Attention (TSA) layer is proposed to efficiently capture the cross-time and cross-dimension dependency. Utilizing DSW embedding and TSA layer, Crossformer establishes a Hierarchical Encoder-Decoder (HED) to use the information at different scales for the final forecasting. Extensive experimental results on six real-world datasets show the effectiveness of Crossformer against previous state-of-the-arts.

1. INTRODUCTION

Multivariate time series (MTS) are time series with multiple dimensions, where each dimension represents a specific univariate time series (e.g. a climate feature of weather). MTS forecasting aims to forecast the future value of MTS using their historical values. MTS forecasting benefits the decision-making of downstream tasks and is widely used in many fields including weather (Angryk et al., 2020) , energy (Demirel et al., 2012) , finance (Patton, 2013) , etc. With the development of deep learning, many models have been proposed and achieved superior performances in MTS forecasting (Lea et al., 2017; Qin et al., 2017; Flunkert et al., 2017; Rangapuram et al., 2018; Li et al., 2019a; Wu et al., 2020; Li et al., 2021) . Among them, the recent Transformer-based models (Li et al., 2019b; Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a; Zhou et al., 2022; Chen et al., 2022) show great potential thanks to their ability to capture long-term temporal dependency (cross-time dependency). Besides cross-time dependency, the cross-dimension dependency is also critical for MTS forecasting, i.e. for a specific dimension, information from associated series in other dimensions may improve prediction. For example, when predicting future temperature, not only the historical temperature, but also historical wind speed helps to forecast. Some previous neural models explicitly capture the cross-dimension dependency, i.e. preserving the information of dimensions in the latent feature space and using convolution neural network (CNN) (Lai et al., 2018) or graph neural network (GNN) (Wu et al., 2020; Cao et al., 2020) to capture their dependency. However, recent Transformer-based models only implicitly utilize this dependency by embedding. In general, Transformer-based models embed data points in all dimensions at the same time step into a feature vector and try to capture dependency among different time steps (like Fig. 1 (b) ). In this way, cross-time dependency is well captured, but cross-dimension dependency is not, which may limit their forecasting capability. Transformer-based models (Li et al., 2019b; Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a) : data points in different dimensions at the same step are embedded into a vector. (c) DSW embedding of Crossformer: in each dimension, nearby points over time form a segment for embedding. for forecasting. These models mainly focus on reducing the complexity of cross-time dependency modeling, but omits the cross-dimension dependency which is critical for MTS forecasting. Vision Transformers. Transformer is initially applied to NLP for sequence modeling, recent works apply transformer to CV tasks to process images (Dosovitskiy et al., 2021; Touvron et al., 2021; Liu et al., 2021b; Chen et al., 2021; Han et al., 2021) . These works achieve state-of-the-art performance on various tasks in CV and inspire our work. ViT (Dosovitskiy et al., 2021) is one of the pioneers of vision transformers. The basic idea of ViT is to split an image into non-overlapping medium-sized patches, then it rearranges these patches into a sequence to be input to the Transformer. The idea of partitioning images into patches inspires our DSW embedding where MTS is split into dimensionwise segments. Swin Transformer (Liu et al., 2021b) performs local attention within a window to reduce the complexity and builds hierarchical feature maps by merging image patches. Readers can refer to the recent survey (Han et al., 2022) for comprehensive study on vision transformers.

3. METHODOLOGY

In multivariate time series forecasting, one aims to predict the future value of time series x T +1:T +τ ∈ R τ ×D given the history x 1:T ∈ R T ×D , where τ , T is the number of time steps in the future and past, respectivelyfoot_0 . D > 1 is the number of dimensions. A natural assumption is that these D series are associated (e.g. climate features of weather), which helps to improve the forecasting accuracy. To utilize the cross-dimension dependency, in Section 3.1, we embed the MTS using Dimension-Segment-Wise (DSW) embedding. In Section 3.2, we propose a Two-Stage Attention (TSA) layer to efficiently capture the dependency among the embedded segments. In Section 3.3, using DSW embedding and TSA layer, we construct a hierarchical encoder-decoder (HED) to utilize information at different scales for final forecasting.

3.1. DIMENSION-SEGMENT-WISE EMBEDDING

To motivate our approach, we first analyze the embedding methods of the previous Transformer-based models for MTS forecasting (Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a; Zhou et al., 2022) . As shown in Fig. 1 (b), existing methods embed data points at the same time step into a vector: x t → h t , x t ∈ R D , h t ∈ R d model , where x t represents all the data points in D dimensions at step t. In this way, the input x 1:T is embedded into T vectors {h 1 , h 2 , . . . , h T }. Then the dependency among the T vectors is captured for forecasting. Therefore, previous Transformer-based models mainly capture cross-time dependency, while the cross-dimension dependency is not explicitly captured during embedding, which limits their forecasting capability. Transformer was originally developed for NLP (Vaswani et al., 2017) , where each embedded vector represents an informative word. For MTS, a single value at a step alone provides little information. While it forms informative pattern with nearby values in time domain. Fig. 1 (a) shows a typical attention score map of original Transformer for MTS forecasting. We can see that attention values have a tendency to segment, i.e. close data points have similar attention weights. Based on the above two points, we argue that an embedded vector should represent a series segment of single dimension (Fig. 1 (c )), rather than the values of all dimensions at single step (Fig. 1 (b)). To this end, we propose Dimension-Segment-Wise (DSW) embedding where the points in each dimension are divided into segments of length L seg and then embedded: x 1:T = x (s) i,d |1 ≤ i ≤ T L seg , 1 ≤ d ≤ D x (s) i,d = x t,d |(i -1) × L seg < t ≤ i × L seg (1) where x (s) i,d ∈ R Lseg is the i-th segment in dimension d with length L seg . For convenience, we assume that T, τ are divisible by L segfoot_1 . Then each segment is embedded into a vector using linear projection added with a position embedding: h i,d = Ex (s) i,d + E (pos) i,d where E ∈ R d model ×Lseg denotes the learnable projection matrix, and E (pos) i,d ∈ R d model denotes the learnable position embedding for position (i, d). After embedding, we obtain a 2D vector array H = h i,d |1 ≤ i ≤ T Lseg , 1 ≤ d ≤ D , where each h i,d represents a univariate time series segment. The idea of segmentation is also used in Du et al. (2022) , which splits the embedded 1D vector sequence into segments to compute the Segment-Correlation in order to enhance locality and reduce computation complexity. However, like other Transformers for MTS forecasting, it does not explicitly capture cross-dimension dependency.

3.2. TWO-STAGE ATTENTION LAYER

For the obtained 2D array H, one can flatten it into a 1D sequence so that it can be input to a canonical Transformer like ViT (Dosovitskiy et al., 2021) does in vision. While we have specific considerations: 1) Different from images where the axes of height and width are interchangeable, the axes of time and dimension for MTS have different meanings and thus should be treated differently. 2) Directly applying self-attention on 2D array will cause the complexity of O(D 2 T 2 L 2 seg ), which is unaffordable for large D. Therefore, we propose the Two-Stage Attention (TSA) Layer to capture cross-time and cross-dimension dependency among the 2D vector array, as sketched in Fig. 2 

(a).

Cross-Time Stage Given a 2D array Z ∈ R L×D×d model as the input of the TSA Layer, where L and D are the number of segments and dimensions, respectively. Z here can be the output of DSW embedding or lower TSA layers. For convenience, in the following, we use Z i,: to denote the vectors of all dimensions at time step i, Z :,d for those of all time steps in dimension d. In the cross-time stage, we directly apply multi-head self-attention (MSA) to each dimension: Ẑtime :,d = LayerNorm Z :,d + MSA time (Z :,d , Z :,d , Z :,d ) Z time = LayerNorm Ẑtime + MLP( Ẑtime ) (3) where 1 ≤ d ≤ D and LayerNorm denotes layer normalization as widely adopted in Vaswani et al. (2017) ; Dosovitskiy et al. (2021) ; Zhou et al. (2021) , MLP denotes a multi-layer (two in this paper) feedforward network, MSA(Q, K, V) denotes the multi-head self-attention (Vaswani et al., 2017) layer where Q, K, V serve as queries, keys and values. All dimensions (1 ≤ d ≤ D) share the same MSA layer. Ẑtime , Z time denotes the output of the MSA and MLP. The computation complexity of cross-time stage is O(DL 2 ). After this stage, the dependency among time segments in the same dimension is captured in Z time . Then Z time becomes the input of Cross-Dimension Stage to capture cross-dimension dependency. As shown in Fig. 2 (c), we set a small fixed number (c << D) of learnable vectors for each time step i as routers. These routers first aggregate messages from all dimensions by using routers as query in MSA and vectors of all dimensions as key and value. Then routers distribute the received messages among dimensions by using vectors of dimensions as query and aggregated messages as key and value. In this way, the all-to-all connection among D dimensions are built: Adding up Eq. 3 and Eq. 4, we model the two stages as:

Cross-Dimension Stage

B i,: = MSA dim 1 (R i,: , Z time i,: , Z time i,: ), 1 ≤ i ≤ L Z dim i,: = MSA dim 2 (Z time i,: , B i,: , B i,: ), 1 ≤ i ≤ L Ẑdim = LayerNorm Z time + Z dim Z dim = LayerNorm Ẑdim + MLP( Ẑdim ) (4) Y = Z dim = TSA(Z) (5) where Z, Y ∈ R L×D×d model denotes the input and output vector array of TSA layer, respectively. Note that the overall computation complexity of the TSA layer is O(DL 2 + DL) = O(DL 2 ). After the Cross-Time and Cross-Dimension Stages, every two segments (i.e. Z i1,d1 , Z i2,d2 ) in Z are connected, as such both cross-time and cross-dimension dependencies are captured in Y.

3.3. HIERARCHICAL ENCODER-DECODER

Hierarchical structures are widely used in Transformers for MTS forecasting to capture information at different scales (Zhou et al., 2021; Liu et al., 2021a) . In this section, we use the proposed DSW embedding, TSA layer and segment merging to construct a Hierarchical Encoder-Decoder (HED). As shown in Fig. 3 , the upper layer utilizes information at a coarser scale for forecasting. Forecasting values at different scales are added to output the final result. Encoder In each layer of the encoder (except the first layer), every two adjacent vectors in time domain are merged to obtain the representation at a coarser level. Then a TSA layer is applied to capture dependency at this scale. This process is modeled as Z enc,l = Encoder(Z enc,l-1 ): l = 1 : Ẑenc,l = H l > 1 : Ẑenc,l i,d = M[Z enc,l-1 2i-1,d • Z enc,l-1 2i,d ], 1 ≤ i ≤ L l-1 2 , 1 ≤ d ≤ D Z enc,l = TSA( Ẑenc,l ) (6) where H denotes the 2D array obtained by DSW embedding; Z enc,l denotes the output of the l-th encoder layer; M ∈ R d model ×2d model denotes a learnable matrix for segment merging; [•] denotes the concatenation operation; L l-1 denotes the number of segments in each dimension in layer l -1, if it is not divisible by 2, we pad Z enc,l-1 to the proper length; Ẑenc,l denotes the array after segment merging in the i-th layer. Suppose there are N layers in the encoder, we use Z enc,0 , Z enc,1 , . . . , Z enc,N , (Z enc,0 = H) to represent the N + 1 outputs of the encoder. The complexity of each encoder layer is O(D T 2 L 2 seg ). Decoder Obtaining the N + 1 feature arrays output by the encoder, we use N + 1 layers (indexed by 0, 1, . . . , N ) in decoder for forecasting. Layer l takes the l-th encoded array as input, then outputs a decoded 2D array of layer l. This process is summarized as Z dec,l = Decoder(Z dec,l-1 , Z enc,l ): l = 0 : Zdec,l = TSA(E (dec) ) l > 0 : Zdec,l = TSA(Z dec,l-1 ) Z dec,l :,d = MSA Zdec,l :,d , Z enc,l :,d , Z enc,l :,d , 1 ≤ d ≤ D Ẑdec,l = LayerNorm Zdec,l + Z dec,l Z dec,l = LayerNorm Ẑdec,l + MLP( Ẑdec,l ) where E (dec) ∈ R τ Lseg ×D×d model denotes the learnable position embedding for decoder. Zdec,l is the output of TSA. The MSA layer takes Zdec,l :,d as query and Z enc,l :,d as the key and value to build the connection between encoder and decoder. The output of MSA is denoted as Z dec,l :,d . Ẑdec,l , Z dec,l denote the output of skip connection and MLP respectively. We use Z dec,0 , Z enc,1 , . . . , Z dec,N to represent the decoder output. The complexity of each decoder layer is O D τ (T +τ ) L 2 seg . Linear projection is applied to each layer's output to yield the prediction of this layer. Layer predictions are summed to make the final prediction (for l = 0, . . . , N ): for l = 0, . . . , N : x (s),l i,d = W l Z dec,l i,d x pred,l T +1:T +τ = x (s),l i,d |1 ≤ i ≤ τ L seg , 1 ≤ d ≤ D x pred T +1:T +τ = N l=0 x pred,l T +1:T +τ where W l ∈ R Lseg×d model is a learnable matrix to project a vector to a time series segment. x l are rearranged to get the layer prediction x pred,l T +1:T +τ . Predictions of all the layers are summed to obtain the final forecasting x pred T +1:T +τ . Baselines We use the following popular models for MTS forecasting as baselines:1) LSTMa (Bahdanau et al., 2015) , 2) LSTnet (Lai et al., 2018) , 3) MTGNN (Wu et al., 2020) , and recent Transformer-based models for MTS forecasting: 4) Transformer (Vaswani et al., 2017) , 5) Informer (Zhou et al., 2021) , 6) Autoformer (Wu et al., 2021a) , 7) Pyraformer (Liu et al., 2021a) and 8) FEDformer (Zhou et al., 2022) .

4. EXPERIMENTS

Setup We use the same setting as in Zhou et al. (2021) : train/val/test sets are zero-mean normalized with the mean and std of training set. On each dataset, we evaluate the performance over the changing future window size τ . For each τ , the past window size T is regarded as a hyper-parameter to search which is a common protocol in recent MTS transformer literature (Zhou et al., 2021; Liu et al., 2021a ). We roll the whole set with stride = 1 to generate different input-output pairs. The Mean Square Error (MSE) and Mean Absolute Error (MAE) are used as evaluation metrics. All experiments are repeated for 5 times and the mean of the metrics reported. Our Crossformer only utilize the past series to forecast the future, while baseline models use additional covariates such as hour-of-the-day. Details about datasets, baselines, implementation, hyper-parameters are shown in Appendix A. transformers for MTS forecasting literatures. FEDformer and Autoformer outperform our model on ILI. We conjecture this is because the size of dataset ILI is small and these two models introduce the prior knowledge of sequence decomposition into the network structure which makes them perform well when the data is limited. Crossformer still outperforms other baselines on this dataset.

4.3. ABLATION STUDY

In our approach, there are three components: DSW embedding, TSA layer and HED. We perform ablation study on the ETTh1 dataset in line with Zhou et al. (2021) ; Liu et al. (2021a) . We use Transformer as the baseline and DSW+TSA+HED to denote Crossformer without ablation. Three ablation versions are compared: 1) DSW 2) DSW+TSA 3) DSW+HED. We analyze the results shown in Table 2 . 1) DSW performs better than Transformer on most settings. The only difference between DSW and Transformer is the embedding method, which indicates the usefulness of DSW embedding and the importance of cross-dimension dependency. 2) TSA constantly improves the forecasting accuracy. This suggests that it is reasonable to treat time and dimension differently. Moreover, TSA makes it possible to use Crossformer on datasets where the number of dimensions is large (e.g. D = 862 for dataset Traffic). 3) Comparing DSW+HED with DSW, HED decreases the forecasting accuracy when prediction length is short but increases it for long term prediction. The possible reason is that information at different scales is helpful to long term prediction. 4) Combining DSW, TSA and HED, our Crossformer yields best results on all settings.

4.4. EFFECT OF HYPER-PARAMETERS

We evaluate the effect of two hyper-parameters: segment length (L seg in Eq. 1) and number of routers in TSA (c in Cross-Dimension Stage of TSA) on the ETTh1 dataset. Segment Length: In Fig. 4 (a), we prolong the segment length from 4 to 24 and evaluate MSE with different prediction windows. For short-term forecasting (τ = 24, 48), smaller segment yields relevantly better results, but the prediction accuracy is stable. For long-term forecasting (τ ≥ 168), prolonging the segment length from 4 to 24 causes the MSE to decrease. This indicates that long segments should be used for long-term forecasting. We further prolong the segment length to 48 for τ = 336, 720, the MSE is slightly larger than that of 24. The possible reason is that 24 hours exactly matches the daily period of this dataset, while 48 is too coarse to capture fine-grained information. Number of Routers in TSA Layer: Number of Routers c controls the information bandwidth among all dimensions. As Fig. 4(b) shows, the performance of Crossformer is stable w.r.t to c for τ ≤ 336. For τ = 720, the MSE is large when c = 3 but decreases and stabilizes when c ≥ 5. In pratice, we set c = 10 to balance the prediction accuracy and computation efficiency. 

4.5. COMPUTATIONAL EFFICIENCY ANALYSIS

O(T ) O( T 2 + τ ) Crossformer (Ours) O D L 2 seg T 2 O D L 2 seg τ (τ + T ) The theoretical complexity per layer of Transformer-based models is compared in Table 3 . The complexity of Crossformer encoder is quadratic w.r.t T . However, for long-term prediction where large L seq is used, the coefficient 1 L 2 seq term can significantly reduce its practical complexity. We evaluate the memory occupation of these models on ETTh1. 4 We set the prediction window τ = 336 and prolong input length T . For Crossformer, L seg is set to 24, which is the best value for τ ≥ 168 (see Fig. 4 (a) ). The result in Fig. 4 (c) shows that Crossformer achieves the best efficiency among the five methods within the tested length range. Theoretically, Informer, Autoformer and FEDformer are more efficient when T approaches infinity. In practice, Crossformer performs better when T is not extremely large (e.g. T ≤ 10 4 ). We also evaluate the memory occupation w.r.t the number of dimensions D. For baseline models where cross-dimension dependency is not modeled explicitly, D has little effect. Therefore, we compare Crossformer with its ablation versions in Section 4.3. We also evaluate the TSA layers that directly use MSA in Cross-Dimension Stage without the Router mechanism, denoted as TSA(w/o Router). Fig. 4 (d) shows that Crossformer without TSA layer (DSW and DSW+HED) has quadratic complexity w.r.t D. TSA(w/o Router) helps to reduce complexity and the Router mechanism further makes the complexity linear, so that Crossformer can process data with D = 300. Moreover, HED can slightly reduce the memory cost and we analyze this is because there are less vectors in upper layers after segment merging (see Fig. 3 ). Besides memory occupation, the actual running time evaluation is shown in Appendix B.6.

5. CONCLUSIONS AND FUTURE WORK

We have proposed Crossformer, a Transformer-based model utilizing cross-dimension dependency for multivariate time-series (MTS) forecasting. Specifically, the Dimension-Segment-Wise (DSW) embedding embeds the input data into a 2D vector array to preserve the information of both time and dimension. The Two-Stage-Attention (TSA) layer is devised to capture the cross-time and crossdimension dependency of the embedded array. Using DSW embedding and TSA layer, a Hierarchical Encoder-Decoder (HED) is devised to utilize the information at different scales. Experimental results on six real-world datasets show its effectiveness over previous state-of-the-arts. We analyzed the limitations of our work and briefly discuss some directions for future research: 1) In Cross-Dimension Stage, we build a simple full connection among dimensions, which may introduce noise on high-dimensional datasets. Recent sparse and efficient Graph Transformers (Wu et al., 2022) can benefit our TSA layer on this problem. 2) A concurrent work (Zeng et al., 2023) which was accepted after the submission of this work received our attention. It questions the effectiveness of Transformers for MTS forecasting and proposes DLinear that outperforms all Transformers including our Crossformer on three of the six datasets (details are in Appendix B.2). It argues the main reason is that MSA in Transformer is permutation-invariant. Therefore, enhancing the ordering preserving capability of Transformers is a promising direction to overcome this shortcoming . 3) Considering datasets used in MTS analysis are much smaller and simpler than those used in vision and texts, besides new models, large datasets with various patterns are also needed for future research.

A.3.1 MAIN EXPERIMENTS

For the main experiments, we use the Crossformer with 3 encoder layers. The number of routers in TSA layer c is set to 10. For dataset ETTh1, ETTm1, WTH and ILI, dimension of hidden state d model is set to 256, the head number of multi-head attention is set to 4; For dataset ECL and Traffic, dimension of hidden state d model is set to 64, the head number of multi-head attention is set to 2. The segment length L seg is chosen from {6, 12, 24} via grid search. We use MSE as loss function and batch size is set to 32. Adam optimizer is used for training and the initial learning rate is chosen from {5e-3, 1e-3, 5e-4, 1e-4, 5e-5, 1e-5} via grid search. The total number of epochs is 20. If the validation loss does not decreases within three epochs, the training process will stop early. For baseline models, if the original papers conduct experiments on the dataset we use, the hyperparameters (except input length T ) recommended in the original papers are used, including the number of layers, dimension of hidden states, etc. Otherwise, the hyper-parameters are chosen through grid search using the validation set. Following Zhou et al. (2021) , on datasets ETTh1, WTH, ECL and Traffic, for different prediction length τ , the input length T is chosen from {24, 48, 96, 168, 336, 720}; on ETTm1, the input length is chosen from {24, 48, 96, 192, 288, 672} ; on ILI, the input length is chosen from {24, 36, 48, 60}. All models including Crossformer and baselines are implemented in PyTorch and trained on a single NVIDIA Quadro RTX 8000 GPU with 48GB memory.

A.3.2 EFFICIENCY ANALYSIS

To evaluate the computational efficiency w.r.t the input length T in Figure 4 (c) of the main paper, we align the hyper-parameters of all Transformer-based models as follows: prediction length τ is set to 336, number of encoder layers is set to 2, dimension of hidden state d model is set to 256, the head number of multi-head attention is set to 4. To evaluate the computational efficiency w.r.t the number of dimensions D in Figure 4 (d) of the main paper, we align the hyper-parameters of ablation versions of Crossformer as follows as: both input length T and prediction length τ are set to 336, number of encoder layers is set to 3, d model is set to 64, the head number of multi-head attention is set to 2. Experiments in the computational efficiency analysis section are conducted on a single NVIDIA GeForce RTX 2080Ti GPU with 11GB memory.

A.4 DETAILS OF ABLATION VERSIONS OF CROSSFORMER

We describe the models we used in ablation study below: 1) DSW represents Crossformer without TSA and HED. The input is embedded by DSW embedding and flatten into a 1D sequence to be input to the original Transformer. The only difference between this model and the Transformer is the embedding method. 2) DSW+TSA represents Crossformer without HED. Compared with Crossformer, the encoder does not use segment merging to capture dependency at different scales. The decoder takes the final output of encoder (i.e. Z enc,N ) as input instead of using encoder's output at each scale. 3) DSW+HED represents Crossformer without TSA. In each encoder layer and decoder layer, the 2D vector array is flatten into a 1D sequence to be input to the original self-attention layer for dependency capture.

B.1 SHOWCASES OF MAIN RESULTS

Figure 5 shows the forecasting cases of three dimensions of the ETTm1 dataset with prediction length τ = 288. For dimension "HUFL", all the five models capture the periodic pattern, but Crossformer is the closest to the ground truth. For "HULL", Pyraformer fails to capture the periodic pattern from the noisy data. For "LUFL" where the data has no clear periodic pattern, MTGNN, FEDformer and Crossformer capture its trend and show significantly better results than the other two models. Figure 6 shows the forecasting cases of three dimensions of the WTH dataset with prediction length τ = 336. For dimension "DBT", all the five models capture the periodic pattern. For "DPT", Autoformer and FEDformer fails to capture increasing trend of the data. For "WD", all models capture the periodic pattern from the noisy data, and the cruves output by MTGNN and Crossformer are sharper than the other three models.

B.2 COMPARISON WITH EXTRA METHODS

We further compare with two additional concurrent methods which were either not peerreviewed (Grigsby et al., 2022) or were accepted after the submission of this work (Zeng et al., 2023) : 1) STformer (Grigsby et al., 2022) , a Transformer-based model that directly flattens the multivariate time-series x 1:T ∈ R T ×D into a 1D sequence to be input to Transformers; 2) DLinear (Zeng et al., 2023) , a simple linear model with seasonal-trend decomposition that challenges Transformer-based models for MTS forecasting. Results are shown in Table 4 and LSTMa and LSTnet are omitted as they are not competitive with other models. The basic idea of STformer is similar to our Crossformer: both of them extend the 1-D attention to 2-D. The explicit utilization of cross-dimension dependency makes STformer competitive with previous Transformer-based models on ETTh1, ETTm1 and WTH, especially for short-term prediction. However, STformer directly flattens the raw 2-D time series into a 1-D sequence to be input to the Transformer. This straightforward method does not distinguish the time and dimension axes and is computationally inefficient. Therefore, besides the good performance for short-term prediction, STformer has difficulty in long-term prediction and encounters the out-of-memory (OOM) problem on high-dimensional datasets (ECL and Traffic) . While Crossformer uses the DSW embedding to capture local dependency and reduce the complexity. The TSA layer with the router mechanism is devised to deal with the heterogeneity of time and dimension axis and further improve efficiency. DLinear is on par with our Crossformer on ETTh1 and ETTm1 (τ ≤ 96); has similar performance with FEDformer on ILI; performs worse than Crossformer on WTH; outperforms all Transformerbased models including our Crossformer on ETTm1 (τ ≥ 288), ECL and Traffic. Considering its simplicity, the performance is impressive. Based on the results, we analyze the limitations of Crossformer and propose some directions to improve it in the future: 1) In Cross-Dimension Stage of TSA layer, we simply build an all-to-all connection among D dimensions with the router mechanism. Besides capturing the cross-dimension dependency, this full connection also introduces noise, especially for high-dimensional dataset. We think high-dimensional data has the sparse property: each dimension is only relevant to a small fraction of all dimensions. Therefore, utilizing the sparsity to reduce noise and improve the computation efficiency of the TSA layer could be a promising direction. 2) Authors of DLinear (Zeng et al., 2023) argue that the Transformer-based models have difficulty in preserving ordering information because the attention mechanism is permutation-invariant and the absolute position embedding injected into the model is not enough for time series forecasting, which is an order-sensitive task. Although Yun et al. (2020) theoretically proves that Transformers with 

C DISCUSSION ON THE SELECTION OF HYPER-PARAMETERS

We recommend to first determine the segment length L seg , as it is related to both the model performance and computation efficiency. The general idea is to use small L seg for short-term prediction and large L seg for long-term prediction. Some priors about the data also help to select L seg . For example, if the hourly sampled data has a daily period, it is better to set L seg = 24. Next, we select the number In the main paper, we assume that the input length T and prediction length τ are divisible by segment length L seg . In this section, we use padding mechanism to handle cases where the assumption is not satisfied. If T is not divisible by L seg , we have (k 1 -1)L seg < T < k 1 L seg for some k 1 . We pad k 1 L seg -T duplicated x 1 in front of x 1:T to get x 1:T : x 1:T = x 1 , . . . , x 1 k1Lseg-T , x 1:T (9) where [, ] denotes the concatenation operation. x 1:T ∈ R k1Lseg×D can be input to the encoder of Crossformer. If τ is not divisible by L seg , we have (k 2 -1)L seg < τ < k 2 L seg for some k 2 . We set the learnable position embedding for decoder as E (dec) ∈ R k2×D×d model and input it to the decoder to get an output in shape of R k2Lseg×D . Then the first τ steps of the output is used as x pred T +1:T +τ . We conduct experiment on ETTm1 dataset to evaluate the effect of indivisible length. Results in Table 6 show that with padding mechanism, indivisible length does not degrade model performance, for both short-term prediction and long-term prediction.

D.2 INCORPORATING COVARIATES

In the main text, we only use historical series x 1:T to forecast the future x T +1:T +τ . In this section, we try to incorporate covariates c 1:T +τ into Crossformer. We use a straightforward method: first embed the covariates into point-wise vectors {d 1 , d 2 , . . . , d T +τ } like previous Transformer-based models do (Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a) . Then, merge the point-wise vectors into segment-wise vectors using learnable linear combination. Finally, add the segment-wise vectors to each dimension of the 2D vector array obtained by DSW embedding: c t → d t , 1 ≤ t ≤ T d (s) i = 0<j≤Lseg α j d (i-1)×Lseg+j , 1 ≤ i ≤ T L seg h cov i,d = h i,d + d (s) i , 1 ≤ i ≤ T L seg , 1 ≤ d ≤ D where → denotes embedding method for point-wise covariates. α j , 1 ≤ j ≤ L seg denotes learnable factors for linear combination. d (s) i denotes the segment-wise covariate embedding. h cov i,d denotes the embedded vector with covariate information for the i-th segment in dimension d, where h i,d is the embedded vector obtained from DSW embedding in the main text. The processing for the input of the decoder is similar, the segment-wise covariate embedding is added to the position embedding for decoder, i.e. E (dec) . We conduct experiments on ETTh1 dataset to evaluate the effect of covariates. Hour-of-the-day, dayof-the-week, day-of-the-month and day-of-the-year are used as covariates. Results in Table 7 show that incorporating covariates does not improve the performance of Crossformer. The possible reason is this straightforward embedding method does not cooperate well with Crossformer. Incorporating covariates into Crossformer to further improve prediction accuracy is still an open problem.



In this work, we mainly focus on forecasting using only past data without covariates. But covariates can be easily incorporated in Crossformer through embedding. Details are shown in Appendix D.2. If not, we pad them to proper length. See details in Appendix D.1. Pyraformer is not evaluated as it requires the additional compiler TVM to achieve linear complexity.



Figure 1: Illustration for our DSW embedding. (a) Self-attention scores from a 2-layer Transformer trained on ETTh1, showing that MTS data tends to be segmented. (b) Embedding method of previous Transformer-based models (Li et al., 2019b; Zhou et al., 2021; Wu et al., 2021a; Liu et al., 2021a): data points in different dimensions at the same step are embedded into a vector. (c) DSW embedding of Crossformer: in each dimension, nearby points over time form a segment for embedding.

Figure 2: The TSA layer. (a) Two-Stage Attention Layer to process a 2D vector array representing multivariate time series: each vector refers to a segment of the original series. The whole vector array goes through the Cross-Time Stage and Cross-Dimension Stage to get corresponding dependency. (b) Directly using MSA in Cross-Dimension Stage to build the D-to-D connection results in O(D 2 ) complexity. (c) Router mechanism for Cross-Dimension Stage: a small fixed number (c) of "routers" gather information from all dimensions and then distribute the gathered information. The complexity is reduced to O(2cD) = O(D).

Figure 3: Architecture of the Hierarchical Encoder-Decoder in Crossformer with 3 encoder layers. The length of each vector denotes the covered time range. The encoder (left) uses TSA layer and segment merging to capture dependency at different scales: a vector in upper layer covers a longer range, resulting in dependency at a coarser scale. Exploring different scales, the decoder (right) makes the final prediction by forecasting at each scale and adding them up.

(s),l i,d ∈ R Lseg denotes the i-th segment in dimension d of the prediction. All the segments in layer

PROTOCOLS DatasetsWe conduct experiments on six real-world datasets followingZhou et al. (2021);Wu  et al. (2021a). 1) ETTh1 (Electricity Transformer Temperature-hourly), 2) ETTm1 (Electricity Transformer Temperature-minutely), 3) WTH (Weather), 4) ECL (Electricity Consuming Load), 5) ILI (Influenza-Like Illness), 6) Traffic. The train/val/test splits for the first four datasets are same asZhou et al. (2021), the last two are split by the ratio of 0.7:0.1:0.2 followingWu et al. (2021a).

Figure 4: Evaluation on hyper-parameter impact and computational efficiency. (a) MSE against hyperparameter segment length L seg in DSW embedding on ETTh1. (b) MSE against hyper-parameter number of routers c in the Cross-Dimension Stage of TSA layer on ETTh1. (c) Memory occupation against the input length T on ETTh1. (d) Memory occupation against number of dimensions D on synthetic datasets with different number of dimensions.

Figure 5: Forecasting cases of three dimensions: High UseFul Load (HUFL), High UseLess Load (HULL) and Low UseFul Load (LUFL) of the ETTm1 dataset with prediction length τ = 288. The red / blue curves stand for the ground truth / prediction. Each row represents one model and each column represents one dimension.

Figure 6: Forecasting cases of three dimensions: Dry Bulb Temperature (DBT), Dew Point Temperature (DPT) and Wind Direction (WD) of the WTH dataset with prediction length τ = 336. The red / blue curves stand for the ground truth / prediction. Each row represents one model and each column represents one dimension.

Query Time Segment T + 19 ∼ T + 24

Figure 7: Attention scores calculated by the decoder of the ablation version of Crossformer (i.e. DSW) on dataset ETTh1. The input length, prediction length and segment length are set as T = 168, τ = 24, L seg = 6. The x axis in each sub-figure represents the time steps serve as keys in attention mechanism, while the y axis denotes dimensions. Brighter color denotes higher attention weights.

Figure 9 (b) shows the running time per batch of Crossformer and its ablation versions w.r.t the number of dimensions D. Crossformers without TSA layer (DSW and DSW+HED) are faster when D is small (D ≤ 30). However, they have difficulty processing high-dimensional MTS due to the quadratic complexity w.r.t D. Indeed, for a single NVIDIA GeForce RTX 2080Ti GPU with 11GB memory, DSW and DSW+HED encounters the out-of-memory (OOM) problem when D > 50. Moreover, TSA(w/o Router) encounter the OOM problem when D > 200.

MSE/MAE with different prediction lengths. Bold/underline indicates the best/second. Results of LSTMa, LSTnet, Transformer, Informer on the first 4 datasets are from Zhou et al. (2021).



Component ablation of Crossformer: DSW embedding, TSA layer and HED on ETTh1.

Computation complexity per layer of Transformer-based models. T denotes the length of past series, τ denotes the length of prediction window, D denotes the number of dimensions, L seg denotes the segment length of DSW embedding in Crossformer.

MSE/MAE comparison with extra methods: STformer(Grigsby et al., 2022) and DLinear(Zeng et al., 2023). Bold/underline indicates the best/second. OOM indicates out-of-memory problem. Gray background marks the CNN-GNN-based model; yellow marks Transformer-based models where cross-dimension dependency is omitted; blue marks Transformer-based models explicitly utilizing cross-dimension dependency; red marks the linear model with series decomposition.

Complementary results to ablation study in Table2. TSA(w/o Router) denotes TSA layer without the router mechanism that directly uses MSA in the Cross-Dimension Stage.

MSE and MAE evaluation with different segment lengths on ETTm1 dataset. * denotes segment length used in the main text, which is a divisor of T, τ . encoder and decoder N . Crossformer with larger N can utilize information of more scales, but also requires more computing resources. The number of routers in TSA layer c can be set to 5 or 10 to balance the prediction accuracy and computation efficiency. Finally, dimension of hidden states d model and head number of multi-head attention can be determined based on the available computing resources.

MSE and MAE evaluation of Crossformer without/with covariates on ETTh1 dataset.

funding

* Junchi Yan is the correspondence author. This work was in part supported by NSFC (61972250, U19B2035, 62222607) and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

A DETAILS OF EXPERIMENTS A.1 BENCHMARKING DATASETS

We conduct experiments on the following six real-world datasets following Zhou et al. (2021) ; Wu et al. (2021a) : 1) ETTh1 (Electricity Transformer Temperature-hourly) contains 7 indicators of an electricity transformer in two years, including oil temperature, useful load, etc. Data points are recorded every hour and train/val/test is 12/4/4 months.2) ETTm1 (Electricity Transformer Temperature-minutely) The train/val/test splits for ETTh1, ETTm1, WTH, ECL are same as Zhou et al. (2021) , for ILI and Traffic are same as Wu et al. (2021a) .The first four datasets are publicly available at https://github.com/zhouhaoyi/ Informer2020 and the last two are publicly available at https://github.com/thuml/ Autoformer.

A.2 BASELINE METHODS

We briefly describe the selected baselines: 1) LSTMa (Bahdanau et al., 2015) treats the input MTS as a sequence of multi-dimensional vectors. It builds an encoder-decoder using RNN and automatically aligns target future steps with their relevant past. 2) LSTnet (Lai et al., 2018) uses CNN to extract cross-dimension dependency and short term crosstime dependency. The long-term cross-time dependency is captured through RNN. The source code is available at https://github.com/laiguokun/LSTNet. 3) MTGNN (Wu et al., 2020) explicitly utilizes cross-dimension dependency using GNN. A graph learning layer learns a graph structure where each node represents one dimension in MTS. Then graph convolution modules are interleaved with temporal convolution modules to explicitly capture cross-dimension and cross-time dependency respectively. The source code is available at https://github.com/nnzhan/MTGNN. 4) Transformer is closed to the original Transformer (Vaswani et al., 2017) that uses self-attention mechanism to capture cross-time dependency. The Informer-style one-step generative decoder is used for forecasting, therefore this is denoted as Informer † in Informer (Zhou et al., 2021) . 5) Informer (Zhou et al., 2021 ) is a Transformer-based model using the ProbSparse self-attention to capture cross-time dependency for forecasting. The source code of Transformer and Informer is available at https://github.com/zhouhaoyi/Informer2020. 6) Autoformer (Wu et al., 2021a ) is a Transformer-based model using decomposition architecture with Auto-Correlation mechanism to capture cross-time dependency for forecasting. The source code is available at https://github.com/thuml/Autoformer. 7) Pyraformer (Liu et al., 2021a ) is a Transformer-based model learning multi-resolution representation of the time series by the pyramidal attention module to capture cross-time dependency for forecasting. The source code is available at https://github.com/alipay/Pyraformer. 8) FEDformer (Zhou et al., 2022 ) is a Transformer-based model that uses the seasonal-trend decomposition with frequency enhanced blocks to capture cross-time dependency for forecasting. The source code is available at https://github.com/MAZiqing/FEDformer.As quoted from the paper, authors mentioned that DLinear "does not model correlations among variates". Therefore, incorporating cross-dimension dependency into DLinear to further improve prediction accuracy is also a promising direction. Moreover, our DSW embedding to enhance locality and HED to capture dependency at different scales can also be potentially useful to further inspire and enhance DLinear.

B.3 ABLATION STUDY OF THE ROUTER MECHANISM

The ablation study of the three main components of Crossformer is shown in Sec. 4.3. In this section, we conduct an ablation study of the router mechanism, a sub-module in TSA layer, and evaluate its impact on prediction accuracy. It should be noticed that the router mechanism is mainly proposed to reduce the computation complexity when D is large. Results are shown in Table 5 . Adding TSA(w/o Router) constantly improves the prediction accuracy of DSW and DSW+HED, showing the necessity of capturing cross-time and cross-dimension dependency in two different stages. For short term prediction (τ ≤ 168), the performances of TSA(w/o Router) and TSA are similar, no matter whether HED is used or not. For long term prediction (τ ≥ 336), the router mechanism slightly improves the prediction accuracy. The possible reason is that we set separate routers for each time step, which helps capture long-term dependency that varies over time.

