EDGE-VARYING FOURIER GRAPH NETWORKS FOR MULTIVARIATE TIME SERIES FORECASTING Anonymous authors Paper under double-blind review

Abstract

The key to multivariate time series (MTS) analysis and forecasting is to disclose the underlying couplings between variables that drive the co-movements. Considerable recent successful MTS methods are built with graph neural networks (GNNs) due to their essential capacity for relational modeling. However, previous work often used a static graph structure of time-series variables for modeling MTS, but failed to capture their ever-changing correlations over time. In this paper, we build a fully-connected supra-graph, representing non-static correlations between any two variables at any two timestamps, to capture high-resolution spatial-temporal dependencies. Whereas, conducting graph convolutions on such a supra-graph heavily increases computational complexity. As a result, we propose the novel Edge-Varying Fourier Graph Networks (EV-FGN), which reformulates the graph convolutions in the frequency domain with high efficiency and scale-free parameters, and applies edge varying graph filters to capture the time-varying variable dependencies. Extensive experiments show that EV-FGN outperforms state-of-the-art methods on seven real-world MTS datasets. Under review as a conference paper at ICLR 2023 information in the graph. To capture time-varying diffusion over the supra-graph, we introduce the edge-varying graph filter to weight the graph edges differently from different iterations and reformulate the edge-varying graph filter with multiple frequency-invariant FGSOs in the frequency domain to reduce the computational cost of graph convolutions. Finally, a novel Edge-Varying Fourier Graph Networks (EV-FGN) is designed for MTS analysis, which is stacked with multiple FGSOs to perform high-efficient multi-layer graph convolutions in the Fourier space. The main contributions of this paper are summarized as follows: • We adaptively learn a supra-graph, representing non-static correlations between any two variables at any two timestamps, to capture high-resolution spatial-temporal dependencies. • To efficiently compute graph convolutions over the supra-graph, we reformulate the graph convolutions in the Fourier space by leveraging FGSO. To the best of our knowledge, this work makes the first step to reformulate the graph convolutions in the Fourier space. • We design a novel network EV-FGN evolved from edge-varying graph filters for MTS analysis to capture the time-varying variable dependencies in the Fourier space. This study makes the first attempt to design a complex-valued feed-forward network in the Fourier space to efficiently compute multi-layer graph convolutions. • Extensive experimental results on seven MTS datasets demonstrate that EV-FGN achieves state-ofthe-art performance with high efficiency and fewer parameters. Multifaceted visualizations further interpret the efficacy of EV-FGN in graph representation learning for MTS forecasting. 2.1 MULTIVARIATE TIME SERIES FORECASTING Classic time series forecasting methods are linear models, such as VAR Watson (1993), ARIMA Asteriou & Hall (2011) and state space model (SSM) Hyndman et al. (2008). Recently, deep learning based methods Lai et al. (2018); Sen et al. (2019); Zhou et al. (2021) have dominated MTS forecasting due to their capability of fitting any complex nonlinear correlations Lim & Zohren (2021). MTS with GNN. More recently, MTS have embraced GNN Wu et al. (2019); Bai et al. (2020); Wu et al. (2020); Yu et al. (2018b); Chen et al. (2022); Li et al. (2018) due to their best capability of modeling structural dependencies between variables. Most of these models, such as STGCN Yu et al. (2018b), DCRNN Li et al. (2018) and TAMP-S2GCNets Chen et al. (2022), require a pre-defined graph structure which is usually unknown in most cases. In recent years, some GNN-based works Kipf et al. (2018); Deng & Hooi (2021) account for the dynamic dependencies due to network design such as the time-varying attention Deng & Hooi (2021). In comparison, our proposed model captures the dynamic dependencies leveraging the high-resolution correlation in the supra-graph without introducing specific networks. MTS with Fourier transform. Recently, increasing MTS forecasting models have introduced the Fourier theory into neural networks as high-efficiency convolution operators Guibas et al. (2022); Chi et al. (2020). SFM Zhang et al. (2017) decomposes the hidden state of LSTM into multiple frequencies by discrete Fourier transform (DFT). mWDN Wang et al. (2018) decomposes the time series into multilevel sub-series by discrete wavelet decomposition and feeds them to LSTM network, respectively. ATFN Yang et al. (2022) utilizes a time-domain block to learn the trending feature of complicated non-stationary time series and a frequency-domain block to capture dynamic and complicated periodic patterns of time series data. FEDformer Zhou et al. ( 2022) proposes an attention mechanism with low-rank approximation in frequency and a mixture of expert decomposition to control the distribution shift. However, these models only capture temporal dependencies in the frequency domain. StemGNN Cao et al. (2020) takes the advantages of both inter-series correlations and temporal dependencies by modeling them in the spectral domain, but, it captures the temporal and spatial dependencies separately. Unlike these efforts, our model is able to jointly encode spatial-temporal dependencies in the Fourier space.

1. INTRODUCTION

Multivariate time series (MTS) forecasting is a key ingredient in many real-world scenarios, including weather forecasting Zheng et al. (2015) , decision making Borovykh et al. (2017) , traffic forecasting Yu et al. (2018a) ; Bai et al. (2020) , COVID-19 prediction Cao et al. (2020) ; Chen et al. (2022) , etc. Recently, deep neural networks, such as long short-term memory (LSTM) Hochreiter & Schmidhuber (1997) , convolutional neural network (CNN) Borovykh et al. (2017 ), Transformer Vaswani et al. (2017) , have dominated MTS modeling. In particular, graph neural networks (GNNs) have demonstrated promising performance on MTS forecasting with their essential capability to capture the complex couplings between time-series variables. Some studies enable to adaptively learn the graph for MTS forecasting even without an explicit graph structure, e.g., by node similarity Mateos et al. (2019) ; Bai et al. (2020) ; Wu et al. (2019) and/or self-attention mechanism Cao et al. (2020) . Despite the success of GNNs on MTS forecasting, three practical challenges are eagerly demanded to address: 1) the dependencies between each pair of time series variables are generally non-static, which demands full dependencies modeling with all possible lags; 2) a high-efficiency dense graph learning method is demanded to replace the high-cost operators of graph convolutions and attention (with quadratic time complexity of the graph size); 3) the graph over MTS varies with different temporal interactions, which demands an efficient dynamic structure encoding method. In this paper, different from most GNN-based methods that construct graphs to model the spatial correlation between variables (i.e. nodes) Bai et al. (2020) ; Wu et al. (2019; 2020) , we attempt to build a supra-graph that sheds light on the "high-resolution" correlations between any two variables at any two timestamps (i.e., fine-grained spatial-temporal dependencies), and largely enhances the expressiveness on non-static spatial-temporal dependencies. Obviously, the supra-graph will heavily increase the computational complexity of GNN-based model, then a high-efficiency learning method is required to reduce the cost for model training. Inspired by Fourier Neural Operator (FNO) Li et al. (2021) , we reformulate the graph convolution (time domain) to much lower-complexity matrix multiplication in the frequency domain by leveraging an efficient and newly-defined Fourier Graph Shift Operator (FGSO). In addition, it is necessary to consider the multiple iterations (layers) of graph convolutions to expand receptive neighbors and mix the diffusion 

𝐗 𝚿 𝐗

Figure 1 : The network architecture of our proposed model. Given an input X ∈ R N ×T , we 1) embed X into X ∈ R N ×T ×d ; 2) transform X to Fourier space X ∈ C N ×T ×d by 2D DFT on the discrete N × T spatial-temporal space; 3) perform graph convolutions in Fourier space by conducting multiplication of FGSO S and X in the K-layer EV-FGN; 4) transform the output of EV-FGN to time domain X Ψ ∈ R N ×T ×d by 2D IDFT; 5) generate τ -step predictions X ∈ R N ×τ via feeding X Ψ to a two-layer feed-forward network.

2.2. GRAPH SHIFT OPERATOR

Graph shift operators (GSOs) (e.g., the adjacency matrix and the Laplacian matrix) are a general set of linear operators which are used to encode neighbourhood topologies in the graph. Klicpera et al. (2019) shows that applying the varying GSOs in the message passing step of GNNs can lead to significant improvement of performance. Dasoulas et al. (2021) proposes a parameterized graph shift operator to automatically adapt to networks with varying sparsity. Isufi et al. (2021) allows different nodes to use different GSOs to weight the information of different neighbors. Hadou et al. (2022) introduces a linear composition of the graph shift operator and time-shift operator to design space-time filters for time-varying graph signals. Inspired by these works, in this paper we design a varying parameterized graph shift operator in Fourier space.

2.3. FOURIER NEURAL OPERATOR

Different from classical neural networks which learn mappings between finite-dimensional Euclidean spaces, neural operators learn mappings between infinite-dimensional function spaces Kovachki et al. (2021b) . Fourier neural operators (FNOs), currently the most promising one of the neural operators, are universal, in the sense that they can approximate any continuous operator to the desired accuracy Kovachki et al. (2021a) . Li et al. (2021) formulates a new neural operator by parameterizing the integral kernel directly in the Fourier space, allowing for an expressive and efficient architecture for partial differential equations. Guibas et al. (2022) proposes an efficient token mixer that learns to mix in the Fourier domain which is a principled architectural modification to FNO. In this paper, we learn a Fourier graph shift operator by leveraging the Fourier Neural operator.

3. METHODOLOGY

Let us denote the entire MTS raw data as X ∈ R N ×L with N variables and L timestamps. Under the rolling setting, we have window-sized time-series inputs with T timestamps, i.e., {X|X ⊂ X, X ∈ R N ×T }. Accordingly, we formulate the problem of MTS forecasting as learning the spatialtemporal dependencies simultaneously on a supra-graph G = (X, S) attributed to each X. The supra-graph G contains N * T nodes that represent values of each variable at each timestamp in X, and S ∈ R (N * T )×(N * T ) is a graph shift operator (GSO) representing the connection structure of G. Since the underlying graph is unknown in most MTS scenarios, we assume all nodes in the supra-graph are connected with each other, i.e., a fully-connected graph, and perform graph convolutions on the graph to learn spatial-temporal representation. Then, given the observed values of previous T steps at timestamp t, i.e., X t-T :t ∈ R N ×T , the task of multi-step multivariate time series forecasting is to predict the values of N variables for next τ steps denoted as Xt+1:t+τ ∈ R N ×τ on the supra-graph G, formulated as follows: Xt+1:t+τ = F(X t-T :t ; G; Θ) (1) where F is the forecasting model with parameters Θ.

3.1. OVERALL ARCHITECTURE

The overall architecture of our model is illustrated in Fig. 1 . Given input data X ∈ R N ×T , first we embed the data into embeddings X ∈ R N ×T ×d by assigning a d-dimension vector for each node using an embedding matrix Φ ∈ R N ×T ×d , i.e., X = X × Φ. Instead of directly learning the huge embedding matrix, we introduce two small parameter matrices: 1) a variable embedding matrix ϕ v ∈ R N ×1×d , and 2) a temporal embedding matrix ϕ u ∈ R 1×T ×d to factorize Φ, i.e., Φ = ϕ v × ϕ u . Subsequently, we perform 2D discrete Fourier transform (DFT) on each discrete N × T spatialtemporal plane of the embeddings X and obtain the frequency input X := DFT(X) ∈ C N ×T ×d . We then feed X to K-layer Edge-Varying Fourier Graph Networks (denoted as Ψ K ) to perform graph convolutions for capturing the spatial-temporal dependencies simultaneously in the Fourier space. To make predictions in time domain, we perform 2D inverse Fourier transform to generate representations X Ψ := IDFT(Ψ K (X )) ∈ R N ×T ×d in the time domain. The representation is then fed to two-layer feed-forward networks (FFN, see more details in Appendix F.4) parameterized with weights and biases denoted as ϕ f f to make predictions for future τ steps X ∈ R N ×τ by one forward procedure. The L2 loss function for multi-step forecasting can be formulated as: L( X; X; Θ) = t Xt+1:t+τ -X t+1:t+τ 2 2 (2) with parameters Θ = {ϕ v , ϕ u , ϕ f f , Ψ K } and the groudtruth X t+1:t+τ ⊂ X at timestamp t.

3.2. FOURIER GRAPH SHIFT OPERATOR

According to the discrete signal processing on graphs Sandryhaila & Moura (2013) , a graph shift operator (GSO) is defined as a general family of operators which enables the diffusion of information over graph structures Gama et al. (2020) ; Dasoulas et al. (2021) . Definition 1 (Graph Shift Operator). Given a graph G with n nodes, a matrix S ∈ R n×n is called a Graph Shift Operator (GSO) if it satisfies S ij = 0 if i ̸ = j and nodes i, j are not connected. The graph shift operator includes the adjacency, Laplacian matrices and their normalisations as instances of its class, and represents the connection structure of the graph. Accordingly, given the graph G attributed to X ∈ R n×d (corresponding to the supra-graph with n = N * T ), a general form of spatial-based graph convolution is defined as O(X) := SXW (3) with the parameter matrix W ∈ R d×d . Regarding S as n × n scores, we can define a matrix-valued kernel κ : [n] × [n] - → R d×d with κ[i, j] = S ij • W , where [n] = {1, 2, • • • , n}. Then the graph convolution can be viewed as a kernel summation.

O(X)

[i] = n j=1 X[j]κ[i, j] ∀i ∈ [n]. In the special case of the Green's kernel κ[i, j] = κ[i -j], we can rewrite the kernel summation O(X)[i] = n j=1 X[j]κ[i -j] = (X * κ)[i] ∀i ∈ [n]. with X * κ denotes the convolution of discrete sequences X and κ. According to the convolution theorem Katznelson (1970) (see Appendix B), the graph convolution is rewritten as O(X)(i) = F -1 (F(X)F(κ)) (i) ∀i ∈ [n]. ) where F and F -1 denote the discrete Fourier transform (DFT) and its inverse (IDFT), respectively. The multiplication in the Fourier space is a lower-complexity computation compared to the graph convolution, and DFT can be efficiently implemented by the fast Fourier transform (FFT). Definition 2 (Fourier Graph Shift Operator). Given a graph G = (X, S) with input X ∈ R n×d and GSO S ∈ R n×n and the weight matrix W ∈ R d×d , the graph convolution is formulated as F(SXW ) = F(X) × n F(κ) ) where F denotes DFT, satisfies κ[i, j] = κ[i -j], and × n is matrix multiplication on dimensions except that of n. We define S := F(κ) ∈ C n×d×d as a Fourier graph shift operator (FGSO). In particular, turning to our case of the fully-connected supra-graph G with an all-one GSO S ∈ {1} n×n , it yields the space-invariant kernel κ[i, j] = S ij • W = W and F(SXW ) = F(X)F(κ). Accordingly, we can parameterize FGSO S with a complex-valued matrix C d×d which is frequencyinvariant and is computationally low costly compared to a varying kernel resulting a parameterized matrix of C n×d×d . Furthermore, we can extend Definition 2 to 2D discrete space, i.e., from [n] to [N ] × [T ], corresponding to the finite discrete spatial-temporal space of multivariate time series. See Appendix C for more explanations on the frequency-invariant FGSO and the extension to 2D domain. Remarks. Compared with FGSO and FNO, frequency-invariant FGSO has several advantages. Assume a graph with n nodes and the embedding dimension d (d ≤ n). 1) Efficiency: the time complexity of frequency-invariant FGSO is O(nd log n + nd 2 ) for DFT, IDFT and the matrix multiplication compared with that of O(n 2 d + nd 2 ) on a GSO. 2) Scale-free parameters: frequencyinvariant FGSO strategically shares O(d 2 ) parameters for each node and the parameter volume is agnostic to the data scale, while the parameter count of FNO is O(nd 2 ).

3.3. EDGE-VARYING FOURIER GRAPH NETWORKS

Graph filters as core operations in signal processing are linear transformations expressed as polynomials of the graph shift operator Isufi et al. (2021) ; Mateos et al. (2019) and can be used to exactly model graph convolutions and capture multi-order diffusion on graph structures Segarra et al. (2017) . To capture the time-varying counterparts and adopt different weights to weight the information of different neighbors in each diffusion order, the edge-variant graph filters are defined as follows Isufi et al. (2021) ; Segarra et al. (2017) : given GSO S ∈ R n×n corresponding to a graph with n nodes, H EV = S 0 + S 1 S 0 + ... + S K S K-1 ...S 0 = K k=0 S k:0 where S 0 denotes the identity matrix, {S k } K k=1 is a collection of K edge-weighting GSOs sharing the sparsity pattern of S, and S k ∈ R n×n corresponds to the k-th diffusion step. The edge-variant graph filters are proved effective to yield a highly discriminative model and lay the foundation for the unification of GCNs and GATs Isufi et al. (2021) . We can extend Equation 7 to H EV and reformulate the multi-order graph convolution in a recursive form, where we omit the weight W for conciseness. Proposition 1. Given a graph input X ∈ R n×d , the K-order graph convolution under the edgevariant graph filter H EV = K k=0 S k:0 with {S k ∈ R n×n } K k=0 is reformulated as follows: H EV X = F -1 K k=0 F(X) × n S 0:k s.t. S 0:k = S 0 × n • • • S K-1 × n S K where S k ∈ C n×d×d is the k-th FGSO satisfying F(S k X) = F(X) × n S k , S 0 is the identity matrix, and F and F -1 denote the discrete Fourier transform and its reverse, respectively. Proposition 1 proved in Appendix D.1 states that we can rewrite the multi-order graph convolution corresponding to H EV as a summation of a recursive multiplication of individual FGSO in the Fourier space. Corresponding to our case of the supra-graph G, we can similarly adopt K frequency-invariant FGSOs and parameterize each FGSO S k with a complex-valued matrix C d×d . This saves a large amount of computation costs and results in a concise form: H EV X = F -1 K k=0 F(X)S 0:k s.t. S 0:k = k i=0 S i The recursive composition has a nice property that S 0:k = S 0:k-1 S k , which inspires us to design a complex-valued feed forward network with S k being the complex weights for the k-th layer. However, both the FGSO and edge-variant graph filter are linear transformations, limiting the capability of modeling non-linear information diffusion on graphs. Following by the convention in GNNs, we introduce the non-linear activation and biases to reformulate the k-the layer as follows: X k = σ(X k-1 S k + b k ) with the complex weight S k ∈ C d×d , biases b k ∈ C d and the activation function σ. Accordingly, we design an edge-varying Fourier graph network (EV-FGN) in the Fourier space according to Equations 9 and 11, as shown in Fig. 1 . Accordingly, the K-layer EV-FGN Ψ K is formulated: Ψ K (X ) = K k=0 X k s.t. X k = σ(X k-1 S k + b k ) (12) where X 0 := F(X), {S k ∈ C d×d } K k=1 and {b k ∈ C d } K k=1 are complex-valued parameters. The frequency output Ψ K (X ) is then transformed via IDFT, followed with feed forward networks (see Appendix F.4 for more details) to make multi-step forecasting in the time domain. Remarks. EV-FGN efficiently learns edge-and neighbor-dependent weights to compute multi-order graph convolutions on one fly in the Fourier space with scale-free parameter volume. For K iterations of graph convolutions, GCNs have a general time complexity of O(Kn 2 d + Knd 2 ), and FNO needs O(Knd log n + Knd 2 ), and EV-FGN achieves a time complexity of O(nd log n + Knd 2 + Knd). EV-FGN saves time costs from FNO in Fourier transforms and is much more efficient than GCNs, especially in our case with n = N T for a given raw input X ∈ R N ×T . See Appendix E for more details about the relations and differences between EV-FGN with FNO, adaptive FNO, and GNNs.

4.1. SETUP

Datasets. We select seven representative datasets from different application scenarios for evaluation, including traffic, energy, web traffic, electrocardiogram, and COVID-19. These datasets are summarized in Table 1 . All datasets are normalized using the min-max normalization. Except the COVID-19 dataset, we split the other datasets into training, validation, and test sets with the ratio of 7:2:1 in chronological order. For the COVID-19 dataset, the ratio is 6:2:2. Baselines. We compare the forecasting performance of our EV-FGN with other representative and SOTA models on the seven datasets, including VAR Watson (1993) , SFM Zhang et al. (2017) , LSTNet Lai et al. (2018) , TCN Bai et al. (2018 ), GraphWaveNet Wu et al. (2019) , DeepGLO Sen et al. (2019) , StemGNN Cao et al. (2020) , MTGNN Wu et al. (2020) , AGCRN Bai et al. (2020) More details about the datasets, baselines and experimental settings can be found in Appendix F.

4.2. RESULTS

We summarize the evaluation results in Table 2 , and more results can be found in Appendix G. Generally, our model EV-FGN establishes a new state-of-the-art on all datasets. On average, EV-FGN improves 8.8% on MAE and 10.5% on RMSE compared to the best baseline for all datasets. Among these baselines, Reformer, Informer, Autoformer and FEDformer are transformer-based models that achieve competitive performance on Electricity and COVID-19 datasets since those models are competent in capturing temporal dependencies. However, they are defective in capturing the spatial dependencies explicitly. GraphWaveNet, MTGNN, StemGNN and AGCRN are GNN-based models that show promising performances on Wiki, Traffic, Solar and ECG datasets due to their high capability in handling spatial dependencies among variables. However, they are limited to simultaneously capturing spatial-temporal dependencies. EV-FGN significantly outperforms the baseline models since it learns comprehensive spatial-temporal dependencies simultaneously and attends to time-varying dependencies among variables. In Appendix G, we further report the results on four datasets and show the comparison between EV-FGN with those models requiring pre-defined graph structures. We report the parameter volumes and the average time costs of five rounds of experiments in Table 3 . In terms of parameter volumes, EV-FGN requires the least volume of parameters among the comparative models. Specifically, it achieves 32.2% and 9.5% parameter reduction over GraphWaveNet on Traffic and Wiki datasets, respectively. This is highly attributed that EV-FGN has shared scale-free parameters for each node. Regarding the training time, EV-FGN runs much faster than all baseline models, and it shows 5.8% and 23.3% efficiency improvements over the fast baseline GraphWaveNet on Traffic and Wiki datasets, respectively. Besides, the variable count of Wiki dataset is about twice larger than that of Traffic dataset, but EV-FGN shows larger efficiency gaps with the baselines. These results demonstrate that EV-FGN shows high efficiency in computing graph convolutions and is scalable to large datasets with large graphs. Significantly, the supra-graph in EV-FGN has N × T nodes, which is much larger than the graphs (with N nodes) in the baselines. Ablation study. We conduct the ablation study on METR-LA dataset to evaluate the contribution of different components of our model. The results shown in Table 4 verify the effectiveness of Visualization of spatial representations learned by EV-FGN. We produce the visualization of the generated adjacency matrix according to the learned representation from EV-FGN on the METR-LA dataset. Specifically, we randomly select 20 detectors and visualize their corresponding adjacency matrix via heat map, as shown in Fig. 2 . Correlating the adjacency matrix with the real road map, we observe: 1) the detectors (7, 8, 9, 11, 13, 18, and 19 ) are very close in physical distance, corresponding to the high values of their correlations with each other in the heat map; 2) the detectors 4, 14 and 16 have small overall correlation values since they are far from other detectors; 3) however, compared with detectors 14 and 16, the detector 4 has slightly higher correlation values to other detectors e.g., 7, 8, 9, which is attributed that although they are far apart, the detectors 4, 7, 8, 9 are on the same road. The results verify the effectiveness of EV-FGN in learning highly interpretative correlations. Visualization on four consecutive days. Furthermore, we conduct experiments to visualize the adjacency matrix of 10 randomly-selected counties on four consecutive days on the COVID-19 dataset. The visualization results via heat map are shown in Fig. 3 . From the figure, we can observe that EV-FGN learns clear spatial patterns that show continuous evolution in the time dimension. This is because EV-FGN highlights the edge-varying design and attends to the time-varying variability of the supra-graph. These results verify that our model enjoys the feasibility of exploiting the time-varying dependencies among variables. 

5. CONCLUSION

In this paper, we define a Fourier graph shift operator (FGSO) and construct the efficient edge-varying Fourier graph networks (EV-FGN) for MTS forecasting. EV-FGN is adopted to simultaneously capture high-resolution spatial-temporal dependencies and account for time-varying variable dependencies. This study makes the first attempt to design a complex-valued feed-forward network in the Fourier space for efficiently computing multi-layer graph convolutions. Extensive experiments demonstrate that EV-FGN achieves state-of-the-art performances with higher efficiency and fewer parameters and shows high interpretability in graph representation learning. This study sheds light on efficiently calculating graph operations in Fourier space by learning a Fourier graph shift operator. A NOTATION 

B CONVOLUTION THEOREM

The convolution theorem Katznelson (1970) is one of the most important property of Fourier transform. It states the Fourier transform of a convolution of two signals equals the pointwise product of their Fourier transforms in the frequency domain. Given a signal x[n] and a filter h[n], the convolution theorem can be defined as follows: F((x * h)[n]) = F(x)F(h) where (x * h)[n] = N -1 m=0 h[m]x[(n -m) N ] denotes the convolution of x and h, (n -m) N denotes (n -m) modulo N, and F(x) and F(h) denote discrete Fourier transform of x[n] and h[n], respectively.

C EXPLANATIONS C.1 THE INTERPRETATION OF FREQUENCY-INVARIANT FGSO

In addition to reducing parameter volumes and saving computation costs, the frequency-invariant parameterized FGSO is empirically proved effective to improve model generalization. As we mentioned above, we perform EV-FGN over the input embeddings and adopt the frequency-invariant FGSOs. Relatively to directly adopting the frequency-variant FGSOs (C n×d×d ), we subtly "factorize" the frequency-variant parameterization to the time domain (i.e., the embedding matrix Φ ∈ R n×d ) and the frequency domain (i.e., frequency-invariant FGSO C d×d ). In the time domain, we embed the raw MTS inputs to improve the model learning capability, while we learn the same transformation (FGSO) for all N * T frequency points in the frequency domain (similar to CNN with shared-weight convolution kernels or filters that slide along input features). Note that the frequency spectrum in the frequency domain has a global view of which each frequency point attends to all variables or timestamps. This treatment in EV-FGN guarantees the model capacity and is empirically proved superior over the treatment without embeddings and with frequency-variant FGSO (please refer to the Ablation study for detailed results). Although the frequency-variant parameterization may be more powerful and flexible than the frequency-invariant one, it introduces more parameters in the frequency domain, especially for multi-layer EV-FGN, and may not obtain superior performance due to inadequate training or overfitting.  (X)[i] = n j=1 X[j]κ[i, j] ∀i ∈ [n] Eq. 5 (kernel summation): O(X)[i] = n j=1 X[j]κ[i -j] = (X * κ)[i] ∀i ∈ [n] When we extend the equations to 2D domain, i.e., from [n] to [N ] × [T ] , it means performing a kernel summation/graph convolution over the discrete spatial-temporal space corresponding to all nodes in the supra-graph. Obviously, these computations of the kernel summation can be easily extended to 2D domain. Similarly, according to the convolution theorem, we can obtain a 2D-version of Eq. 6: Eq. 6 (graph convolution): O(X)(i) = F -1 (F(X)F(κ)) (i) ∀i ∈ [N ] × [T ] with F and F -1 denote the 2D discrete Fourier transform and its inverse, respectively. Accordingly, when extending Definition 2 to 2D domain, F denotes the 2D discrete Fourier transform. Therefore, given input embeddings X ∈ R N ×T ×d , we perform 2D discrete Fourier transform on each discrete N × T spatial-temporal plane of the embeddings to obtain the frequency spectrum, and then feed the frequency input into K-layer EV-FGN followed by two-layer (real-valued) feed-forward networks to generate multi-step forecasting (see Section 3.1 for more details). Note that we adopt the frequencyinvariant FGSO (d × d) in the EV-FGN where the feed-forward computations act on the embedding dimension, i.e., d. In addition, when extending Definition 2 to 2D domain, the frequency-invariant FGSO is invariant to both frequency components derived from the spatial dimension (N ) and time dimension (T ) respectively.

C.3 INTERPRETATION OF OMITTING THE WEIGHT MATRIX W IN PROPOSITION 1

This treatment of omitting the weight matrix W in Eq. 9 is feasible since the weight matrix W can be absorbed in the embedding input, precisely the embedding matrix parameters. Note that we feed the input embeddings into EV-FGN (refer to Fig. 1 and Section 3.1). In addition, this treatment will not reduce the capability of EV-FGN intuitively since we adopt edge-varying filters in EV-FGN. Note that traditional GCNs adopt different weight matrices (regarding the model capability) but the same GSO (e.g., adjacency and Laplacian matrices regarding the given graph structure) in different diffusion orders. Differently, there is no pre-given graph structure in MTS forecasting scenarios, therefore we adopt the edge-varying filters, i.e., varying GSOs in EV-FGN, which does not reduce the model capability and achieves desirable performance empirically.

D PROOFS D.1 PROOF OF PROPOSITION 1

The proof aims to expand the graph convolution corresponding to H EV using a set of FGSOs in the Fourier space. According to the concise form of Equation 7, i.e., F(SX) = F(X) × n S ( ) where F denotes the discrete Fourier transform (DFT) and we omit the weight matrix W for convenience (we can treat W as an identity matrix), it yields: F(S K S K-1 • • • S 0 X) = F(S K (S K-1 ...S 0 X)) = F(S K-1 ...S 0 X) × n S K = F(X ) × n S 0 • • • S K-1 × n S K (15) where {S i } K i=0 is a set of GSOs, and {S i } K i=0 is a set of FGSOs corresponding to {S i } K i=0 individually. Thus, the edge-varying graph filter H EV can be rewritten as F(H EV X) = F(S 0 X + S 1 S 0 X + ... + S K S K-1 ...S 0 X) = F(S 0 X) + F(S 1 S 0 X) + ... + F(S K S K-1 ...S 0 X) = F(X) × n S 0 + F(X) × n S 0 × n S 1 + ... + F(X)S 0 × n • • • S K-1 × n S K (16) Accordingly, we have H EV X = F -1 K k=0 F(X) × n S 0:k s.t. S 0:k = S 0 × n • • • S K-1 × n S K Proved. E COMPARED WITH OTHER NETWORKS where n is the number of nodes, while a K-layer EV-FGN infers the graph structure in Fourier space with the time complexity proportional to n log n. In addition, compared with GATs that implicitly achieve edge-varying weights with different layers, EV-FGN adopts different FGSOs in different diffusion steps explicitly.

E.2 FNO

Inspired by Fourier neural operator (FNO) Li et al. (2021) that computes the global convolutions in the Fourier space, we elaborately design EV-FGN to compute the multi-order graph convolutions in the Fourier space. Both FNO and EV-FGN replace the time-consuming graph/global convolutions in time domain with the efficient spectral convolution in the Fourier space according to the convolution theorem. However, FNO and EV-FGN are quite different in the network architecture. As shown in Fig. 5 , FNO consists of a stack of Fourier layers where each Fourier layer serially performs 1) DFT to obtain the spectrum of the input, 2) then the spectral convolution in the Fourier space, 3) and finally IDFT to transform the output to the time domain. Accordingly, FNO needs K pairs of DFT and IDFT for K Fourier layers in FNO. In contrast, EV-FGN with just a pair of DFT and IDFT performs multi-order (multi-layer) graph convolutions on one fly via stacking multiple FGSOs in Fourier space, as shown in Fig. 1 . In addition, adaptive Fourier neural operator (AFNO) Guibas et al. ( 2022) is a variant of FNO. It is a neural unit and used in a plug-and-play fashion as an alternative to self-attention to reduce the quadratic complexity. Similarly, it requires a pair of DFT and IDFT to accomplish the self-attention computation. In contrast, our proposed EV-FGN is a neural network with a set of FGSOs in a well-designed connection. F EXPERIMENT DETAILS Specifically, given the groudtruth X t+1:t+τ ∈ R N ×τ and the predictions Xt+1:t+τ ∈ R N ×τ for future τ steps at timestamp t, the metrics are defined as follows: M AE = 1 τ N N i=1 τ j=1 |x ij -xij | (18) RM SE = 1 τ N N i=1 τ j=1 (x ij -xij ) 2 (19) M AP E = 1 τ N N i=1 τ j=1 x ij -xij x ij × 100% with x ij ∈ X t+1:t+τ and xij ∈ Xt+1:t+τ . We further compare the sub adjacency with the real road map (generated by the google map tool) to verify the learned dependencies between different detectors. Figure 4 . On the COVID-19 dataset, we randomly choose 10 counties out of N = 55 counties and obtain their four sub adjacency matrices of four consecutive days for visualization. Each of the four sub adjacency matrices R 10×10 embodies the dependencies between counties in one day. Figure 4 reflects the time-varying dependencies between counties (i.e., variables). Figure 5 . Since we adopt a 3-layer EV-FGN, we can calculate four adjacency matrices based on the input X of EV-FGN and the outputs of each layer in EV-FGN, i.e., X 1 , X 2 , X 3 . Following the way of visualization in Figure 4 , we select 10 counties and two timestamps on the four adjacency matrices for visualization. Figure 5 shows the effects of each layer of EV-FGN in filtering or enhancing variable correlations. Figure 10 . We select 8 counties and visualize the correlations between 12 consecutive time steps for each selected county respectively. Figure 10 reflects the temporal correlations within each variable. 7 . From the table, we can find that EV-FGN achieves the best MAE and RMSE on all the prediction lengths. On average, EV-FGN has 30.0% and 27.9% improvement on MAE and RMSE respectively over the best baseline, i.e., TAMP-S2GCNets. Among these models, TAMP-S2GCNets requiring a pre-defined graph topology achieves competitive performance since it enhances the resultant graph learning mechanisms with a multi-persistence. However, it constructs the graph in the spatial dimension, while our model EV-FGN adaptively learns a supra-graph connecting any two variables at any two timestamps, which is effective and more powerful to capture high-resolution spatial-temporal dependencies. 10 . The table shows that EV-FGN achieves increasingly better performance from K = 1 to K = 4 and achieves the best results when K = 3. With the further increase of K, EV-FGN obtains inferior performance. The results indicate that high-order diffusion information is beneficial for improving the forecasting accuracy, but the diffusion information may gradually weaken the effect or even bring noises to forecasting with the increase of the order.

G MORE RESULTS

In addition, we conduct extensive experiments on the ECG dataset to analyze the effect of the input length and the embedding dimension d, as shown in Fig. 7 and Fig. 8 , respectively. Fig. 7 shows that the performance (including RMSE and MAPE) of EV-FGN gets better as the input length increases, indicating that EV-FGN can learn a comprehensive supra-graph from long MTS inputs to capture the spatial and temporal dependencies. Moreover, Fig. 8 shows that the performance (RMSE and MAPE) first increases and then decreases with the increase of the embedding size, which is attributed that a large embedding size improves the fitting ability of EV-FGN but it may easily lead to the overfitting issue especially when the embedding size is too large. Ablation Study. We provide more details about each variant used in this section and Section 4.3. • w/o Embedding. A variant of EV-FGN feeds the raw MTS input instead of its embeddings into the graph convolution in the Fourier space. • w/o Residual. A variant of EV-FGN does not have the K = 0 layer output, i.e., X , in the summation. • w/o Summation. A variant of EV-FGN adopts the last order (layer) output X K as the final frequency output of the EV-FGN. We conduct another ablation study on the COVID-19 dataset to further investigate the effects of the different components of our EV-FGN. The results are shown in Table 11 , which confirms the results in Table 4 and further verifies the effectiveness of each component in EV-FGN. Both Table 11 and Table 4 report that the embedding and dynamic filter in EV-FGN contribute more than the design of residual and summation to the state-of-the-art performance of EV-FGN. Ordering of the time series. Note that the supra-graph connecting all nodes is a fully-connected graph, indicating that any spatial order is feasible. Then how could we perform 2D DFT to achieve the graph convolution? First, the graph convolution on the supra-graph can be viewed as a kernel summation (cf. Eq.4, i.e., O(X)[i] = n j=1 X[j]κ[i, j], ∀j ∈ [n]) and does not depend on a specific spatial order. From Eq.4 to Eq.5, we extend the kernel summation to a kernel integral in the continuous spatial-temporal space and introduce a special kernel, i.e., the shift-invariant Green's kernel κ(i, j) = κ(i -j). According to the convolution theorem, we can reformulate the graph convolution with continuous Fourier transform (i.e., Eq.6 O(X)(i) = F -1 (F(X)F(κ)) (i), ∀i ∈ D) and then apply Eq.6 to the finite discrete spatial-temporal space (i.e., the supra-graph). Accordingly, we can obtain the Definition 2 and Eq.7 reformulating the graph convolution with 2D DFT. To understand the learning paradigm of EV-FGN, we can recall the self-attention mechanism as an analogy that does not need any assumption on datapoint order and introduce extra position embeddings for capturing temporal patterns. EV-FGN performs 2D DFT on each discrete N × T spatial-temporal plane of the embeddings X instead of the raw data X. A key insight underpinning EV-FGN is to introduce variable embeddings and temporal position embeddings to equip EV-FGN with a sense of variable correlations (spatial patterns and temporal patterns). To verify the claim, we conducted experiments on the dataset ECG. Specifically, we randomly shuffle the order of time-series variables of the raw data five times and evaluate our EV-FGN on each shuffled data. The results are reported in the following table. Note that we can also conduct shuffling experiments on temporal order via performing the same shuffling scheme over each window-sized time-series input, which will get the same conclusion. To demonstrate the ability of our EV-FGN in jointly learning spatial-temporal dependencies, we visualize the temporal adjacency matrix of different variables. Note that the spatial adjacency matrices of different days are reported in Fig. 3 . Specifically, we randomly select 8 counties from the COVID-19 dataset and calculate the correlations of 12 consecutive time steps for each county. Then we visualize the adjacency matrix via heat map, and the results are shown in Fig. 10 where N denotes the index of the country (variable). From the figure, we observe that EV-FGN learns clear and specific temporal patterns for each county. These results show that our EV-FGN can not only learn highly interpretable spatial correlations (see Fig. 2 and Fig. 3 ), but also capture discriminative temporal patterns.



VISUALIZATIONTo better investigate the spatial-temporal representation learned by EV-FGN, we conduct more visualization experiments on METR-LA and COVID-19 datasets. More details about the way of visualization can be found in Appendix F.5.

Figure 2: The adjacency matrix (right) learned by EV-FGN and the corresponding road map (left).

Figure 3: The adjacency matrix for four consecutive days on the COVID-19 dataset.Visualization of EV-FGN diffusion process. To understand how EV-FGN works, we analyze the frequency input of each layer. We choose 10 counties from COVID-19 dataset and visualize their adjacency matrices at two different timestamps, as shown in Fig.4. From left to right, the results correspond to X 0 , • • • , X 3 respectively. From the top, we can find that as the number of layers increases, some correlation values are reduced, indicating that some correlations are filtered out. In contrast, the bottom case illustrates some correlations are enhanced as the number of layers increases. These results show that EV-FGN can adaptively and effectively capture important patterns while removing noises to a learn discriminative model. More visualizations are provided in Appendix H.2.

Figure 4: The diffusion process of EV-FGN at two timestamps (top and bottom) on COVID-19.

EXPLANATION TO THE EXTENSION OF DEFINITION 2 TO 2D DOMAIN Recall Eqs. 4 and 5, Eq. 4 (kernel summation): O

Figure 5: The simplified structure of FNO derived from Li et al. (2021). F and F -1 denote Fourier transform and its reverse respectively.

, we compare our model EV-FGN with seven neural MTS models (including STGCN, DCRNN, StemGNN, AGCRN, GraphWaveNet, MTGNN, Informer and CoST) on the METR-LA dataset, and the results are shown in Table9. On average, we improve 5.7% on MAE and 2.5% on RMSE. Among these models, StemGNN achieves competitive performance because it combines GFT to capture the spatial dependencies and DFT to capture the temporal dependencies. However, it is also limited to simultaneously capturing spatial-temporal dependencies. CoST learns disentangled seasonal-trend representations for time series forecasting via contrastive learning and obtains competitive results. But, our model still outperforms CoST. Because, compared with CoST, our model not only can learn the dynamic temporal representations, but also capture the discriminative spatial representations. Besides, STGCN and DCRNN require pre-defined graph structures. But StemGNN and our model outperform them for all steps, and AGCRN outperforms them when the prediction lengths are 9 and 12. This also shows that a novel adaptive graph learning can precisely capture the hidden spatial dependency. In addition, we compare EV-FGN with the baseline models under the different prediction lengths on the ECG dataset, as shown in Fig.6. It reports that EV-FGN achieves the best performances (MAE, RMSE and MAPE) for all prediction lengths.

Figure 6: Performance comparison in different prediction lengths on the ECG dataset.

Figure 7: Influence of input length.Figure 8: Influence of embedding size.

Figure 7: Influence of input length.Figure 8: Influence of embedding size.

Figure 9: The sensitivity between MAE, RMSE, MAPE and number of nodes on the Wiki dataset.

Figure 10: The temporal adjacency matrix of eight variables on COVID-19 dataset.

Summary of datasets.

All experiments are implemented in Python in Pytorch 1.8 (SFM in Keras) and conducted on one NVIDIA RTX 3080 card. Our model is trained using RMSProp with a learning rate of 0.00001 and MSELoss (Mean Squared Error) as the loss function. The best parameters for all comparative models are chosen through careful parameter tuning on the validation set. We use MAE (Mean Absolute Errors), RMSE (Root Mean Squared Errors), and MAPE (Mean Absolute Percentage Error) to measure the performance.

Forecasting results on the six datasets. We investigate the parameter volumes and training time costs of EV-FGN, StemGNN, AGCRN, GraphWaveNet and MTGNN on two representative datasets (Wiki and Traffic).

Results of parameter volumes and training time costs on Traffic and Wiki datasets. Embedding shows the significance of performing embedding to improve model generalization. w/o Dynamic Filter using the same graph shift operator verifies the effectiveness of applying different graph shift operators in capturing time-varying dependencies. In addition, w/o Residual represents EV-FGN without the K = 0 layer, while w/o Summation adopts the last order (layer) output X K as the output of EV-FGN. These results demonstrate the importance of high-order diffusion and the contribution of multi-order diffusion. More results and analysis of the ablation study are provided in Appendix H.1.

Ablation study on the METR-LA dataset.



Solar 1 : This dataset is about the solar power collected by National Renewable Energy Laboratory. We choose the power plant data points in Florida as the data set which contains 593 points. The data is collected from 2006/01/01 to 2016/12/31 with the sampling interval of every 1 hour.Wiki 2 : This dataset contains a number of daily views of different Wikipedia articles and is collected from 2015/7/1 to 2016/12/31. It consists of approximately 145k time series and we randomly choose 2k from them as our experimental data set. Traffic 3 : This dataset contains hourly traffic data from 963 San Francisco freeway car lanes. The traffic data are collected since 2015/01/01 with the sampling interval of every 1 hour. ECG 4 : This dataset is about Electrocardiogram(ECG) from the UCR time-series classification archive Dau et al. (2019). It contains 140 nodes and each node has a length of 5000. Electricity 5 : This dataset contains electricity consumption of 370 clients and is collected since 2011/01/01. Data sampling interval is every 15 minutes. COVID-19 6 : This dataset is about COVID-19 hospitalization in the U.S. states of California (CA) from 01/02/2020 to 31/12/2020 provided by the Johns Hopkins University with the sampling interval of every one day.

Performance comparison under different prediction lengths on the COVID-19 dataset.

Performance comparison under different prediction lengths on the Wiki dataset.In addition, we compare our model EV-FGN with five neural MTS models (including StemGNN, AGCRN, GraphWaveNet, MTGNN and Informer) on Wiki dataset under different prediction lengths, and the results are shown in Table8. From the table, we observe that EV-FGN outperforms other models on MAE, RMSE and MAPE metrics for all the prediction lengths. On average, EV-FGN improves MAE, RMSE and MAPE by 6.8%, 3.2% and 22.9%, respectively. Among these models, AGCRN shows promising performances since it captures the spatial and temporal correlations adaptively. However, it fails to simultaneously capture spatial-temporal dependencies, limiting its forecasting performance. In contrast, our model learns a supra-graph to capture comprehensive spatial-temporal dependencies simultaneously for multivariate time series forecasting.

Performance comparison under different prediction lengths on the METR-LA dataset.

Performance at different diffusion steps on the COVID-19 dataset.

Ablation studies on the COVID-19 dataset. metrics w/o Embedding w/o Dynamic Filter w/o Residual w/o Summation EV-FGN Dynamic Filter. A variant of EV-FGN uses the same FGSO for all diffusion steps instead of applying different FGSOs in different diffusion steps. It corresponds to a vanilla graph filter.

Five-round results on randomly shuffled data on the dataset ECG.

F.4 EXPERIMENTAL SETTINGS

We summarize the implementation details of the proposed EV-FGN as follows. Note that the details of the baselines are introduced in their corresponding descriptions (see Section F.2).Network details. The fully connected feed-forward network (FFN) consists of three linear transformations with LeakyReLU activations in between. The FFN is formulated as follows:where×τ are the weights of the three layers respectively, and Training details. We carefully tune the hyperparameters, including the embedding size, batch size,, on the validation set and choose the settings with the best performance for EV-FGN on different datasets. Specifically, the embedding size and batch size are tuned over {32, 64, 128, 256, 512} and {2, 4, 8, 16, 32, 64, 128} respectively. For the COVID-19 dataset, the embedding size is 256, and the batch size is set to 4. For the Traffic, Solar and Wiki datasets, the embedding size is 128, and the batch size is set to 2. For the METR-LA, ECG and Electricity datasets, the embedding size is 128, and the batch size is set to 32. Note that the supra-graph connecting all nodes is a fully-connected graph, indicating that any spatial order is feasible. Therefore, although we perform 2D DFT in EV-FGN, none spatial order in the spatial space is necessary for performing EV-FGN, and we directly adopt the raw dataset for experiments.To reduce the number of parameters, we adopt a linear transform to map the original time domain representation X Ψ ∈ R N ×T ×d to a low-dimensional tensor X Ψ ∈ R N ×l×d with l < T . We then reshape X Ψ ∈ R N ×(ld) and feed it to FFN. We perform grid search on the dimensions of FFN, i.e., d f f n 1 and d f f n 2 , over {32, 64, 128, 256, 512} and tune the intermediate dimension l over {2, 4, 6, 8, 12}. The settings of the three hyperparameters over all datasets are shown in Table 6 . Finally, we set the diffusion step K = 3 for all datasets.

F.5 DETAILS FOR VISUALIZATION EXPERIMENTS

To verify the effectiveness of EV-FGN in learning the spatial-temporal dependencies on the fullyconnected supra-graph, we obtain the output of EV-FGN as the node representation, denoted as R = IDFT(Ψ K (X )) ∈ R N ×T ×d with inverse discrete Fourier transform (IDFT) and K-layer EV-FGN Ψ K . Then, we visualize the adjacency matrix A calculated based the flatten node representation R ∈ R N T ×d , formulated as A = RR T ∈ R N T ×N T , to show the variable correlations. Note that A is normalized via A/ max(A). Since it is not feasible to directly visualize the huge adjacency matrix A of the supra-graph, we visualize its different subgraphs in Figures 3, 4 , 5, and 10 to better verify the learned spatial-temporal information on the supra-graph from different perspectives.Figure 3 : On the METR-LA dataset, we average its adjacency matrix A over the temporal dimension (i.e., marginalizing T ) to A ′ ∈ R N ×N . Then, we randomly select 20 detectors out of all N = 207 detectors and obtain their corresponding sub adjacency matrix (R 20×20 ) from A ′ for visualization.

