MULTIVARIATE TIME SERIES FORECASTING BY GRAPH ATTENTION NETWORKS WITH THEORETICAL GUARANTEES Anonymous authors Paper under double-blind review

Abstract

Multivariate time series forecasting (MTSF) aims to predict future values of multiple variables based on past values of multivariate time series, and has been applied in fields including traffic flow prediction, stock price forecasting, and anomaly detection. Capturing the inter-dependencies among variables poses one significant challenge to MTSF. Several methods that model the correlations between variables with an aim to improve the test prediction accuracy have been considered in recent works, however, none of them have theoretical guarantees. In this paper, we developed a new norm-bounded graph attention network (GAT) for MTSF by upper-bounding the Frobenius norm of weights in each layer of the GAT model to achieve optimal performance. Under optimal parameters, we theoretically show that our model can achieve a generalization error bound which is expressed as products of Frobenius norm of weight in each layer and the numbers of neighbors and attention heads, while the latter is represented as polynomial terms with the degree as the number of layers. Empirically, we investigate the impact of different components of GAT models on the performance of MTSF. Our experiment also verifies our theoretical findings. Empirically, we also observe that the generalization performance of our method is dependent on the number of attention heads, the number of neighbors, the scales (norms) of the weight matrices, the scale of the input features, and the number of layers. Our method provides novel perspectives for improving the generation performance for MTSF, and our theoretical guarantees give substantial implications for designing attention-based methods for MTSF.

1. INTRODUCTION AND BACKGROUNDS

Substantial time series data generated in the real world make multivariate time series forcasting (MTSF) a crucial topic in various scenarios, such as traffic forecasting, sensor signal anomaly detection in the Internet of things, demand and supply prediction in the supply chain management, and stock market price prediction in financial investment (Cao et al., 2020) . Traditional methods simply deploy time series models, e.g., auto-regressive (AR) (Mills & Mills, 1990) , auto-regressive integrated moving average (ARIMA) (Box et al., 2015) and vector auto-regression (VAR) (Box et al., 2015; Hamilton, 2020; Lütkepohl, 2005) for forecasting. Specifically, ARIMA, though one of the classic forecasting methods in univariate situations, fails to accommodate multivariate issues due to high computational complexity. VAR, as an extension of AR model in multivariate situations, is widely used in MTSF tasks due to its simplicity, however, it cannot handle the nonlinear relationships among variables, leading to reduced forecasting accuracy. In addition to traditional statistical methods, deep learning methods have been applied in MTSF problems and demonstrated potentials to solve these problems (Tokgöz & Ünal, 2018) . The long short term memory (LSTM) (Graves, 2012) , gated recurrent units (GRU) (Cho et al., 2014) , gated linear units (GLU) (Dauphin et al., 2017) , temporal convolution networks (TCN) (Bai et al., 2018) , state frequency memory (SFM) network (Zhang et al., 2017) have found success in practical time series tasks. However, another important issue in time series data, complex inter-dependency (i.e., the correlations among multiple correlated time series), is still unaddressed in these methods, restricting forecasting accuracy (Bai et al., 2020; Cao et al., 2020) . For example, in the traffic forecasting task, adjacent roads naturally interplay with each other. Another example is stock price prediction, in which it is easier to predict a stock price based on the historical information of the stocks in similar categories, while information on stocks from other sectors can be relatively useless. Graph is a special form of data that describes the relationships between different entities. Recently, graph neural networks (GNNs) (Scarselli et al., 2008) have achieved great success in handling graph data with development in permutation-invariance, local connectivity, and compositionality. In general, GNNs assume that the state of a node is influenced by the states of its neighbors. By disseminating information through structures, GNNs allow each node in a graph to be aware of its neighborhood context. MTSF can be viewed naturally from a graph perspective. Variables from multivariate time series can be considered as nodes in a graph where they are interlinked each other through hidden dependency relationships. It follows that modeling multivariate time series data using GNNs can be a promising way to preserve their temporal trajectory while exploiting the inter-dependency among time series. In the meantime, due to the popularity of convolutional neural networks (CNNs), considerable studies attempt to generalize convolutions to graph-structured data, leading to the creation of graph convolutional networks (GCNs) (Duvenaud et al., 2015; Atwood & Towsley, 2016; Monti et al., 2018; Niepert et al., 2016; Kipf & Welling, 2017) . GCNs model a node's feature representation by aggregating the representations of its one-step neighbors. Many studies have shown that GNN-and GCN-based methods outperform prior methods in time series forecasting tasks (Yu et al., 2017; Wu et al., 2019; Chen et al., 2020) . The graph attention network (GAT) (Veličković et al., 2017) , one of the most popular GNN architectures, is considered a state-of-the-art neural architecture to process graph-structured data. Building on the aggregating approach of GCNs, in GATs, every node computes the importance of its neighboring nodes, and then utilizes the importance as weights to update its representations of the features during the aggregation. Compared to the well-known GCNs, GATs have demonstrated equivalent, if not improved, performance across well-established benchmarks of node multiclass classification. Within the GAT framework, Guo et al. (2019) ; Deng & Hooi (2021) use GAT-based models to adaptively adjust the correlations among multiple time series, showing better performance in accuracy over GNNs and GCNs. The numeric and experimental successes of GATs for MTSF notwithstanding, theoretical understandings of the underlying mechanisms of GATs for MTSF are still limited: none of them has theoretical guarantees with respect to generalization error bounds, the most commonly used method to theoretically evaluate the prediction model. The generalization error bound provides a standard approach to evaluate neural networks as it characterizes the predictive performance of a class of learning models for unseen data (Golowich et al., 2018) . Therefore, understanding the generalization error bound of GATs for MTSF will shed light on the relationship between the architecture of the GATs and their generalization performance for MTSF, advancing understandings of underlying mechanisms. Studies show that deriving generalization error bound for neural network classes requires constraints on the size of weights. Bartlett (1998) first gave a generalization error bound for neural networks by bounding the size of the cover of neural network function classes, suggesting that the bound depends on the number of training samples and the size of weights, rather than the number of weights. In the following studies, the empirical Rademacher complexity (ERC) was shown as an essential component of generalization error bound for neural network classes. Bartlett & Mendelson (2002) introduced the generalization error bounds using the Rademacher complexity of the function classes that include neural networks with constraints on the magnitudes of weights for binary classification. Bartlett et al. (2017) then presented a margin-based generalization error bound using the Rademacher complexity of neural network function classes with the spectral norms of weight matrices being controlled for multiclass classification. Neyshabur et al. (2015) used the Rademacher complexity bounds showing that the generalization error of deep neural networks can be upper bounded by a bound in terms of the Frobenius norms of weights. Golowich et al. (2018) further demonstrated that the generalization error bound for deep neural network classes with bounded Frobenius norm of weights can be independent of the number of layers and the width of each layer if employing proper techniques. These methods are also extended to graph-based neural networks. Garg et al. (2020) derived the generalization error bounds for GNNs using Rademacher complexity for binary classification. Lv (2021) provided the generalization error bound for GCNs via Rademacher complexity for binary classification. Contributions. In this study, to capture the inter-dependencies among variables of MTSF, we develop a GAT-based method for MTSF; to secure the generalization error bound, we require the norm of weight matrix in our model to be bounded; to evaluate the performance of our method, we compare our method with two SOTA methods and show our method outperforms over these prior methods. We also provide the theoretical generalization error bound for our method, aiming to develop models with a desired generalization error for MTSF. Specifically, we derive the generalization error bounds of two-layer GAT models for multi-step MTSF task. We also extend our generalization error bounds to deep GAT models with more than two layers. Generalization error bounds derived in this study are based on the bound of ERC of GAT models with the weight matrix norm being controlled. This approach is characterized by controlling Frobenius norm of the hidden layer weight matrix, a common method to derive the norm-based generalization error bounds for DNNs, CNNs, and GNNs. In particular, we show that ERC derived for GAT models for MTSF has a polynomial dependence on the number of neighbors considered in attention representation and the number of attention heads being used. The aforementioned ERC is also dependent on the product of norms of weight matrices of each layer, the L 2 -norm of the input feature vector, and the Lipschitz constant of loss and activation functions. To further understand the effectiveness of GATs for MTSF, in addition to theoretical analysis of the relationships between components of the GAT model and this bound, we also investigate the influence of different components of GATs models on the performance of MTSF using experiments with complex stock price data. Our experimental results are consistent with theoretical findings. To our best knowledge, we develop the first GAT-based method for MTSF with theoretical guarantees.

2.1. PROBLEM FORMULATION

In this paper, we focus on the task of MTSF, considering a multivariate situation that contains N correlated univariate time series represented as {X 1 , . . . , X N }, where we use X i = {x i,1 , x i,2 , . . . , x i,t , . . .} to denote a sequence of time series i from time step 1 to infinity. Based on a sequence of historical T time steps of values prior to current time t, our goal is to predict the multi-step-away value of {y 1 , . . . , y N } using an appropriate prediction model f , where each y i = {y i,t+1 , . . . , y i,t+C } has values from C timestamps. In addition, the historical inputs can be representative of multiple aspects if complemented with auxiliary features, so our problem can be characterized as {y 1 , . . . , y N } = f ({x 1,t , . . . , x 1,t-T +1 } , . . . , {x N,t , . . . , x N,t-T +1 }). To accurately capture the inter-dependency, the problem is formulated on the graph structure as introduced below.

2.2. THE GRAPH STRUCTURE

Now we consider an undirected graph G = (N , E). N = (n 1 , . . . , n N ), |N | = N , is a set of node labels representing the sources of N time series. E ⊂ N × N is the set of edges representing the connection between series. We let x i ∈ X , i ∈ [N ] be a random variable representing the input feature vector of node n i for time series i. For node i, its random input feature x i ∈ X ⊂ R D is a multi-dimensional vector, which contains all the historical values from T time steps, in other words, we let x i = (x i,t , . . . , x i,t-T +1 ) be the concatenation of T time steps; its true label y i ∈ Y ⊂ R C is the vector for the C-step-away values. During the learning period, some nodes, which will be treated as the training set, know the C-step-away true values y. We denote the set of indices of those nodes as M ⊂ N such that M = |M| < N . And for each node in the M, we order them based on their node labels n i and re-index them based on their order number, j ∈ [M ]. Here the random input and labels are S = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x M , y M )} over M. In the following paragraphs, we will introduce the GATs for our problem.

2.3. GATS MODEL

We consider the GATs defined by Veličković et al. (2017) , given the random feature matrix X = [x 1 , x 2 , . . . , x N ], a L-layers GAT model f , and the final output Z (L) , Z (L) = f (x 1 ) , . . . , f (x M ) = P (L) ⊕ K k=1 σ P (L-1) • • • ⊕ K k=1 σ P (2) ⊕ K k=1 σ P (1,k) XW (1,k) W (2) • • • W (L-1) W (L) . In a GAT with more than two layers, the output of the hidden layer l is Z (l) = ⊕ K k=1 σ(P (l,k) Z (l-1) W (l,k) ), Here we use subscript (l) or (l -1) to indicate which layer the variable belongs to. We have l ∈ {1, 2, . . . , L -1}, and Z 0 = X. Here, W (l,k) ∈ R D l-1 ×D l is an l -1-to-l weight matrix for a hidden layer with D l feature maps. Here σ is the activation function. And we let ⊕ denote the concatenation for the attention heads. This definition is specific to the GATs, with its detailed description found in Veličković et al. (2017) . And we have total K such matrices in layer l with each matrix W (l,k) corresponding to one attention head. P (1,k) , P (2) and P (L-1) are the attention matrix introduced by us, and further function as an operator to incorporate the attention. We will justify its equivalence to the original GAT models (Veličković et al., 2017) in later paragraphs. Even though our analysis covers GATs with more than two layers, we will give a focus on two-layers GATs model, which is also implemented by Veličković et al. (2017) , with the following simple form: Z (2) = f (x 1 ) , . . . , f (x M ) = P (2) ⊕ K k=1 σ P (1,k) XW (1,k) W (2) . The first layer consists of K attention heads computing D 1 features each (a total of K × D 1 features), followed by an activation function σ. Here, W (1,k) ∈ R D×D1 is an input-to-hidden weight matrix for a hidden layer with D 1 feature maps, and we have K such matrices. The second layer is used for prediction: a single attention head that predicts the y. For C-step-away forecasting, we have D L = C. The W (2) ∈ R KD1×C is a hidden-to-output weight matrix.

Attention Model

We now give more explanation about the attention introduced in the GAT model. In section 2 of the original GAT paper, Veličković et al. ( 2017) mentioned the learnable linear transformation to transform the input features into higher-level features with sufficient expressive power. In their process, they first apply a shared linear transformation, parameterized by the weight matrix, W ∈ R D×D1 , to every node. Then they perform self-attention on the node: a shared attentional mechanism a : R D1 × R D1 → R computes attention coefficients e ij = a(x i W, x j W), to indicate the importance of node j's features to node i. They also inject the graph structure into the mechanism by performing masked attention: they compute e ij for nodes j ∈ N (i), the neighbors of node i, which might include node i itself. Then they normalize the coefficients to make them easily comparable across different nodes using the softmax function to obtain p i,j p i,j = φ([e i,1 , e i,2 , . . .]) j = exp e i,j k∈N (i) exp e i,k . Then the output features from the first layer for each node will be: σ( j∈N (i) p i,j x j W). They also propose the K-head attention. The K independent attention mechanisms execute the aforementioned transformation, and then their output are concatenated, resulting in the following feature representation: ⊕ K k=1 σ( j∈N (i) p i,j (k) x j W (k) ). To make the above process more integrated, here in the GAT models, we introduce the attention matrix P that contains the normalized attention coefficients used to compute a linear combination of the neighborhood features, yielding the new feature representation for every node. This matrix contains individual node's importance weights with every other node in its neighborhood. Let P (l) ∈ R N ×N , l ∈ [L -1], is the matrix of the graph attention matrix defined by the attention coefficientsfoot_0 . We use N e ≤ N as the fixed number of neighbors for each node. And let N (n) be the set of neighbors for node n. P (l) =      -p 1 (l) - -p 2 (l) - . . . -p N (l) -      =      0 p 1,i (l) p 1,k (l) . . . p 1,m (l) {i,k,m}=N . . . p N,s (l) p N,q . . . p N,j (l) 0 {s,q,j}=N (N )      , p n,e (l) ∈ [0, 1] is the coefficients of node n attributed from node e. Each row sum of the P (l) is equal to 1, which is e∈N (n) p n,e (l) = 1. Here p n (l,k) (row n of P (l,k) ) denotes node n's all coefficients, which are computed by p n (l,k) = φ M • Z (l-1) W (l,k) N (l,k) , u n • 1 N 1 2 ∈ R N , ( ) where φ is a softmax function, used to calculate the importance weight from other nodes. The N (l,k) ∈ R D l ×1 is a convolutional filter with filter size equal to 1 × 1 and output channel size equal to 1. The u n is the n's entry of Z (l-1) W (l,k) N (l,k) . And M is a mask matrix with M n,e = 1 for n ∈ [N ] and e ∈ N (n), and 0 otherwise, and • is the element-wise product, 1 N is the size N vector with all entries equal to 1. In the paper by Veličković et al. (2017) , the proposed attention mechanism a consists of the following steps: firstly apply another convolutional filter N to the x W, secondly sum up these representations to get the attention coefficients-e i,j = x i WN + x j WN. We integrate the result of this whole process into an attention matrix P (l) , which functions exactly as the attention mechanism in the original architecture introduced by Veličković et al. (2017) . Since the sum of row elements of the P equals one, we have the following property hold for P (l) : P (l) F ≤ N , i.e., the Forbunis norm of P (l) is bounded by the total number of nodes.

2.4. THE EMPIRICAL RISK FRAMEWORK FOR MTSF

We first introduce function spaces of GATs for MTSF. Let the X and Y be the feature and true label spaces, respectively, and Q an unknown distribution over X × Y. Let F ⊂ V X be the hypothesis class for predictions, where V is another space that might be different from Y. In our paper, we let the function space F be the space of our GAT classes that contains the GAT functions f . We defer the detailed definitions of F in later sections, i.e. in terms of Weights-bounded GATs for the MTSF problem, see the definition 15. Given F, X , and Y, we let g : V × Y → [0, B] be the loss function defined over F. We assume that g is bounded, i.e., the range of loss is [0, B]. Additionally, we require B = 1 (if not, we can scale the loss function) without loss of generality. We also introduce the function class g F ⊂ [0, B] X ×Y by composing the functions in F with g(•, •), i.e., g F = {(x, y) → g(f (x), y) : f ∈ F} . For any risk function g defined over F, given the training set S = {(x 1 , y 1 ), . . . , (x M , y M )} which includes M i.i.d samples from X × Y according to distribution Q, the expected/population risk E(f ) and the empirical risk function Ê(f ) are defined as: E(f ) = E (x,y)∼Q [g(f (x), y)], f ∈ F . (5) Ê(f ) = 1 M M j=1 g(f (x j ), y j ). A predictor f ∈ F can be generalized if for any δ > 0, lim sup |S|=M →∞ Ê(f ) → E(f ) a.s. A predictor with a generalization guarantee is closely related to the complexity of its hypothesis space. In that sense, the generalization error bound for F is characterized by the condition where E(f ) is bounded by the summation of Ê(f ), the ERC that is generally the dominating term, and an error function associated with the confidence of the bound and the sample size M .

2.5. THE RADEMACHER COMPLEXITY

Suppose F = {f : x → f (x)} is a model space. We define the ERC R(F)and Rademacher complexity R S (F) as R(F) = E   1 M sup f ∈F M j=1 j f (x j ) | x 1 , . . . , x N   , (7) R S (F) = E S∼Q R(F), where { 1 , • • • , M } are i.i.d. Rademacher variables satisfying P ( j = 1) = P ( j = -1) = 1/2.

3.1. NOTATION

We use bold-faced letters to denote vectors and capital letters to denote matrices or fixed parameters (which should be clear from the context). Given a vector w ∈ R D , w refers to the Euclidean norm, and for p ≥ 1, w p = F i=1 |w i | p 1/p refers to the L p norm. For a matrix W, W F refers to the Frobenius norm, W F = i j |w i,j | 2 . A function f : R n → R m is L-Lipschitz, L ≥ 0, if f (a) -f (b) ≤ L a -b for all a, b ∈ R n . We use standard big-O notation, with Ω(•), Θ(•), and O(•) hiding constants.

3.2. FUNCTION CLASS OF GATS

Given the inputs X = (x i , . . . , x N ) as multiple time series with each x i as input feature for node n i , the class of 2-layer GATs for MTSF f maps x to the output f (x) that represents a C-step-away prediction expressed in equation 2. And We consider a subset of such class requiring each f with a bounded weight norm, expressed as F = x → f (x j ) ∈ R C ; W (1,k) F ≤ M 1 , w c (2) ≤ M 2 , j ∈ [M ] , here we use f (x j ) to mean the output of f corresponding to node j, where j ∈ [M ], and we know the true label y j of this node. Furthermore, we also provide a model space F c ⊂ R X with a single dimensional output that corresponds to the c-th component of model output from f (x) for the c-th time step, expressed as F c = x → f (x j ) c , W (1,k) F ≤ M 1 , w c (2) ≤ M 2 , c ∈ [C], j ∈ [M ] .

3.3. AN UPPER BOUND OF RADEMACHER COMPLEXITY OF GAT CLASS

Here we first provide an upper bound of ERC of GAT class F c for single dimensional output of MTSF. Theorem 1 (Upper Bound of ERC of GAT class F c for MTSF). Let the activation function σ(•) be L σ -Lipschitz continuous, and also satisfy σ(0) = 0 and σ(αz) = ασ(z) for all α ≥ 0. Assume the L 2 -norm of the feature vector x comes from a bounded domain X = {x : x ≤ B}. Assume the Frobenius norm of every weight matrix in the first layer of the GAT class is bounded, namely, W (1,k) F ≤ M 1 with some constant M 1 > 0 for every k. Also, the norm of the weight vector of the second layer of the GATs is also bounded, w c (2) ≤ M 2 , where c ∈ [C], with some constant M 2 > 0. Let N i denote the neighborhood of node i (including i), let the number of neighbors of each node is equal to each other, namely, for some common constant N e ∈ N + , assume N e := |N i | for all node i ∈ N , furthermore, we consider the most general formulation, which allows every node to attend on every other node, i.e., N e = N . Then let R(F c ) be the ERC defined in the definition 7 for GAT class F c in the definition 17, given the M sized input set {x 1 , . . . , x M }, then we have R(F c ) = O(L σ BK(N ) 3/2 M -1/2 M 1 M 2 ). We see that this bound has a polynomial dependence on the N , the total number of nodes. The N appears here due to the fact that we consider all the nodes as neighbors, which can be replaced by the N e , the number of neighbors. A small N e < N can result in a potentially smaller bound. The proof details can be found in §A.

3.4. GENERALIZATION ERROR BOUND OF THE GAT CLASS FOR MTSF

In this section, we will give the final generalization error bound of the GAT class for MTSF. The formal result is in the following theorem and the proof is in §B. Theorem 2. Define the hypothesis class F as the definition 15. We suppose g is Lipschitz with constant L g . Then for any δ ∈ (0, 1), with probability at least 1 -δ, for all f ∈ F , we have E(f ) ≤ Ê(f ) + 2 √ 2CL g R(F c ) + 3 ln(2/δ) 2M , where we have R(F c ) = O(L σ BK(N ) 3/2 M -1/2 M 1 M 2 ) from Theorem 1.

4. EXTENSION TO GAT CLASS WITH LAYERS L > 2

Now we extend the analysis to GATs with more than two layers for MTSF and provide corresponding generalization error bounds. Here, the proof is done by a simple induction argument using the "peeling-off" technique employed for Rademacher complexity bounds for neural networks. The output of a L-layer GATs represents a multi-step-away prediction shown in expression 1. We define the function class over M, according to the definition of the GATs network, with F = x → f (x j ) ∈ R C : W (1,k) F ≤ M 1 , . . . , w c (L) ≤ M L , c ∈ [C], j ∈ [M ] . Furthermore, we require that the Frobenius norm of every weight matrix in every layer of the GAT class is bounded, namely, for any l ∈ [L], W (l,k) F ≤ M l with some constant M l > 0 for every k. Also, we have single dimensional GAT function space defined as F c = {x → f (x j ) c : f ∈ F , c ∈ [C] , j ∈ [M ]} . (12) Also, the output up to layer l, l ∈ [L -1], has the format as, f l (x 1 ) , . . . , f l (x M ) = ⊕ K l k l =1 σ P (l) • • • ⊕ K2 k2=1 σ P (2) ⊕ K1 k1=1 σ P (1,k) XW (1,k) W (2) • • • W (l) ∈ R D l . ( ) Thus, we define a layer-wised class of functions as F l = f l : x → f l (x j ) ∈ R D l , W (1,k) F ≤ M 1 , . . . , W (l,k) F ≤ M l , j ∈ [M ] . We provide an upper bound of ERC of GAT class F c with L layers with proof details in §C Theorem 3 (Upper Bound of ERC of GAT class F c with L layers). Let all the assumptions from Theorem 1 be fulfilled. Furthermore, let the Frobenius norm of every weight matrix in the first L -1 layers of the GATs be bounded, namely, W (l,k) F ≤ M l with some constant M l > 0 for every k. Also, the norm of the weight vector of the last layer is also bounded, w c (L) ≤ M L , where c ∈ [C], with some constant M L > 0. Let R(F c ) be the ERC defined in Equation 7for GAT class F c in the definition 12, given the M sized input set {x 1 , . . . , x M }, then we have R(F c ) = O((4L σ K) L-1 L l=1 M l (BN L-1/2 M -1/2 )). Based on the theorem 3, we provide the generalization error bounds for the GAT class with more than two layers in the following theorem. Theorem 4 (Generalization Error Bounds for the GAT Class with More than Two Layers). Given any real δ ∈ (0, 1), with probability at least 1 -δ, for any f ∈ F defined in the definition 11, which contains L-layer GATs, and given the M sized training data set S = {(x 1 , y 1 ), . . . , (x M , y M )}, we have the generalization error for MTSF is upper bounded as E(f ) ≤ Ê(f ) + 2 √ 2CL g R(F c ) + 3 ln(2/δ) 2M , where R(F c ) is calculated in theorem 3. To give the generalization error bound of a deep GAT, we firstly derive its ERC. To bound the ERC, we apply the layer-peeling strategy that the ERC of L-layer networks is expressed by a factor multiplied by the ERC of L -1-layer networks. Specifically, we consider this factor as the matrix Frobenius norm, and our current bounds scale with the product of these norms as the layer size increases. However, the number of attention heads and the number of attention neighbors appear as the polynomial terms with an order roughly equal to the number of layers L. The bound has an exponential dependence on the network depth. The generalization error bound in Theorem 4 implies that the following attempts can be taken to reduce the generalization error: i) increase the training samples, ii) minimize the empirical loss, and iii) design the neural network carefully to achieve a proper hypothesis class. Increasing the complexity of the hypothesis class can decrease the approximation error but also increase the estimation error due to a large ERC, which leads to undesired test performance in a practical task. In the next section, we will empirically show how structural components of GATs related to complexity could affect the test performance for a MTSF task, providing empirical support for our theoretical findings.

5. EXPERIMENT DETAILS

Our goal is to show the relationship between the upper bound of the generalization error of the GAT model and variables in the ERC, including the number of attention heads, the number of neighbors, the Frobenius norm of model weights, the number of model layers, the norm of inputs, and the number of labeled nodes. We use the daily stock price data from Nasdaq and NYSE. The multivariate time series have about 1500 stocks. For each stock, the one-week historical data is used to predict future returns. For every stock, we assume that every feature is i.i.d. per day. By default, we use a three-layer (input layer, hidden layer, and output layer) GAT to do single-step forecasting on the returns of each stock. For each variable, we try its various values with all other variables's values fixed. We repeat the experiment 20 times and report the loss on the out-of-sample test dataset to represent its generalization error. We use minimum square error (MSE) as evaluation metricsfoot_1 . The experiment results in Figure 1 shows test error beginning to increase quadratically after some values of K. This is consistent with our theoretical results on the generalization error bound.

6. EXPERIMENT RESULTS

Number of Neighbors -N e . For three-layer GATs, Theorem 3 indicates that with the increasing number of neighbors considered for the attention operation, the upper bound of ERC is y = O(N L-1/2 e ). Figure 1 demonstrates that as the N e increases, the test error conforms to this theoretical error bound. But it is noteworthy that when the number is too small, the loss is also high. It is possible that at certain range, the influence of the information loss due to the limited number of neighbors is dominant. Norm of Weight Matrix -M l . We also have an empirical evaluation on the relationship between the generalization error bound and the Frobenius norm of GAT models' weights. Theorem 3 indicates that with the increasing bound of the Frobenius norm, the upper bound of ERC increases polynomially. The experiment results in Figure 1 corroborate the conclusion. When the weight norm increases, the generalization error first decreases, then increases. The reason for the initial decrease is because the bound on the norms is so small that it severely prevents the weights from having enough amount of updates. Thus, the scales (norms) of the weight matrices should be neither too large (induces large generalization error) nor too small (harms weights' updates) and choosing proper scales is important in practice as the current work has shown Li et al. (2018) . Number of Layers (Model Depth) -L. Theorem 3 and Theorem 1 indicate that the upper bound of ERC increases exponentially with an increasing number of layers, suggesting that the number of layers has negative impact on the test performance of GATs for MTSF. However, Sun et al. (2016) reported that deeper nets which have larger representation power are able to fit training data better and achieve smaller empirical error. This observation indicates a positive impact of the number of layers' positive on the test performance for MTSF. Our empirical results are consistent with the above discussions about the double-edged impacts. The experiment results are shown in Figure 1 and indicate that as the number of layers increases, the test error first decreases, and then increases. This observation indicates that if the ERC increases quickly, the representation power cannot compensate for the negative impact of the increased number of layers. Norm of Inputs -B. The experiment results in Figure 1 show that when the input norm is greater than certain values, the generalization loss of the GAT has a linear relationship with the upper bound of the input norm. This empirical observation aligns with Theorem 3. Number of Labeled Nodes -M . Theorem 3 also indicates that when the number of labeled nodes increases, the generalization loss of the GAT decreases and the upper bound of the generalization loss is y = O(M -1/2 ). The empirical results in Figure 1 confirm this relationship. We include the details about how we control the above six variables in D.2. In addition, we conduct experiments on two-layer GATs to show that empirical results are also consistent with error bounds in Theorem 1. Results and discussions are in §E. It is noteworthy that Figure 1 also shows that the test error decreases initially, then starts to increase as Theorem 3 suggests. The inconsistency at the beginning could be brought about by other factors that affect the test error. As seen in the Theorem 4, in addition to the ERC, the training error also contributes to the upper bound. Nevertheless, the ERC becomes dominant by increasing the number of attention heads, the number of neighbors, the upper bound of weights norm, and the number of hidden layers, indicating the importance of a proper design of neural networks for MTSF to guarantee a smaller test error. Among all the above factors that affect the generalization error, since the order of the upper bound of the weights norm is the highest, properly controlling the upper bound of the weights norm is especially significant. The train-test pipeline runs 20 times over 20 random seeds. The Weights-bounded GAT has a better test loss than the vanilla GAT regarding the minimum, maximum, first quantile, third quantile and the median of the test loss. As reported in the previous literature (Wu et al., 2021) , increasing the complexity hypothesis class in terms of larger weight matrices bounds could decrease the approximation error, but may also increase the estimation error, which corresponds to the second term of RHS in Theorem 4. Therefore, in practical training process, we generally start with a simple neural network and gradually increase its complexity in terms of larger weight matrix bounds to improve the test performance, and the bound can be a tuning parameter in our model. We call it weight control. To show the meanings of weight control, we further developed an improved version of the GAT called Weightsbounded GAT. And we conduct an additional experiment to demonstrate its improvement over the vanilla GAT. As the name suggests, the weight Frobenius norm of each layer is bounded by a hyperparameter. We compare the Weights-bounded GAT with the vanilla GAT using test loss on the above stock return forecasting task. We repeat the same train-test pipeline 20 times and collect 20 test losses for each model. As Figure 2 shows, the Weightsbounded GAT performs much better than the vanilla GAT. 

APPENDICES TO MULTIVARIATE TIME SERIES FORECASTING BY GRAPH ATTENTION NETWORKS WITH THEORETICAL GUARANTEES

A PROOF OF THEOREM 1 Here we will derive the upper bound of ERC of two-layer GATs with on-dimensional output for the single-time-step prediction. As mentioned in section 2.3, our GATs model's second layer l (2) uses Z (1) ∈ R N ×KD1 as input, and outputs Z (2) ∈ R M ×C . The input to the first layer l (1) is a set of node features X = (x 1 , . . . , x N ) , , where each x i ∈ R D , D is the number of features in each node. The first layer produces Z (1) with each z (1) ∈ R KD1 . The Z (1) is composed by concatenation of K outputs from the K identically structured attention layers l (1,k) , with each output denoted as H (1,k) = σ(P (1,k) XW (1,k) ) Now we write the output vector of GATs for all classes of node j, j ∈ [M ] as, z j (2) = KD1 kr=1 w kr (2) N t=1 p j,t (2) • K k=1 σ D d=1 w d,r (1,k) N i=1 p t,i (1,k) x i,d . Here we use index notation kr ∈ [KD 1 ] because we already have indexes k ∈ [K] and r ∈ [D 1 ]. And it is easy to show that above output can be easily written as in vector format: z j (2) = KD1 kr=1 w kr (2) N t=1 p j,t (2) • K k=1 σ N i=1 p t,i (1,k) w r (1,k) , x i , where w r (1,k) represents the column r of W (1,k) , w kr (2) represents the row kr of W (2) , and w c (2) represents the column c of W (2) . Then the class of functions defined over the subset of the node set M, |M| = M , will be F = x → f (x j ) = KD1 kr=1 w kr (2) N t=1 p n,t (2) • K k=1 σ N i=1 p t,i (1,k) w r (1,k) , x i ∈ R C ; (15) W (1,k) F ≤ M 1 , w c (2) ≤ M 2 , j ∈ [M ] , Furthermore, we also provide a model space with a single dimensional output that corresponds to the c-th component of model output from f (x) for the c-th time step. Now we write the output of node n of time-step c of l (2) as z j,c (2) = KD1 kr=1 w kr,c N t=1 p n,t (2) • K k=1 σ F d=1 w d,r (1,k) N i=1 p t,i (1,k) x d i , and its vector format is z j,c (2) = KD1 kr=1 w kr,c (2) N t=1 p n,t (2) • K k=1 σ N i=1 p t,i (1,k) w r (1,k) , x i . Now let hypothesis class F c ⊂ R X be a set of functions on x ∈ X . Specifically, we have such single dimensional GAT function space defined as F c = x → f (x j ) c = KD1 kr=1 w kr,c N t=1 p j,t (2) • K k=1 σ N i=1 p t,i (1,k) w r (1,k) , x i ; W (1,k) F ≤ M 1 , w c (2) ≤ M 2 , c ∈ [C], j ∈ [M ] . We have the first layer's matrix to be W (1,k) = w 1 (1,k) , . . . , w D1 (1,k) . The output of first layer l 1 is Z (1) = h 1 (1,k) , h 2 (1,k) , . . . , h N (1,k) ∈ R N ×KD1 . We then further write the output of attention head k of first layer l (1) as H (1,k) = h (1,k,1) , h (1,k,2) , . . . , h (1,k,N ) ∈ R N ×D1 . By the concatenation relationship, we have the row t of Z (1) to be z t (1) = h (1,1,t) , h (1,2,t) , . . . , h (1,k,t) ∈ R KD1 . ( ) And we define each h (1,k,t) as h (1,k,t) = h 1 (1,k,t) , h 2 (1,k,t) , . . . , h D1 (1,k,t) ∈ R D1 , with each h r (1,k,t) ∈ R defined as h r (1,k,t) = σ N i=1 p t,i (1,k) w r (1,k) , x i . Then the output of node j of class c can be re-written as z j,c (2) = φ N t=1 p j,t (2) • z t (1) , w c , where w c (2) represents column c of W (2) . According to the definition of F c in definition 17, and ERC, we have R(F c ) = E   1 M sup f ∈F M j=1 j f (x j ) c   = E       1 M sup W (1,k) F ≤M1 w c (2) ≤M2 M j=1 j N t=1 p j,t (2) • z t (1) , w c (2)       = E       1 M sup W (1,k) F ≤M1 w c (2) ≤M2 M j=1 j N t=1 p j,t (2) • z t (1) , w c (2)       ≤ M 2 M E   sup W (1,k) F ≤M1 M j=1 j N t=1 p j,t (2) • z t (1)   The inequality comes from the Cauchy-Schwartz inequality. We further unfold the expectation, we have E   sup W (1,k) F ≤M1 M j=1 j N t=1 p j,t (2) • z t (1)   = E   sup w (1,k) =M1 K k=1 M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   , ( ) The equality is due to the following derivation: M j=1 j N t=1 p i,t (2) • z t (1) 2 = KD1 kr=1   M j=1 j N t=1 p j,t (2) h r (1,k,t)   2 = K k=1 D1 r=1   M j=1 j N t=1 p j,t (2) h r (1,k,t)   2 = K k=1 D1 r=1   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w r (1,k) , x i   2 For a fixed k-th attention head, we let the w 1 (1,k) , w 2 (1,k) , . . . , w D1 (1,k) be the the columns of W (1,k) , then, by positive homogeneity of σ, we have K k=1 D1 r=1   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w r (1,k) , x i   2 = K k=1 D1 r=1 w r (1,k) 2   M j=1 j N t=1 p j,t (2) σ   N i=1 p t,i (1,k) w r (1,k) w r (1,k) , x i     2 The supremum of this quantity over w 1 (1,k) , w 2 (1,k) , . . . , w D1 (1,k) under the constraint that W (1,k) 2 F ≤ M 2 1 = D1 r=1 w r (1,k) 2 ≤ M 2 1 is attained when w r (1,k) = M 1 for some r and w r (1,k) = 0 for all other r = r. In the end, only the r terms remain. For simplicity of notation, we use w (1,k) to mean that r's column w r (1,k) . Therefore, we have E   sup W (1,k) F ≤M1 M j=1 j N t=1 p j,t (2) • z t (1)   = E   sup w (1,k) =M1 K k=1 M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   (a) ≤ K k=1 E   sup w (1,k) =M1 M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   (b) ≤2 K k=1 E   sup w (1,k) =M1 M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   The inequality (a) is because sup( i x i ) ≤ i sup(x i ). For the inequality (b) above, we have E   sup w (1,k) =M1 M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   (a) = E   sup w (1,k) =M1   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   + +   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   -   (b) ≤ E   sup w (1,k) =M1   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   +   + E   sup w (1,k) =M1   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   -   (c) =2 E   sup w (1,k) =M1   M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   +   (d) =2 E     sup w (1,k) =M1 M j=1 j N t=1 p j,t N i=1 p t,i (1,k) w (1,k) , x i   +   (e) =2 E     sup w (1,k) =M1 M j=1 j N t=1 p j,t N i=1 p t,i (1,k) w (1,k) , x i     where the equality (a) above is due to |x| = (x) + + (x) -, and the inequality (b) is due to sup A+B ≤ sup A + sup B , and the equality (c) comes from the symmetry in the distribution of the i random variables. The equality (d) is due to sup A+ = (sup A ) + . The equality (e) is because the supremum is nonnegative, as when w (1,k) = 0, we can get the M j=1 j N t=1 p j,t N i=1 p t,i (1,k) w (1,k) , x i = 0. We then rewrite M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i = N t=1 M j=1 j p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i By the same argument in sup( i x i ) ≤ i sup(x i ), we have E   sup w (1,k) =M1 M j=1 j N t=1 p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   ≤ N t=1 E   sup w (1,k) =M1 M j=1 j p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   For any fixed t, we have E   sup w (1,k) =M1 M j=1 j p j,t (2) σ N i=1 p t,i (1,k) w (1,k) , x i   (a) ≤2 max j p j,t (2) • E   sup w (1,k) =M1 M j=1 j σ N i=1 p t,i (1,k) w (1,k) , x i   (b) ≤2 E   sup w (1,k) =M1 M j=1 j σ N i=1 p t,i (1,k) w (1,k) , x i   The inequality (a) is due to the contraction property of ERC. The inequality (b) is due to the definition of graph attention matrix P, i.e., the maximum value of entries in each row is equal to 1. Put ( 19), ( 20), ( 22), ( 23) together, we get E   sup W (1,k) F ≤M1 M j=1 j N t=1 p j,t (2) • z t (1) 2   ≤4 K k=1 N t=1 E   sup w (1,k) =M1   M j=1 j σ N i=1 p t,i (1,k) w (1,k) , x i     Then we use the fact that σ is L σ Lipschitz continuous and the contraction property of ERC. We further derive that 4 K k=1 N t=1 E   sup w (1,k) =M1   M j=1 j σ N i=1 p t,i (1,k) w (1,k) , x i     (a) ≤4L σ K k=1 N t=1 E   sup w (1,k) =M1 M j=1 j N i=1 p t,i (1,k) x i , w (1,k)   (b) ≤4L σ K k=1 N t=1 E     sup w (1,k) =M1    M j=1 j N i=1 p t,i (1,k) x i 2 2 w (1,k) 2 2    1/2     ≤4L σ M 1 K k=1 N t=1   E    M j=1 j N i=1 p t,i (1,k) x i 2 2       1/2 (c) =4L σ M 1 K k=1 N t=1   M j=1   N i=1 p t,i (1,k) x i 2 2     1/2 =4L σ M 1 K k=1 N t=1   M j=1 Xp t (1,k) 2 2   1/2 , Where the inequality (a) is by Cauchy-Schwartz, and the inequality (b) is by Jensen's inequality. And the equality (c) follows the i.i.d. condition of Rademacher sequences with zero-mean. Next, we continue to bound the rest, 4L σ M 1 K k=1 N i=1   M j=1 Xp t (1,k) 2 2   1/2 ≤ 4L σ M 1 √ M K k=1 N i=1 X 2 p t (1,k) 2 ≤4L σ M 1 √ M K k=1 N i=1 X 2 p t (1,k) 1 = 4L σ M 1 √ M K k=1 N i=1 ( X 2 ) ≤4L σ M 1 √ M K k=1 N i=1 sup w 2 =1 Xw 2 ≤ 4L σ M 1 √ M K k=1 N i=1   sup w 2 =1 N i=1 x i , w 2 1/2   ≤4L σ M 1 √ M K k=1 N i=1 N i=1 x i 2 2 1/2 ≤ 4L σ M 1 BK(N ) 3/2 M 1/2 , where the first inequality is by Ax 2 ≤ A 2 x 2 . Combined with the earlier result, we get R(F c ) ≤ 4L σ BK(N ) 3/2 M -1/2 M 1 M 2 B POOF OF THEOREM 2 We now turn to prove the Theorem 2. Our proof strategy will be the following. We first provide a classic theorem that was used to bound the expected loss based on the empirical loss and the upper bound of ERC of loss functions associated with GATs for the multi-time-step situation F, then we derive this upper bound of ERC of loss functions of F by extending the upper bound of ERC of GATs with one-dimensional output for the single-time-step prediction F c , based on the existing theorem in literature. Proof. From Theorem Mohri et al. (2018) , we have the following holds for all f E(f ) ≤ Ê(f ) + 2R(g F ) + 3 ln(2/δ) 2M , where E(f ), R(g F ), and Ê(f ) are defined in section 2.4. In order to extend to multi-dimensional vector-valued functions for MTSF, we will use the contraction inequality for the hypothesis class F of vector-valued functions f ∈ R C . Lemma 5 (Corollary 4 in Maurer (2016) ). let F be a class of vector-valued functions f = (f 1 , . . . , f C ) ∈ R C , with each f c ∈ F c ⊂ R X , and let {x 1 , . . . , x M }, {y 1 , . . . , y M } be a data set with each x j ∈ X and y j ∈ Y. Let ψ(•, •) be a 1-Lipschitz function mapping V × Y to R, and associated to F, where F ⊂ V X . Then we have E   sup f ∈F M j=1 j ψ((f (x j ), y j )   ≤ √ 2 E   sup f ∈F M j=1 C c=1 jc f c (x j )   ( ) where cj is the c, j-th entry of a C × M matrix of independent Rademacher variables. However, the RHS of equation 26 has supremum over all f ∈ F which is hard to compute and we can reduce it to scalar classes, and derive the following bound (Maurer, 2016) : E   sup f ∈F M j=1 C c=1 jc f c (x j )   ≤ C c=1 E   sup f ∈F c M j=1 j f (x j )   , Then we can derive the following upper bound for the loss function g(f (x), y) with f (x) being vector-valued functions based on equation 26 and 27, where the RHS of equation 27 is related to R(F c ): R(g F ) ≤ √ 2CR(F c ) C PROOF THEOREM 3 AND THEOREM 4 Proof. For l = L, by the definition of network output and the Rademacher complexity, we have R(F c ) = E   1 M sup f ∈F M j=1 j f (x j ) c   = E      1 M sup f (L-1) ∈F L-1 w c (L) ≤M L M j=1 j N t=1 p j,t (L) • z t (L-1) , w c (L)      ≤ M L M E     sup W (L-1,k) F ≤M L-1 f (L-2) ∈F L-2 M j=1 j N t=1 p j,t (L) • z t (L-1)     so we get M M 2 R(F) ≤ E     sup W (L-1,k) ≤M L-1 f (L-2) ∈F L-2 M j=1 j N t=1 p j,t (L) • z t (L-1)     We denote the RHS as R(L). We further unfold the expectation, we have E     sup W (L-1,k) F ≤M L-1 f (L-2) ∈F L-2 M j=1 j N t=1 p j,t (L) • z t (L-1)     ≤4 K k=1 N t=1 E     sup w (L-1,k) =M L-1 f (L-2) ∈F L-2   M j=1 j σ N i=1 p t,i (L-1,k) w (L-1,k) , z i L-2       This follows the same reason with equation 24. Now for any l ∈ [L -1], we have E     sup w (l,k) =M l f (l-1) ∈F l-1   M j=1 j σ N i=1 p t,i (l,k) w (l,k) , z i l-1       ≤L σ E     sup w (l,k) =M l f (l-1) ∈F l-1 M j=1 j N i=1 p t,i (l,k) z i l-1 , w (l,k)     ≤L σ E     sup w (l,k) =M l f (l-1) ∈F l-1 M j=1 j N i=1 p t,i (l,k) z i l-1 2 w (l,k) 2     =L σ M l     E     sup W (l-1,k) F ≤M l-1 f (l-2) ∈F l-2 M j=1 j N i=1 p t,i (l,k) z i l-1 2         Then we get the induction equation, which says E     sup W (l,k) F ≤M l f (l-1) ∈F l-1 M j=1 j N t=1 p j,t (l+1,k) • z t (l) 2     ≤4L σ M l K k=1 N t=1     E     sup W (l-1,k) F ≤M l-1 f (l-2) ∈F l-2 M j=1 j N i=1 p t,i (l,k) z i l-1 2         R(l + 1) ≤ 4L σ M l K k=1 N t=1 R(l) By induction, we get R(l + 1) ≤ (4L σ KN ) L-1 L-1 l=1 M l R(1) ≤ (4L σ KN ) L-1 L-1 l=1 M l   E    M j=1 j N i=1 p t,i (1,k) x i 2 2       1/2 ≤ (4L σ KN ) L-1 L-1 l=1 M l B(N M ) 1/2 Combining with the previous result, we get We implement our neural network as a three-layer Graph Attention Neural network. This includes the input layer, one hidden layer, and the output layer. Each layer is a GATConv layer from the Pytorch-Geometric packagefoot_2 . We use ELU activation (Clevert et al., 2015) and Dropout (Srivastava et al., 2014) after both the input layer and the hidden layer. R(F c ) ≤ (4L σ K) L-1 L l=1 M l BN L-1/2 M -1/2 For the number of heads variable, we change the number of attention heads of each layer and all layers use the same number of attention heads. For the number of neighbors variable, we adjust the dropout rate inside each layer's attention mechanism (different from the dropout layer) so that only a percentage of the nodes are considered when using the attention to aggregate information from a node's neighbors. For the weight norm variable, we adjust the bound of the Frobenius norm of the weight matrix inside each layer by using weight clipping. Each element of the weight matrix is clipped to the threshold to make sure the Frobenius norm of the matrix is less than or equal to the bound. For the number of layers variable, we adjust the number of hidden layers ranging from 1 to 8. For the input norm variable, we make sure each node's feature vector's L1 norm is less than or equal to a bound ranging from 1 to 28. For the number of labeled nodes variable, we use a ratio to adjust the size of the training dataset. In the experiment section, for all experiments, where the number of layers, L, needs to be fixed for studying the relationship of the generalization error with different GAT components that impact the ERC, including the number of attention heads, the number of neighbors, the upper bound of weights norm, we use a three-layer GAT to demonstrate Theorem 3. However, the upper bound of input norm and the number of labeled nodes affecting the ERC do not depend on the number of layers, thus, in our supplemental experiment for two-layer GATs, we only consider the three variables (the number of attention heads, the number of neighbors, and the upper bound of weights norm) that impact ERC and depend on the number of layers. Notably, Theorem 1 for generalization error bound of two-layer GATs is just a special case of Theorem 3 for deep GATs: when we let L equal to 2 in Theorem 3, the generalization error bound is identical to the bound in Theorem 1 multiplied with some constants. We now give empirical results on two-layer GATs to see if it is consistent with Theorem 1 and Theorem 3.

D.3 MODEL HYPERPARAMETERS

In this appendix section, we include more experiment results from a two-layer GAT on three variables in the ERC. As Figure 3 shows, when the number of attention heads increases, the generalization error also increases at a linear rate after certain values. When the number of neighbors increases, the generalization error initially decreases when the number of neighbors is small, then starts increasing following the trend of a polynomial function. As for the upper bound of weight norm, the generalization error increases quadratically with the increment of the norm. The empirical results for all three variables generally conforms to the theoretical bound suggests by Theorem 1 as well as Theorem 3. For some inconsistency between the results and the reference red line or theoretical results at the the beginning, the reason, as we have discussed, can be due to the trade-off between approximation error and estimation error, since when the complexity of hypothesis class increases, the former decreases and the latter increases. In the long run, the estimation error or the ERC dominate the bound. Thus, the results for L = 2 along with the three-layer experiments provide evidence for our theoretical findings (Theorem 1 and Theorem 3). E(f ) ≤ Ê(f ) + 2 √ 2CL g R(F c ) + 3 ln(2/δ) 2ñ , where we have R(F c ) = O(L σ BKM 1 M 2 (N e ) 3/2 ñ-1/2 ) from Theorem 6.

F.2 RESULTS UNDER SUPERVISED SETTING

In this section, we provide the results of experiments under the fully-supervised setting. The multivariate time series have about 1500 stocks, and all of these stocks are used for training and testing. We ensure that the training and testing data do not have any overlap. The results for the relationship between the upper bound of the generalization error of the model and variables in the ERC are shown in the Figure 4 . (Guo et al., 2019; Deng & Hooi, 2021) use GAT-based models to model multiple time series data, showing better performance in accuracy over other traditional linear methods (e.g., VAR), neural-network based methods (e.g., LSTM), and graph-network based methods (e.g., GNNs and GCNs). Both works differ from ours as they do not consider any control of model variables (e.g., the weight matrix norm) and they lack theoretical guarantees in terms of the generalization error. GDN is proposed by Deng & Hooi (2021) and is used for anomaly detection in multivariate time series. ASTGCN is developed by Guo et al. (2019) for multiple traffic flow forecasting. We use GDN and ASTGCN as baselines to compare with Weight-bounded GAT. To adapt them into our settings, we modify the output layer of GDN to perform the regression instead of classification task. We also keep other hyper-parameters the same across three methods. Table 3 reports their average test error and standard deviation from 20 runs. We use two evaluation metrics, the MSE and MAE. Additionally, we also use figure 5 As Table 3 and Figure 5 show, the Weight-bounded GAT performs better than the GDN and ASTGCN. Apart from using graph attention mechanism, GDN also learns a directed graph to model the causal-effect relationship between different nodes, which can capture the asymmetric dependency patterns. Learning a graph structure can better dynamically capture the relationship between nodes, but also increase the learning time and model complexity. In the meantime, while using attention between nodes, ASTGCN also considers the attention along the time axis. Our method does not consider attention along the time as the features in every time step has already embedded the historical information. Compared to GDN and ASTGCN, the weight-bounded control in our method plays an important role in controlling the generalization error, which is also verified in our theoretical analysis. This result further corroborated the outperformance of our method over the GAT-based SOTA methods for MTSF.

F.4 MODEL ARCHITECTURE

Figure 6 : The demonstration of the three-layer Weight-bounded GAT model we use. The input layer and the hidden layer are both followed by an ELU activation and a dropout. Each GAT layer is implemented using a Pytorch-Geometric GATConv layer.



In the final layer, we have P (L) ∈ R M ×N , since we obtain output for the nodes in M. It is known that MSE loss is not Lipschitz continuous over all Y, however, since we consider a finite hypothesis class satisfying bounded input and weights conditions, the output of GATs is bounded, thus, the MSE is a locally Lipschitz continuous function. https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html https://www.kaggle.com/datasets/paultimothymooney/stock-market-data https://www.dropbox.com/sh/gpa3283fgpq0yx1/AAAYnzAmnIhVdp4KFXw5wTzAa? dl=0



Figure 1: Experiment results on six variables in the ERC. We run the experiment 20 times and obtain a standard deviation of the generalization error. (a) relationship between test loss and the number of attention heads. (b) relationship between test loss and the number of neighbors. (c) relationship between test loss and the upper bound of weight norm. (d) relationship between test loss and the number of (hidden) layers. (e) relationship between test loss and the upper bound of input norm. (f) relationship between test loss and the number of labeled nodes. The red line is a possible theoretical upper bound. The plots show that all test losses generally conform to the big O of the theoretical upper bound.

Figure 2: Test loss of two types of GATs.The train-test pipeline runs 20 times over 20 random seeds. The Weights-bounded GAT has a better test loss than the vanilla GAT regarding the minimum, maximum, first quantile, third quantile and the median of the test loss.

Figure 3: Additional experiment results using a two-layer GAT on three variables in the ERC whose generalization error is related to the number of layers. We run the experiment 20 times and obtain a standard deviation of the generalization error. (a) relationship between test loss and the number of attention heads. (b) relationship between test loss and the number of neighbors. (c) relationship between test loss and the upper bound of weight norm. The red line is a possible theoretical upper bound. The plots show that when L equals to 2, all test losses still generally conform to the big O of the theoretical upper bound.

Figure 4: Experiment results on six variables in the ERC. We run the experiment 20 times and obtain a standard deviation of the generalization error. (a) relationship between test loss and the number of attention heads. (b) relationship between test loss and the number of neighbors. (c) relationship between test loss and the upper bound of weight norm. (d) relationship between test loss and the number of (hidden) layers. (e) relationship between test loss and the upper bound of input norm. (f) relationship between test loss and the training data set size. The red line is a possible theoretical upper bound. The plots show that all test losses generally conform to the big O of the theoretical upper bound.

Figure 5: Test loss of three methods. The train-test pipeline runs 20 times over 20 random seeds. The Weight-bounded GAT has a better test loss than the GDN and ASTGCN regarding the first quantile, third quantile and the mean of the test loss.

Shizhao Sun, Wei Chen, Liwei Wang, Xiaoguang Liu, and Tie-Yan Liu. On the depth of deep neural networks: A theoretical view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.

D EXPERIMENT AND DATA D.1 EXPERIMENT SETUPThe model is trained by the Adam optimizer(Kingma & Ba, 2014). The learning rate is 1e-4. The number of training epochs is 30. The batch size is set to 5. We split the dataset into three parts for training, validation and testing with a ratio of 0.6 : 0.2 : 0.2. All the deep learning models, are implemented in Python with Pytorch and executed on a server with 8 NVIDIA GeForce GTX 2080Ti GPUs. The Nvidia rriver version is 470.141.03 and the CUDA version is 11.4. Experiment environment. The list only includes major packages. All the packages are installed using Anaconda and Pip.

Default hyperparameters. When we study the impact of different values of a variable such as the number of heads, we keep all other variables fixed to the a set of same values.

Comparison of Weight-bounded GAT with Baselines ASTGCN, GDN Metrics (10 -3 ) MSE MAE STD(MSE) STD(MAE)

7. ETHICS STATEMENT

We have read the code of ethics carefully and ensured that our paper conforms to them.

8. REPRODUCIBILITY STATEMENT

We make our best effort to ensure the reproducibility of the paper's experiments and provide clear guidelines. More specifically, we include detailed experiment setup information in D.1. Besides, we give detailed description on the architecture of the neural network we use and we explain how we control and adjust different variables in D.2. We list all the important hyperparameters and their default values in D.3. We give the website link for the dataset we use in D.4. There is no extra data processing steps to be conducted. All the work is done in the code in an end-to-end manner. We also give the link to the source code in D.5.We explain all assumptions of our theorems (Theorem 1, 2, 3, 4) clearly, and provide the complete proof of the theorems in the appendix §A, §B, and §C.

F.1 MTSF UNDER FULLY-SUPERVISED SETTING INSTEAD OF SEMI-SUPERVISED SETTING

We consider an undirected graph G = (N , E). N = (n 1 , . . . , n N ), |N | = N , is a set of node labels representing the sources of N time series. E ⊂ N × N is the set of edges representing the connection between series. We let x i ∈ X , i ∈ [N ] be a random variable representing the input feature vector of node n i for time series i. For node i, its random input feature x i ∈ X ⊂ R D is a multi-dimensional vector, which contains all the historical values from T time steps, in other words, we let x i = (x i,t , . . . , x i,t-T +1 ) be the concatenation of T time steps; its true label y i ∈ Y ⊂ R C is the vector for the C-step-away values. We sample n training data over G, where n = N × V for some V ∈ N + . In other words, we have V batches of samples over graph G. For any risk function g defined over F, given the training set S = {(x 1 , y 1 ), . . . , (x n , y n )} which includes n samples from X × Y according to distribution Q, the expected/population risk E(f ) and the empirical risk function Ê(f ) are defined as:We introduce 1 ≤ Ñ ≤ N as the effective size of data because the data among nodes corresponding to different time series are not independent. If Ñ = 1, all of the time series are fully dependent. If Ñ = N , they are mutually independent. So the Ñ characterizes the strength of the independence among different time series. Then we use ñ as the effective sample size of data which are i.i.d, where V ≤ ñ ≤ N V = n. So the training set S will contain ñ data.Given the inputs X = (x i , . . . , x N ) as multiple time series with each x i as input feature for node n i , the class of 2-layer GATs for MTSF f maps x to the output f (x) that represents a C-step-away prediction expressed in equation 2. We consider a subset of such class requiring each f with a bounded weights norm, expressed asFurthermore, we also provide a model space F c ⊂ R X with a single dimensional output that corresponds to the c-th component of model output from f (x) for the c-th time step, expressed asHere we first provide an upper bound of ERC of GAT class F c for single dimensional output of MTSF.Theorem 6 (Upper Bound of ERC of GAT class F c for MTSF). Let the activation function σ(•) be L σ -Lipschitz continuous, and also satisfy σ(0) = 0 and σ(αz) = ασ(z) for all α ≥ 0. Assume that the L 2 -norm of the feature vector x comes from a bounded domain X = {x : x ≤ B}. Assume that the Frobenius norm of every weights matrix in the first layer of the GAT class is bounded, namely, W (1,k) F ≤ M 1 with some constant M 1 > 0 for every k. Also, the norm of the weights vector of the second layer of the GATs is bounded, w c (2) ≤ M 2 , where c ∈ [C], with some constant M 2 > 0. Let N i denote the neighborhood of node i (including i), let the number of neighbors of each node be identical, namely, for some common constant N e ∈ N + , assume N e := |N i | for all node i ∈ N , furthermore, we consider the most general formulation, which allows every node to attend on every other node, i.e., N e = N .Then let R(F c ) be the ERC defined in the definition 7 for GAT class F c in the definition 32, given the ñ sized input set {x 1 , . . . , x ñ}, then we have R(F c ) = O(L σ BKM 1 M 2 (N e ) 3/2 ñ-1/2 ).Similar proof details can be found in §A. Theorem 7. Define the hypothesis class F as the definition 15. We suppose g is Lipschitz with constant L g . Then for any δ ∈ (0, 1), with probability at least 1 -δ, for all f ∈ F , we have

