MULTIVARIATE TIME SERIES FORECASTING BY GRAPH ATTENTION NETWORKS WITH THEORETICAL GUARANTEES Anonymous authors Paper under double-blind review

Abstract

Multivariate time series forecasting (MTSF) aims to predict future values of multiple variables based on past values of multivariate time series, and has been applied in fields including traffic flow prediction, stock price forecasting, and anomaly detection. Capturing the inter-dependencies among variables poses one significant challenge to MTSF. Several methods that model the correlations between variables with an aim to improve the test prediction accuracy have been considered in recent works, however, none of them have theoretical guarantees. In this paper, we developed a new norm-bounded graph attention network (GAT) for MTSF by upper-bounding the Frobenius norm of weights in each layer of the GAT model to achieve optimal performance. Under optimal parameters, we theoretically show that our model can achieve a generalization error bound which is expressed as products of Frobenius norm of weight in each layer and the numbers of neighbors and attention heads, while the latter is represented as polynomial terms with the degree as the number of layers. Empirically, we investigate the impact of different components of GAT models on the performance of MTSF. Our experiment also verifies our theoretical findings. Empirically, we also observe that the generalization performance of our method is dependent on the number of attention heads, the number of neighbors, the scales (norms) of the weight matrices, the scale of the input features, and the number of layers. Our method provides novel perspectives for improving the generation performance for MTSF, and our theoretical guarantees give substantial implications for designing attention-based methods for MTSF.

1. INTRODUCTION AND BACKGROUNDS

Substantial time series data generated in the real world make multivariate time series forcasting (MTSF) a crucial topic in various scenarios, such as traffic forecasting, sensor signal anomaly detection in the Internet of things, demand and supply prediction in the supply chain management, and stock market price prediction in financial investment (Cao et al., 2020) . Traditional methods simply deploy time series models, e.g., auto-regressive (AR) (Mills & Mills, 1990) , auto-regressive integrated moving average (ARIMA) (Box et al., 2015) and vector auto-regression (VAR) (Box et al., 2015; Hamilton, 2020; Lütkepohl, 2005) for forecasting. Specifically, ARIMA, though one of the classic forecasting methods in univariate situations, fails to accommodate multivariate issues due to high computational complexity. VAR, as an extension of AR model in multivariate situations, is widely used in MTSF tasks due to its simplicity, however, it cannot handle the nonlinear relationships among variables, leading to reduced forecasting accuracy. In addition to traditional statistical methods, deep learning methods have been applied in MTSF problems and demonstrated potentials to solve these problems (Tokgöz & Ünal, 2018) . The long short term memory (LSTM) (Graves, 2012), gated recurrent units (GRU) (Cho et al., 2014) , gated linear units (GLU) (Dauphin et al., 2017) , temporal convolution networks (TCN) (Bai et al., 2018) , state frequency memory (SFM) network (Zhang et al., 2017) have found success in practical time series tasks. However, another important issue in time series data, complex inter-dependency (i.e., the correlations among multiple correlated time series), is still unaddressed in these methods, restricting forecasting accuracy (Bai et al., 2020; Cao et al., 2020) . For example, in the traffic forecasting task, adjacent roads naturally interplay with each other. Another example is stock price prediction, in which it is easier to predict a stock price based on the historical information of the stocks in similar categories, while information on stocks from other sectors can be relatively useless. Graph is a special form of data that describes the relationships between different entities. Recently, graph neural networks (GNNs) (Scarselli et al., 2008) have achieved great success in handling graph data with development in permutation-invariance, local connectivity, and compositionality. In general, GNNs assume that the state of a node is influenced by the states of its neighbors. By disseminating information through structures, GNNs allow each node in a graph to be aware of its neighborhood context. MTSF can be viewed naturally from a graph perspective. Variables from multivariate time series can be considered as nodes in a graph where they are interlinked each other through hidden dependency relationships. It follows that modeling multivariate time series data using GNNs can be a promising way to preserve their temporal trajectory while exploiting the inter-dependency among time series. In the meantime, due to the popularity of convolutional neural networks (CNNs), considerable studies attempt to generalize convolutions to graph-structured data, leading to the creation of graph convolutional networks (GCNs) (Duvenaud et al., 2015; Atwood & Towsley, 2016; Monti et al., 2018; Niepert et al., 2016; Kipf & Welling, 2017) . GCNs model a node's feature representation by aggregating the representations of its one-step neighbors. Many studies have shown that GNN-and GCN-based methods outperform prior methods in time series forecasting tasks (Yu et al., 2017; Wu et al., 2019; Chen et al., 2020) . The graph attention network (GAT) (Veličković et al., 2017) , one of the most popular GNN architectures, is considered a state-of-the-art neural architecture to process graph-structured data. Building on the aggregating approach of GCNs, in GATs, every node computes the importance of its neighboring nodes, and then utilizes the importance as weights to update its representations of the features during the aggregation. Compared to the well-known GCNs, GATs have demonstrated equivalent, if not improved, performance across well-established benchmarks of node multiclass classification. Within the GAT framework, Guo et al. ( 2019); Deng & Hooi (2021) use GAT-based models to adaptively adjust the correlations among multiple time series, showing better performance in accuracy over GNNs and GCNs. The numeric and experimental successes of GATs for MTSF notwithstanding, theoretical understandings of the underlying mechanisms of GATs for MTSF are still limited: none of them has theoretical guarantees with respect to generalization error bounds, the most commonly used method to theoretically evaluate the prediction model. The generalization error bound provides a standard approach to evaluate neural networks as it characterizes the predictive performance of a class of learning models for unseen data (Golowich et al., 2018) . Therefore, understanding the generalization error bound of GATs for MTSF will shed light on the relationship between the architecture of the GATs and their generalization performance for MTSF, advancing understandings of underlying mechanisms. Studies show that deriving generalization error bound for neural network classes requires constraints on the size of weights. Bartlett (1998) first gave a generalization error bound for neural networks by bounding the size of the cover of neural network function classes, suggesting that the bound depends on the number of training samples and the size of weights, rather than the number of weights. In the following studies, the empirical Rademacher complexity (ERC) was shown as an essential component of generalization error bound for neural network classes. Bartlett & Mendelson (2002) introduced the generalization error bounds using the Rademacher complexity of the function classes that include neural networks with constraints on the magnitudes of weights for binary classification. Bartlett et al. (2017) Contributions. In this study, to capture the inter-dependencies among variables of MTSF, we develop a GAT-based method for MTSF; to secure the generalization error bound, we require the norm of weight matrix in our model to be bounded; to evaluate the performance of our method, we compare our method with two SOTA methods and show our method outperforms over these prior methods. We also provide the theoretical generalization error bound for our method, aiming to develop models with a desired generalization error for MTSF. Specifically, we derive the generalization error bounds of two-layer GAT models for multi-step MTSF task. We also extend our generalization error bounds to deep GAT models with more than two layers. Generalization error bounds derived in this study are based on the bound of ERC of GAT models with the weight matrix norm being controlled. This approach is characterized by controlling Frobenius norm of the hidden layer weight matrix, a common method to derive the norm-based generalization error bounds for DNNs, CNNs, and GNNs. In particular, we show that ERC derived for GAT models for MTSF has a polynomial dependence on the number of neighbors considered



then presented a margin-based generalization error bound using the Rademacher complexity of neural network function classes with the spectral norms of weight matrices being controlled for multiclass classification. Neyshabur et al. (2015) used the Rademacher complexity bounds showing that the generalization error of deep neural networks can be upper bounded by a bound in terms of the Frobenius norms of weights. Golowich et al. (2018) further demonstrated that the generalization error bound for deep neural network classes with bounded Frobenius norm of weights can be independent of the number of layers and the width of each layer if employing proper techniques. These methods are also extended to graph-based neural networks. Garg et al. (2020) derived the generalization error bounds for GNNs using Rademacher complexity for binary classification. Lv (2021) provided the generalization error bound for GCNs via Rademacher complexity for binary classification.

