DISCRETE GRAPH STRUCTURE LEARNING FOR FORE-CASTING MULTIPLE TIME SERIES

Abstract

Time series forecasting is an extensively studied subject in statistics, economics, and computer science. Exploration of the correlation and causation among the variables in a multivariate time series shows promise in enhancing the performance of a time series model. When using deep neural networks as forecasting models, we hypothesize that exploiting the pairwise information among multiple (multivariate) time series also improves their forecast. If an explicit graph structure is known, graph neural networks (GNNs) have been demonstrated as powerful tools to exploit the structure. In this work, we propose learning the structure simultaneously with the GNN if the graph is unknown. We cast the problem as learning a probabilistic graph model through optimizing the mean performance over the graph distribution. The distribution is parameterized by a neural network so that discrete graphs can be sampled differentiably through reparameterization. Empirical evaluations show that our method is simpler, more efficient, and better performing than a recently proposed bilevel learning approach for graph structure learning, as well as a broad array of forecasting models, either deep or non-deep learning based, and graph or non-graph based.

1. INTRODUCTION

Time series data are widely studied in science and engineering that involve temporal measurements. Time series forecasting is concerned with the prediction of future values based on observed ones in the past. It has played important roles in climate studies, market analysis, traffic control, and energy grid management (Makridakis et al., 1997) and has inspired the development of various predictive models that capture the temporal dynamics of the underlying system. These models range from early autoregressive approaches (Hamilton, 1994; Asteriou & Hall, 2011) to the recent deep learning methods (Seo et al., 2016; Li et al., 2018; Yu et al., 2018; Zhao et al., 2019) . Analysis of univariate time series (a single longitudinal variable) has been extended to multivariate time series and multiple (univariate or multivariate) time series. Multivariate forecasting models find strong predictive power in stressing the interdependency (and even causal relationship) among the variables. The vector autoregressive model (Hamilton, 1994) is an example of multivariate analysis, wherein the coefficient magnitudes offer hints into the Granger causality (Granger, 1969) of one variable to another. For multiple time series, pairwise similarities or connections among them have also been explored to improve the forecasting accuracy (Yu et al., 2018) . An example is the traffic network where each node denotes a time series recording captured by a particular sensor. The spatial connections of the roads offer insights into how traffic dynamics propagates along the network. Several graph neural network (GNN) approaches (Seo et al., 2016; Li et al., 2018; Yu et al., 2018; Zhao et al., 2019) have been proposed recently to leverage the graph structure for forecasting all time series simultaneously. The graph structure however is not always available or it may be incomplete. There could be several reasons, including the difficulty in obtaining such information or a deliberate shielding for the protection of sensitive information. For example, a data set comprising sensory readings of the nation-wide energy grid is granted access to specific users without disclosure of the grid structure. Such practical situations incentivize the automatic learning of the hidden graph structure jointly with the forecasting model. Because GNN approaches show promise in forecasting multiple interrelated time series, in this paper we are concerned with structure learning methods applicable to the downstream use of GNNs. A prominent example is the recent work of Franceschi et al. ( 2019) (named LDS), which is a meta-learning approach that treats the graph as a hyperparameter in a bilevel optimization framework (Franceschi et al., 2017) . Specifically, let X train and X val denote the training and the validation sets of time series respectively, A ∈ {0, 1} n×n denote the graph adjacency matrix of the n time series, w denote the parameters used in the GNN, and L and F denote the the loss functions used during training and validation respectively (which may not be identical). LDS formulates the problem as learning the probability matrix θ ∈ [0, 1] n×n , which parameterizes the element-wise Bernoulli distribution from which the adjacency matrix A is sampled: min θ E A∼Ber(θ) [F (A, w(θ), X val )], s.t. w(θ) = argmin w E A∼Ber(θ) [L(A, w, X train )]. (1) Formulation (1) gives a bilevel optimization problem. The constraint (which by itself is an optimization problem) defines the GNN weights as a function of the given graph, so that the objective is to optimize over such a graph only. Note that for differentiability, one does not directly operate on the discrete graph adjacency matrix A, but on the continuous probabilities θ instead. LDS has two drawbacks. First, its computation is expensive. The derivative of w with respect to θ is computed by applying the chain rule on a recursive-dynamics surrogate of the inner optimization argmin. Applying the chain rule on this surrogate is equivalent to differentiating an RNN, which is either memory intensive if done in the reverse mode or time consuming if done in the forward mode, when unrolling a deep dynamics. Second, it is challenging to scale. The matrix θ has Θ(n 2 ) entries to optimize and thus the method is hard to scale to increasingly more time series. In light of the challenges of LDS, we instead advocate a unilevel optimization: min w E A∼Ber(θ(w)) [F (A, w, X train )]. Formulation ( 2) trains the GNN model as usual, except that the probabilities θ (which parameterizes the distribution from which A is sampled), is by itself parameterized. We absorb these parameters, together with the GNN parameters, into the notation w. We still use a validation set X val for usual hyperparameter tuning, but these hyperparameters are not θ as treated by (1). In fact, formulation (1) may need a second validation set to tune other hyperparameters. The major distinction of our approach from LDS is the parameterization θ(w), as opposed to an inner optimization w(θ). In our approach, a modeler owns the freedom to design the parameterization and better control the number of parameters as n 2 increases. To this end, time series representation learning and link prediction techniques offer ample inspiration for modeling. In contrast, LDS is more agnostic as no modeling is needed. The effort, instead, lies in the nontrivial treatment of the inner optimization (in particular, its differentiation). As such, our approach is advantageous in two regards. First, its computation is less expensive, because the gradient computation of a unilevel optimization is straightforward and efficient and implementations are mature. Second, it better scales, because the number of parameters does not grow quadratically with the number of time series. We coin our approach GTS (short for "graph for time series"), signaling the usefulness of graph structure learning for enhancing time series forecasting. It is important to note that the end purpose of the graph is to improve forecasting quality, rather than identifying causal relationship of the series or recovering the ground-truth graph, if any. While causal discovery of multiple scalar variables is an

