MULTIVARIATE PROBABILISTIC TIME SERIES FORE-CASTING VIA CONDITIONED NORMALIZING FLOWS

Abstract

Time series forecasting is often fundamental to scientific and engineering problems and enables decision making. With ever increasing data set sizes, a trivial solution to scale up predictions is to assume independence between interacting time series. However, modeling statistical dependencies can improve accuracy and enable analysis of interaction effects. Deep learning methods are well suited for this problem, but multivariate models often assume a simple parametric distribution and do not scale to high dimensions. In this work we model the multivariate temporal dynamics of time series via an autoregressive deep learning model, where the data distribution is represented by a conditioned normalizing flow. This combination retains the power of autoregressive models, such as good performance in extrapolation into the future, with the flexibility of flows as a general purpose high-dimensional distribution model, while remaining computationally tractable. We show that it improves over the state-of-the-art for standard metrics on many real-world data sets with several thousand interacting time-series.

1. INTRODUCTION

Classical time series forecasting methods such as those in Hyndman & Athanasopoulos (2018) typically provide univariate forecasts and require hand-tuned features to model seasonality and other parameters. Time series models based on recurrent neural networks (RNN), like LSTM (Hochreiter & Schmidhuber, 1997) , have become popular methods due to their end-to-end training, the ease of incorporating exogenous covariates, and their automatic feature extraction abilities, which are the hallmarks of deep learning. Forecasting outputs can either be points or probability distributions, in which case the forecasts typically come with uncertainty bounds. The problem of modeling uncertainties in time series forecasting is of vital importance for assessing how much to trust the predictions for downstream tasks, such as anomaly detection or (business) decision making. Without probabilistic modeling, the importance of the forecast in regions of low noise (small variance around a mean value) versus a scenario with high noise cannot be distinguished. Hence, point estimation models ignore risk stemming from this noise, which would be of particular importance in some contexts such as making (business) decisions. Finally, individual time series, in many cases, are statistically dependent on each other, and models need the capacity to adapt to this in order to improve forecast accuracy (Tsay, 2014) . For example, to model the demand for a retail article, it is important to not only model its sales dependent on its own past sales, but also to take into account the effect of interacting articles, which can lead to cannibalization effects in the case of article competition. As another example, consider traffic flow in a network of streets as measured by occupancy sensors. A disruption on one particular street will also ripple to occupancy sensors of nearby streets-a univariate model would arguably not be able to account for these effects. In this work, we propose end-to-end trainable autoregressive deep learning architectures for probabilistic forecasting that explicitly models multivariate time series and their temporal dynamics by employing a normalizing flow, like the Masked Autoregressive Flow (Papamakarios et al., 2017) or Real NVP (Dinh et al., 2017) . These models are able to scale to thousands of interacting time series, we show that they are able to learn ground-truth dependency structure on toy data and we establish new state-of-the-art results on diverse real world data sets by comparing to competitive baselines. Additionally, these methods adapt to a broad class of underlying data distribution on account of using a normalizing flow and our Transformer based model is highly efficient due to the parallel nature of attention layers while training. The paper first provides some background context in Section 2. We cover related work in Section 3. Section 4 introduces our model and the experiments are detailed in Section 5. We conclude with some discussion in Section 6. The Appendix contains details of the datasets, additional metrics and exploratory plots of forecast intervals as well as details of our model.

2. BACKGROUND 2.1 DENSITY ESTIMATION VIA NORMALIZING FLOWS

Normalizing flows (Tabak & Turner, 2013; Papamakarios et al., 2019) are mappings from R D to R D such that densities p X on the input space X = R D are transformed into some simple distribution p Z (e.g. an isotropic Gaussian) on the space Z = R D . These mappings, f : X → Z, are composed of a sequence of bijections or invertible functions. Due to the change of variables formula we can express p X (x) by p X (x) = p Z (z) det ∂f (x) ∂x , where ∂f (x)/∂x is the Jacobian of f at x. Normalizing flows have the property that the inverse x = f -1 (z) is easy to evaluate and computing the Jacobian determinant takes O(D) time. The bijection introduced by Real NVP (Dinh et al., 2017) called the coupling layer satisfies the above two properties. It leaves part of its inputs unchanged and transforms the other part via functions of the un-transformed variables (with superscript denoting the coordinate indices) y 1:d = x 1:d y d+1:D = x d+1:D exp(s(x 1:d )) + t(x 1:d ), where is an element wise product, s() is a scaling and t() a translation function from R d → R D-d , given by neural networks. To model a nonlinear density map f (x), a number of coupling layers which map X → Y 1 → • • • → Y K-1 → Z are composed together all the while alternating the dimensions which are unchanged and transformed. Via the change of variables formula the probability density function (PDF) of the flow given a data point can be written as log p X (x) = log p Z (z) + log | det(∂z/∂x)| = log p Z (z) + K i=1 log | det(∂y i /∂y i-1 )|. Note that the Jacobian for the Real NVP is a block-triangular matrix and thus the log-determinant of each map simply becomes log | det(∂y i /∂y i-1 )| = log | exp(sum(s i (y 1:d i-1 ))|, where sum() is the sum over all the vector elements. This model, parameterized by the weights of the scaling and translation neural networks θ, is then trained via stochastic gradient descent (SGD) on training data points where for each batch D we maximize the average log likelihood (1) given by L = 1 |D| x∈D log p X (x; θ). In practice, Batch Normalization (Ioffe & Szegedy, 2015) is applied as a bijection to outputs of successive coupling layers to stabilize the training of normalizing flows. This bijection implements the normalization procedure using a weighted moving average of the layer's mean and standard deviation values, which has to be adapted to either training or inference regimes.

