NEURAL MULTI-EVENT FORECASTING ON SPATIO-TEMPORAL POINT PROCESSES USING PROBABILISTICALLY ENRICHED TRANSFORMERS

Abstract

Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. Historydependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.

1. INTRODUCTION

Predicting the occurrence of discrete events in time and space has been the focus of many scientific studies and applications. Problems such as predicting earthquake hazards (Ogata, 1998; Chen et al., 2020) , infectious diseases over a population (Meyer et al., 2012; Schoenberg et al., 2019) , mobility and traffic in cities (Du et al., 2016) , and brain neuronal spikes (Perkel et al., 1967) fall under this category and have gained quite some interest. Over the years, many works used the history dependent spatio-temporal Hawkes process (Ozaki, 1979; Ogata, 1988; 1998; Ogata & Zhuang, 2006; Ogata et al., 1993; Nandan et al., 2017; Sornette & Werner, 2005; Zhuang, 2012; Helmstetter & Sornette, 2003; Bansal & Ogata, 2013) to model these point events. The stochasticity and the excitatory history-dependency of the Hawkes process are modeled by a conditional intensity function that varies in time and space. Assuming n events and their associated markers (t, x, M ) collected in the history H t = {(t i , x i , M i ) | t i < t, i = 1 : n}, the Hawkes intensity function is defined as λ(t, x|H t ) := lim ∆t,∆x→0 P ∆t,∆x (t, x|H t ) |B(x, ∆x)|∆t = µ θ (x) + i:ti<t g ϕ (t -t i , x -x i , M i ), where P ∆t,∆x (t, x|H t ) denotes the history-dependant probability of having an event in a small time interval [t, t + ∆t) and a small ball B(x, ∆x) centered at x ∈ R d with the radius of ∆x ∈ R d , and H t denotes all events happening up to but not including time t. The functions µ θ and g ϕ represent the (parametric) background intensity function and spatio-temporal kernel, respectively, forming λ(t, x|H t ). A variety of different parametric forms for both µ θ and g ϕ have been proposed (Ogata, 1998) . Despite their use across different domains, Hawkes processes have several limitations. First, one has to predetermine parametric forms for µ θ and g ϕ . This limits the expressive power of the model and, more importantly, might lead to a mismatch between the proposed model and one that can better represent the observed data. Second, (1) imposes similar behavior across all events over time meaning that every preceding event has the same form of impact on future events' occurrence. As an example, in the classical Hawkes process with a temporal decaying kernel g ϕ (t -t i ) = 1 β exp(-(t-ti) β ) , the same parameter β is applied to all events via their corresponding time differences. Third, the intensity function λ(t, x|H t ) can be used to predict only one step ahead via, e.g., the commonly-used first-order moment (Rasmussen, 2011; Snyder & Miller, 2012) given by E[t n+1 |H tn ] = ∞ tn t V λ(t, x|H tn ) exp - V t tn λ(u, v|H tn )dudv dxdt, (2) E[x n+1 |H tn ] = V x ∞ tn λ(t, x|H tn ) exp - V t tn λ(u, v|H tn )dudv dtdx, where solving the integral in a continuous high-dimensional space could be computationally expensive, or inaccurate in the case of using sampling methods such as Monte Carlo sampling (Hastings, 1970) . Fourth, although (2) makes it feasible to predict the next event, the multi-event prediction demands sequential usage of the first-order moment, where at each step the history is updated via the most recent predicted event. This makes the problem of multi-event prediction challenging and prone to error accumulation. proposed learning the Hawkes intensity function based on Gaussian diffusion kernels while using imitation learning as a flexible model-fitting approach. However, their method could not tackle any of the aforementioned shortcomings. Therefore, in the present work, we introduce a novel neural network that is capable of simultaneous spatio-temporal multi-event forecasting while addressing all the above-mentioned shortcomings. To the best of our knowledge, this is the first work proposing a data-driven multi-event forecasting network for stochastic point processes. Our architecture augments the encoder and decoder blocks of a transformer architecture (Vaswani et al., 2017) from natural language processing (NLP) with probabilistic and bijective layers. This network design provides a batch of rich multi-event spatio-temporal distributions associated with multiple future events. We propose to directly learn/predict these distributions (without the need to learn the intensity function as opposed to the previous works), using self-supervised learning (Liu et al., 2021) , marking a further point of departure with respect to existing work. The rest of this paper is organized as follows. In Section 2, we provide background knowledge on attention mechanisms and normalizing flows, since these are used as constituent blocks in our solution. Next, in Section 3 we provide a formal definition of the problem to be solved along with our proposed neural multi-event forecasting network. Further, in Section 4 we evaluate the performance of our network on a variety of datasets and compare it with baseline models. Finally, in Section 5 we finish the paper with conclusions and future goals. Our contributions can be summarized as follows: • We introduce a neural architecture that is capable of simultaneous multi-event forecasting of the time and location of discrete events in continuous time and space. • We compare the performance of our solution with state-of-the-art models through extensive experiments on a variety of datasets that represent stochastic point events.



(2020)), to alleviate the second drawback mentioned earlier. The capability of discovering long-term dependencies while achieving fast training (via parallelization) is another benefit of this line of work. However, the last two drawbacks are still present in those approaches, hindering their utility for multi-event forecasting. Directly learning the temporal distribution using the variational autoencoders (VAEs) (Kingma & Welling, 2014) augmented with CNFs (Chen et al., 2018) as proposed by Mehrasa et al. (2019), or utilizing the deep sigmoidal flow (DSF) (Huang et al., 2018) and the sum of squares (SOS) polynomial flow (Jaini et al., 2019) as proposed by Shchur et al. (2019) could tackle the first and third limitations. However, the second and last mentioned shortcomings remained to be solved. In another recent work,Zhu et al. (2021)

To alleviate some of the mentioned shortcomings, previous works have proposed learning the intensity function λ(t, x|H t ) directly from data using neural networks. The early studies in this regard(Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017)  attempted to learn λ(t, x|H t ) using a recurrent neural network (RNN) and variants thereof. In a recent work, Chen et al. (2020) designed a neural ordinary differential equation (ODE) architecture that can learn a continuous intensity function. This is attained by combining jump and attentive continuous normalizing flows (CNFs)(Chen  et al., 2018)  for space, and an RNN architecture for time. Despite being data-driven, these works do not tackle the last three drawbacks mentioned above. Moreover, due to the memory limitations of RNNs, these works are incapable of unveiling long-term dependencies. On the other hand, RNNs

