NEURAL MULTI-EVENT FORECASTING ON SPATIO-TEMPORAL POINT PROCESSES USING PROBABILISTICALLY ENRICHED TRANSFORMERS

Abstract

Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. Historydependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.

1. INTRODUCTION

Predicting the occurrence of discrete events in time and space has been the focus of many scientific studies and applications. Problems such as predicting earthquake hazards (Ogata, 1998; Chen et al., 2020) , infectious diseases over a population (Meyer et al., 2012; Schoenberg et al., 2019) , mobility and traffic in cities (Du et al., 2016) , and brain neuronal spikes (Perkel et al., 1967) fall under this category and have gained quite some interest. Over the years, many works used the history dependent spatio-temporal Hawkes process (Ozaki, 1979; Ogata, 1988; 1998; Ogata & Zhuang, 2006; Ogata et al., 1993; Nandan et al., 2017; Sornette & Werner, 2005; Zhuang, 2012; Helmstetter & Sornette, 2003; Bansal & Ogata, 2013) to model these point events. The stochasticity and the excitatory history-dependency of the Hawkes process are modeled by a conditional intensity function that varies in time and space. Assuming n events and their associated markers (t, x, M ) collected in the history H t = {(t i , x i , M i ) | t i < t, i = 1 : n}, the Hawkes intensity function is defined as λ(t, x|H t ) := lim ∆t,∆x→0 P ∆t,∆x (t, x|H t ) |B(x, ∆x)|∆t = µ θ (x) + i:ti<t g ϕ (t -t i , x -x i , M i ), where P ∆t,∆x (t, x|H t ) denotes the history-dependant probability of having an event in a small time interval [t, t + ∆t) and a small ball B(x, ∆x) centered at x ∈ R d with the radius of ∆x ∈ R d , and H t denotes all events happening up to but not including time t. The functions µ θ and g ϕ represent the (parametric) background intensity function and spatio-temporal kernel, respectively, forming λ(t, x|H t ). A variety of different parametric forms for both µ θ and g ϕ have been proposed (Ogata, 1998) . Despite their use across different domains, Hawkes processes have several limitations. First, one has to predetermine parametric forms for µ θ and g ϕ . This limits the expressive power of the model and, more importantly, might lead to a mismatch between the proposed model and one that can better represent the observed data. Second, (1) imposes similar behavior across all events over time meaning that every preceding event has the same form of impact on future events' occurrence. As an example, in the classical Hawkes process with a temporal decaying kernel g ϕ (t -t i ) = 1 β exp(-(t-ti) β ), the same parameter β is applied to all events via their corresponding time differences. Third, the intensity function λ(t, x|H t ) can be used to predict only one 1

