Neural Point Process for Learning Spatiotemporal Event Dynamics

Abstract

Learning the dynamics of spatiotemporal events is a fundamental problem. Neural point processes enhance the expressivity of point process models with deep neural networks. However, most existing methods only consider temporal dynamics without spatial modeling. We propose Deep Spatiotemporal Point Process (DeepSTPP), a deep dynamics model that integrates spatiotemporal point processes. Our method is flexible, efficient, and can accurately forecast irregularly sampled events over space and time. The key construction of our approach is the nonparametric space-time intensity function, governed by a latent process. The intensity function enjoys closed form integration for the density. The latent process captures the uncertainty of the event sequence. We use amortized variational inference to infer the latent process with deep networks. Using synthetic datasets, we validate our model can accurately learn the true intensity function. On real-world benchmark datasets, our model demonstrates superior performance over state-of-the-art baselines.

1.. Introduction

Accurate modeling of spatiotemporal event dynamics is fundamentally important for disaster response (Veen and Schoenberg, 2008) , logistic optimization (Safikhani et al., 2018) and social media analysis (Liang et al., 2019) . Compared to other sequence data such as texts or time series, spatiotemporal events occur irregularly with uneven time and space intervals. Discrete-time deep dynamics models such as recurrent neural networks (RNNs) (Hochreiter and Schmidhuber, 1997; Chung et al., 2014 ) assume events to be evenly sampled. Interpolating an irregular sampled sequence into a regular sequence can introduce significant biases (Rehfeld et al., 2011) . Furthermore, event sequences contain strong spatiotemporal dependencies. The rate of an event depends on the preceding events, as well as the events geographically correlated to it. Spatiotemporal point processes (STPP) (Daley and Vere-Jones, 2007; Reinhart et al., 2018 ) provides the statistical framework for modeling continuous-time event dynamics. As shown in Figure 1 , given the history of events sequence, STPP estimates the intensity function that is evolv-ing in space and time. However, traditional statistical methods for estimating STPPs often require strong modeling assumptions, feature engineering, and can be computationally expensive. Machine learning community is observing a growing interest in continuous-time deep dynamics models that can handle irregular time intervals. For example, Neural ODE (Chen et al., 2018) parametrizes the hidden states in an RNN with an ODE. Shukla and Marlin (2018) uses a separate network to interpolates between reference time points. Neural temporal point process (TPP) (Mei and Eisner, 2017; Zhang et al., 2020; Zuo et al., 2020) is an exciting area that combines fundamental concepts from temporal point processes with deep learning to model continuous-time event sequences, see a recent review on neural TPP (Shchur et al., 2021) . However, most of the existing models only focus on temporal dynamics without considering spatial modeling. In the real world, while time is a unidirectional process (arrow of time), space extends in multiple directions. This fundamental difference from TPP makes it nontrivial to design a unified STPP model. The naive approach to approximate the intensity function by a deep neural network would lead to intractable integral computation for likelihood. Prior research such as Du et al. (2016) discretizes the space as "markers" and use marked TPP to classify the events. This approach cannot produce the space-time intensity function. Okawa et al. (2019) models the spatiotemporal density using a mixture of symmetric kernels, which ignores the unidirectional property of time. Chen et al. (2021) proposes to model temporal intensity and spatial density separately with neural ODE, which is computational expensive. We propose a simple yet efficient approach to learn STPP. Our model, Deep Spatiotemporal Point Process (DeepSTPP) marries the principles of spatiotemporal point processes with deep learning. We take a non-parametric approach and model the space-time intensity function as mixture of kernels. The parameters of the intensity function are governed by a latent stochastic process no sampling which captures the uncertainty of the event sequence. The latent process is then inferred via amortized variational inference. That is, we draw a sample from the variational distribution for every event. We use a Transformer network to parametrize the variational distribution conditioned on the previous events. Compared with existing approaches, our model is non-parametric, hence does not make assumptions on the parametric form of the distribution. Our approach learns the space-time intensity function jointly without requiring separate models for time-intensity function and spatial density as in Chen et al. (2021) . Our model is probabilistic by nature and can describe various uncertainties in the data. More importantly, our model enjoys closed form integration, making it feasible for processing large-scale event datasets. To summarize, our work makes the following key contributions: • Deep Spatiotemporal Point Process. We propose a novel Deep Point Process model for forecasting unevenly sampled spatiotemporal events. It integrates deep learning with spatiotemporal point processes to learn continuous space-time dynamics. • Neural Latent Process. We model the space-time intensity function using a nonparametric approach, governed by a latent stochastic process. We use amortized variational inference to perform inference on the latent process conditioned on the previous events. • Effectiveness. We demonstrate our model using many synthetic and real-world spatiotemporal event forecasting tasks, where it achieves superior performance in accuracy and efficiency. We also derive and implement efficient algorithms for simulating STPPs.

2.. Methodology

We first introduce the background of spatiotemporal point process, and then describe our approach to learn the underlying spatiotemporal event dynamics.

2.1.. Background on Spatiotemporal Point Process

Spatiotemporal Point Process. Spatiotemporal point process (STPP) models the number of events N (S × (a, b)) that occurred in the Cartesian product of the spatial domain S ⊆ R 2 and the time interval (a, b] . It is characterized by a non-negative space-time intensity function given the history H t := {(s 1 , t 1 ), . . . , (s n , t n )} tn≤t : λ * (s, t) := lim ∆s→0,∆t→0 E[N (B(s, ∆s) × (t, t + ∆t))|H t ] B(s, ∆s)∆t which is the probability of finding an event in an infinitesimal time interval (t, t + ∆t] and an infinitesimal spatial ball S = B(s, ∆s) centered at location s. Example 1: Spatiotemporal Hawkes process (STH). Spatiotemporal Hawkes (or self-exciting) process assumes every past event has an additive, positive, decaying, and spatially local influence over future events. Such a pattern resembles neuronal firing and earthquakes. It is characterized by the following intensity function (Reinhart et al., 2018) : λ * (s, t) := µg 0 (s) + i:t i <t g 1 (t, t i )g 2 (s, s i ) : µ > 0 (2) where g 0 (s) is the probability density of a distribution over S, g 1 is the triggering kernel and is often implemented as the exponential decay function, g 1 (∆t) := α exp(-β∆t) : α, β > 0, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at s i . Example 2: Spatiotemporal Self-Correcting process (STSC). Self-correcting spatiotemporal point process Isham and Westcott (1979) assumes that the background intensity increases with a varying speed at different locations, and the arrival of each event reduces the intensity nearby. STSC can model certain regular event sequences, such as an alternating home-to-work travel sequence. It has the following intensity function: λ * (s, t) = µ exp g 0 (s)βt - i:t i <t αg 2 (s, s i ) : α, β, µ > 0 (3) Here g 0 (s) is the density of a distribution over S, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at location s i . Maximum likelihood Estimation. Given a history of n events H t , the joint log-likelihood function of the observed events for STPP is as follows: log p(H t ) = n i=1 log λ * (s i , t i ) - S t 0 λ * (u, τ )dudτ (4) Here, the space-time intensity function λ * (s, t) plays a central role. Maximum likelihood estimation seeks the optimal λ * (s, t) from data that optimizes Eqn. (4). Predictive distribution. Denote the probability density function (PDF) for STPP as f (s, t|H t ) which represents the conditional probability that next event will occur at location s and time t, given the history. The PDF is closely related to the intensity function: f (s, t|H t ) = λ * (s, t) 1 -F * (s, t|H t ) = λ * (s, t) exp -S t tn λ * (u, τ )dτ du (5) where F is the cumulative distribution function (CDF), see derivations in Appendix A.1. This means the intensity function specifies the expected number of events in a region conditional on the past. The predicted time of the next event is the expected value of the predictive distribution for time f ⋆ (t) in the entire spatial domain: E[t n+1 |H t ] = ∞ tn t S f * (s, t)dsdt = ∞ tn t exp - t tn λ * (τ )dτ λ * (t)dsdt Similarly, the predicted location of the next event evaluates to: E[s n+1 |H t ] = S s ∞ tn f * (s, t)dtds = ∞ tn exp - t tn λ * (τ )dτ S sλ * (s, t)dsdt Unfortunately, Eqn. ( 4) is generally intractable. It requires either strong modeling assumptions or expensive Monte Carlo sampling. We propose the Deep STPP model to simplify the learning.

2.2.. Deep Spatiotemporal Point Process (DSTPP)

We propose DeepSTPP, a simple and efficient approach for learning the space-time event dynamics. Our model (1) introduces a latent process to capture the uncertainty (2) parametrizes the latent process with deep neural networks to increase model expressivity and (3) approximates the intensity function with a set of spatial and temporal kernel functions. Neural latent process. Given a sequence of n event, we wish to model the conditional density of observing the next event given the history f (s, t|H t ). We introduce a latent process to capture the uncertainty of the event history and infer the latent process with armotized variational inference. The latent process dictates the parameters in the space-time intensity function. We sample from the latent process using the re-parameterization trick Kingma and Welling (2013) . As shown in Figure 2 , given the event sequence H t = {(s 1 , t 1 ), . . . , (s n , t n )} tn≤t , we encode the entire sequence into the high-dimensional embedding. We use positional encoding to encode the sequence order. To capture the stochasticity in the temporal dynamics, we introduce a latent process sample ! " ($|ℋ ' ) ℋ ' = * + , -+ , … , * / , -/ ' 0 1 ' $ + $ 2 $ / … transformer 3 * 5, -= 6 7 8 7 9 : (5, 5 7 ; < 7 )9 ' (-, -7 ; = 7 ) > ? @ ? 8 7 < 7 = 7 A B (5, -|$) decoder Figure 2 : Design of our DeepSTPP model. For a historical event sequence, we encode it with a transformer network and map to the latent process (z 1 , • • • , z n ). We use a decoder to generate the parameters (w i , γ i , β i ) for each event i given the latent process. The estimate intensity is calculated using kernel functions k s and k t and the decoded parameters. z = (z 1 , • • • , z n ) for the entire sequence. We assume the latent process follows a multivariate Gaussian at each time step: z i ∼ q ϕ (z i |H t ) = N (µ, Diag(σ)) where the mean µ and covariance Diag(σ) are the outputs of the embedding neural network. In our implementation, we found using a Transformer Vaswani et al. (2017) with sinusoidal positional encoding to be beneficial. The positions to be encoded are the normalized event time instead of the index number, to account for the unequal time interval. Recently, Zuo et al. (2020) also demonstrated that Transformer enjoys better performance for learning the intensity in temporal point processes. Non-parametric model. We take a non-parameteric approach to model the space-time intensity function λ * (s, t) as: λ * (s, t|z) = n+J i=1 w i k s (s, s i ; γ i )k t (t, t i ; β i ) Here w i (z), γ i (z), β i (z) are the parameters for each event that is conditioned on the latent process. Specifically, w i represents the non-negative intensity magnitude, implemented with a soft-plus activation function. k s (•, •) and k t (•, •) are the spatial and temporal kernel functions, respectively. For both kernel functions, we parametrize them as a normalized RBF kernel: k s (s, s i ) = α -1 exp -γ i ∥s -s i ∥ , k t (t, t i ) = exp -β i ∥t -t i ∥ (8) where the bandwidth parameter γ i controls an event's influence over the spatial domain. The parameter β i is the decay rate that represents the event's influence over time. α = S exp -γ i ∥s-s i ∥ ds is the normalization constant. We use a decoder network to generate the parameters {w i , γ i , β i } given z separately, shown in Figure 2 . Each decoder is a 4-layer feed-forward network. We use a softplus activation function to ensure w i and γ i are positive. The decay rate β i can be any number, such that an event could have constant or increasing triggering intensity over time. In addition to n historical events, we also randomly sample J representative points from the spatial domain to approximate the background intensity. This is to account for the influence from unobserved events in the background, with varying rates at different absolution locations. The inclusion of these representative points can approximate this background distribution. The model design in (7) enjoys a closed form integration, which gives the conditional PDF as: f (s, t|H t , z) = λ * (s, t|z) exp - n+J i=1 w i β i [k t (t n , t i ) -k t (t, t i )] See the derivation details in Appendix A.2. DeepSTPP circumvents the integration of the intensity function and enjoys fast inference in forecasting future events. In contrast, NSTPP Chen et al. ( 2021) is relatively inefficient as its ODE solver also requires additional numerical integration. Parameter learning. Due to the latent process, the posterior becomes intractable. Instead, we use amortized inference by optimizing the evidence lower bound (ELBO) of the likelihood. In particular, given event history H t , the conditional log-likelihood of the next event is: log p(s, t|H t ) ≥ log p θ (s, t|H t , z) + KL(q ϕ (z|H t )||p(z)) (10) = log λ * (s, t|z) - t tn λ * (τ )dτ + KL(q||p) where ϕ represents the parameters of the encoder network and θ are the parameters of the decoder network. p(z) is the prior distribution, which we assume to be Gaussian. KL(•||•) is the Kullback-Leibler divergence between two distributions. We can optimize the objective function in Eqn. (11) w.r.t. the parameters ϕ and θ using back-propagation.

3.. Related Work

Spatiotemporal Dynamics Learning. Modeling the spatiotemporal dynamics of a system in order to forecast the future is a fundamental task in many fields. Most work on spatiotemporal dynamics has been focused on spatiotemporal data measured at regular space-time interval, e.g., (Xingjian et al., 2015; Li et al., 2018; Yao et al., 2019; Fang et al., 2019; Geng et al., 2019) . For discrete spatiotemporal events, statistical methods include space-time point process, see (Moller and Waagepetersen, 2003; Mohler et al., 2011) . (Zhao et al., 2015) propose multi-task feature learning whereas (Yang et al., 2018) propose RNN-based model to predict spatiotemporal check-in events. These discrete-time models assume data are sampled evenly, thus are unsuitable for our task. Continous Time Sequence Models. Continuous time sequence models provide an elegant approach for describing irregular sampled time series. For example, (Chen et al., 2018; Jia and Benson, 2019; Dupont et al., 2019; Gholami et al., 2019; Finlay et al., 2020; Kidger et al., 2020; Norcliffe et al., 2021) assumes the latent dynamics are continuous and can be modeled by an ODE. But for high-dimensional spatiotemporal processes, this approach can be computationally expensive. (2017) shows that there is no significant benefit of using continuous-time RNN for discrete event data. Special treatment is still needed for modeling unevenly sampled events. Deep Point Process. Point process is well-studied in statistics (Moller and Waagepetersen, 2003; Daley and Vere-Jones, 2007; Reinhart et al., 2018) . Deep point process couples deep learning with point process and has received considerable attention. For example, neural Hawkes process applies RNNs to approximate the temporal intensity function (Du et al., 2016; Mei and Eisner, 2017; Xiao et al., 2017; Zhang et al., 2020) , and (Zuo et al., 2020) employs Transformers. (Shang and Sun, 2019) integrates graph convolution structure. However, all existing works focus on temporal point processes without spatial modeling. For datasets with spatial information, they discretize the space and treat them as discrete "markers". 

4.. Experiments

We evaluate DeepSTPP for spatiotemporal prediction using both synthetic and real-world data. Baselines We compare DeepSTPP with the state-of-the-art models, including • Spatiotemporal Hawkes Process (MLE) (Reinhart et al., 2018) : it learns a spatiotemporal parametric intensity function using maximum likelihood estimation, see derivation in Appendix A.3. • Recurrent Marked Temporal Point Process (RMTPP) (Du et al., 2016) : it uses GRU to model the temporal intensity function. We modify this model to take spatial location as marks. • Neural Spatiotemporal Point Process (NSTPP) Chen et al. ( 2021): a neural point process model that parameterizes the spatial PDF and temporal intensity with continuous-time normalizing flows. Specifically, we use Jump CNF as it is a better fit for Hawkes processes. All models are implemented in PyTorch, trained using the Adam optimizer. We set the number of representative points to be 100. The details of the implementation are deferred to the Appendix C.1. For the baselines, we use the authors' original repositories whenever possible. Datasets. We simulated two types of STPPs: spatiotemporal Hawkes process (STH) and spatiotemporal self-correcting process (STSC) . For both STPPs, we generate three synthetic datasets, each with a different parameter setting, denoted as DS1, DS2, and DS3 in the tables. We also derive and implement efficient algorithms for simulating STPPs based on Ogata's thinning algorithm Ogata (1981) . We view the simulator construction as an independent contribution from this work. The details of the simulation can be found in Appendix B. We use two real-world spatiotemporal event datasets from NSTPP Chen et al. ( 2021) to benchmark the performance. • Earthquakes Japan: catalog earthquakes data including the location and time of all earthquakes in Japan from 1990 to 2020 with magnitude of at least 2.5 from the U.S. Geological Survey. There are in total 1,050 sequences. The number of events per sequences ranges between 19 to 545foot_0 . • COVID-19: daily county level COVID-19 cases data in New Jersey state published by The New York Times. There are 1,650 sequences and the number of events per sequences ranges between 7 to 305. For both synthetic data and real-world data, we partition long event sequences into non-overlapping subsequences according to a fixed time range T . The targets are the last event, and the input is the rest of the events. The number of input events varies across subsequences. For each dataset, we split each into train/val/test sets with the ratio of 8:1:1. All results are the average of 3 runs. 

4.1.. Synthetic Experiment Results

For synthetic data, we know the ground truth intensity function. We compare our method with the best possible estimator: maximum likelihood estimator (MLE), as well as the NSTPP model. The MLE is learned by optimizing the log-likelihood using the BFGS algorithm. RMTPP can only learn the temporal intensity thus is not included in this comparison. Predictive log-likelihood. Table 1 shows the comparison of the predictive distribution for space and time. We report Log Likelihood (LL) of f (s, t|H t ) and the Hellinger Distance (HD) between the predictive distributions and the ground truth averaged over time. On both the STH and STSC datasets with different parameter settings, DeepSTPP outperform the baseline NSTPP in terms of LL and HD. It shows that DeepSTPP can estimate the spatiotemporal intensity more accurately for point processes with unknown parameters. Temporal intensity estimate. Table 2 shows the mean absolute percentage error (MAPE) between the models' estimated temporal intensity and the ground truth λ ⋆ (t) over a short sampled range. On the STH datasets, since MLE has the correct parametric form, it is the theoretical optimum. Compared to baselines, DeepSTPP generally obtained the same or lower MAPE. It shows that joint spatiotemporal modeling also improve the performance of temporal prediction. Intensity visualization. Figure 3 visualizes the learned space-time intensity and the ground truth for STH and STSC, providing strong evidence that DeepSTPP can correctly learn the underlying dynamics of the spatiotemporal events. Especially, NSTPP has difficulty in modeling the complex dynamics of the multimodal distribution such as the spatiotemporal Hawkes process. NSTPP sometimes produces overly smooth intensity surfaces, and lost most of the details at the peak. In contrast, our DeepSTPP can better fit the multimodal distribution through the form of kernel summation and obtain more accurate intensity functions.

Computational efficiency. Figure 4 provides the run time comparison for the training between

DeepSTPP and NSTPP for 100 epochs. To ensure a fair comparison, all experiments are conducted on 1 GTX 1080 Ti with Intel Core i7-4770 and 64 GB RAM. Our method is 100 times faster than NSTPP in training. It is mainly because our spatiotemporal kernel formulation has a close form of integration, which bypasses the complex and cumbersome numerical integration. For real-world data evaluation, we report the conditional spatial and temporal log-likelihoods, i.e., log f * (s|t) and log f * (t), of the final event given the input events, respectively. The total log-likelihood, log f * (s, t), is the summation of the two values.

4.2.. Real-World Experiment Results

Predictive performances. As our model is probabilistic, we compare against baselines models on the test predictive LL for space and time separately in Table 3 . RMTPP can only produce temporal intensity thus we only include the time likelihood. We observe that DeepSTPP outperforms NSTPP most of the time in terms of accuracy. It takes only half of the time to train, as shown in Figure 4 . Furthermore, we see that STPP models (first three rows) achieve higher LL compared with only modeling the time (RMTPP). It suggests the additional benefit of joint spatiotemporal modeling to increases the time prediction ability. As shown in Table 4 , we see that (1) Shared decoders decreases the number of parameters but reduces the performance. (2) Separate process largely increases the number of parameters but has negligible influences in test log-likelihood. (3) LSTM encoder: changing the encoder from Transformer to LSTM also results in slightly worse performance. Therefore, we validate the design of DeepNSTPP: we assume all distribution parameters are governed by one single hidden stochastic process with separate decoders and a Transformer as encoder.

5.. Conclusion

We propose a family of deep dynamics models for irregularly sampled spatiotemporal events. Our model, Deep Spatiotemporal Point Process (DeepSTPP), integrates a principled spatiotemporal point process with deep neural networks. We derive a tractable inference procedure by modeling the space-time intensity function as a composition of kernel functions and a latent stochastic process. We infer the latent process with neural networks following the variational inference procedure. Using synthetic data from the spatiotemporal Hawkes process and self-correcting process, we show that our model can learn the spatiotemporal intensity accurately and efficiently. We demonstrate superior forecasting performance on many real-world benchmark spatiotemporal event datasets. Future work include further considering the mutual-exciting structure in the intensity function, as well as modeling multiple heterogeneous spatiotemporal processes simultaneously.

Appendix A. Model Details

A.1. Spatiotemporal Point Process Derivation Conditional Density. The intensity function and probability density function of STPP is related: f (s, t|H t ) = λ * (s, t) 1 -F * (s, t) = λ * (s, t) exp - S t tn λ * (s, τ )dτ ds = λ * (s, t) exp - t tn λ * (τ )dτ The last equation uses the relation that λ * (s, t) = λ * (t)f (s|t), according Daley and Vere-Jones (2007) Chapter 2.3 (4). Here λ * (t) is the time intensity and f * (s|t) := f (s|t, H t ) is the spatial PDF that the next event will be at location s given time t. According to Daley and Vere-Jones (2007) Chapter 15.4, we can also view STPP as a type of TPP with continuous (spatial) marks, Likelihood. Given a STPP, the log-likelihood of observing a sequence H t = {(s 1 , t 1 ), (s 2 , t 2 ), ...(s n , t n )} tn≤t is given by: L(H tn ) = log n i=1 f (s i , t i |H t i-1 )(1 -F * (s, t)) = n i=1 log λ * (s i , t i ) - S t i t i-1 λ * (τ )dτ ds + log(1 -F * (s, t)) = n i=1 log λ * (s i , t i ) - S tn 0 λ * (s, τ )dτ - S T tn λ * (s, τ )dτ = n i=1 log λ * (s i , t i ) - S T 0 λ * (s, τ )dτ = n i=1 log λ * (t i ) + n i=1 log f * (s i |t i ) - T 0 λ * (τ )dτ Inference. With a trained STPP and a sequence of history events, we can predict the next event timing and location using their expectations, which evaluate to E[t n+1 |H tn ] = ∞ tn t S f (s, t|H tn )dsdt = ∞ tn t exp - t tn λ * (τ )dτ S λ * (s, t)dsdt, = ∞ tn t exp - t tn λ * (τ )dτ λ * (t)dt The predicted location for the next event is:  E[s n+1 |H tn ] = ∞ tn s S λ * (s, Also notice that f * (s, t) = f * (s|t)f * (t), λ * (s, t) = f * (s|t)λ * (t) and λ * (t) = f * (t) 1 -F * (t) . Therefore f * (s, t) = f * (s | t)f * (t) = f * (s | t)λ * (t) exp - t tn λ * (τ )dτ = λ * (s, t) exp - t tn λ * (τ )dτ For DeepSTPP, the spatiotemporal intensity is λ * (s, t) = i w i exp(-β i (t -t i ))k s (s -s i ) The temporal intensity simply removes the k s (which integrates to one). The bandwidth doesn't matter. λ * (t) = i w i exp(-β i (t -t i )) Integrate λ * (τ ) yields λ * (τ )dτ = - i w i β i exp(-β i (τ -t i )) + C Note that deriving the exp would multiply the coefficient -β i . The definite integral is t tn λ * (τ )dτ = - i w i β i [exp(-β i (t -t i )) -exp(-β i (t n -t i ))] Then replacing the integral in the original formula yields f * (s, t) = λ * (s, t) exp - t tn λ * (τ )dτ = λ * (s, t) exp i w i β i [exp(-β i (t -t i )) -exp(-β i (t n -t i ))] The temporal kernel function k t (t, t i ) = exp(-β i (t -t i )), we reach the closed form formula. Inference The expectation of the next event time is E * [t i ] = ∞ t i-1 tf * (t)dt = ∞ tn tλ * (t) exp - t t i-1 λ * (τ )dτ dt where the inner integral has a closed form. It requires 1D numerical integration. Given the predicted time ti , the expectation of the space can be efficiently approximated by E * [s i ] ≈ E * [s i | ti ] = i ′ <i α -1 w i ′ k t ( ti , t i ′ )s i ′ where α = i ′ <i w i ′ k t ( ti , t i ′ ) is a normalize coefficient.

A.3. Spatiotemporal Hawkes Process Derivation

Spatiotemporal Hawkes process (STHP). Spatiotemporal Hawkes (or self-exciting) process is one of the most well-known STPPs. It assumes every past event has an additive, positive, decaying, and spatially local influence over future events. Such a pattern resembles neuronal firing and earthquakes. Spatiotemporal Hawkes is characterized by the following intensity function (Reinhart et al., 2018) : λ * (s, t) := µg 0 (s) + i:t i <t g 1 (t, t i )g 2 (s, s i ) : µ > 0 where g 0 (s) is the probability density of a distribution over S, g 1 is the triggering kernel and is often implemented as the exponential decay function, g 1 (∆t) := α exp(-β∆t) : α, β > 0, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at s i . Maximum Likelihood. For spatiotemporal Hawkes process, we pre-specified the model kernels g 0 (s) and g 2 (s, s j ) to be Gaussian: g 0 (s) := 1 2π |Σ g0 | -1 2 exp - 1 2 (s -s µ )Σ -1 g0 (s -s µ ) T (15) g 2 (s, s j ) := 1 2π |Σ g2 | -1 2 exp - 1 2 (s -s j )Σ -1 g2 (s -s j ) T Specifically for the STHP, the second term in the STPP likelihood evaluates to T 0 λ * (τ )dτ = µT + α T 0 τ 0 e -β(τ -u) dN (u)dτ (0 ≤ u ≤ τ, 0 ≤ τ ≤ T ) → (u ≤ τ ≤ T, 0 ≤ u ≤ T ) = µT + α T 0 T u e -β(τ -u) dτ dN (u) = µT - α β T 0 e -β(T -u) -1 dN (u) = µT - α β N i=0 e -β(T -t i ) -1 Finally, the STHP log-likelihood is L = n i=1 log λ * (s i , t i ) -µT + α β N i=0 e -β(T -t i ) -1 This model has 11 scalar parameters: 2 for s µ , 3 for Σ g0 , 3 for Σ g2 , α, β, and µ. We directly estimate s µ as the mean of {s i } n 0 , and then estimate the other 9 parameters by minimizing the negative log-likelihood using the BFGS algorithm. T in the likelihood function is treated as t n . Inference Based on the general formulas in Appendix A.1, and also note that for an STHP, t t i-1 λ * (τ )dτ = t 0 λ * (τ )dτ - t i-1 0 λ * (τ )dτ =    µt - α β i-1 j=0 e -β(t-t j ) -1    -    µt i-1 - α β i-1 j=0 e -β(t i-1 -t j ) -1    = µ(t -t i-1 ) - α β i-1 j=0 e -β(t-t i-1 +t i-1 -t j ) -e -β(t i-1 -t j ) = µ(t -t i-1 ) - α β e -β(t-t i-1 ) -1 i-1 j=0 e -β(t i-1 -t j ) and S sµg 2 (s, s µ )ds = µs µ S s n i=0 g 1 (t, t i )g 2 (s, s i )ds = n i=0 g 1 (t, t i ) S sg 2 (s, s i )ds = n i=0 g 1 (t, t i )s i S sλ * (s, t)ds = µs µ + n i=0 g 1 (t, t i )s i , we have E[t i |H t i-1 ] = ∞ t i-1 t   µ + α i-1 j=0 e -β(t-t j )   exp   α β e -β(t-t i-1 ) -1 i-1 j=0 e -β(t i-1 -t j ) -µ(t -t i-1 )   dt and E[s i |H t i-1 ] = ∞ t i-1   µs µ + α i-1 j=0 e -β(t-t j ) s j   exp   α β e -β(t-t i-1 ) -1 i-1 j=0 e -β(t i-1 -t j ) -µ(t -t i-1 )

  dt

Both require only 1D numerical integration. Spatiotemporal Self-Correcting process (STSCP). A lesser-known example is self-correcting spatiotemporal point process Isham and Westcott (1979) . It assumes that the background intensity increases with a varying speed at different locations, and the arrival of each event reduces the intensity nearby. The next event is likely to be in a high-intensity region with no recent events. Spatiotemporal self-correcting process is capable of modeling some regular event sequences, such as an alternating home-to-work travel sequence. It has the following intensity function: λ * (s, t) = µ exp g 0 (s)βti:t i <t αg 2 (s, s i ) : α, β, µ > 0 (17) Here g 0 (s) is the density of a distribution over S, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at s i . When simulating a process with decreasing inter-event intensity, such as the Hawkes process, M * (t) and L * (t) can be simply chosen to be λ * (t) and ∞. When simulating a process with increasing inter-event intensity, such as the self-correcting process, L * (t) is often empirically chosen to be 2/λ * (t), since the next event is very likely to arrive before twice the mean interval length at the beginning of the interval. M * (t) is therefore λ * (t + L * (t)). end if 19: end while=0

B.2. STPP Simulation

It has been mentioned in Section 2.1 that an STPP can be seen as attaching the locations sampled from f * (s|t) to the events generated by a TPP. Simulating an STPP is basically adding one step to Algorithm 1: sample a new location from f * (s|t) after retaining a new event at t. As for a spatiotemporal self-correcting process, neither f * (s, t) nor λ * (t) has a closed form, so the process's spatial domain has to be discretized for simulation. λ * (t) can be approximated by s∈S λ * (s, t)/|S|, where S is the set of discretized coordinates. L * (t) and M * (t) are chosen to be 2/λ * (t) and λ * (t + L * (t)). Since f * (s|t) is proportional to λ * (s, t), sampling a location from f * (s|t) is implemented as sampling from a multinomial distribution whose probability mass function is the normalized λ * (s, t).

B.3. STHP Simulation

To simulate a spatiotemporal Hawkes process with Gaussian kernel, we mainly followed an efficient procedure proposed by Zhuang (2004) , that makes use of the clustering structure of the Hawkes process and thus does not require repeated calculations of λ * (s, t).



The statistics differ slightly from the original paper due to updates in the data source.



Figure 1: Illustration of learning spatiotemporal point process. We aim to learn the space-time intensity function given the historical event sequence and representative points as background.

Figure 3: Ground-truth and learned intensity on two synthetic data. Top: ground-truth; Middle: learned intensity by our DeepSTPP model. Bottom: learned conditional intensity by NSTPP. 'X's refer to event history, where smaller 'X' refers to larger time difference.

Figure 4: Log train time comparison on all datasets

Ogata Modified Thinning Algorithm for Simulating a TPP 1: Input: Interval [0, T ], model parameters 2: t ← 0, H ← ∅ 3: while true do 4: Compute m ← M (t|H) , l ← L(t|H) 5: Draw ∆t ∼ Exp(m) (exponential distribution with mean 1/m)

Okawa et al. (2019) extendsDu et al. (2016) for spatiotemporal event prediction but they only predict the density instead of the next location and time of the event.Zhu et al. (2019) parameterizes the spatial kernel with a neural network embedding without consider the temporal sequence. Recently, Chen et al. (2021) propose neural spatiotemporal point process (NSTPP) which combines continuous-time neural networks with continuous-time normalizing flows to parameterize spatiotemporal point processes. However, this approach is quite computationally expensive, which requires evaluating the ODE solver for multiple time steps.

Test

Estimated λ * (t) MAPE on synthetic data

Test log likelihood (LL) comparison for space and time on real-world data over 3 runs.

Test LL for alternative model designs over 3 runs We use one shared decoder to generate model parameters. Shared decoders input the sampled z to one decoder and partition its output to generate model parameters.(2) Separate process: We assume that each of the {w i , β i , γ i } follows a separate latent process and we sample them separately. Separate processes use three sets of means and variances to sample {w i , β i , γ

t) exp -Computational Complexity. It is worth noting that both learning and inference require conditional intensity. If the conditional intensity has no analytic formula, then we need to compute numerical integration over S. Then, evaluating the likelihood or either expectation requires at least triple integral. Note that E[t i |H t i-1 ] and E[s i |H t i-1 ] actually are sextuple integrals, but we can memorize all λ * (s, t) from t = t i-1 to t ≫ t i-1 to avoid re-compute the intensities. However, memorization leads to high space complexity. As a result, we generally want to avoid an intractable conditional intensity in the model.

Appendix B. Simulation Details

In this appendix, we discuss a general algorithm for simulating any STPP, and a specialized algorithm for simulating an STHP. Both are based on an algorithm for simulating any TPP.

B.1. TPP Simulation

The most widely used technique to simulate a temporal point process is Ogata's modified thinning algorithm, as shown in Algorithm 1 Daley and Vere-Jones (2007) It is a rejection technique; it samples points from a stationary Poisson process whose intensity is always higher than the ground truth intensity, and then randomly discards some samples to get back to the ground truth intensity.The algorithm requires picking the forms of M * (t) and L * (t) such thatIn other words, M * (t) is an upper bound of the actual intensity in [t, t + L(t)]. It is noteworthy that if M * (t) is chosen to be too high, most sampled points would be rejected and would lead to an inefficient simulation.Algorithm 2 Simulating spatiotemporal Hawkes process with Gaussian kernel 1: Generate the background events G (0) with the intensity λ * (s, t) = µg 0 (s), i.e., simulate a homogenous Poisson process Pois(µ) and sample each event's location from a bivariate Gaussian distribution N (s µ , Σ) end for 8: end while 9: return S =0 

B.4. Parameter Settings

For the synthetic dataset, we pre-specified both the STSCP's and the STHP's kernels g 0 (s) and g 2 (s, s j ) to be Gaussian:The STSCP is defined on S = [0, 1] × [0, 1], while the STHP is defined on S = R 2 . The STSCP's kernel functions are normalized according to their cumulative probability on S. Table 5 shows the simulation parameters. The STSCP's spatial domain is discretized as an 101 × 101 grid during the simulation.

Appendix C. Experiment Details

In this section, we include experiment configurations and some additional experiment results.

C.1. Model Setup Details

For a better understanding of DeepSTPP, we list out the detailed hyperparameter settings in Table 6 . We use the same set of hyperparameters across all datasets. 

