NEURAL MULTI-EVENT FORECASTING ON SPATIO-TEMPORAL POINT PROCESSES USING PROBABILISTICALLY ENRICHED TRANSFORMERS

Abstract

Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. Historydependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.

1. INTRODUCTION

Predicting the occurrence of discrete events in time and space has been the focus of many scientific studies and applications. Problems such as predicting earthquake hazards (Ogata, 1998; Chen et al., 2020) , infectious diseases over a population (Meyer et al., 2012; Schoenberg et al., 2019) , mobility and traffic in cities (Du et al., 2016) , and brain neuronal spikes (Perkel et al., 1967) fall under this category and have gained quite some interest. Over the years, many works used the history dependent spatio-temporal Hawkes process (Ozaki, 1979; Ogata, 1988; 1998; Ogata & Zhuang, 2006; Ogata et al., 1993; Nandan et al., 2017; Sornette & Werner, 2005; Zhuang, 2012; Helmstetter & Sornette, 2003; Bansal & Ogata, 2013) to model these point events. The stochasticity and the excitatory history-dependency of the Hawkes process are modeled by a conditional intensity function that varies in time and space. Assuming n events and their associated markers (t, x, M ) collected in the history H t = {(t i , x i , M i ) | t i < t, i = 1 : n}, the Hawkes intensity function is defined as λ(t, x|H t ) := lim ∆t,∆x→0 P ∆t,∆x (t, x|H t ) |B(x, ∆x)|∆t = µ θ (x) + i:ti<t g ϕ (t -t i , x -x i , M i ), where P ∆t,∆x (t, x|H t ) denotes the history-dependant probability of having an event in a small time interval [t, t + ∆t) and a small ball B(x, ∆x) centered at x ∈ R d with the radius of ∆x ∈ R d , and H t denotes all events happening up to but not including time t. The functions µ θ and g ϕ represent the (parametric) background intensity function and spatio-temporal kernel, respectively, forming λ(t, x|H t ). A variety of different parametric forms for both µ θ and g ϕ have been proposed (Ogata, 1998) . Despite their use across different domains, Hawkes processes have several limitations. First, one has to predetermine parametric forms for µ θ and g ϕ . This limits the expressive power of the model and, more importantly, might lead to a mismatch between the proposed model and one that can better represent the observed data. Second, (1) imposes similar behavior across all events over time meaning that every preceding event has the same form of impact on future events' occurrence. As an example, in the classical Hawkes process with a temporal decaying kernel g ϕ (t -t i ) = 1 β exp(-(t-ti) β ), the same parameter β is applied to all events via their corresponding time differences. Third, the intensity function λ(t, x|H t ) can be used to predict only one step ahead via, e.g., the commonly-used first-order moment (Rasmussen, 2011; Snyder & Miller, 2012) given by E[t n+1 |H tn ] = ∞ tn t V λ(t, x|H tn ) exp - V t tn λ(u, v|H tn )dudv dxdt, (2) E[x n+1 |H tn ] = V x ∞ tn λ(t, x|H tn ) exp - V t tn λ(u, v|H tn )dudv dtdx, where solving the integral in a continuous high-dimensional space could be computationally expensive, or inaccurate in the case of using sampling methods such as Monte Carlo sampling (Hastings, 1970) . Fourth, although (2) makes it feasible to predict the next event, the multi-event prediction demands sequential usage of the first-order moment, where at each step the history is updated via the most recent predicted event. This makes the problem of multi-event prediction challenging and prone to error accumulation. To alleviate some of the mentioned shortcomings, previous works have proposed learning the intensity function λ(t, x|H t ) directly from data using neural networks. The early studies in this regard (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017) attempted to learn λ(t, x|H t ) using a recurrent neural network (RNN) and variants thereof. In a recent work, Chen et al. (2020) designed a neural ordinary differential equation (ODE) architecture that can learn a continuous intensity function. This is attained by combining jump and attentive continuous normalizing flows (CNFs) (Chen et al., 2018) for space, and an RNN architecture for time. Despite being data-driven, these works do not tackle the last three drawbacks mentioned above. Moreover, due to the memory limitations of RNNs, these works are incapable of unveiling long-term dependencies. On the other hand, RNNs and their variants suffer from vanishing and exploding gradients, especially when applied to long sequential data. Furthermore, sequential training of both RNNs and jumpCNFs as proposed by Chen et al. (2018) results in a slow and computationally expensive training process. Some other recent works (Zuo et al., 2020; Zhou et al., 2022) , proposed using attention mechanisms (already present in the attentiveCNF in Chen et al. (2020) ), to alleviate the second drawback mentioned earlier. The capability of discovering long-term dependencies while achieving fast training (via parallelization) is another benefit of this line of work. However, the last two drawbacks are still present in those approaches, hindering their utility for multi-event forecasting. Directly learning the temporal distribution using the variational autoencoders (VAEs) (Kingma & Welling, 2014) augmented with CNFs (Chen et al., 2018) as proposed by Mehrasa et al. (2019) , or utilizing the deep sigmoidal flow (DSF) (Huang et al., 2018) and the sum of squares (SOS) polynomial flow (Jaini et al., 2019) as proposed by Shchur et al. (2019) could tackle the first and third limitations. However, the second and last mentioned shortcomings remained to be solved. In another recent work, Zhu et al. (2021) proposed learning the Hawkes intensity function based on Gaussian diffusion kernels while using imitation learning as a flexible model-fitting approach. However, their method could not tackle any of the aforementioned shortcomings. Therefore, in the present work, we introduce a novel neural network that is capable of simultaneous spatio-temporal multi-event forecasting while addressing all the above-mentioned shortcomings. To the best of our knowledge, this is the first work proposing a data-driven multi-event forecasting network for stochastic point processes. Our architecture augments the encoder and decoder blocks of a transformer architecture (Vaswani et al., 2017) from natural language processing (NLP) with probabilistic and bijective layers. This network design provides a batch of rich multi-event spatio-temporal distributions associated with multiple future events. We propose to directly learn/predict these distributions (without the need to learn the intensity function as opposed to the previous works), using self-supervised learning (Liu et al., 2021) , marking a further point of departure with respect to existing work. The rest of this paper is organized as follows. In Section 2, we provide background knowledge on attention mechanisms and normalizing flows, since these are used as constituent blocks in our solution. Next, in Section 3 we provide a formal definition of the problem to be solved along with our proposed neural multi-event forecasting network. Further, in Section 4 we evaluate the performance of our network on a variety of datasets and compare it with baseline models. Finally, in Section 5 we finish the paper with conclusions and future goals. Our contributions can be summarized as follows: • We introduce a neural architecture that is capable of simultaneous multi-event forecasting of the time and location of discrete events in continuous time and space. • We compare the performance of our solution with state-of-the-art models through extensive experiments on a variety of datasets that represent stochastic point events.

2.1. ATTENTION AND TRANSFORMERS

Attention blocks play a critical role in the transformer architecture (Vaswani et al., 2017) . Here we introduce their fundamental pieces, already specializing them to the problem at hand. Let {κ i = [t i , x i , M i ]} i=1:n be a sequence of (column) vectors κ i ∈ R d+2 representing n discrete events formed by concatenating the associated time t i ∈ R and markers x i ∈ R d and M i ∈ R. The marker x i denotes the location of the event at time t i in a d dimensional space, and M i can be used to encode any other information of interest about the event, such as the magnitude of an earthquake or the biker's age. Even though we represent here M i as a scalar, these markers can also be higher dimensional. Moreover, the events κ i are temporally sorted such that t i < t j for i < j. Using learnable linear transformations, we form query, key and value vectors (denoted by q i , k i , and v i , respectively) for each event κ i as q i = W q κ i , k i = W k κ i , v i = W v κ i , where d+2) . For a given event κ i , we can build the matrix W q ∈ R d k ×(d+2) , W k ∈ R d k ×(d+2) , W v ∈ R dv×( K (i) = [k 1 k 2 . . . k i ], which contains as columns the key vectors of all the events up to (and containing) event κ i and, similarly, the matrix V (i) = [v 1 v 2 . . . v i ] values. Based on this notation, we compute the hidden representation h i of event κ i as h i = V (i) softmax K ⊤ (i) q i √ d k . Note that in (5) we are generating h i by linearly combining the columns of V (i) , i.e., the values of all events preceding (and including) κ i . The weights in this linear combination are given by a softmax applied to the inner products between the keys of every event and the query specific of κ i . In this sense, h i contains a weighted combination of the values of all preceding events, where these values and the weights in the combination can be learned by tuning the transformations W q , W k , and W v .

2.2. NORMALIZING FLOWS

A normalizing flow is a transformation of a simple probability distribution to a complex one using a sequence of differentiable and invertible bijective maps (Kobyzev et al., 2020) . Assume that κ ∈ R dκ has a complex unknown distribution p κ . Having z ∈ R dκ as a random sample drawn from a simple known distribution p z (e.g., a normal distribution), we could use a sequence of invertible and differentiable functions F = F 1 • F 2 • • • • F k to map z into κ according to κ = F (z). Therefore, using the change of variables formula we have that p z (z) = p κ (F (z))|det(DF (z)|, where det(DF (z)) denotes the determinant of the Jacobian matrix of F (z) and accounts for the volume change in the density as we move from the simple distribution to the complex one. Equation (6) makes it feasible to have an explicit form for the complex underlying distribution p κ and we can effectively sample from it by generating samples from p z and transforming those via F .

3. NEURAL MULTI-EVENT FORECASTING

Given a history of n events, we want to determine the probability distribution (in time and space) of the next L events. Formally, we want to solve the following problem. Problem 1. Given the event history H tn+1 = {(t i , x i , M i ) | i = 1 : n}, estimate the probability distributions {p l (t l , x l |H t l )} n+L l=n+1 of the next L events. We refer to the above problem as batched or multi-event prediction because given the history up to event n we want to estimate the spatio-temporal probability densities of a batch of L events shown by p l in Problem 1. Note that despite using the word batched, these distributions do not need to be independent of each other. We present here a data-driven solution of our problem where, during a training phase, we get to observe several sequences of length n + L of our events of interest. We propose a transformer-based architecture trained in a self-supervised manner to extract the relations between historical events. Furthermore, we enhance the transformer block with probabilistic and bijective layers to construct history-dependent spatio-temporal distributions from the hidden states of the transformer. Since p l (t l , x l |H t l ) = p l (x l |t l , H t l )p l (t l |H t l ), we separately predict two batched history-dependent temporal and spatial distributions. Figure 1 shows a high-level view of our proposed network. Out of the n + L events in the training sequences, n are used as the history information and the inputs to the encoder, whereas the remaining L are the inputs to the decoder in the training phase, and our goal (consistent with Problem 1) is to predict their spatio-temporal distributions in the test phase.

3.1. NETWORK ARCHITECTURE

As shown in Figure 1 , the transformer consists of two main blocks known as the encoder and the decoder. The encoder is fed with input sequence batches, where each sequence has n events with their associated markers (t i , x i , M i ). Using multi-head attention blocks (Vaswani et al., 2017) , we extract representations for all n events that capture history dependency. More precisely, via (5), we extract the hidden representation h ti for time and h xi for space, separately, which will be further used as part of the inputs to the decoder block. Figure 1 : The framework of our proposed architecture. Our network is composed of an encoder, a decoder, and two parallel batched probabilistic layers each followed by bijective layers known normalizing flows. The input to the decoder is used to learn the output of the bijective layers using log-likelihood maximization in the training phase. In the test phase, a sequence of all zeros is input to the decoder and the output of the bijective layers are the predicted batched probability distributions, i.e., our proposed solution to Problem 1. The decoder contains two attention blocks that extract hidden representations of the output data (events n + 1 through n + L) during the training phase. We denote these representationsfoot_0 by {h t l , h x l } n+L l=n+1 . The first attention block in the decoder captures history dependency within the subsequence of events between time n + 1 and n + L. We denote the outputs of this first block by q t l and q x l since these are used as queries for the second multi-head attention block in the decoder; see Figure 1 . The hidden states {h ti , h xi } n i=1 from the encoder are used as both the key and the value inputs to the second attention block in the decoder layer. In this way, the history dependency between the input (from event 1 to n) and output (from event n + 1 to L) events is encoded in the output hidden representations of the decoder, i.e., in {h t l , h x l } n+L l=n+1 . Despite the typical imple-mentation of the transformer decoder using look-ahead masks, due to our multi-event objective we remove this mask in the second attention block of the decoder. It is worth mentioning that, during testing, the time and markers for the events n + 1 through n + L are input as zero, so that no information is used for prediction beyond H tn+1 , thus abiding by the constraint of Problem 1. Finally, normalizing, residual, and dense layers are used in both the encoder and the decoder blocks for dimension matching and enhancing the training. In our next step, we separately inject {h t l } n+L l=n+1 and {h x l } n+L l=n+1 into two separate batched probabilistic layers (Dürr et al., 2020) . In essence, for the time variable t l , we define a trainable map going from h t l into the parameter of an exponential distribution. We denote this distribution by p (zt) l ( • ; h t l ). Similarly, for the spatial variable x l , we define a trainable map going from (h x l , t l ) into the parameters of a multivariate Gaussian distribution. We denote this distribution by p (zx) l ( • ; t l , h x l ). We learned the parameters of these distributions independently for every l ∈ {n + 1, . . . , n + L}. However, it should be noticed that the hidden states {h t l } n+L l=n+1 and {h x l } n+L l=n+1 already encode spatio-temporal dependencies, so that the parameters learned for the probability distributions of different events already capture the correlations with other events across time. Also notice that the parameters of the spatial distribution depend on the time instant t l since we want to estimate the joint probability density over time and space (see Problem 1) as p l (t l , x l |H t l ) = p l (x l |t l , H t l )p l (t l |H t l ). 2 Although the probabilistic layers have a critical role in forming the underlying batched distributions, they can only take simple and tractable forms (in our case, exponential and multivariate Gaussian), which are not expressive enough to capture the true underlying distributions of the times and locations of future events. Thus, a set of bijective layers, known as normalizing flows (Kobyzev et al., 2020) , are used to transform them into richer and more complex batched forms. As discussed in Section 2.2, we can use (6) to address the transformation of distributions p (zt) l and p (zx) l into more complex forms. We use the exponential and multivariate normal distributions as the base distributions p (zt) l and p (zx) l because they have been used extensively as parametric kernels to form the Hawkes intensity function in (1) (Ogata, 1998; 1988) . Moreover, in Appendix A.7 we have shown that the histograms depicting the space and time distributions in various datasets are not too far from being a multivariate Gaussian distribution for space and an exponential for consecutive time intervals. In this sense, normalizing flows only need to learn to further adjust these distributions to the observed data. More precisely, we find continuous and invertible functions F 1 and F 2 such that, for every l, these functions can transform samples from p (zt) l and p (zx) l (denoted by z t l and z x l ) into t l and x l as in t l = F 1 (z t l ), x l = F 2 (z x l ). Using ( 6) we can relate the known expressions of the base distributions with the distributions of t l and x l using the Jacobians of F 1 and F 2 , respectively. We denote these learned distributions by p (t) l ( • ; h t l ) and p (x) l ( • ; t l , h x l ). In terms of parametric functional forms, we use a softsign bijector for F 1 and a RealNVP bijector (Dinh et al., 2016) for F 2 . To train these bijections, we seek to maximize the log-likelihood of the true events {t l , x l } n+L l=n+1 under the learned distributions p (t) l ( • ; h t l ) and p (x) l ( • ; t l , h x l ) . More details about the training of our network will be discussed in Section 4.3 and Appendix A.2. Going back to Problem 1, by maximizing the log-likelihood we are training our network in an end-to-end fashion such that p (t) l (t l ; h t l ) ≈ p l (t l |H t l ) and p (x) l (x l ; t l , h x l ) ≈ p l (x l |t l , H t l ) , where, we recall, H t l contains the history of events up to (but not including) event l. From here we can obtain p l (t l , x l |H t l ) = p l (x l |t l , H t l )p l (t l |H t l ), for l = n + 1, . . . , n + L. Finally, be recalling that during testing every event beyond n is completed via zero padding, in (8) we are effectively obtaining a distribution for time and space of every event l in the range n + 1 to n + L given H tn+1 , as we aimed for in Problem 1. ( • ; ht l ); However, we use the true known time values t l known as inputs to the probablistic layers when modeling p (zx) l ( • ; t l , hx l ) in both train and test phases. More details are provided in Appendix A.2.

3.2. ADDRESSING THE LIMITATIONS OF EXISTING APPROACHES

As previewed in Section 1, our architecture in Figure 1 was inspired by the need of addressing four shortcomings present in classical Hawkes processes and partially shared by more modern alternatives. Here, we explicitly discuss how the different constituent blocks in our architecture address these deficiencies. First, we do not depend on prespecified kernels to model the probability densities of time and location. Notice that, although we adopt exponential and Gaussian distributions for the latent variables z t l and z x l as further described in Appendix A.1, these are later transformed via normalizing flows to better represent the observed data. Second, the attention mechanisms in both the encoder and decoder blocks of the transformer can learn heterogeneous effects between events across time. Third, the explicit forms of p (t) l (t l ; h t l ) and p (x) l (x l ; t l , h x l ) obtained via normalizing flows facilitate the fast sampling of future events via (8) without having to rely on integration as in (2). Fourth, the incorporation of the decoder block of the transformer enables multi-event prediction bypassing the need of sequential one-by-one event prediction. To be more precise, the use of our decoder block implies that, during training, we are incorporating the future information {h t l , h x l } n+L l=n+1 along with their dependence on the encoded history {h ti , h xi } n i=1 . This makes our training procedure self-supervised rather than unsupervised, unlike all previous works in the area. In essence, during training, the network learns to fit the distribution of multiple future events jointly. Hence, during testing, we can zero-pad the input to the decoder and still obtain an estimation of the spatio-temporal probability densities of several events in the future.

4.1. DATA

We used a variety of synthetic and real-world datasets representing point processes with discrete events. We briefly introduce these datasets here; more details can be found in Appendix A.6. In all datasets, we collected sequences of 500 events with an overlap of 498 events. We used n = 497 events in each sequence as the inputs and the final L = 3 events as the outputs that we want to predict during the testing phase. Train, validation, and test sets are formed according to the 80%-14%-6% split rule after shuffling the formed sequences. South California Earthquakes. Earthquake events from 2008 to 2016 in South California (Ross et al., 2019 ) of magnitude at least 2.5. The event description includes time, location in 3 dimensions, magnitude and consecutive events' time intervals as event markers. Citibike. Rental events from a bike sharing service in New York City (Amazonawstripdata). The event description includes starting time of the biking, location in 2 dimensions, the biker's birth year, and consecutive rentals' time intervals as event markers. Covid-19. Daily Covid-19 cases in different states of the United States (The-NewYork-Times). The event description includes day of catching Covid-19, location in 2 dimensions, the number of cases on that day, and consecutive events' time intervals as event markers. Pinwheel. Hawkes pinwheel dataset introduced in Chen et al. ( 2020). Hawkes time instances were simulated using the thinning algorithm introduced in Ogata (1981) and assigned to a cluster-based pinwheel distribution. For simplicity, we assigned the same magnitude to all spatial points sampled from the formed distribution when forming the data sequences.

4.2. BASELINES

We consider two categories of baseline methods, namely, those that try to predict the time and those that try to predict the location of the next event. Time category models are used to learn the intensity function from input times and we predict the expected time of multiple events in the future by sequentially utilizing (2). These models include the Hawkes process, self-correcting point process (Isham & Westcott, 1979) , and the homogeneous Poisson process (Pasupathy, 2010) . Note that the sequential prediction of the future time-points for more than one event, in cases of the Hawkes and self-correcting intensities, relies on a redefined history that incorporates the very last predicted event time in the previous step. We also use space models to sequentially learn the space distributions of multiple events in the future. In this regard, we use the conditional multi-variate Gaussian mixture model (GMM) (Murphy, 2012; Bishop & Nasrabadi, 2006) to learn the space distribution, where the Gaussian kernel parameters are learned from the historical events.

4.3. NETWORK TRAIN AND TEST

Training starts by passing batches of input and output sequences to the encoder and the decoder, respectively. The encoder outputs are also passed to the decoder as earlier explained in Section 3.1. We used inputs and outputs with all the existing markers for each dataset as mentioned in Section 4.1. A list of the hyperparameters used in this work is shown in Table 3 in Appendix A.7. We define the loss as minimizing the negative log-likelihood loss = - n+L l=n+1 log p (t) l (t l ; h t l ) - n+L l=n+1 log p (x) l (x l ; t l , h x l ). Thus, we are effectively learning the coefficients in the transformer blocks -which affect the hidden representations h t l and h x l -and the coefficients in the probabilistic and bijective layers -which parameterize p -to maximize the likelihood of generating the observed data. The baseline methods are trained to encode the history of past events and to sequentially predict the expected next event based on (2). Therefore, in each step we compute the expected time tl and location xl of event l and we want these to be close to the true values t l and x l before updating the history for the next prediction step. To satisfy this condition, during the training phase, we add regularizers to the negative log-likelihood as in loss = - n+L l=1 log p t (t l |H t l ) - n+L l=1 log p x (x l |t l , H t l ) + λ 1 n+L l=n+1 |t l -tl | + λ 2 n+L l=n+1 ||x l -xl || 2 , where p t ( • |H t l ) and p x ( • |t l , H t l ) are the learned probability densities for the baseline method at hand, and λ 1 and λ 2 encode the relative weights of the regularizers. To be more precise, if a method only predicts the time, we use only the first and third terms in (10) as the loss and if a method only predicts the location we use the second and fourth terms as the loss. For a fair comparison between the loss of our network vs. the baselines, during the testing phase we report the negative log-likelihood for events n + 1 through n + L for all the tested models. We train our network using the Adam optimizer with an initial learning rate of 0.001 and a scheduled learning rate rule. We used droupout layers to reduce overfitting and ran 1000 epochs using batches of size 32. 

4.4. RESULTS

Figure 2 shows the predicted batched spatial and temporal distributions for the next three events on the South California Earthquake data (Ross et al., 2019) using our proposed architecture. On the top right figure, we took the average of 1000 samples taken from the predicted batched history dependent time distributions to forecast the time of future events shown by the red stars. The linear trend is shown to emphasize that our network is capable of predicting time events with different in-between time-intervals. The bottom figures show the 3-dimensional representation of batched history-dependent multi-modal predicted densities along the main faults, prone to earthquake occurrence. In the 3-d visualization we show three slices along the z-axis, representing the lowest, highest, and the depth associated with the true future event, shown by the green circle. In the middle, we show the 2-dimensional density visualizations that are associated with the true event's depth shown in the 3-d figures below them. Note that, while the bottom figures represent the predicted densities, in the middle figures we have the predicted density on the left (i.e., for the first predicted event) and the other two middle figures depict the consecutive predicted density differences for the two subsequent events. This is done to highlight the performance of our network in predicting simultaneous but different multi-event densities. Each predicted space density is associated with the assigned events' predicted time shown above it. Despite seeing only slight differences between the predicted space densities, regions that are most expected to have the next event are still well-recognized. Note that while the same history is encoded for all the three future events, the differences between the predicted batched densities are highly influenced by the decoder layer which incorporates future information that was available during the training phase. Considering that consecutive outputs in most sequences occur very close to each other in time and space, the decoder layer is trained to represent the same behaviour in the test phase, predicting batched densities with slight but informative differences. The high bias on the upper left of the space density maps is caused by the highly earthquake-prone regions/faults in South California. Figure 3 on the other hand, represents the sequential prediction performance of the baseline models on the same dataset used in Figure 2 . The Hawkes model and the conditional GMM model are used to learn the time and space distributions, respectively, where we used (2) to sequentially predict the time and location of the next event. Comparing this result with Figure 2 we see the better performance of our network in predicting the time and space distribution of future events on earthquake data. Moreover, the conditional GMM model is incapable of predicting a complex and well-defined density along fault lines representing earthquake-prone regions. More details on how we perform The results shown in Table 1 for all the experiments indicate that, except for the Hawkes time pinwheel synthetic dataset, our network outperforms all the baseline models. Due to a very high error accumulation associated with the sequential prediction using the conditional GMM model, we used the true rather than the predicted values at each prediction step to update the history. More visualization results are given in Appendix A.4. Since the times associated with the pinwheel dataset are synthetically simulated from the Hawkes model, Hawkes is a better candidate for forecasting the expected time distribution for this dataset as shown in Table 1 . Among the given baseline models, the self-correcting and homogeneous Poisson are performing the worst, where the self-correcting with the inhibitory assumption behind history dependency never converged on the earthquake dataset. While all the provided baseline models assume certain parametric forms for the space or time distributions, our proposed network is applicable to any discrete type dataset even in the absence of underlying physics or known models that can guide the right choice of a spatio-temporal kernel.

5. CONCLUSION

We have proposed a novel probabilistic approach by enhancing a transformer architecture to conditionally learn the spatio-temporal distribution of multiple stochastic discrete events. Our network combines the transformer model and normalizing flows augmented with batched probabilistic layers to simultaneously learn the underlying distributions of multiple future events using self-supervised learning. The attention blocks of the transformer architecture assign score-based history dependency among events, which can capture both excitatory and inhibitory behaviors while being parallel and fast to train. We show that our approach achieves state-of-the-art performance on different spatiotemporal datasets collected from a wide range of domains. For future work, we are interested in using other available sources of data as proposed by the work of Okawa et al. (2022) such as GPS and beneath-earth images to propose a context-aware batched spatio-temporal distribution forecasting tool, enhancing the performance of our network on the stochastic point processes. Another interesting direction is to use the diffusion model (Song et al., 2020; Ho et al., 2020; Nichol & Dhariwal, 2021) instead of normalizing flows as generative models of learning the probability distributions. We hope that by adding gradual noise in the reverse diffusion denoising steps, we might better learn regions associated with rare events. p (zt) l ( • ; h t l ) = 1 β l exp(- z t l β l ), p (zx) l ( • ; t l , h x l ) = ((2π) nz x l |Σ l |) -1 2 exp(- 1 2 (z x l -µ l ) ⊤ Σ -1 l (z x l -µ l )), where β l is learned as a function of h t l , and {Σ l , µ l } are learned as functions of {t l , h x l }. p 11) are individually followed by a softsign bijector F 1 and a RealNVP bijector F 2 (Dinh et al., 2016) , respectively, as two separate bijective layers to model the desired batched conditional temporal and spatial distributions p (zt) l ( • ; h t l ) and p (zx) l ( • ; t l , h x l ) in ( (t) l ( • ; h t l ) and p (x) l ( • ; t l , h x l ). Using (6), we have p (t) l ( • ; h t l ) = p (zt) l ( • ; h t l )|det(DF 1 (z t l )| -1 , p (x) l ( • ; t l , h x l ) = p (zx) l ( • ; t l , h x l )|det(DF 2 (z x l )| -1 . ( ) Notice that the usage of the softsign bijective function F 1 (y) = y |y|+1 after the exponential probabilistic layer restricts the outputs between 0 and 1, i.e. F 1 : (0, ∞) → (0, 1). This is consistent with the fact that we always normalize our time inputs {t i , t l } to stay between 0 and 1, so that our time intervals also always stay between 0 and 1. Therefore, this bijective map (Dillon et al., 2017) will ensure that the outputs from the exponential probabilistic layer will stay in our desired range. Moreover, as shown in the time-interval histograms associated with all datasets in Figure 12 , all follow an approximate exponential behavior. This indicates that we do not require much flexibility in choosing the temporal bijective layer. Also, note that since time is 1-dimensional, it is not applicable to utilize the well-known bijective functions, such as the RealNVP (Dinh et al., 2016) , NICE Dinh et al. (2014) , the Masked Autoregressive normalizing flow (Papamakarios et al., 2017) , to model the batched temporal distribution, mainly because these flow architectures are designed for dimensions higher than 1.

A.2 DETAILS ON NETWORK TRAIN AND TEST

As pointed out earlier, the training and testing phases of our network come with slight differences which we better clarify in this section. As shown in Figure 1 , during training, both batches of inputs {(t i , x i , M i )} n i=1 and outputs {(t l , x l , M l )} n+L l=n+1 are fed into the encoder and the decoder, respectively. Using (9), the goal is to learn a parametric description of a batched probability distribution that seeks to explain future events in only time and space, meaning p (t) l and p (x) l , respectively. The reason behind considering only these markers (time and space) is that these are the main markers covered by the original Hawkes process as shown in (1). Learning batched distributions associated with other markers may need the contribution of models other than Hawkes. As an example, the Gutenberg-Richter (GR) frequency-magnitude law in earth science (Gutenberg & Richter, 1944; Knopoff, 2000; Tinti & Mulargia, 1985; Rhoades, 1996) is needed to be considered when modeling/ learning the batched magnitude distribution p M l associated with future earthquake events. Also notice that, as indicated by the Hawkes intensity function in (1), using other markers M as historical information has no contradiction with learning only the batched spatio-temporal distributions. On the other hand, during testing, we only use the input batches {(t i , x i , M i )} n i=1 to feed the encoder, whereas a batch of zeros as outputs, meaning {(t l , x l , M l )} n+L l=n+1 = 0, are used to feed the decoder. We expect that in the testing phase, the trained network could predict the batched spatiotemporal distributions p (t) l and p (x) l , associated with the future L events, by only relying on the historical information {(t i , x i , M i )} n i=1 . This might cause confusion about passing zero queries to the second attention layer of the decoder, due to using zero inputs. However, this can be addressed by relying on the biases associated with the decoder block, which were tuned in the training phase. Another question that might be raised due to feeding zero inputs to the decoder in the testing phase, is how the decoder can distinguish between different inputs. Notice that we used positional embedding (PE) proposed by Vaswani et al. (2017) in both the training and testing phases. However, the decoder is also fed by the output from the encoder, which is computed during testing using the true history {(t i , x i , M i )} n i=1 that we use for prediction. Consequently, the decoder can distinguish between different event histories via the output of the decoder in addition to having the preserved positions using PE. United States. Unless we consider trips between states, the patients' time and location in one state might even provide false information influencing the chances of others getting the disease in another state. Therefore, removing this information might even enhance predicting space distributions. Our interpretation of very little changes in forecasting time distribution in the case of having no historical information is that covid-19 is among the fastest disease to catch, therefore not very long-term dependency in time is needed to forecast the time of catching the disease. We guess that the short-term dependency captured by the decoder during training is enough to predict when the next person will catch the disease. Pinwheel: As shown on Figure 11 top, the sequences of pinwheel events are formed in a clock-wise way. Therefore, historical positional and temporal information should have a big affect on predicting the location of next events in the future. This is in line with the results shown in Table 2 , indicating how removing historical information worsens spatial multi-event distribution forecasting. On the other hand, despite using synthetic Hawkes time (Chen & Stindl, 2022 ) associated with pinwheel events, we saw no change in time forecasting when removing history dependency. For this outcome, we infer that the thining process (Ogata et al., 1993) used to extract the times (Chen & Stindl, 2022) assigned to the events does not consider long-term and long-range dependency. Therefore, the shortterm history dependency that exists among the output events {(t l , x l )} n+L l=n+1 that is captured during training by the decoder layer, suffices predicting multiple events' time distributions with no reliance on further look onto the history.

A.3.2 NO DECODER

Here, we no longer feed {(t l , x l , M l )} n+L l=n+1 to the decoder during the training phase. This is similar to the test phase of our model where we input {(t l , x l , M l )} n+L l=n+1 = 0 to the already trained network. However, in this study we input {(t l , x l , M l )} n+L l=n+1 = 0 during both training and testing phases. Results in Table 2 indicate that except for the pinwheels dataset's spatial distribution forecasting, removing the decoder does not significantly change the prediction results. Nonetheless, results associated with the full network are slightly better. As earlier mentioned in Section 3.2 during training, the network learns to fit the distribution of multiple future events jointly, in a selfsupervised setting. Hence, during testing, zero-padding the input to the decoder still obtains an estimation of the spatio-temporal probability densities of several events in the future. This happens due to the reliance of the network on the historical input data, as well as the learned bias associated with the decoder layer. However, we accept that the learned bias might only provide little history dependency information between output events, especially in the case of a small number of events to be predicted (L = 3 in our case). Since the setup when having no decoder differs from the main setup of our proposed network only in terms of the learned bias associated with the decoder layer, we do not expect to see much difference in the results. However, we suspect that having larger output sequences will highlight this difference between the two settings. In that case, we could consider a larger distance in time and space among the output events, which results in differently tuning the decoder layers' bias during the training phase. In the case of the pinwheel dataset, we realized that removing the decoder enhanced the multi-event spatial distribution forecasting results. This is when the batched temporal distribution forecasting results are negatively affected. In line with the outcomes of the previous ablation study, we inferred that the pinwheel events are influenced massively by the long-term history dependency in space. Therefore, removing the short-term dependency among the output events will allow the events to focus more on long-range effects of the historical data. However, the results associated with predicting the temporal distributions got worse. As expected from the outcome of the study in Appendix A.3.1, this might be due to the importance of short-term dependency between the time of the events, modeled by the thining process (Ogata et al., 1993) .

A.3.3 CONCLUSION ON ABLATION STUDY

Through these two studies and what we discussed in the main body of this paper, we showed the critical role of each layer forming our network. Mainly, we observed that each layer is designed for a specific task, driven by the physics behind the formation of the used dataset. Unlike other methods that solely rely on the Hawkes process, our network is designed to fully consider any sort of shortor long-term history dependencies desired by the physics behind data. In general, from these studies, we concluded that the performance of our proposed network is highly data-driven. Predicting batched distributions associated with datasets that are formed based on both long-and short-range history dependencies, such as the earthquake dataset, demands the usage of both the encoder and decoder blocks of our model. When events are more influenced by short-term dependencies, or there's a lack of the presence of other factors influencing the events' occurrence, the encoder might negatively affect the prediction results. This was shown as the outcome of the first study on the Covid-19 dataset. The reverse applies to events dependent on long-range history dependencies, such as the pinwheel and earthquake datasets. There might be cases where removing the encoder has no influence on the prediction results, due to no significant history dependency, such as the Citibike dataset. On the other hand, capturing the short-term dependencies, mainly done by the decoder layer, is important when events are also influenced by these dependencies. This conclusion is based on the outcomes of applying the second ablation study to many datasets, such as the earthquake dataset, Citibike, Covid-19, and the temporal events forming the Pinwheel dataset. Since all datasets used in this work, except for the pinwheel dataset, represent real-world phenomena, we believe no single model can be used to learn/predict distributions associated with them. Therefore, it is critical to have a network that can be driven by the data's underlying physics, whilst being capable of simultaneous multi-event prediction.

A.4 MORE RESULTS AND ANALYSIS

To show the performance of our network on synthetic Hawkes time simulated data vs. the baseline model, we used the pinwheel dataset as introduced in Section 4.1, where the events are distributed with less bias towards a specific region. As depicted in Figure 5 on top, not only the history dependent batched spatial distributions for the next three events are well predicted using our network, but also regions prone to events occurring further away in time, shown in a clock-wise pattern in Figure 11 , are as well predicted. On the other hand, the conditional GMM model was not successful in extracting the complex distribution associated with next events' occurrences. Note that using the conditional GMM model to perform sequential next event prediction, at each step, we incorporate the true event shown by the green circle, rather than the newly predicted event shown by the red star, to redefine the history for next step prediction. This is due to the high error accumulation caused by using the sequentially predicted events. Figure 6 shows the predicted time of the next three events applying our proposed network and the time Hawkes model to the simulated Hawkes pinwheel dataset. While the result on the top figure are indicative of simultaneous multiple future time event prediction, the bottom figures depict utilizing (2) sequentially while updating the history via the most recently predicted event. These results are in agreement with Table 1 indicating the outperformance of using the Hawkes model for this dataset. As mentioned earlier, this is due to the fact that the time assigned to the synthetic Hawkes dataset are simulated by performing thinning algorithm on a time Hawkes process. As mentioned earlier, the expected times extracted by applying our network to this and all other datasets are measured Figure 7 : Visualizations of the learned history dependent batched spatial distributions for the last three events of an unseen earthquake sequence using our proposed architecture. The bottom figures depict the predicted batched spatial distribution associated with the last three events in 3 dimensions where the z axis is along events' depth. The 2-d representation of the last three events' predicted spatial distributions mapped to the events' depth are shown on the top plots where except for the one on the right, we show the differences in the predicted distributions. The black stars on the spatial distributions represent the last 50 events of the input sequence as the history information, where in this figure the history events are very much aligned along the main fault in South California. Figure 8 : Same visualization as in Figure 7 on another earthquake sequence of length 500 where the history events, shown by black stars, are more distributed along all fault lines rather than just one.

A.5 BASELINES

We use baselines to separately model the time and spatial events distributions, conditioned on the history. Earlier in (10), we showed that the loss defined for the benchmarks have some differences with the loss we defined for our proposed architecture in the training phase. As already discussed in the main body of the paper, the baseline models only provide the expected next event using (2), which unlike our proposed method, gives the predicted event explicitly rather than a distribution. For further prediction in the future, the already predicted event has to be incorporated in the history, which will accordingly update the intensity function λ(t i |H ti ) as well as the space distribution p(x i |t i , H ti ) to be used for further predictions. To have a precise updated history, we need our predicted points to be as close to the true values as possible. Therefore, we expand (10) to provide Conditional Gaussian mixture model: We implemented two forms for the Conditional Gaussian mixture models as the baselines to learn the spatial distributions p(x i |t i , H ti ) i=1:n mentioned in Pinwheel. We sample from a multivariate process with 15 clusters, each containing 150 events. Due to the space Hawkes synthetic dataset not being available, we used the same spatial distribution provided in Chen et al. (2020) with the assigned Hawkes timings using the RHawkes package Chen & Stindl (2022) . Since we have sequences of 500 events, each sequence will contain a few number of clusters as shown in Figure 11 . This resulted in 613 training sequences, 157 validation sequences, and 106 test sequences of same length. Pinwheel Earthquake Figure 11 : Top: Pinwheel synthetic dataset where simulated Hawkes times are assigned to the events of clusters in a clockwise way. These timings are simulated using thinning algorithm Ogata (1981) . Bottom: South California earthquakes distributed over the main faults.

A.7 HYPER-PARAMETERS

We have set the number of attention layers in both the encoder and the decoder to 6, whereas each layer contains 6 heads individually. The reason behind these choices are what proposed by the main transformer network proposed in NLP Vaswani et al. (2017) , where they suggest to choose between 6 to 8 attention heads and layers. Before inputting any of the sequences to the encoder and the decoder, we first used Multi-layer perceptron (MLP) layers with ELU activation functions, to embed to a 64 dimensional space. The latent representations h ti , h xi , depicted in Figure 1 are in a 64 dimensional space, whereas we used another set of MLP layers with ELU activation functions to transfer back h t l , and h x l to the same dimension as t l , and x l before injecting them into the probabilistic layers. This was due to the use of bijective layers (normalizing flows) following the probabilistic layers only being capable of mapping to the same dimensional space. As mentioned earlier in Section 3.1, we used multi-variate Gaussian and exponential probabilistic layers for space and time, respectively as considered by the Hawkes model. The markers histograms shown by Figure 12 as well indicate the usefulness of these choices for our probabilistic layers. We also tried modeling the time distribution using Gaussian probabilistic layers which led to bad results. For the bijective flow layers we tried using RealNVP Dinh et al. (2016) and the Masked Autoregressive flows Papamakarios et al. (2017) for the space distributions, and we tried softplus and softsign flows for time. Our results indicate that the RealNVP and softplus are capable of better modeling the expected space and time batched distributions. Note that both the RealNVP and Masked Autoregressive flows are only applicable to data with dimension of higher than 1, and therefore are not good candidates for transforming the time batched distributions. To overcome the overfitting when training our model, we used dropout layers with rates listed as {0.1, 0.15, 0.2}. For the earthquake and Hawkes pinwheel synthetic data, both 0.1 and 0.15 were



Notice that, as a convention, we tend to use subindex i (as in ht i ) to refer to the events in the interval {1, . . . , n} and subindex l (as in ht l ) for events in the interval {n + 1, . . . , n + L}. Also we use bold notation to show vector form of a varibale, meaning ht l depicts 1 dimensional hidden states, whereas ht i stands for dimensions of higher than 1. Note that during the test phase we only feed the decoder with t l = 0 when modeling p (z t ) l Code is available as supplementary material and will be publicly released in github if the paper is accepted. Code is available as supplementary material and will be publicly released in github if the paper is accepted.



Figure2: Visualizations of the learned history-dependent batched spatial and temporal distributions for the last three events of an unseen earthquake sequence using our proposed architecture. The red stars on the top figures are the simultaneous prediction of the true events' time, shown by green circles, where we took an average between 1000 samples taken from the output predicted time distributions. The bottom figures depict the predicted batched spatial distribution associated with the last three events in 3 dimensions where the z-axis is along events' depth. The 2-d representation of the last three events' predicted spatial distributions mapped to the events' depth are shown on the middle plots where except for the one on the left, we show the differences in the predicted distributions. The black stars on the spatial distributions represent the last 50 events of the input sequence as the history information.

Figure 3: Same visualizations as Figure2for the same data using Hawkes model for time, augmented with conditional-GMM for space. Here the prediction in time and space is performed sequentially using (2) by updating the history at each prediction step.

Figure4: Comparing the performance of the proposed network with two ablation studies applied to the earthquake data. Both removing the decoder and the history dependency worsen the predition results. This highlights the importance of both long-and shortrange history dependencies among seismic events.

Figure5: Visualizations of the learned history dependent batched spatial distributions for the last three events of simulated Hawkes pinwheel sequence using our proposed architecture vs. conditional GMM. The first figure on the left hand side depicts the predicted distribution associated with the first event, where the middle and the right hand side figures only depict the consecutive differences between the predicted densities. The green circle shows the true event that is expected to occur. The black stars represent the last 50 events of the input sequence as history information. The red star on bottom figures shows the sequentially predicted next event using (2) which needs to be injected back to the history for the next step prediction. Due to high error accumulation, we instead use the green circle to update history on the bottom figures.

Figure9: Visualization of the first 3 out of 6 encoder-decoder attention layers, where we show 3 out of 6 heads per layer. These figures are the results of applying our proposed network to the pinwheel dataset. The x axis represent historical events, whereas the y axis show the three expected events in the future. The colors show how weak or strong the past events have an influence on the future events' occurrence.

Figure 12: Histograms of different markers associated with different datasets used in this work. The "extra marker"s associated with Earthquake, Citibike, Covid-19, and pinwheel datasets are the magnitude, birth year, number of cases, and all-one intensity, respectively.

Results representing the negative log-likelihood for the events in the range n+1 to n+L for the learned distributions (less is better). ± indicate the standard deviation of the loss among all test batch sequences.

Hyper-parameters and layer-types used in our model for both training and test.

A APPENDIX

A.1 FURTHER DETAILS ON NETWORK ARCHITECTURE In Section 3.1 we showed the usage of two individual batched probabilistic layers that are separately fed with {h t l } n+L l=n+1 and {(h x l , t l )} n+L l=n+1 , where {h t l , h x l } n+L l=n+1 are the outputs of the decoder block and {t l } n+L l=n+1 are the known time values. We denoted the outputs of these layers by p (zt) l ( • ; h t l ) and p (zx) l ( • ; t l , h x l ) as the conditional exponential and multivariate Gaussian distributions, respectively, whose parameters are learned from {h t l } n+L l=n+1 and {(h x l , t l )} n+L l=n+1 . To further expand the mathematical description of these layers' outputs we have As illustrated in Figure 1 , we designed our model to jointly predict both time and location distributions by first drawing a time sample tl from the learned p (t) l and then fixing and using tl as an input to the batched spatial probabilistic layer, to model/predict p (x) l ( • ; tl , h x l ). Since our baselines are for either time prediction or location prediction, we decided to compare them with our method in such a way that we simultaneously learn/predict both p (t) l ( • ; h t l ) and p (x) l ( • ; t l , h x l ), where for p (x) l ( • ; t l , h x l ) we assume that the time t l is known and given. In this way, the error is comparable with those that predict solely location since we are controlling for the error that could be introduced by a wrong time prediction tl .

A.3 ABLATION STUDY

To further leverage the importance of each layer/block we used in our proposed model, in addition to what we described in Sections 3.1 and 3.2, we also performed two ablation studies on our network using all the available datasets. l are completely independent of the history, meaning that we have pl . For this matter we provide batches of zero inputs to the encoder {(t i , x i , M i )} n i=1 = 0 in both the training and testing phase while keeping all other steps exactly as before. Results for this study on different datasets are given in Table 2 .South California Earthquakes: As shown by Table 2 , removing history dependency from our network causes worse performance of multi-event forecasting in both time and space for this dataset. This is in line with the well-known epidemic-type-aftershock-sequence (ETAS) model (Ogata, 1998) suggesting long-term and long-range dependencies in both time and space between seismic events.Citibike: Removing history dependency on this data, did not highly affect the results. Our interpretation of this dataset is that many other factors other than the previous bikers' markers included in this dataset might influence the time or location of riding a bike. Therefore, removing history dependency does not highly influence forecasting the time or location of multiple bikers in the future when no other influencing factors are considered.

Covid-19:

We realized that removing the history dependency did not affect time distribution forecasting, but instead enhanced predicting space distributions. Despite the disease being very contagious, the space distribution forecasting results are not surprising at all. This is due to the fact that we are taking into account the long-range history dependency over many different states of the Section 4.2. In the first method we compute the pairwise Gaussian log-likelihood among all x i and x j associated with events 1 ≤ j ≤ i ≤ n, as well as the pairwise log-timedecay ∆t ij . We minimize the loss associated with adding up the pairwise Gaussian log-likelihood and log-timedecay to learn the history dependent Gaussian mixture model.The second method is the general K-cluster Gaussian mixture model introduced in Murphy (2012); Bishop & Nasrabadi (2006) , where the learned parameters are functions of the historical events markers. Having a mixture of K Gaussian densitieswhere n k represents the number of samples associated with cluster k, we learn µ k and Σ k from {{(t i , x i , M i )} n i=1 } k as the time, space and the extra marker associated with all events assigned to cluster k. In this case, p(x l |µ, Σ) in ( 14), denotes the conditional spatial probability associated with event x l=n+1 .Although the second method is capable of conditionally being dependent on any other markers associated with events, the first method had a better performance on all datasets. Therefore, in this work we only reported the results based on the first method, where both implementations are given in our publicly available code.A.6 PREPROCESSING OF DATA SEQUENCES South California Earthquakes. We use earthquake discrete events distributed over South of California with latitude, longitude and depth. The distribution of all events along the main faults as well as four sequences of 500-length size, projected on a 2-d map of South California faults, are shown in Figure 11 . Since we don't work with in-between distances, there is no need to convert to the Cartesian coordinate system. After forming sequences as mentioned in Section 4.1, we normalize the space and magnitude of each sequence with respect to the mean and standard deviation gathered from all events. The bias caused by normalization towards more earthquake prone regions as previously shown in Figure 2 is remained to be eased in later work as discussed in Section 5.Citibike. We use the data from April to August of 2019. No preprocessing was performed for this dataset. This resulted in 3500 training sequences, 900 validation sequences, and 600 test sequences of same length.Covid-19. We use the data from March of 2020 to May of 2022. No preprocessing was performed for this dataset. This resulted in 1400 training sequences, 360 validation sequences, and 240 test sequences of same length. good options whereas for the Citibike and Covid-19 we had to increase the rate to 0.2 to reduce the overfitting.All the mentioned hyperparameters can be set differently in our network using argparsing 4 , once the code is publicly available. Our network can be applied to any discrete form data with variable dimension assigned markers for future use.

