ADDITIVE POISSON PROCESS: LEARNING INTENSITY OF HIGHER-ORDER INTERACTION IN POINT PROCESSES Anonymous

Abstract

We present the Additive Poisson Process (APP), a novel framework that can model the higher-order interaction effects of the intensity functions in point processes using lower dimensional projections. Our model combines the techniques in information geometry to model higher-order interactions on a statistical manifold and in generalized additive models to use lower-dimensional projections to overcome the effects from the curse of dimensionality. Our approach solves a convex optimization problem by minimizing the KL divergence from a sample distribution in lower dimensional projections to the distribution modeled by an intensity function in the point process. Our empirical results show that our model is able to use samples observed in the lower dimensional space to estimate the higher-order intensity function with extremely sparse observations.

1. INTRODUCTION

Consider two point processes which are correlated with arrival times for an event. For a given time interval, what is the probability of observing an event from both processes? Can we learn the joint intensity function by just using the observations from each individual processes? Our proposed model, the Additive Poisson Process (APP), provides a novel solution to this problem. The Poisson process is a counting process used in a wide range of disciplines such as time-space sequence data including transportation (Zhou et al., 2018) , finance (Ilalan, 2016) , ecology (Thompson, 1955) , and violent crime (Taddy, 2010) to model the arrival times for a single system by learning an intensity function. For a given time interval of the intensity function, it represents the probability of a point being excited at a given time. Despite the recent advances of modeling of the Poisson processes and its wide applicability, majority of the point processes model do not consider the correlation between two or more point processes. Our proposed approach learns the joint intensity function of the point process which is defined to be the simultaneous occurrence of two events. For example in a spatial-temporal problem, we want to learn the intensity function for a taxi to pick up customers at a given time and location. For this problem, each point is multi-dimensional, that is (x, y, t) N i=1 , where a pair of x and y represents two spatial dimensions and t represents the time dimension. For any given location or time, we can only expect very few pick-up events occurring, therefore making it difficult for any model to learn the low valued intensity function. Previous approaches such as Kernel density estimation (KDE) (Rosenblatt, 1956) are able to learn the joint intensity function. However, KDE suffers from the curse of dimensionality, which means that KDE requires a large size sample or a high intensity function to build an accurate model. In addition, the complexity of the model expands exponentially with respect to the number of dimensions, which makes it infeasible to compute. Bayesian approaches such as using a mixture of beta distributions with a Dirichlet prior (Kottas, 2006) and Reproducing Kernel Hilbert Space (RKHS) (Flaxman et al., 2017) have been proposed to quantify the uncertainty with a prior for the intensity function. However, these approaches are often non-convex, making it difficult to obtain the global optimal solution. In addition, if observations are sparse, it is hard for these approaches to learn a reasonable intensity function. All previous models are unable to efficiently and accurately learn the intensity of the interaction between point processes. This is because the intensity of the joint process is often low, leading to sparse samples or, in an extreme case, no direct observations of the simultaneous event at all, making it difficult to learn the intensity function from the joint samples. In this paper, we propose a novel framework to learn the higher-order interaction effects of intensity functions in point processes. Our model combines the techniques introduced by Luo & Sugiyama (2019) to model higher-order interactions between point processes and by Friedman & Stuetzle (1981) in generalized additive models to learn the intensity function using samples in a lower dimensional space. Our proposed approach is to decompose a multi-dimensional point process into lower-dimensional representations. For example, in the x-dimension we have points (x i ) N i=1 , in the y-dimension, we have points (y i ) N i=1 and in the time dimension we have (t i ) N i=1 . The data in these lower dimensional space can be used to improve the estimate of the joint intensity function. This is different from the traditional approach where we only use the simultaneous events to learn the joint intensity function. We first show the connection between generalized additive models and Poisson processes. We then provide the connection between generalized additive models and the log-linear model (Agresti, 2012), which has a well-established theoretical background in information geometry (Amari, 2016) . We draw parallels between the formulation of the generalized additive models and the binary loglinear model on a partially ordered set (poset) (Sugiyama et al., 2017) . The learning process in our model is formulated as a convex optimization problem to arrive at a unique optimal solution using natural gradient, which minimizes the Kullback-Leibler (KL) divergence from the sample distribution in a lower dimensional space to the distribution modeled by the learned intensity function. This connection provides remarkable properties to our model: the ability to learn higher-order intensity functions using lower dimensional projections, thanks to the Kolmogorov-Arnold representation theorem. This property makes it advantageous to use our proposed approach for the cases where there are, no observations, missing samples, or low event rate. Our model is flexible because it can capture interaction between processes as a partial order structure in the log-linear model and the parameters of the model are fully customizable to meet the requirements for the application. Our empirical results show that our model effectively uses samples projected onto a lower dimensional space to estimate the higher-order intensity function. Our model is also robust to various sample sizes.

2. FORMULATION

In this section we first introduce the technical background in the Poisson process and its extension to a multi-dimensional Poisson process. We then introduce the Generalized Additive Model (GAM) and its connection to the Poisson process. This is followed by presenting our novel framework, called Additive Poisson Process (APP), which is our main technical contribution and has a tight link to the Poisson process modelled by GAMs. We show that learning of APP can be achieved via convex optimization using natural gradient. The Poisson process is characterized by an intensity function λ:R D → R, where we assume multiple D processes. An inhomogeneous Poisson process is a general type of processes, where the arrival intensity changes with time. The process with time-changing intensity λ(t) is defined as a counting process N(t), which has an independent increment property. For all time t ≥ 0 and changes in time δ ≥ 0, the probability p for the observations is given as p (Daley & Vere-Jones, 2007) . Given a realization of timestamps t 1 , t 2 , . . . , t N with t i ∈ [0, T ] D from an inhomogeneous (multi-dimensional) Poisson process with the intensity λ. Each t i is the time of occurrence for the i-th event across D processes and T is the observation duration. The likelihood for the Poisson process (Daley & Vere-Jones, 2007) is given by (N(t + δ) -N(t) = 0) = 1 -δλ(t) + o(δ), p(N(t + δ) -N(t) = 1) = δλ(t) + o(δ), and p(N(t + δ) -N(t) ≥ 2) = o(δ), where o(•) denotes little-o notation p {t i } N i=1 | λ (t) = exp -λ (t) dt N i=1 λ (t i ) , where t = [t (1) , . . . , t (D) ] ∈ R D . We define the functional prior on λ(t) as λ(t) := g (f (t)) = exp (f (t)) . (2) The function g(•) is a positive function to guarantee the non-negativity of the intensity which we choose to be the exponential function, and our objective is to learn the function f (•). The log-

