ADDITIVE POISSON PROCESS: LEARNING INTENSITY OF HIGHER-ORDER INTERACTION IN POISSON PRO-CESSES

Abstract

We present the Additive Poisson Process (APP), a novel framework that can model the higher-order interaction effects of the intensity functions in Poisson processes using projections into lower-dimensional space. Our model combines the techniques from information geometry to model higher-order interactions on a statistical manifold and in generalized additive models to use lower-dimensional projections to overcome the effects from the curse of dimensionality. Our approach solves a convex optimization problem by minimizing the KL divergence from a sample distribution in lower-dimensional projections to the distribution modeled by an intensity function in the Poisson process. Our empirical results show that our model effectively uses samples observed in lower dimensional space to estimate a higher-order intensity function with sparse observations.

1. INTRODUCTION

The Poisson process is a counting process used in a wide range of disciplines such as spatialtemporal sequential data in transportation (Zhou et al., 2021) , finance (Ilalan, 2016) and ecology (Thompson, 1955) to model the arrival rate by learning an intensity function. For a given time interval, the integral of the intensity function represents the average number of events occurring in that interval. The intensity function can be generalized to multiple dimensions. However, for most practical applications, learning the multi-dimensional intensity function is a challenge due to the sparsity of observations. Despite the recent advances of Poisson processes, current Poisson process models are unable to learn the intensity function of a multi-dimensional Poisson process. Our research question is, "Are there any good ways of approximating the high dimensional intensity function?" Our proposal, the Additive Poisson Process (APP), provides a novel solution to this problem. Throughout this paper, we use a running example in a spatial-temporal setting. Say we want to learn the intensity function for a taxi to pick up customers at a given time and location. For this setting, each event is multi-dimensional; that is, (x, y, W ), where a pair of x and y represents two spatial coordinates and W represents the day of the week. In addition, observation time t is associated with this event. For any given location or time, we can expect at most a few pick-up events, which makes it difficult for any model to learn the low-valued intensity function. Figure 1b visualizes this problem. In this problem setup, if we would like to learn the intensity function at a given location (x, y) and day of the week W , the naïve approach would be to learn the intensity at (x, y, W ) directly from observations. This is extremely difficult because there could be only few events for a given location and day. However, there is useful information in lower-dimensional space; for example, the marginalized observations at the location (x, y) across all days of the week, or on the day W at all locations. This information can be included into the model to improve the estimation of the joint intensity function. Using the information in lower-dimensional space provides a structured approach to include prior information based on the location or day of the week to improve the estimation of the joint intensity function. For example, a given location could be a shopping center or a hotel, where it is common for taxis to pick up passengers, and therefore we expect more passengers at this location. There could also be additional patterns that could be uncovered based on the day of the week. We can then use the observations of events to update our knowledge of the intensity function. Previous approaches, such as kernel density estimation (KDE) (Rosenblatt, 1956) , learn the joint intensity function using information in lower dimensions. However, KDE suffers from the curse of dimensionality, which means that it requires a large size of samples to build an accurate model. In addition, the complexity of the model expands exponentially with respect to the number of dimensions, which makes it infeasible to compute. Bayesian approaches, such as using a mixture of beta distributions with a Dirichlet prior (Kottas, 2006; Kottas & Sansó, 2007) and Reproducing Kernel Hilbert Space (RKHS) (Flaxman et al., 2017) , have been proposed to quantify the uncertainty for the intensity function. However, these approaches are often non-convex, making it difficult to obtain the globally optimal solution. Besides, if observations are sparse, it is hard for these approaches to learn a reasonable intensity function. Additional related work about Bayesian inference for Poisson processes and Poisson factorization can be found in Appendix A. In this paper, we propose a novel framework to learn the higher-order interaction effects of intensity functions in Poisson processes. Our model combines the techniques introduced by Luo & Sugiyama (2019) to model higher-order interactions between Poisson processes and by Friedman & Stuetzle (1981) in generalized additive models to learn the joint intensity function using samples in a lower dimensional space. Our proposed approach is to decompose a multi-dimensional Poisson process into lower-dimensional representations. For example, we have points (x i ) N i=1 in the x-dimension and (y i ) N i=1 in the y-dimension. Such data in the lower-dimensional space can be used to improve the estimation of the joint intensity function. This is different from the traditional approaches where only the joint occurrence of events is used to learn the joint intensity. We first show the connection between generalized additive models and Poisson processes, and then provide the connection between generalized additive models and the log-linear model (Agresti, 2012) , which has a well-established theoretical background in information geometry (Amari, 2016) . We draw parallels between the formulation of the generalized additive models and the binary loglinear model on a partially ordered set (poset) (Sugiyama et al., 2017) . The learning process in our model is formulated as a convex optimization problem to arrive at a unique optimal solution using natural gradient, which minimizes the Kullback-Leibler (KL) divergence from the sample distribution in a lower-dimensional space to the distribution modeled by the learned intensity function. This connection provides remarkable properties to our model: the ability to learn higher-order intensity functions using lower-dimensional projections, thanks to the Kolmogorov-Arnold representation theorem. This property makes it advantageous to use our proposed approach for cases where there are no observations, missing samples, or low event rates. Our model is flexible because it can capture the interaction effects between events in a Poisson process as a partial order structure in the log-linear model and the parameters of the model are fully customizable to meet the requirements of the application. Our empirical results show that our model effectively uses samples projected onto a lower dimensional space to estimate the higher-order intensity function. More importantly, our model is also robust to various sample sizes.

2. FORMULATION

We start this section by introducing the technical background in the Poisson process and its extension to a multi-dimensional Poisson process. We then introduce the Generalized Additive Model (GAM) and its connection to the Poisson process. This is followed by presenting our novel framework, called Additive Poisson Process (APP), which is our main technical contribution and has a tight link to the Poisson process modeled by GAMs. We show that the learning of APP can be achieved via convex optimization using natural gradient. The Poisson process is characterized by an intensity function λ : R → R. An inhomogeneous Poisson process is an extension of a homogeneous Poisson process, where the arrival rate changes with time. The process with time-changing intensity λ(t) is defined as a counting process N(t), which has an independent increment property. For any t ≥ 0 and infinitesimal interval δ ≥ 0, probability of events count is p(N(t + δ) -N(t) = 0) = 1 -δλ(t) + o(δ), p(N(t + δ) -N(t) = 1) = δλ(t) + o(δ), and p(N(t + δ) -N(t) ≥ 2) = o(δ), where o(•) denotes little-o notation (Daley & Vere-Jones, 2007) . To take (inhomogeneous) multi-dimensional attributes into account, we consider multiple intensity functions, each of which is given as λ J : R → R associated with a subset J of the domain of possible states [D] = {1, . . . , D}. Each J ⊆ [D] determines the condition of the occurrence of the event. To flexibly consider any combination of possible states, [D] is composed

