FADIN: FAST DISCRETIZED INFERENCE FOR HAWKES PROCESSES WITH GENERAL PARAMETRIC KERNELS

Abstract

Temporal point processes (TPP) are a natural tool for modeling event-based data. Among all TPP models, Hawkes processes have proven to be the most widely used, mainly due to their adequate modeling for various applications, in particular when considering exponential or non-parametric kernels. Although nonparametric kernels are an option, such models require large datasets. While exponential kernels are more data efficient and relevant for certain applications where events immediately trigger more events, they are ill-suited for applications where latencies need to be estimated, such as in neuroscience. This work aims to offer an efficient solution to TPP inference using general parametric kernels with finite support. The developed solution consists of a fast L2 gradient-based solver leveraging a discretized version of the events. After supporting the use of discretization theoretically, the statistical and computational efficiency of the novel approach is demonstrated through various numerical experiments. Finally, the effectiveness of the method is evaluated by modeling the occurrence of stimuli-induced patterns from brain signals recorded with magnetoencephalography (MEG). Given the use of general parametric kernels, results show that the proposed approach leads to a more plausible estimation of pattern latency compared to the state-of-the-art.

1. INTRODUCTION

The statistical framework of Temporal Point Processes (TPPs; see e.g., Daley & Vere-Jones 2003) is well adapted for modeling event-based data. It offers a principled way to predict the rate of events as a function of time and the previous events' history. TPPs are historically used to model intervals between events, such as in renewal theory, which studies the sequence of intervals between successive replacements of a component susceptible to failure. TPPs find many applications in neuroscience, in particular, to model single-cell recordings and neural spike trains (Truccolo et al., 2005; Okatan et al., 2005; Kim et al., 2011; Rad & Paninski, 2011) , occasionally associated with spatial statistics (Pillow et al., 2008) or network models (Galves & Löcherbach, 2015) . In the machine learning community, there is a growing interest in these statistical tools (Bompaire, 2019; Shchur et al., 2020; Mei et al., 2020) . Multivariate Hawkes processes (MHP; Hawkes 1971) are likely the most popular, as they can model interactions between each univariate process. They also have the peculiarity that a process can be self-exciting, meaning that a past event will increase the probability of having another event in the future on the same process. The conditional intensity function is the key quantity for TPPs. With MHP, it is composed of a baseline parameter and kernels. It describes the probability of occurrence of an event depending on time. The kernel function represents how processes influence each other or themselves. The most commonly used inference method to obtain the baseline and the kernel parameters of MHP is the maximum likelihood (MLE; see e.g., Daley & Vere-Jones, 2007 or Lewis & Mohler, 2011) . One alternative and often overlooked estimation criterion is the least squares `2 error, inspired by the theory of empirical risk minimization (Reynaud-Bouret & Rivoirard, 2010; Hansen et al., 2015; Bacry et al., 2020) . A key feature of MHP modeling is the choice of kernels. Non-parametric and parametric kernels are the two possibilities. In the non-parametric setting, kernel functions are approximated by histograms (Lewis & Mohler, 2011; Lemonnier & Vayatis, 2014) , by a linear combination of pre-defined functions (Zhou et al., 2013a; Xu et al., 2016) , by functions lying in a RKHS (Yang et al., 2017) or, alternatively, by neural networks (Mei & Eisner, 2017; Shchur et al., 2019; Pan et al., 2021) . In addition to the frequentist approach, many Bayesian approaches, such as Gibbs sampling (Ishwaran & James, 2001) or (stochastic) variational inference (Hoffman et al., 2013) , have been adapted to MHP in particular to fit non-parametric kernels. Bayesian methods also rely on the modelling of the kernel by histograms (e.g., Donnet et al., 2020) or by a linear combination of pre-defined functions (e.g., Linderman & Adams, 2015) . These approaches are designed whether in continuous-time (Rasmussen, 2013; Zhang et al., 2018; Donnet et al., 2020; Sulem et al., 2021) or in discrete-time (Mohler et al., 2013; Linderman & Adams, 2015; Zhang et al., 2018; Browning et al., 2022) . These functions allow great flexibility for the shape of the kernel, yet this comes at the risk of poor estimation of it when only a small amount of data is available (Xu et al., 2017) . Another approach to estimate the intensity function is to consider kernels parametrized by ⌘. Although it can introduce a potential bias by assuming a particular shape for kernels, this approach has several benefits. First, it reduces inference difficulties , as ⌘ is typically lower dimensional compared to non-parametric kernels. Moreover, for kernels satisfying the Markov property (Bacry et al., 2015) , computing the conditional intensity function is linear in the total number of timestamps/events. The most popular kernel belonging to this family is the exponential kernel (Ogata, 1981) . It is defined by ⌘ = (↵, ) 7 ! ↵ exp( t), where ↵ and are the scaling and the decay parameters, respectively (Veen & Schoenberg, 2008; Zhou et al., 2013b) . However, as pointed out by Lemonnier & Vayatis (2014) , the maximum likelihood estimator for MHP with exponential kernels is efficient only if the decay is fixed. Thus, only the scaling parameter is usually inferred. This implies that the hyperparameter must be chosen in advance, usually by using a grid search, a random search, or Bayesian optimization. This leads to a computational burden when the dimension of the MHP is high. The second option is to define a decay parameter common to all kernels, which results in a loss of expressiveness of the model. In both cases, the relevance of the exponential kernel relies on the choice of the decay parameter, which may not be adapted to the data (Hall & Willett, 2016) . For more general parametric kernels which do not verify the Markov property, the inference procedure with both MLE or `2 loss scales poorly as they have quadratic computational scaling with the number of events, making their use limited in practice (see e.g., Bompaire, 2019, Chapter 1). These limitations for parametric and non-parametric kernels prevent their usage in some applications, as pointed out by Carreira (2021) in finance or Allain et al. (2021) in neuroscience. A strong motivation for this work is also neuroscience applications. The quantitative analysis of electrophysiological signals such as electroencephalography (EEG) or magnetoencephalography (MEG) is a challenging modern neuroscience research topic (Cohen, 2014) . By giving a non-invasive way to record human neural activity with a high temporal resolution, EEG and MEG offer a unique opportunity to study cognitive processes as triggered by controlled stimulation (Baillet, 2017) . Convolutional dictionary learning (CDL) is an unsupervised algorithm that has recently been proposed to study M/EEG signals (Jas et al., 2017; Dupré la Tour et al., 2018) . It consists in extracting patterns of interest in M/EEG signals. It learns a combination of time-invariant patterns -called atoms -and their activation function to reconstruct the signal sparsely. However, while CDL recovers the local structure of signals, it does not provide any global information, such as interactions between patterns or how their activations are affected by stimuli. Atoms typically correspond to transient bursts of neural activity (Sherman et al., 2016) or artifacts such as eye blinks or heartbeats. By offering an event-based perspective on non-invasive electromagnetic brain signals, CDL makes Hawkes processes amenable to M/EEG-based studies. Given the estimated events, one important goal is then to uncover potential temporal dependencies between external stimuli presented to the subject and the appearance of the atoms in the data. More precisely, one is interested in statistically quantifying such dependencies, e.g., by estimating the mean and variance of the neural response latency following a stimulus. In Allain et al. ( 2021), the authors address this precise problem. Their approach is based on an EM algorithm and a Truncated Gaussian kernel, which can cope with only a few brain data, as opposed to non-parametric kernels, which are more data hungry. Beyond neuroscience, Carreira (2021) use a likelihood-based approach using exponential kernels to model order book events. Their approach use high-frequency trading data, taking account of latency at hand in the proposed loss. This paper proposes a new inference method -named FaDIn -to estimate any parametric kernels for Hawkes processes. Our approach is based on two key features. First, we use finite-support kernels and a discretization applied to the ERM-inspired least-squares loss. Second, we propose to employ some precomputations that significantly reduce the computational cost. We then show that the implicit bias induced by the discretization procedure is negligible compared to the statistical error. Further, we highlight the efficiency of FaDIn in computation and statistical estimation over the non-parametric approach. Finally, we demonstrate the benefit of using a general kernel with MEG

