EFFICIENT INFERENCE OF FLEXIBLE INTERACTION IN SPIKING-NEURON NETWORKS

Abstract

Hawkes process provides an effective statistical framework for analyzing the timedependent interaction of neuronal spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modelling inhibitory interactions among neurons. Instead, the nonlinear Hawkes process allows for a more flexible influence pattern with excitatory or inhibitory interactions. In this paper, three sets of auxiliary latent variables (Pólya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make functional connection weights in a Gaussian form, which allows for a simple iterative algorithm with analytical updates. As a result, an efficient expectationmaximization (EM) algorithm is derived to obtain the maximum a posteriori (MAP) estimate. We demonstrate the accuracy and efficiency performance of our algorithm on synthetic and real data. For real neural recordings, we show our algorithm can estimate the temporal dynamics of interaction and reveal the interpretable functional connectivity underlying neural spike trains.

1. INTRODUCTION

One of the most important tracks in neuroscience is to examine the neuronal activity in the cerebral cortex under varying experimental conditions. Recordings of neuronal activity are represented through a series of action potentials or spike trains. The transmitted information and functional connection between neurons are considered to be primarily represented by spike trains (Kass et al., 2014; Kass & Ventura, 2001; Brown et al., 2004; 2002) . A spike train is a sequence of recorded times at which a neuron fires an action potential and each spike may be considered to be a timestamp. Spikes occur irregularly both within and across multiple trials, so it is reasonable to consider a spike train as a point process with the instantaneous firing rate being the intensity function of point processes (Perkel et al., 1967; Paninski, 2004; Eden et al., 2004) . An example of spike trains for multiple neurons is shown in Fig. 2a in the real data experiment. Despite many existing applications, the classic point process models, e.g., Poisson processes, neglect the time-dependent interaction within one neuron and between multiple neurons, so fail to capture the complex temporal dynamics of a neural population. In contrast, Hawkes process is one type of point processes which is able to model the self-exciting interaction between past and future events. Existing applications cover a wide range of domains including seismology (Ogata, 1998; 1999) , criminology (Mohler et al., 2011; Lewis et al., 2012) , financial engineering (Bacry et al., 2015; Filimonov & Sornette, 2015) and epidemics (Saichev & Sornette, 2011; Rizoiu et al., 2018) . Unfortunately, due to the linearly additive intensity, the vanilla Hawkes process can only represent the purely excitatory interaction because a negative firing rate may exist with inhibitory interaction. This makes the vanilla version inappropriate in the neuroscience domain where the influence between neurons is a mixture of excitation and inhibition (Maffei et al., 2004; Mongillo et al., 2018) . In order to reconcile Hawkes process with inhibition, various nonlinear Hawkes process variants are proposed to allow for both excitatory and inhibitory interactions. The core point of nonlinear Hawkes process is a nonlinearity which maps the convolution of the spike train with a causal influential kernel to a nonnegative conditional intensity, such as rectifier (Reynaud-Bouret et al., 2013) , exponential (Gerhard et al., 2017) and sigmoid (Linderman, 2016; Apostolopoulou et al., 2019) . The sigmoid mapping function has the advantage that the Pólya-Gamma augmentation scheme can be utilized to convert the likelihood into a Gaussian form, which makes the inference tractable. In Linderman (2016) , a discrete-time model is proposed to convert the likelihood from a Poisson process to a Poisson distribution. Then Pólya-Gamma random variables are augmented on discrete observations to propose a Gibbs sampler. This method is further extended to a continuous-time regime in Apostolopoulou et al. (2019) by augmenting thinned points and Pólya-Gamma random variables to propose a Gibbs sampler. However, the influence function is limited to be purely exciting or inhibitive exponential decay. Besides, due to the nonconjugacy of the excitation parameter of exponential decay influence function, a Metropolis-Hastings sampling step has to be embedded into the Gibbs sampler making the Markov chain Monte Carlo (MCMC) algorithm further inefficient. To address the parametric and inefficient problems in aforementioned existing works, we develop a flexible sigmoid nonlinear multivariate Hawkes processes (SNMHP) model in the continuous-time regime, (1) which can represent the flexible excitation-inhibition-mixture temporal dynamics among the neural population, (2) with the efficient conjugate inference. An EM inference algorithm is proposed to fit neural spike trains. Inspired by Donner & Opper (2017; 2018) , three auxiliary latent variable sets: Pólya-Gamma variables, latent marked Poisson processes and sparsity variables are augmented to make functional connection weights in a Gaussian form. As a result, the EM algorithm has analytical updates with drastically improved efficiency. As shown in experiments, it is even more efficient than the maximum likelihood estimation (MLE) for the parametric Hawkes process in high dimensional cases.

2. OUR MODEL

Neurons communicate with each other by action potentials (spikes) and chemical neurotransmitters. A spike causes the pre-synaptic neuron to release a chemical neurotransmitter that induces impulse responses, either exciting or inhibiting the post-synaptic neuron from firing its own spikes. The addition of excitatory and inhibitory influence to a neuron determines whether a spike will occur. At the same time, the impulse response characterizes the temporal dynamics of the exciting or inhibiting influence which can be complex and flexible (Purves et al., 2014; Squire et al., 2012; Bassett & Sporns, 2017) . Arguably, the flexible nonlinear multivariate Hawkes processes are a suitable choice for representing the temporal dynamics of mutually excitatory or inhibitory interactions and functional connectivity of neuron networks.

2.1. MULTIVARIATE HAWKES PROCESSES

The vanilla multivariate Hawkes processes (Hawkes, 1971)  are sequences of timestamps D = {{t i n } Ni n=1 } M i=1 ∈ [0, T ] where t i n is the timestamp of n-th event on i-th dimension with N i being the number of points on i-th dimension, M the number of dimensions, T the observation window. The i-th dimensional conditional intensity, the probability of an event occurring on i-th dimension in [t, t + dt) given all dimensional history before t, is designed in a linear superposition form: λ i (t) = µ i + M j=1 t j n <t φ ij (t -t j n ), where µ i > 0 is the baseline rate of i-th dimension and φ ij (•) ≥ 0 is the causal influence function (impulse response) from j-th dimension to i-th dimension which is normally a parameterized function, e.g., exponential decay. The summation explains the self-and mutual-excitation phenomenon, i.e., the occurrence of previous events increases the intensity of events in the future. Unfortunately, one blemish is the vanilla multivariate Hawkes processes allow only nonnegative (excitatory) influence functions because negative (inhibitory) influence functions may yield a negative intensity which is meaningless. To reconcile the vanilla version with inhibitory effect and flexible influence function, we propose the SNMHP.

2.2. SIGMOID NONLINEAR MULTIVARIATE HAWKES PROCESSES

Similar to the classic nonlinear multivariate Hawkes processes (Brémaud & Massoulié, 1996) , the i-th dimensional conditional intensity of SNMHP is defined as λ i (t) = λ i σ(h i (t)), h i (t) = µ i + M j=1 t j n <t φ ij (t -t j n ), where µ i is the base activation of neuron i, h i (t) is a real-valued activation and σ(•) is the logistic (sigmoid) function which maps the activation into a positive real value in (0, 1) with λ i being a upper-bound to scale it to (0, λ i ). The sigmoid function is chosen because as seen later, the Pólya-Gamma augmentation scheme can be utilized to make the inference tractable. After incorporating the nonlinearity, it is straightforward to see the influence functions, φ ij (•), can be positive or negative. If φ ij (•) is negative, the superposition of φ ij (•) will lead to a negative activation h i (t) that renders the intensity to 0; instead, the intensity tends to λ i with a positive φ ij (•). To achieve a flexible impulse response, the influence function is assumed to be a weighted sum of basis functions φ ij (•) = B b=1 w ijb φb (•), where { φb } B b=1 are predefined basis functions and w ijb is the weight capturing the influence from j-th dimension to i-th dimension by b-th basis function with positive indicating excitation and negative indicating inhibition. The basis functions are nonnegative functions capturing the temporal dynamics of the interaction. Although basis functions can be in any form, in order for the weights to represent functional connection strength, basis functions are chosen to be probability densities with compact support that means they have bounded support [0, T φ ] and the integral is one. As a result, the i-th dimensional activation is h i (t) = µ i + M j=1 t j n <t B b=1 w ijb φb (t -t j n ) = µ i + M j=1 B b=1 w ijb t j n <t φb (t -t j n ) = µ i + M j=1 B b=1 w ijb Φ jb (t) = w T i • Φ(t), where Φ jb (t) is the convolution of j-th dimensional observation with b-th basis function and can be precomputed; w i = [µ i , w i11 , . . . , w iM B ] T and Φ(t) = [1, Φ 11 (t), . . . , Φ M B (t)] T , both are (M B + 1) × 1 vectors. A similar model is used in Linderman (2016) where a binary variable is included to characterize the sparsity of functional connection. As shown later, the sparsity in our model is guaranteed by utilizing a Laplace prior on weight instead. In this paper, the basis functions are scaled (shifted) Beta densities, but alternatives such as Gaussian or Gamma also can be used. The reason we choose Beta distribution is the inference of weights will be subject to edge effects with infinite support densities when close to the endpoints of [0, T φ ]. The weighted sum of Beta densities is a natural choice. With appropriate mixing, it can be used to approximate functions on bounded intervals arbitrarily well (Kottas, 2006) .

3. INFERENCE

The likelihood of a point process model is provided in Daley & Vere-Jones (2003) . Correspondingly, the probability density (likelihood) of SNMHP on the i-th dimension as a function of parameters in continuous time is p(D|w i , λ i ) = Ni n=1 λ i σ(h i (t i n )) exp - T 0 λ i σ(h i (t))dt . It is worth noting that h i (t) depends on w i and observations on all dimensions. Our goal is to infer the parameters i.e., weights and intensity upper-bounds, from observations, e.g., neural spike trains, over a time interval [0, T ]. The functional connectivity in cortical circuits is demonstrated to be sparse in neuroscience (Thomson & Bannister, 2003; Sjöström et al., 2001) . To include sparsity, a factorizing Laplace prior is applied on the weights which characterize the functional connection. With the likelihood Eq. 5 and Laplace prior p L (w i ) = j,b 1 2α exp (- |w ijb | α ), the log-posterior corresponds to a L1 penalized log-likelihood. The i-th dimensional MAP estimate can be expressed as w * i , λ * i = argmax log p(D|w i , λ i ) + log p L (w i ) , where w * i and λ * i are MAP estimates. The dependency of the log-posterior on parameters is complex because the sigmoid function exists in the log-likelihood term and the absolute value function exists in the log-prior term. As a result, we have no closed-form solutions for the MAP estimates. Numerical optimization methods can be applied, but unfortunately, the efficiency is low due to the high dimensionality of parameters which is (M B + 2) × M . To circumvent this issue, three sets of auxiliary latent variables: Pólya-Gamma variables, latent marked Poisson processes and sparsity variables are augmented to make the weights appear in a Gaussian form in the posterior. As a result, an efficient EM algorithm with analytical updates is derived to obtain the MAP estimate.

3.1. AUGMENTATION OF P ÓLYA-GAMMA VARIABLES

Following Polson et al. (2013) , the binomial likelihoods parametrized by log odds can be represented as mixtures of Gaussians w.r.t. a Pólya-Gamma distribution. Therefore, we can define a Gaussian representation of the sigmoid function σ(z) = ∞ 0 e f (ω,z) p PG (ω|1, 0)dω, where f (ω, z) = z/2-z 2 ω/2-log 2 and p PG (ω|1, 0) is the Pólya-Gamma distribution with ω ∈ R + . Substituting Eq. 7 into the likelihood Eq. 5, the products of observations σ(h i (t i n )) are transformed into a Gaussian form.

3.2. AUGMENTATION OF MARKED POISSON PROCESSES

Inspired by Donner & Opper (2018) , a latent marked Poisson process is augmented to linearize the exponential integral term in the likelihood. Applying the property of sigmoid function σ(z) = 1 -σ(-z) and Eq.7, the exponential integral term is transformed to exp - T 0 λ i σ(h i (t))dt = exp - T 0 ∞ 0 1 -e f (ω,-hi(t)) λ i p PG (ω|1, 0)dωdt . (8) The right hand side is a characteristic functional of a marked Poisson process. According to the Campbell's theorem (Kingman, 2005) (App. I), the exponential integral term can be rewritten as exp - T 0 λ i σ(h i (t))dt = E p λ i   (ω,t)∈Πi e f (ω,-hi(t))   , where Π i = {(ω i k , t i k )} Ki k=1 denotes a realization of a marked Poisson process and p λi is the probability measure of the marked Poisson process Π i with intensity λ i (t, ω) = λ i p PG (ω|1, 0). The events {t i k } Ki k=1 follow a Poisson process with rate λ i and the latent Pólya-Gamma variable ω i k denotes the independent mark at each location t i k . We can see that, after substituting Eq. 9 into the likelihood Eq. 5, the exponential integral term is also transformed into a Gaussian form.

3.3. AUGMENTATION OF SPARSITY VARIABLES

The augmentation of two auxiliary latent variables above makes the augmented likelihood become a Gaussian form w.r.t. the weights. However, the absolute value in the exponent of the Laplace prior hampers the Gaussian form of weights in the posterior. To circumvent this issue, we augment the third set of auxiliary latent variables: sparsity variables. It has been proved that a Laplace distribution can be represented as an infinite mixture of Gaussians (Donner & Opper, 2017; Pontil et al., 2000) p L (w ijb ) = 1 2α exp (- |w ijb | α ) = ∞ 0 β ijb 2πα 2 exp - β ijb 2α 2 w 2 ijb p(β ijb )dβ ijb , where p(β ijb ) = (β ijb /2) -2 exp (-1/(2β ijb )). It is straightforward to see the weights are transformed into a Gaussian form in the prior after the augmentation of latent sparsity variables β.

3.4. AUGMENTED LIKELIHOOD AND PRIOR

After the augmentation of three sets of latent variables, we obtain the augmented joint likelihood and prior (derivation in App. II) p(D, Π i , ω i |w i , λ i ) = Ni n=1 λ i (t i n , ω i n )e f (ω i n ,hi(t i n )) • p λi (Π i |λ i ) (ω,t)∈Πi e f (ω,-hi(t)) , (11a) p(w i , β i ) = M B+1 j,b β ijb 2πα 2 exp - β ijb 2α 2 w 2 ijb 2 β ijb 2 exp - 1 2β ijb , where ω i is the vector of ω i n on each t i n , β i is a (M B + 1) × 1 vector of [β i00 , β i11 , . . . , β iM B ] T , λ i (t i n , ω i n ) = λ i p PG (ω i n |1, 0) . The motivation of augmenting auxiliary latent variables should now be clear: the augmented likelihood and prior contain the weights in a Gaussian form, which corresponds to a quadratic expression for the log-posterior (L1 penalized log-likelihood).

3.5. EM ALGORITHM

The original MAP estimate has been represented by Eq. 6. With the support of auxiliary latent variables, we propose an analytical EM algorithm to obtain the MAP estimate instead of performing numerical optimization. In the standard EM algorithm framework, the lower-bound (surrogate function) of the log-posterior can be represented as Q(w i , λ i |w s-1 i , λ s-1 i ) = E Πi,ωi log p(D, Π i , ω i |w i , λ i ) + E βi [log p(w i , β i )] , with expectation over posterior distributions p(Π i , ω i |w s-1 i , λ s-1 i ) and p(β i |w s-1 i , λ s-1 i ), s -1 indicating parameters from last iteration. E step: Based on joint distributions in Eq. 11, the posterior of latent variables can be derived. The detailed derivation is provided in App. III. The posterior distributions of Pólya-Gamma variables ω i and sparsity variables β i , and the posterior intensity of marked Poisson process Π i are p(ω i |w s-1 i ) = Ni n=1 p PG (ω i n |1, h s-1 i (t i n )), Λ i (t, ω|w s-1 i , λ s-1 i ) = λ s-1 i σ(-h s-1 i (t))p PG (ω|1, h s-1 i (t)), p(β i |w s-1 i ) = M B+1 j,b p IG (β ijb | α w s-1 ijb , 1), where Λ i (t, ω) is the posterior intensity of Π i , p IG is the inverse Gaussian distribution. It is worth noting that h s-1 i (t) depends on w s-1 i . The first order moments, E[ω i n ] = 1/(2h s-1 i (t i n )) tanh(h s-1 i (t i n )/2 ) and E[β ijb ] = α/w s-1 ijb , will be used in the M step. M step: Substituting Eq. 13 into Eq. 12, we obtain the lower-bound Q(w i , λ i |w s-1 i , λ s-1 i ). The updated parameters can be obtained by maximizing the lower-bound. The detailed derivation is provided in App. III. Due to the augmentation of auxiliary latent variables, the update of parameters has a closed-form solution λ s i = (N i + K i ) /T, w s i = Σ i T 0 B i (t)Φ(t)dt, where K i = T 0 ∞ 0 Λ i (t, ω|w s-1 i , λ s-1 i )dωdt, Σ i = T 0 A i (t)Φ(t)Φ T (t)dt + diag α -2 E[β i ] -1 with diag(•) indicating the diagonal matrix of a vector, A i (t) = Ni n=1 E[ω i n ]δ(t -t i n ) + ∞ 0 ωΛ i (t, ω)dω, B i (t) = 1 2 Ni n=1 δ(t -t i n ) -1 2 ∞ 0 Λ i (t, ω )dω with δ(•) being the Dirac delta function. It is worth noting that numerical quadrature methods, e.g., Gaussian quadrature, need to be applied to intractable integrals above.

3.6. COMPLEXITY AND HYPERPARAMETERS

Algorithm 1: EM inference for SNMHP Result: {λ i (t) = λ i σ(w T i • Φ(t))} M i=1 Predefine basis functions { φb (  (N N T φ B + L(N (M B + 1) 2 + M (M B + 1) 3 )) where N is the number of observations on all dimensions, N T φ is the the average number of observations on the support of T φ on all dimensions and L is the number of iterations. The first term is due to the convolution nature of Hawkes process, the second and third term to the matrix multiplication and inversion in EM iterations. For one application, the number of dimensions M and basis functions B are fixed and much less than N . Therefore, the complexity can be simplified as O(N (N T φ + L)). The hyperparameter α in Laplace prior that encodes the sparsity of weights and parameters of basis functions can be chosen by cross validation or maximizing the lower-bound Q using numerical methods. For the number of basis functions: in essence, a large number leads to a more flexible functional space while a small number results in a faster inference. In experiments, we gradually increase it until no more significant improvement. Similarly, the number of quadrature nodes and EM iterations is also gradually increased until a suitable value. The pseudocode is provided in Alg. 1.

4. EXPERIMENTS

We validate the EM algorithm for SNMHP in analyzing both synthetic and real-world spike data collected from the cat primary visual cortex. For comparison, the following most relevant baselines are considered: (1) parametric linear multivariate Hawkes processes that are vanilla multivariate Hawkes processes with exponential decay influence functions, for which the inference is performed by MLE (Ozaki, 1979) ; (2) nonparametric linear multivariate Hawkes processes with flexible influence functions, for which the inference is by majorization minimization Euler-Lagrange (MMEL) (Zhou et al., 2013) ; (3) parametric nonlinear multivariate Hawkes processes with exponential decay influence functions, for which the inference is by MCMC based on augmentation and Poisson thinning (MCMC-Aug) (Apostolopoulou et al., 2019) . The implementation of our model is publicly available at https://github.com/zhoufeng6288/SNMHawkesBeta.

4.1. SYNTHETIC DATA

We analyze spike trains obtained from the synthetic network model shown in Fig. 1a . The synthetic neural network contains four groups of two neurons each. In each group, the 2 neurons are selfexciting and mutual-inhibitive while groups are independent of each other. We assume 4 scaled (shifted) Beta distributions as basis functions with support [0, T φ = 6] in Fig. 1b . For the ground truth, it is assumed that φ 11 = φ 33 = φ 55 = φ 77 = φ1 , φ 22 = φ 44 = φ 66 = φ 88 = φ4 , φ 12 = φ 34 = φ 56 = φ 78 = -1 2 φ2 , φ 21 = φ 43 = φ 65 = φ 87 = -1 2 φ3 with positive indicating excitation and negative indicating inhibition. With base activation {µ i } 8 i=1 = 0 and upper-bounds {λ i } 8 i=1 = 5, we use the thinning algorithm (Ogata, 1998) to generate two sets of synthetic spike data on the time window [0, T = 1000] with one being the training dataset in Fig. 1c and the other one test dataset in App. IV. Each dataset contains 8 sequences and each sequence consists of 3340 events on average. We aim to identify the functional connectivity of the neural population and the temporal dynamics of influence functions from statistically dependent spike trains. More experimental details, e.g., hyperparameters, are given in the App. IV. The temporal dynamics of interactions among the neural population is shown in Fig. 1d where we plot the estimated influence functions of 1-st and 2-nd neurons (other neurons are shown in the App. IV). The estimated φ11 and φ22 exhibit the self-exciting relation with φ12 and φ21 characterizing the mutual-inhibitive interactions. All estimated influence functions are in a flexible form and close to the ground truth. Besides, as shown in Fig. 1e , the estimated functional connectivity recovers the ground-truth structure successfully. The functional connectivity is defined as |φ ij (t)|dt meaning there is no connection only if neither excitation nor inhibition exists. The training and test log-likelihood (LogL) curves w.r.t. EM iterations are shown in Fig. 1f where our EM algorithm converges fast with only 50 iterations needed to obtain a plateau. The trade-off between accuracy (LogL) and efficiency (running time) w.r.t. the number of quadrature nodes and basis functions is shown in Fig. 1g where we can see the accuracy is not sensitive to the number of quadrature nodes over 100 and the optimal number of basis functions is 4. A larger number does not significantly improve the accuracy but leads to a longer running time. Moreover, we compare the running time of our method with alternatives in Fig. 1h where the number of dimensions M is fixed to 2, basis functions B to 4, quadrature nodes to 200 and iterations of all methods to 200. We can observe that our EM algorithm is the most efficient, even superior to MLE for the classic parametric case, which verifies its efficiency. Also, we compare our model's fitting and pre- 

4.2. REAL DATA

In this section, we analyze our model performance on a real multi-neuron spike train dataset. We aim to draw some conclusions about the functional connectivity of cortical circuits and make inferences of the temporal dynamics of influence. Spike Train Data (Blanche, 2005; Apostolopoulou et al., 2019) Several multi-channel silicon electrode arrays are designed to record simultaneously spontaneous neural activity of multiple isolated single units in anesthetized paralyzed cat primary visual cortex areas 17. The spike train dataset contains spike times of 25 simultaneously recorded neurons. Preliminary Setup We extract the spike times in the time window [0, 300] (time unit: 100ms, the same applies to the following) as the training data (Fig. 2a ) and [300, 600] as the test data (App. IV). Both datasets contain approximate 7000 timestamps. All hyperparameters are fine tuned to obtain the maximum test LogL: the scaled (shifted) Beta distribution Beta(α = 50, β = 50, shift = -5) with support [0, T φ = 10] is designed as the basis function; the number of quadrature nodes is set to 1000 and EM iterations to 100. More experimental details, e.g., hyperparameters, are given in the App. IV. Results 25 × 25 influence functions among the neuron population are estimated in the application. An example of the influence functions between 8-th and 9-th neurons are plotted in Fig. 2b where our SNMHP model successfully captures the exciting or inhibitive interaction between neurons. Besides, the estimated functional connectivity is shown in Fig. 2c where we can see the functional connection structure among neural population is sparse. Unfortunately, because the ground-truth functional connectivity of cortical circuits is unknown, the estimated functional connectivity cannot be compared with the ground truth but here we resort to the test LogL to verify whether the estimation is good. The training and test LogL curves are shown in Fig. 2d where they both reach a close plateau indicating the estimation is appropriate without overfitting or underfitting. A significant advantage of our EM algorithm is the efficiency. The 25-dimensional observation in the real data is a challenge for the inference. For the running time, our EM algorithm costs 3 minutes, the MCMC-Aug costs 1 hour and 45 minutes with the same number of iterations while MLE and MMEL cannot finish in 2 days due to the curse of dimensionality. Moreover, the fitting and prediction ability is compared in Tab. 2. The superior performance of SNMHP w.r.t. training and test LogL demonstrates our model can capture the complex mixture of exciting and inhibitive interactions among neural population which leads to better goodness-of-fit.

5. DISCUSSION AND CONCLUSION

Although we propose a point-estimation method (EM algorithm) in this work, a straightforward extension to Gibbs sampler is already at hand. Based on the augmented likelihood and prior, we can obtain the conditional densities of latent variables and parameters in closed form, which constitutes a Gibbs sampler with better efficiency than MCMC-Aug since the time-consuming Metropolis-Hasting sampling in MCMC-Aug is not needed. However, the proposed Gibbs sampler is less efficient than the proposed EM algorithm because the latent Poisson processes have to be sampled by thinning algorithm in Gibbs sampler which is time consuming. For the model in Apostolopoulou et al. (2019) , a tighter intensity upper-bound is used to reduce the number of thinned points to accelerate the sampler. Instead, our EM algorithm does not encounter this problem as we compute the expectation rather than sampling. Moreover, Apostolopoulou et al. (2019) can only use one basis function, which limits influence functions to be purely exciting or inhibitive exponential decay. Instead, our model can utilize multiple basis functions to characterize an influence function that is a mixture of excitation and inhibition. In this paper, we develop a SNMHP model in the continuous-time regime which can characterize excitation-inhibition-mixture temporal dependencies among the neural population. Three auxiliary latent variables are augmented to make the corresponding EM algorithm in a closed form to improve efficiency. The synthetic and real data experimental results confirm that our model's accuracy and efficiency are superior to the state of the arts. From the application perspective, although our model is proposed in the neuroscience domain, it can be applied to other applications where the inhibition is a vital factor, e.g., in the coronavirus (COVID-19) spread, the inhibitive effect may represent the medical treatment or cure, or the forced isolation by government. From the inference perspective, our EM algorithm is a point-estimation method; other efficient distribution-estimation methods can be developed, e.g., the Gibbs sampler mentioned above or the mean-field variational inference. where we utilize the tilted Pólya-Gamma density p PG (ω|b, c) ∝ e -c 2 ω/2 p PG (ω|b, 0) (Polson et al., 2013) . 2. The posterior of sparsity variables β i is an inverse Gaussian distribution which is dependent on weights w s-1 i p( β i |w s-1 i ) = M B+1 j,b p IG (β ijb | α w s-1 ijb , 1). 3. The posterior of Π i is dependent on both h s-1 i (t) and λ s-1 i p(Π i |w s-1 i , λ s-1 i ) = p λi (Π i |λ s-1 i ) (ω,t)∈Πi e f (ω,-h s-1 i (t)) p λi (Π i |λ s-1 i ) (ω,t)∈Πi e f (ω,-h s-1 i (t)) dΠ i . The Campbell's theorem can be applied to convert the denominator, the equation above can be transformed as p(Π i |w s-1 i , λ s-1 i ) = p λi (Π i |λ s-1 i ) (ω,t)∈Πi e f (ω,-h s-1 i (t)) exp (-(1 -e f (ω,-h s-1 i (t)) )λ s-1 i p PG (ω|1, 0)dωdt) = (ω,t)∈Πi e f (ω,-h s-1 i (t)) λ s-1 i p PG (ω|1, 0) • exp - e f (ω,-h s-1 i (t)) λ s-1 i p PG (ω|1, 0)dωdt . The above posterior distribution is in the likelihood form of a marked Poisson process with intensity function Λ i (t, ω|w s-1 i , λ s-1 i ) = e f (ω,-h s-1 i (t)) λ s-1 i p PG (ω|1, 0) = λ s-1 i σ(-h s-1 i (t))p PG (ω|1, h s-1 i (t)).

M STEP

Substituting posterior distributions of latent variables into Eq. 12, we obtain the lower-bound Q. The first term of Eq. 12 is E Πi,ωi log p(D, Π i , ω i |w i , λ i ) = - 1 2 w T i • T 0 A i (t)Φ(t)Φ T (t)dt • w i + w T i • T 0 B i (t)Φ(t)dt -λ i T + N i + Λ i (t, ω)dωdt log λ i + C where we utilize the mean rule in Campbell's theorem, C is a constant and et al., 2013) . The integral of intensity function has no closed-form solution but can be solved by numerical quadrature methods. A i (t) = Ni n=1 E[ω i n ]δ(t -t i n ) + ∞ 0 ωΛ i (t, ω)dω, B i (t) = 1 2 Ni n=1 δ(t -t i n ) - 1 2 ∞ 0 Λ i (t, ω)dω, with δ(•) being the Dirac delta function and E[ω i n ] = 1/(2h s-1 i (t i n )) tanh(h s-1 i (t i n )/2) (Polson The second term of Eq. 12 is E βi [log p(w i , β i )] = - 1 2 w T i • diag E[β i ] α 2 • w i + C, where C is a constant, E[β i ] = {E[β ijb ]} M B+1 jb = {α/w s-1 ijb } M B+1 jb and diag(•) indicates the diagonal matrix of a vector.

Published as a conference paper at ICLR 2021

The updated parameters λ s i and w s i can be obtained by setting the gradient of Q to zero. Due to auxiliary variables augmentation, we can see the weights are in a quadratic form in the lower-bound, which leads to an analytical expression λ s i = (N i + K i ) /T, w s i = Σ i T 0 B i (t)Φ(t)dt, where K i = T 0 ∞ 0 Λ i (t, ω|w s-1 i , λ s-1 i )dωdt, Σ i = T 0 A i (t)Φ(t)Φ T (t)dt + diag α -2 E[β i ] -1 . It is worth noting that numerical quadrature methods need to be applied to intractable integrals above.

IV EXPERIMENTAL DETAILS

In this appendix, we elaborate on some experimental details.

SYNTHETIC DATA EXPERIMENTS

For the synthetic data, the intensities and spike times of our simulated training and test data are shown in Fig. 1 . As shown in the experiment of log-likelihood and running time w.r.t. the number of basis functions, the optimal number of basis functions is 4, which are chosen as the ground truth: φ{1,2,3,4} = Beta( α = 50, β = 50, scale = 6, shift = {-2, -1, 0, 1}). By cross validation, the hyperparameter α is chosen to be 0.05. As shown in the experiment of log-likelihood and running time w.r.t. the number of quadrature nodes, the accuracy is not sensitive to the number of quadrature nodes over 100, so the number of quadrature nodes is set to 2000. The number of EM iterations is set to 200 which is large enough for convergence. We plot the estimated influence functions of 8 neurons in Fig. 2 . For comparison, we also plot the estimated influence functions of 8 neurons from vanilla multivariate Hawkes processes using the MLE algorithm in Fig. 3 and the functional connectivity graph in Fig. 4 . We can see both estimated influence functions and functional connectivity graph are far from the ground truth. This demonstrates the necessity of incorporating inhibitive interaction into the model when the Hawkes process is applied in the neuroscience domain. The running time experiment and the fitting and prediction experiment are both conducted for 2 neurons because the baseline models cannot finish in 2 days with 8 neurons because of the curse of dimensionality. All hyperparameters are fine tuned in real data experiments. Specifically, the optimal basis function is chosen as: φ = Beta(α = 50, β = 50, scale = 10, shift = -5). The hyperparameter α is optimised to be 0.1 by cross validation. The number of quadrature nodes is chosen to be 1000 for which the running time is acceptable. The number of EM iterations is set to 100 which is large enough for convergence.



Figure 1: The synthetic network model and experimental results. (a): The synthetic neural population contains 4 independent groups. In each group, the interdependencies between 2 neurons are self-exciting and mutual-inhibitive with red arrows indicating excitation and blue arrows indicating inhibition. (b): Four scaled (shifted) Beta densities as basis functions on the support of [0, 6]. (c): The intensities and spike times of 8 neurons in the synthetic data. (d): The estimated influence functions of 1-st and 2-nd neurons where the estimated φ11 , φ12 , φ21 , φ22 are close to the ground truth, the other ground truth φ 13...18 and φ 23...28 are not labeled since they are all zero (GT=Ground Truth). (e): The heat map of functional connectivity among neural population with ground truth (left) and estimation (right). (f): The training and test log-likelihood curve w.r.t. EM iterations. (g): The trade-off between accuracy and efficiency w.r.t. # of quadrature nodes and basis functions for synthetic data. (h): The running time of 2D data for EM algorithm and alternatives w.r.t. the average observation number on each dimension (the precomputation of Φ(t) is included).

Figure 2: The real data experimental results. (a): The training spike trains extracted from real data (test spike trains in App. IV). (b): The estimated influence functions between 8-th and 9-th neurons. (c): The heat map of estimated functional connectivity among 25 neurons. (d): The training and test LogL curves w.r.t. EM iterations.

Figure 1: The intensities and spike times of 8 neurons in our synthetic training dataset (left) and test dataset (right).

Figure2: The estimated influence functions of all neurons where the estimated φ's are close to the ground truth and some ground truth are not labeled since they are all zero (GT=Ground Truth).

Figure3: The estimated influence functions of all neurons from vanilla multivariate Hawkes processes using MLE; some influence functions are not labelled since they are all zero (GT=Ground Truth).

•)} B b=1 ; Initialize the hyperparameter α and {λ i , w i , ω i , Π i , β i } M i=1 ; for Iteration do for Dimension i do Update the posterior of ω i by Eq. 13a; Update the posterior intensity of Π i by Eq. 13b; Update the posterior of β i by Eq. 13c; Update the intensity upper-bound λ i by Eq. 14a; Update the weights w i by Eq. 14b. end Update the hyperparameter α. end

Training/test LogL (×10 3 ) of different models for synthetic data.

Training/test LogL (×10 3 ) and running times of different models for real data.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for insightful comments which greatly improved the paper. This work was supported by NSFC Projects (Nos. 62061136001, 61620106010), Beijing NSF Project (No. JQ19016), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration. F. Zhou was partially funded by China Postdoctoral Science Foundation.

APPENDIX I CAMPBELL'S THEOREM

Let Π Ẑ = {(z n , ω n )} N n=1 be a marked Poisson process on the product space Ẑ = Z × Ω with intensity Λ(z, ω) = Λ(z)p(ω|z). Λ(z) is the intensity for the unmarked Poisson process {z n } N n=1 with ω n ∼ p(ω n |z n ) being an independent mark drawn at each z n . Furthermore, we define a function h(z, ω) :for any ξ ∈ C. The above equation defines the characteristic functional of a marked Poisson process. This proves Eq.9 in the main paper. The mean iswhich is used when substituting Eq. 13 into Eq. 12.II DERIVATION OF AUGMENTED LIKELIHOOD AND PRIOR Substituting Eq.7 and 9 into Eq.5 in the main paper, the augmented likelihood is obtainedwhere ω i is the vector of ω i n andwhich is Eq.11a.Similarly, the integrand in Eq. 10 is just the augmented prior in Eq. 11b.

III DERIVATION OF EM ALGORITHM

In the standard EM algorithm framework, the lower-bound of log-posterior has been provided in Eq. 12. The posterior of latent variables can be derived from the joint distribution in Eq. 11. The derivation is relatively easy for ω i and β i while Π i is difficult. In the following, s -1 and s mean the last and current iteration in the EM algorithm.

E STEP

1. The posterior of Pólya-Gamma variables ω i is dependent on the activation h s-1 i (t) at {t i n } Ni n=1 , which is further dependent on w s-1 i through Eq. 4 

