GRAPH-INFORMED NEURAL POINT PROCESSES WITH MONOTONIC NETS

Abstract

Multi-class event data is ubiquitous in real-world applications. The recent neural temporal point processes (Omi et al., 2019) have used monotonic nets to model the cumulative conditional intensity to avoid an intractable integration in the likelihood. While successful, they are restricted to single-type events and can easily sink to poor learning results. To address these limitations and to exploit valuable structural information within event participants, we develop a Graph-Informed Neural Point Process (GINPP) that can freely handle multiple event types, greatly improve learning efficiency, and effectively integrate the graph information to facilitate training. First, we find the bottleneck of the previous model arises from the standard softplus transformation over the output of the monotonic net, which enlarges the prediction variations of the monotonic net and increases the training challenge. We propose a shift-scale variant that can significantly reduce the variation and promote the learning efficiency. Second, we use a conditional mark distribution to model multiple event types, without the need for explicitly estimating the intensity for each type. The latter can be much more challenging. Third, we use random walk to collect the neighborhood of each event participant, and use an attention mechanism to update the hidden state of each participant according to the observed events of both the participant itself and its neighborhood. In this way, we can effectively leverage the graph knowledge, and scale up to large graphs. We have shown the advantage of our approach in both ablation studies and real-world applications.

1. Introduction

Real-world applications often involve multi-class events. For example, 911 calls seek for a variety of helps, traffic records include different types of accidents, and among social network users are various types of interactions (tweeting, following, poking, etc.) . Neural temporal point processes (e.g., (Du et al., 2016; Mei and Eisner, 2017; Zhang et al., 2020a; Zuo et al., 2020) ) are a family of powerful methods for event modeling and prediction, which use neural networks (NN) to model the intensity of events and can flexibly estimate the complex dependencies among the observed events. However, due to the use of NNs, the cumulative (i.e., integral of) conditional intensity in the point process likelihood is often analytically intractable, and demand a complex, expensive approximation. To bypass this issue, the recent work Omi et al. (2019) uses a monotonic net (Sill, 1997; Chilinski and Silva, 2020) to model the monotonically increasing cumulative intensity to avoid the integration, and the intensity is obtained by simply taking the derivative. To ensure the positiveness, a softplus transformation is applied to the output of the monotonic net. Despite the elegance and success, this method only supports single-type events. More important, it often suffers from inefficient learning and easily falls into poor performance. In this paper, we propose GINPP, a graph-informed neural point process model to overcome these problems, and to further utilize the valuable structural knowledge within the event participants, which is often available in practice. The major contributions of our work are listed as follows. • First, we investigate the learning challenge of (Omi et al., 2019) , and find the bottleneck arises from the softplus transformation over the monotonic net prediction to ensure positiveness. To obtain an output slightly above zero, the standard softplus demands the input, i.e., the monotonic net prediction, must be negative and have much greater scales. Hence, a small output range can cause a much wider input (monotonic net prediction) range, which are biased toward the negative domain. The large variation of the prediction scale makes the estimation of the monotonic net much more difficult and inefficient. • Second, we propose a shift-scale variant of the softplus function, where the scale controls the shape and the shift controls the position. By setting these two hyperparameters properly, the required input range can be greatly shrunk, and get close to the output range. Accordingly, the variations of the prediction scales is significantly reduced, and the learning of the monotonic net is much easier and more efficient. • Third, we construct a marked point process for multi-class events. By introducing a conditional mark distribution, we can freely handle different event types and only need a single-output monotonic net, which models the unified cumulative conditional intensity. This is more efficient and convenient than a naive extension that separately estimates the cumulative intensity for each particular event type. • Fourth, to incorporate the graph structure in training, we use random walk to collect the neighborhood for each participant. We use an attention mechanism to update the hidden state of each participant, according to the observed events of not only the participant itself, but also its neighborhood. In this way, the estimation of hidden state can be improved with enriched observations, and the event dependencies can be more comprehensively captured. The random walk further enables us to scale to large graphs. Accordingly, we develop an efficient, scalable stochastic mini-batch learning algorithm. For evaluation, we first examined GINPP in ablation studies. We tested the performance of the monotonic net with our shift-scale softplus transformation in learning two benchmark functions: one is monotonic and the other is not. Our method converges fast, accurately learns the first function, and finds a close monotonic approximation to the second function. By contrast, with the standard softplus, the learning is saturated early at large loss values and the estimation is much worse. Then, we tested on a synthetic bi-type event dataset. GINPP accurately recovered the intensity for each event type via the learned overall intensity and the mark distribution. Next, we evaluated GINPP in six real-world benchmark datasets. We examined the accuracy in predicting the time and type of future events. In both tasks, GINPP consistently outperforms all the competing methods. Even without incorporating the graphs, GINPP still achieves better accuracy. When the graph structure is available, GINPP improves the accuracy further.

2. Background

Temporal Point Process (TPP) is a general mathematical framework for event modeling (Daley and Vere-Jones, 2007) . A TPP is specified by the conditional intensity (or rate) of the events. Suppose we have K types of events, and denote by λ k (t) the conditional intensity for event type k. Given a sequence of observed events and their types, Γ = [(t 1 , s 1 ), . . . , (t N , s N )], where t n is the timestamp and s n is the type of each event n (1 ≤ s n ≤ K, t n ≤ t n+1 ). The likelihood of the TPP is given by p(Γ) = K k=1 exp - T 0 λ k (t)dt • N n=1 λ sn (t n ), where T is the entire span of the observed events. One popular TPP is the homogeneous Poisson process, which assumes each conditional intensity λ k (t) is a time-invariant constant λ 0 k , and has nothing to do with previous events {(t n , s n )|t n < t}. While simple and convenient, Poisson processes ignore the complex relationships among the events. The Hawkes process (Hawkes, 1971 ) is more expressive in that it models the excitation effect among the events, λ k (t) = λ 0 k + tn<t ρ sn→k (t -t n ) (2) where λ 0 k ≥ 0 is the background rate, ρ sn→k (∆) > 0 is the triggering kernel, and quantifies how much contribution the past event at t n , of type s n , makes to trigger a new event of type k to occur at t. The most commonly used triggering kernel is an exponential kernel, which assumes an exponential decay of the excitation effect along with the time lag ∆. Neural Temporal Point Process. Hawkes processes only account for additive, excitation effects, and are inadequate to capture various complex event dependencies. To overcome this limitation, recent works (Du et al., 2016; Mei and Eisner, 2017) use neural networks to model the conditional intensity. Typically, a recurrent neural network (RNN) is used to capture the complex event dependencies. For each event n, we introduce a hidden state h n , which is computed according to the previous state h n-1 , the current time, event type, and other input features. An illustrative example is h n = RNN-Cell(h n-1 , t n , s n ). (3) Then we obtain the conditional intensity through a positive transformation over the hidden state, e.g., λ k (t) = f (w k h n ). We then substitute the intensity into (1), and maximize the likelihood to estimate the model parameters. Although the NN modeling of λ k (t) greatly increases the model capacity/expressivity, it makes the cumulative intensity in the likelihood (1), namely T 0 λ k (t)dt, analytically intractable to compute. We have to use approximations, such as Monte-Carlo sampling and numerical quadrature, which can be expensive and complex. To sidestep this issue, Omi et al. (2019) instead modeled the cumulative conditional intensity with the RNN output, φ(h n-1 , t) = f sp (MNet (h n-1 , t)) = t tn-1 λ(τ )dτ, where t n-1 ≤ t ≤ t n , h n-1 is the RNN state corresponding to the last observed event, MNet is a monotonic net (Sill, 1997; Chilinski and Silva, 2020) , which guarantees the output is monotonically increasing along with the input time t and hence is consistent with the cumulative intensity, and f sp is the softplus function, f sp (•) = log(1 + exp(•)), which is to ensure the positiveness. Note f sp is also monotonically increasing, and the transformation with f sp does not change the monotonicity on t. Since Omi et al. (2019) only considered single-type events, we omit the subscript k and denote the single conditional intensity by λ(t). Given (4), we can obtain the conditional intensity by taking the derivative, λ(t n ) = ∂φ(hn-1,t) ∂t t=tn . The likelihood of an event sequence [t 1 , . . . , t N ] is p(t 1 , . . . , t N ) = N n=1 exp - tn tn-1 λ(t)dt • exp - T t N λ(t)dt • N n=1 λ(t n ) = N n=1 φ(h n-1 , t n ) • φ(h N , T ) • N n=1 ∂φ(h n-1 , t) ∂t t=tn ( ) where t 0 = 0 and h 0 is initial state of the RNN. Since there is no integration, the computation and optimization is much easier and more convenient, especially with automatic differentiation libraries.

3. Model

Although the model of (Omi et al., 2019) is smart and successful, it only supports single-type events. More important, we found that it often suffers from inefficient learning and easily falls into poor performance. To address these issues and to further take advantage of the structural knowledge within the event participants, we develop GINPP, a graph-informed neural point process model based on the monotonic net, presented as follows. Specifically, we assume our dataset includes K types of events, which were launched by M participants. Each observed event sequence is a series of mixed-type events launched by a particular participant. For example, in online social media, a tweeter account can be viewed as a participant, which can launch a series of events of different types: tweeting, retweeting, replying, like, direct messaging, etc. We denote the event sequence of participant m by Γ m = [(t m 1 , s m 1 ), . . . , (t m Nm , s m Nm )] where each event type s m n ∈ {1, . . . , K} (1 ≤ n ≤ N m ). Among the M participants is a graph structure that encodes their correlations, denoted by G = (E, V) where V = {1, . . . , M } is the vertex set, and E = {(i, j)} is the edge set. First, we consider extending (Omi et al., 2019) to support multiple event types. A straightforward extension is to follow the idea of (Mei and Eisner, 2017) and expand the output dimension of the monotonic net to K (see (4)) for each participant (vertex) m. We then apply an elementwise softplus transformation to obtain the cumulative conditional intensity for each event type k, namely, φ k (h m n-1 , t) = f sp MNet h m n-1 , t [k] , where h m n-1 is the hidden state of the vertex m. However, this method will increase the learning challenge of the monotonic net, because K monotonic constraints have to be satisfied simultaneously. To circumvent this issue, we construct a marked point process (Daley and Vere-Jones, 2007) , where the event type is considered as a mark of the event. As in (4), we still use a single-output monotonic net, but to model a global cumulative conditional intensity φ(h m n-1 , t). We then introduce a mark distribution to sample the event type according to the last state h m n-1 and the time lag ∆ t = t -t m n-1 , p(s = k|t) ∝ exp u k β(h m n-1 , ∆ t ) ) where β(•) is the output of a neural network, and u k is the embedding of event type k, which will be jointly estimated during training. The conditional intensity is therefore given by λ m k (t) = λ m (t)p(s = k|t) where λ m (t) = ∂φ(h m n-1 ,t) ∂t is the global conditional intensity. The likelihood of the event sequence Γ m = [(t m 1 , s m 1 ), . . . , (t m Nm , s m Nm ) ] is therefore a minor adjustment of (5), p(Γ m ) = exp - T 0 K k=1 λ m k (t) dt Nm n=1 λ m s m n (t m n ) = Nm n=1 φ(h m n-1 , t m n ) • φ(h m Nm , T ) • Nm n=1 p(s = s m n |t m n ) ∂φ(h m n-1 , t) ∂t t=t m n . This simple modification enables us to freely model multiple types of events, but sidesteps the difficulty of learning a multi-output monotonic net. Empirically, we found our method is much more effective. Second, we investigate the learning challenge of (Omi et al., 2019) . We find the bottleneck arises from the standard softplus transformation f sp in (4), which enlarges the prediction scales of the monotonic net and increases the training difficulty. Specifically, as a continuous relaxation of the ReLU activation, the output f sp (x) is approximately equal to the input x only when the output is relatively big, e.g., 3.05 = f sp (3). By contrast, when the output is small, e.g., close to zero, the input x is a negative number with a much larger scale, e.g., 0.0009 = f sp (-7). This can be seen from the inverse, f -1 sp (y) = log(e y -1). As shown in Fig. 1a , when y is close to zero, e y -1 is close to zero, and the inverse function varies violently. As a consequence, a small output range requires a much wider input range, and the monotonic net's prediction has to cover this input range. For example, an output range [10 -4 , 3] corresponds to the input range [-9.21, 2.95] . Hence, the variation of the monotonic net's prediction is greatly enlarged. This can be further verified from a probabilistic analysis. Suppose the training output follows a uniform distribution in [0, 3] . This is reasonable, because in practice, we often normalize the data to avoid their scales being too large for better numerical stability and optimization efficiency. We then look into the corresponding input distribution of the standard softplus. As shown in Fig. 1b (the blue line), the input distribution is much wider, and includes a long, fat tail over the negative domain. That means, the prediction of the monotonic net has to fit many negative values with much larger scales, which can bring back the troubles in optimization efficiency and numerical stability. To alleviate this issue, we propose a shift-scale variant of the softplus function, fsp (x; α, γ) = γ log(1 + exp( x -α γ )) where α and γ are two hyperparameters, shift and scale. When α > 0, the function body is moved toward the right of the x-axis and hence the negative input range needed to obtain close-to-zero outputs is shrunk. Furthermore, when we choose γ ∈ (0, 1), the absolute value of the input is amplified. That means, to achieve the same output, e.g., 0.0009, the scale of the input xi.e., the output of the monotonic net -can be greatly decreased. Therefore, both α and γ can shrink the input range, and reduce the fat tail of the distribution over the negative inputs, so as to make monotonic net learning easier and more efficient. Fig. 1b shows the input distribution with α = 0, γ = 0.5 and α = γ = 0.5. In both cases, the distribution over the large negative inputs is greatly reduced. The reduction with α = γ = 0.5 is more significant. Our experiments have verified the improvement of learning with our shift-scale softplus function (see Sec. 6.1).

Vertex m

Neighborhood vertexes RNN state < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 Y Y X K N E A e g H f 4 a l O N a / / S m d 4 C q M = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I U Z d F N y 4 r 2 A e 0 0 5 J J M 2 1 o k h m S j F K G / o c b F 4 q 4 9 V / c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 S h R h D Z J x C P V C b C m n E n a N M x w 2 o k V x S L g t B 1 M b j O / / U i V Z p F 8 M N O Y + g K P J A s Z w c Z K / Z 7 A Z h y E 6 X j W F w N v U K 6 4 V X c O t E q 8 n F Q g R 2 N Q / u o N I 5 I I K g 3 h W O u u 5 8 b G T 7 E y j H A 6 K / U S T W N M J n h E u 5 Z K L K j 2 0 3 n q G T q z y h C F k b J P G j R X f 2 + k W G g 9 F Y G d z F L q Z S 8 T / / O 6 i Q m v / Z T J O D F U k s W h M O H I R C i r A A 2 Z o s T w q S W Y K G a z I j L G C h N j i y r Z E r z l L 6 + S 1 k X V u 6 z W 7 m u V + k 1 e R x F O 4 B T O w Y M r q M M d N K A J B B Q 8 w y u 8 O U / O i / P u f C x G C 0 6 + c w x / 4 H z + A J z f k p g = < / l a t e x i t > h m 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " o y / J 6 3 p I L V b s T Q o z z 3 s / r Y d a Z g M = " > A A A B 9 X i c b V B N T w I x F H y L X 4 h f q E c v j c T E E 9 k l R D 0 S v X j E R M A E F t I t X W h o u 5 u 2 q y E b / o c X D x r j 1 f / i z X 9 j F / a g 4 C R N J j P v 5 U 0 n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 y h R h L Z I x C P 1 E G B N O Z O 0 Z Z j h 9 C F W F I u A 0 0 4 w u c n 8 z i N V m k X y 3 k x j 6 g s 8 k i x k B B s r 9 X s C m 3 E Q p u N Z X w x q g 3 L F r b p z o F X i 5 a Q C O Z q D 8 l d v G J F E U G k I x 1 p 3 P T c 2 f o q V Y Y T T W a m X a B p j M s E j 2 r V U Y k G 1 n 8 5 T z 9 C Z V Y Y o j J R 9 0 q C 5 + n s j x U L r q Q j s Z J Z S L 3 u Z + J / X T U x 4 5 a d M x o m h k i w O h Q l H J k J Z B W j I F C W G T y 3 B R D G b F Z E x V p g Y W 1 T J l u A t f 3 m V t G t V 7 6 J a v 6 t X G t d 5 H U U 4 g V M 4 B w 8 u o Q G 3 0 I Q W E F D w D K / w 5 j w 5 L 8 6 7 8 7 E Y L T j 5 z j H 8 g f P 5 A 5 5 j k p k = < / l a t e x i t > h m 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 t U t p n 7 D T h Q J 4 A s a F r q / C q d U a l I = " > A A A B 9 X i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l R k t 6 r L o x m U F + 4 B 2 W j J p p g 1 N M k O S U c r Q / 3 D j Q h G 3 / o s 7 / 8 Z M O w t t P R A 4 n H M v 9 + Q E M W f a u O 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b O k o U o Q 0 S 8 U i 1 A 6 w p Z 5 I 2 D D O c t m N F s Q g 4 b Q X j 2 8 x v P V K l W S Q f z C S m v s B D y U J G s L F S r y u w G Q V h O p r 2 R P + i X y q 7 F X c G t E y 8 n J Q h R 7 1 f + u o O I p I I K g 3 h W O u O 5 8 b G T 7 E y j H A 6 L X Y T T W N M x n h I O 5 Z K L K j 2 0 1 n q K T q 1 y g C F k b J P G j R T f 2 + k W G g 9 E Y G d z F L q R S 8 T / / M 6 i Q m v / Z T J O D F U k v m h M O H I R C i r A A 2 Y o s T w i S W Y K G a z I j L C C h N j i y r a E r z F L y + T 5 n n F u 6 x U 7 6 v l 2 k 1 e R w G O 4 Q T O w I M r q M E d 1 K E B B B Q 8 w y u 8 O U / O i / P u f M x H V 5 x 8 5 w j + w P n 8 A Z / n k p o = < / l a t e x i t > h m 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " D Y u 6 j 0 D T n p 6 8 k I t g E 0 b M S A 7 u + a I = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I U Z d F N y 4 r 2 A e 0 0 5 J J M 2 1 o k h m S j F K G / o c b F 4 q 4 9 V / c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 S h R h D Z J x C P V C b C m n E n a N M x w 2 o k V x S L g t B 1 M b j O / / U i V Z p F 8 M N O Y + g K P J A s Z w c Z K / Z 7 A Z h y E 6 X j W F 4 P a o F x x q + 4 c a J V 4 O a l A j s a g / N U b R i Q R V B r C s d Z d z 4 2 N n 2 J l G O F 0 V u o l m s a Y T P C I d i 2 V W F D t p / P U M 3 R m l S E K I 2 W f N G i u / t 5 I s d B 6 K g I 7 m a X U y 1 4 m / u d 1 E x N e + y m T c W K o J I t D Y c K R i V B W A R o y R Y n h U 0 s w U c x m R W S M F S b G F l W y J X j L X 1 4 l r Y u q d 1 m t 3 d c q 9 Z u 8 j i K c w C m c g w d X U I c 7 a E A T C C h 4 h l d 4 c 5 6 c F + f d + V i M F p x 8 5 x j + w P n 8 A a F r k p s = < / l a t e x i t > h m 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " p q a D n a L i q A + r I 5 X r w 4 6 c p 9 q b p P 0 = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I r 2 X R j c s K 9 g H t t G T S T B u a Z I Y k o 5 S h / + H G h S J u / R d 3 / o 2 Z d h b a e i B w O O d e 7 s k J Y s 6 0 c d 1 v p 7 C y u r a + U d w s b W 3 v 7 O 6 V 9 w + a O k o U o Q 0 S 8 U i 1 A 6 w p Z 5 I 2 D D O c t m N F s Q g 4 b Q X j 2 8 x v P V K l W S Q f z C S m v s B D y U J G s L F S r y u w G Q V h O p r 2 R P + i X 6 6 4 V X c G t E y 8 n F Q g R 7 1 f / u o O I p I I K g 3 h W O u O 5 8 b G T 7 E y j H A 6 L X U T T W N M x n h I O 5 Z K L K j 2 0 1 n q K T q x y g C F k b J P G j R T f 2 + k W G g 9 E Y G d z F L q R S 8 T / / M 6 i Q m v / Z T J O D F U k v m h M O H I R C i r A A 2 Y o s T w i S W Y K G a z I j L C C h N j i y r Z E r z F L y + T 5 l n V u 6 y e 3 5 9 X a j d 5 H U U 4 g m M 4 B Q + u o A Z 3 U I c G E F D w D K / w 5 j w 5 L 8 6 7 8 z E f L T j 5 z i H 8 g f P 5 A 6 L v k p w = < / l a t e x i t > h m 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " b N q j V X u w Z M c 0 n r A q 4 g + 0 w / 5 n N / 0 = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I V J d F N y 4 r 2 A e 0 0 5 J J M 2 1 o k h m S j F K G / o c b F 4 q 4 9 V / c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 S h R h D Z J x C P V C b C m n E n a N M x w 2 o k V x S L g t B 1 M b j O / / U i V Z p F 8 M N O Y + g K P J A s Z w c Z K / Z 7 A Z h y E 6 X j W F 4 P a o F x x q + 4 c a J V 4 O a l A j s a g / N U b R i Q R V B r C s d Z d z 4 2 N n 2 J l G O F 0 V u o l m s a Y T P C I d i 2 V W F D t p / P U M 3 R m l S E K I 2 W f N G i u / t 5 I s d B 6 K g I 7 m a X U y 1 4 m / u d 1 E x N e + y m T c W K o J I t D Y c K R i V B W A R o y R Y n h U 0 s w U c x m R W S M F S b G F l W y J X j L X 1 4 l r Y u q V 6 t e 3 l 9 W 6 j d 5 H U U 4 g V M 4 B w + u o A 5 3 0 I A m E F D w D K / w 5 j w 5 L 8 6 7 8 7 E Y L T j 5 z j H 8 g f P 5 A 6 R z k p 0 = < / l a t e x i t > h m 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " K i S U j R c M w I 8 4 v M u h C 7 E w U N e A + G c = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I W J d F N y 4 r 2 A e 0 0 5 J J M 2 1 o k h m S j F K G / o c b F 4 q 4 9 V / c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 S h R h D Z J x C P V C b C m n E n a N M x w 2 o k V x S L g t B 1 M b j O / / U i V Z p F 8 M N O Y + g K P J A s Z w c Z K / Z 7 A Z h y E 6 X j W F 4 P a o F x x q + 4 c a J V 4 O a l A j s a g / N U b R i Q R V B r C s d Z d z 4 2 N n 2 J l G O F 0 V u o l m s a Y T P C I d i 2 V W F D t p / P U M 3 R m l S E K I 2 W f N G i u / t 5 I s d B 6 K g I 7 m a X U y 1 4 m / u d 1 E x N e + y m T c W K o J I t D Y c K R i V B W A R o y R Y n h U 0 s w U c x m R W S M F S b G F l W y J X j L X 1 4 l r Y u q d 1 W 9 v L + s 1 G / y O o p w A q d w D h 7 U o A 5 3 0 I A m E F D w D K / w 5 j w 5 L 8 6 7 8 7 E Y L T j 5 z j H 8 g f P 5 A 6 X 3 k p 4 = < / l a t e x i t > h m 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " Finally, to incorporate the valuable graph knowledge, we use an attention mechanism to model the RNN states at each vertex (participant), based on the observed events occurred on both the vertex itself and its neighborhood (see Fig. 2 ). Specifically, we introduce an embedding vector v m to represent each vertex m. J K G c F k b G S x Q i a 0 W E k r a i c d / P o / 8 = " > A A A B 9 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I a J d F N y 4 r 2 A e 0 0 5 J J M 2 1 o k h m S j F K G / o c b F 4 q 4 9 V / c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 S h R h D Z J x C P V C b C m n E n a N M x w 2 o k V x S L g t B 1 M b j O / / U i V Z p F 8 M N O Y + g K P J A s Z w c Z K / Z 7 A Z h y E 6 X j W F 4 P a o F x x q + 4 c a J V 4 O a l A j s a g / N U b R i Q R V B r C s d Z d z 4 2 N n 2 J l G O F 0 V u o l m s a Y T P C I d i 2 V W F D t p / P U M 3 R m l S E K I 2 W f N G i u / t 5 I s d B 6 K g I 7 m a X U y 1 4 m / u d 1 E x P W / J T J O D F U k s W h M O H I R C i r A A 2 Z o s T w q S W Y K G a z I j L G C h N j i y r Z E r z l L 6 + S 1 k X V To compute the RNN states at each vertex m, we flatten the observed event sequences on m and its neighborhood B m to obtain a single sequence Γ m = [(t 1 , s 1 , r 1 ), . . . , (t N , s N , r N )], where t 1 ≤ . . . ≤ t N , {s n } N n=1 are the event types, {r n } N n=1 the vertex indices, and each r n ∈ {m} ∪ B m . For vertex m, we introduce a hidden state h m n at each event (t n , s n , r n ). We update h m n by g m n = RNN-Cell(h m n-1 , u sn , t n -t n-1 ), a n = σ(τ • v m v rn ), ( ) h m n = a n • g m n + (1 -a n ) • h m n-1 ( ) where σ is the sigmoid function, a n is an attention score computed from the inner product of the vertex embeddings, and τ > 0 adjusts how much the score is leaning toward 0 or 1 (in our experiment we set τ = 2.0). We can see that even if the event did not occur on vertex m, i.e., r n = m, we can still update the state on m according to its similarity to the neighborhood vertex r n , i.e., the attention score a n . This score reflects how much the vertex m should pay attention to the event happened on the neighborhood to update itself. If vertex m and r n are very dissimilar, i.e., a n ≈ 0, then there is almost no update. In this way, the correlation between the vertexes (participants), which are reflected by the graph structure, are integrated into our model to bolster the estimation of the cumulative conditional intensity (see ( 4)), better fit the observed data and improve the prediction. When there is no graph structure, we have all r n = m and a n ≈ 1, and we return to the standard RNN updates.

4. Algorithm

Given a collection of observed event sequences on all the vertexes (participants), D = {S 1 , . . . , S L }, where each S l = {Γ 1 l , . . . , Γ M l } (1 ≤ l ≤ L) and each Γ m l is an event sequence on vertex m (1 ≤ m ≤ M ), we develop a stochastic mini-batch learning algorithm to estimate the vertex embeddings {v j } M j=1 , the event type embeddings {u k } K k=1 , the parameters of the RNN and monotoic net (see ( 9) and( 4)), and the NN parameters for mark distribution (see ( 6)). Specifically, to be scalable to a large graph G, at each step, we use random walk to sample a small set of vertexes B. We view the vertexes in B are neighborhood to each other. Thereby, the neighborhood includes not only the vertexes that are directly connected, but also the ones connected via a short path. In this way, we can exploit more abundant local structural information. Next, to handle a large number sequences L, we randomly select one sequence set S l (l ∈ {1, . . . , L}), and use the subset of sequences on the sampled vertexes, {Γ m l |m ∈ B}, to compute the likelihood (see ( 7)) and its gradient. We use this gradient as a stochastic gradient to update the model parameters. We can apply any stochastic optimization algorithm. Our stochastic training is summarized in Algorithm 1. Algorithm 1 GINPP (D, E, G, α, γ)

Initialize the model parameters repeat

Randomly shuffle the sequence sets in D = {S 1 , . . . , S L } for each set S l do Random walk over the graph G to sample a small set of vertexes B. Pick the subset of sequences associated with B: A = {Γ m l |m ∈ B} Compute the gradient of the likelihood on A according to (7). Update the model parameters with the gradient (e.g., ADAM) end for until E epochs are done

5. Related Work

Poisson process is a popular tool for event data analysis, e.g., (Lloyd et al., 2015; Schein et al., 2015; 2016; 2019) , but its independent increment assumption ignores the event dependencies or interactions. Many works therefore propose to use Hawkes processes (HPs) (Hawkes, 1971) to capture the mutual excitation effects among the events, such as (Blundell et al., 2012; Tan et al., 2016; Linderman and Adams, 2014; Du et al., 2015; He et al., 2015; Wang et al., 2017; Yang et al., 2017a; Xu and Zha, 2017; Xu et al., 2018) . A series of works improve the learning with HPs, such as nonparametric kernel estimation (Zhou et al., 2013; Zhang et al., 2020b; Zhou et al., 2020) , short doubly-censored event sequences (Xu et al., 2017) , Granger causality (Xu et al., 2016) and online estimation (Yang et al., 2017b) . Another recent line of works (Zhe and Du, 2018; Pan et al., 2020; Wang et al., 2020) uses the HP framework for high-order structure decomposition and representation learning. In order to estimate more complex event dependencies, recent research has attempted to use neural networks to construct temporal point processes. Du et al. (2016) used an RNN to construct a marked temporal point process, where the conditional intensity is formulated as a linear transformation of the RNN state and the time difference then through an exponential transformation to ensure the positiveness. Mei and Eisner (2017) proposed a continuous-time LSTM (Hochreiter and Schmidhuber, 1997) to model the conditional intensity and used the softplus transformation to ensure the positiveness. To handle the intractable integration in the point process likelihood, Mei and Eisner (2017) used a Monte-Carlo approximation. The recent works (Zhang et al., 2020a; Zuo et al., 2020) use the attention mechanism (Vaswani et al., 2017; Bahdanau et al., 2014) to replace the RNN framework, but they still have to approximate the integration in the likelihood. Omi et al. (2019) bypassed this problem by feeding the RNN states into a monotonic net (Sill, 1997; Chilinski and Silva, 2020) to directly model the cumulative conditional intensity, so we do not need to explicitly compute the integration. The monotonic net is typically a multi-layer perceptron but imposes the nonnegative constraint over the weights during learning to fulfill the monotonicity. While successful, the model of Omi et al. (2019) only supports single-type events, which can be limited in practice. Our work GINPP extends their model to support multi-type events with the marked point process framework (Daley and Vere-Jones, 2007) , which is simple and efficient. Note that Du et al. ( 2016) also used the marked point process framework. GINPP uses the RNN framework to model the hidden state due to the stable and excellent performance, but an attention mechanism is used to incorporate the graph structure into the state computation. Hence, GINPP can be viewed as a hybrid approach. It is straightforward to extend GINPP to a full attention model. Other works include Zhou et al. (2021) that extends HPs with a mixture of shifted Beta densities, ODE based models (Rubanova et al., 2019) , neural network influence kernels (Zhu et al., 2021a) , deep Fourier kernels (Zhu et al., 2021b) , the intensity-free point process learning (Shchur et al., 2019) that models the time difference between successive events, etc. Based on the intensity-free framework (Shchur et al., 2019) , Zhang et al. (2021) also proposed a neural point process model with a prior graph incorporated. But their goal is to infer the labels of nodes in the graph to detect clusters and anomalies, rather than predict the occurrence and type of new events. The recent work of Pan et al. (2021) develops a nonparametric decaying model of the temporal inference, and can explicitly recover various excitation and inhibition effects, and their decay patterns among the events. While flexible and interpretable, it still needs to approximate the cumulative conditional intensity, which is done via Gauss-Laguerre quadrature.

6.1. Ablation Study

We first performed an ablation study to confirm the effectiveness of our shift-scale softplus transformation (8). To this end, we tested with two functions: g 1 (x) = 0.5 • σ (5(x -1)) , g 2 (x) = 0.3 • 0.5 sin(10x)e -x/2 + e x -1 -0.2 where g 1 (x) > 0 monotonically increases with the input x and g 2 does not. We used a monotonic net plus the shift-scale softplus transformation to learn the two functions. The monotonic net includes two layers, with 256 neurons per layer, and tanh activation. We uniformly sampled 50 training points from x ∈ [0, 2]. We implemented the model with TensorFlow, and used Adam for stochastic optimization. The learning rate was set to 10 -3 . We set α = 0, γ = 0.2 and α = 0.5, γ = 0.2. We also tested with the standard softplus We show the learning curves and estimated the functions in Figure 3 : The learning curves and estimation results of monotonic net with softplus transformation. Fig. 3 . As we can see, with our shift-scale softplus, the learning converges much faster (see Fig. 3a and Fig. 3c ). With the standard softplus, the learning fell into a relative big loss first, and stayed there for quite a long time before the new improvement started to happen. By contrast, during the stage that the standard softplus was stuck, our shift-scale softplus had driven the training loss to drop fast and converged (see the red line). This shows that the shift-scale version can greatly boost the learning efficiency. Note that for standard softplus, the long stucking stage can signal a wrong message that the training can stop. The estimated functions with our shift-scale softplus are also more accurate. From Fig. 3b and 3d , our learned functions are quite close to the ground-truth. In Fig. 3b , our estimation (α = 0.5, γ = 0.2) almost overlaps with the ground-truth. By contrast, the estimation with the standard softplus (the blue line) has a clear deviation. Therefore, it shows that our shift-scale softplus not only bolster the learning efficiency but also leads to better learning results. Next, we examined if GINPP can recover the ground-truth conditional intensity. To this end, we generated a synthetic dataset of bi-type events, where type 0 events excite type 1 events while type 1 inhibits type 0, and events of the same type do not influence on each other, ρ 0→1 (∆) = max(1.0 -0.05∆ 2 , 0), ρ 1→0 (∆) = -0.5 exp(-0.5∆). We then substitute ρ 0→1 and ρ 1→0 into (2) (background rate is zero). To ensure the positiveness, we applied a softplus transformation to obtain the conditional intensity. We used Thinning algorithm Lewis and Shedler (1979) to sample 10K sequences for training and 1K for validation. Each sequence consists of 64 events. We evaluated the likelihood of the validation set after each training epoch, and halted the training when no improvement was observed (i.e., early stopping). We used the GRU cell for our RNN component. The dimension of the hidden state and event type embeddings was set to 64 and 16, respectively. For the monotonic net component, we used two hidden layers, 256 neurons per layer, and tanh activation. For the NN in the mark distribution (see ( 6)), we used one hidden layer, with 256 neurons and RELU activation function. We set α = γ = 0.5 for the shift-scale softplus. We used Adam for stochastic optimization with learning rate 10 -3 and mini-batch size 64. The maximum number of epochs was set to 100. Fig. 4a shows the estimated total conditional intensity, and we can see it matches the ground-truth quite closely. Then Fig. 4b reports the derived conditional intensity for type 0 and type 1 from the learned mark distribution. As we can see, GINPP can also accurately recover the intensity for each event type, although it does not explicitly estimate them. The results have confirmed the capability of our multi-type event model.

6.2. Predictive Performance

Next, we evaluated the performance of GINPP in predicting the time and type of future events. To this end, we used six real-world benchmark datasets, where the first four do not include graph information and the remaining two are associated with graphs. (1) Retweet (Zhao et al., 2015) , including 24K event sequences and three event types: retweeting by "small" , "medium" and "large" retweeters. (2) SO (Du et al., 2016) , 6, 633 awarding sequences in the Q/A site Stack Overflow. The event type is the award, such as "Nice Question", "Guru", and "Great answer". There are 22 event types. (3) MIMIC (Du et al., 2016) , clinical visit events from 650 anonymous patients in seven years, and the event type is the diagnosis outcome. There are 75 outcomes. (4) Socialfoot_0 , 48.9K student activity events in the campus of a university. The event type corresponds to the campus location of the event and we have 50 event types, and 1,614 event sequences happened among 51 students. (5) 911Callfoot_1 , 251K emergence calls in Montgomery County of Pennsylvania from year 2015 to 2017. There are three types of calls (events): EMS, FIRE and Traffic. We have collected 6,187 event sequences from 73 zones. We viewed each zone as a vertex, and added an edge if two zones are neighboring each other. ( 6) SLCCrimelfoot_2 , 48.7K criminal events happened in 23 zones of Salt Lake City. There are 25 event types (e.g., robbery, murder and arson) and 1,403 sequences. We viewed each zone as a vertex and created the graph according to the geographic neighboring relationship. We compared with the following popular and/or state-of-the-art approaches: (1) PP, simple homogeneous Poisson process, (2) Hawkes process (HP) with an exponential triggering kernel, (3) Recurrent Marked Temporal Point Process (RMTPP) (Du et al., 2016) , (4) Neural Hawkes Processes (NeuralHP) (Mei and Eisner, 2017) , (5) Self-Attentive Hawkes Process (SAHP) (Zhang et al., 2020a) that uses the self-attention mechanism to model the conditional intensity for each event type, (6) Transformer Hawkes process (TRHP) (Zuo et al., 2020) that also uses the attention mechanism to model the intensity. In addition, we compared with (7) Simple Statistics (SS) that predicts the occurrence time of new events with the average lag between successive events, and predicts the new event type with the most frequent type of the observed events. We used the original implementation of NeuralHP (https://github.com/HMEIatJHU/neurawkes), SAHP (https: //github.com/QiangAIResearcher/sahp_repo), TRHP (https://github.com/ SimiaoZuo/Transformer-Hawkes-Process), and a popular open-source implementation of RMTPP (https://github.com/woshiyyya/ERPP-RMTPP). For GINPP, we employed the same setting as in the ablation study (see Sec. 6.1). In addition, we set the dimension of the vertex embeddings to 16. To perform random walk, we randomly selected an initial vertex, and each step randomly hopped to a vertex that connects to the current vertex. The probability is the inverse of the degree of the current vertex. We maintained the visited vertex set until the number reached to the mini-batch size. Except SS, all the methods used ADAM for stochastic optimization with learning rate 10 -3 , and the mini-batch size was chosen from {8, 16, 32}. We used the default settings of all the other methods, and early stopping for every method. We randomly split each dataset into 70% for training, 10% for validation, and 20% for testing. Each method was used to predict the occurrence time and type of the last event in each test sequence. We repeated the experiment for five times, and computed the average root-mean-square-error (RMSE) and classification accuracy (ACC) for time and type predictions, respectively. We computed the standard deviation. For all the datasets, we also ran our method without graphs (i.e., we set an empty G), denoted by GINPP-1. The results are reported in Table 1 . As we can see, in every dataset, GINPP outperforms all the competing methods, in many cases by a large margin. It shows our method is superior in both event time and type prediction. When the graph knowledge is available, GINPP is better than our method not incorporating the graph structure, i.e., GINPP-1. The results show that the graph structure can further facilitate the training and prediction, and our attention method is effective. Finally, we investigated the attention score between the vertexes after training. Specifically, we examined three zones in 911Call dataset: Zone 2, 52 and 67. Fig. 5a shows the locations of all the zones. The points represent the observed events on those zones. The color represents the region of each zone. Then Fig 5b, 5c and 5d show the attention scores between the all the event locations and the events in zone 2, 52 and 67, respectively. The color indicates the magnitude; see the color bar. We can see that in many cases, when the event locations are in the neighboring zones, the attention score with them is high. This implies those events take significant impacts on updating the hidden state of the zone. However, there are also cases that the attention score with neighboring zones is small or close to zero, for example, the purple points surrounding Zone 2 (see Fig. 5b ) and on right side of Zone 67 (see Fig. 5d ). In such cases, the neighborhood events have little effect on the state update. This has shown the selection effect of the attention mechanism. On the other hand, the attentions score is in general smaller or closer to zero for distant zones. This might be attributed to the incorporation of the graph structure, because the distant vertexes are unlikely to be collected by the random walk, and hence their embeddings can be more dissimilar after training.

7. Conclusion

We have developed GINPP, a graph-informed neural point process that can avoid intractable integration in the likelihood, support multiple event types, and incorporate valuable graph knowledge into training and prediction. The experiments in ablation studies and seven real-world applications have shown the encouraging results.



http://realitycommons.media.mit.edu/SocialEvolutionData.html https://www.kaggle.com/datasets/mchirico/montcoalert https://opendata.utah.gov/browse?category=Public%20Safety



Figure 1: Inverse of the standard softplus and the input distribution for shift-scale softplus.

Figure 2: State update for a vertex.

Figure 4: Estimation of the conditional intensity.

Figure 5: The attention scores of zone 52, 2, and 67. The cross is the center of the zone.

± 0.002 0.366 ± 0.010 0.305 ± 0.021 0.122 ± 0.006 0.526 ± 0.008 0.222 ± 0.006 PP 0.549 ± 0.002 0.366 ± 0.010 0.186 ± 0.050 0.110 ± 0.005 0.544 ± 0.008 0.181 ± 0.011 HP 0.540 ± 0.014 0.357 ± 0.007 0.294 ± 0.041 0.123 ± 0.009 0.511 ± 0.013 0.216 ± 0.013 RMTPP 0.575 ± 0.006 0.376 ± 0.008 0.848 ± 0.018 0.509 ± 0.008 0.531 ± 0.005 0.203 ± 0.009 NeuralHP 0.574 ± 0.014 0.383 ± 0.007 0.718 ± 0.041 0.556 ± 0.009 0.511 ± 0.013 0.218 ± 0.013 SAHP 0.497 ± 0.027 0.305 ± 0.025 0.337 ± 0.036 0.053 ± 0.015 0.293 ± 0..021 0.041 ± 0.016 TRHP 0.541 ± 0.004 0.375 ± 0.008 0.768 ± 0.015 0.485 ± 0.007 0.532 ± 0.008 0.220 ± 0.011 GINPP-1 0.607 ± 0.004 0.401 ± 0.009 0.863 ± 0.015 0.616 ± 0.011 0.550 ± 0.005 0.228 ± 0.011 GINPP ----0.553 ± 0.005 0.233 ± 0.009 Predictive performance of the time (RMSE) and type (ACC) of the future events.

