ADVERSARIAL DATA GENERATION OF MULTI-CATEGORY MARKED TEMPORAL POINT PROCESSES WITH SPARSE, INCOMPLETE, AND SMALL TRAINING SAMPLES

Abstract

Asynchronous stochastic discrete event based processes are commonplace in application domains such as social science, homeland security, and health informatics. Modeling complex interactions of such event data via marked temporal point processes (MTPPs) provides the ability of detection and prediction of specific interests or profiles. We present a novel multi-category MTPP generation technique for applications where training datasets are inherently sparse, incomplete, and small. The proposed adversarial architecture augments adversarial autoencoder (AAE) with feature mapping techniques, which includes a transformation between the categories and timestamps of marked points and the percentile distribution of the particular category. The transformation of training data to the distribution facilitates the accurate capture of underlying process characteristics despite the sparseness and incompleteness of data. The proposed method is validated using several benchmark datasets. The similarity between actual and generated MTPPs is evaluated and compared with a Markov process based baseline. Results demonstrate the effectiveness and robustness of the proposed technique.

1. INTRODUCTION

Marked Temporal Point Processes (MTPPs) are widely used for modeling and analysis of asynchronous stochastic discrete events in continuous time (Upadhyay et al., 2018; Türkmen et al., 2019; Yan, 2019) with applications in numerous domains such as homeland security, cybersecurity, consumer analytics, health care analytics, and social science. An MTPP models stochastic discrete events as marked points (e i ) defined by its time of the occurrence t i and its category c i . Usually, point processes are characterized using the conditional intensity function, λ * (t) = λ(t|H t ) = P[event ∈ [t, t + dt)|H t ], which given the past H t = {e i = (z i , t i )|t i < t} specifies the probability of an event occurring at future time points. There are many popular intensity functional forms. Hawkes process (self-exciting process) (Hawkes, 1971 ) is a point process used in both statistical and machine learning contexts where the intensity is a linear function of past events (H t ) (Türkmen et al., 2019) . In traditional parametric models, the conditional intensity functions are manually pre-specified (Yan, 2019) . Recently, various neural network models (generally called neural TPP) have been used to learn arbitrary and unknown distributions while eliminating the manual intensity function selection. Reinforcement learning (Zhu et al., 2019; Li et al., 2018) , recurrent Neural Networks (RNN) (Du et al., 2016) , and generative neural networks (Xiao et al., 2018) are used to approximate the intensity functions and learn complex MTPP distributions using larger datasets. Recent advances in data collection techniques allow collecting complex event data which form heterogeneous MTTPs where a marked point (e ij ) defines a time of occurrence (t i ) and a category (c j ) separately. Therefore, multi-category MTTPs not only concern about the time of occurrence but also the category of the next marked point. The multi-category MTTPs append extra dimensionality to the distribution which complicates the learning using existing technologies. In fact, multi-category MTPPs are greatly helpful to model the behavioral patterns of suspicious or specific individuals and groups in homeland security (Campedelli et al., 2019b; a; Hung et al., 2018; 2019) , potential malicious network activities in cybersecurity (Peng et al., 2017) , recommendation systems in consumer analytics (Vassøy et al., 2019) , and the behavioral patterns of patients to determine certain illnesses (Islam et al., 2017; Mancini & Paganoni, 2019) . A number of challenges limit the collection and access to data in many fields often resulting in small and incomplete datasets. Scenarios involving social, political and crime behaviors are often incomplete due to data collection challenges such as data quality maintenance, privacy and confidentiality issues (National Institutes of Health & Services, 2020), but still a rigorous analysis with complete data is essential to produce accurate and reliable outcomes. So, there is a critical need for a technique to capture and learn from MTPP distribution, develop and apply machine learning algorithms, etc., for a small set of data some of which may be incomplete. We present an adversarial multi-category MTPP generation technique which is capable of generating sparse, asynchronous, stochastic, multi-category, discrete events in continuous time based on a limited dataset. Adversarial training has recently evolved and is able to provide exceptional results in many data generation applications, mostly in image, audio, and video generation while precisely mimicking the features of an actual dataset. The primary GAN architecture (Goodfellow et al., 2014) only engages well for continuous and complete data distributions and GANs have not been used for learning the distribution of discrete variables (Choi et al., 2017) . Later, GAN architectures for discrete events have been introduced (Makhzani et al., 2015; Yu et al., 2017) and also applied for MTTP generation using extensive training data (Xiao et al., 2018; 2017) . Adversarial autoencoders (AAE) are fluent in capturing latent discrete or continuous distributions (Makhzani et al., 2015) . In this work, we present feature mapping modules for accommodating incomplete data and make AAE capable of capturing the MTPP distributions of incomplete and small datasets. The incompleteness of the data points can be occurred in following ways. The marked points have been not collected or actors did not originally expose some marked points due to the dynamicity of these stochastic processes, which is the case especially in social and behavioral domains. Main contribution of the paper is a novel technique to synthetically generate high-fidelity multi-category MTPPs using adversarial autoencoders and feature mapping techniques by leveraging sparse, incomplete, and small datasets. To the best of our knowledge, there is no technique available for multi-category MTTP generation using such a dataset which is significantly more challenging than the existing generation scenarios. Section 2 reviews related literature on MTTPs and AAEs. Section 3 presents the definition of multicategory MTTPs and Section 4 discusses the usage of AAEs for incomplete, multi-category MTTP generation. Then Section 5 presents the unique preprocessing and postprocessing techniques include in the feature mapping encoder and the decoder. Section 6 discusses the results of the experiment, and Section 7 summarises the conclusion and future work.

2. RELATED WORK

MTPPs are widely used for modeling of asynchronous stochastic discrete events in continuous time (Upadhyay et al., 2018; Du et al., 2016; Li et al., 2018; Türkmen et al., 2019) . Usually, an MTTP is defined using a conditional intensity function (Türkmen et al., 2019) which provides the instantaneous rate of events given previous points. Intensity functions are often approximated by various processes such as the Poisson process, Hawkes process (self-exciting process) (Hawkes, 1971) , and self-correcting process (Isham & Westcott, 1979) . In traditional MTPPs, the intensity function has to be explicitly defined; however any mismatch between the manually defined and the underlying intensity function of a process can have a significant adverse impact on the accuracy of models and outcomes. Deep generative networks avoid the requirement of manually identifying the intensity and thus allows the use of arbitrary and complex distributions. Recurrent Neural Networks (RNNs) with reinforcement learning have been widely used in recent years (Du et al., 2016; Li et al., 2018) as well as several hybrid and extended models are also presented. A stochastic sequential model is proposed in (Sharma et al., 2019) as a combination of a deep state space model and deterministic RNN for modeling MTPPs. FastPoint (Türkmen et al., 2019) uses deep RNNs to capture complex temporal patterns and self-excitation dynamics within each mark are modeled using Hawkes processes. A semi-parametric generative model is introduced in (Zhu et al., 2019) for spatio-temporal event data by combining spatial statistical models with reinforcement learning. The advanced data collection techniques and online social media platforms produce complex event data and thus social network analysis can now be used to inform solutions to many societal issues (Bonchi et al., 2011) . Many such

