Learning Stochastic Behaviour from Aggregate Data

Abstract

Learning nonlinear dynamics from aggregate data is a challenging problem since the full trajectory of each individual is not available, namely, the individual observed at one time point may not be observed at next time point, or the identity of individual is unavailable. This is in sharp contrast to learning dynamics with trajectory data, on which the majority of existing methods are based. We propose a novel method using the weak form of Fokker Planck Equation (FPE) to describe density evolution of data in a sampling form, which is then combined with Wasserstein generative adversarial network (WGAN) in training process. In such a sample-based framework we are able to study nonlinear dynamics from aggregate data without solving the partial differential equation (PDE). The model can also handle high dimensional cases with the help of deep neural networks. We demonstrate our approach in the context of a series of synthetic and real-world data sets.

1. Introduction

In the context of a dynamic system, Aggregate data refers to the data sets that full trajectory of each individual is not available, meaning that there is no known individual level correspondence. Typical examples include data sets collected for DNA evolution, social gathering, density in control problems, bird migration during which it is impossible to identify individual bird, and many more. In those applications, some observed individuals at one time point may be unobserved at the next time spot, or when the individual identities are blocked or unavailable due to various technical and ethical reasons. Rather than inferring the exact information for each individual, the main objective of learning dynamics in aggregate data is to recover and predict the evolution of distribution of all individuals together. Trajectory data, in contrast, is a kind of data that we are able to acquire the information of each individual all the time, although some studies considered the case that some individual trajectories are partially missing. However, the identities of those individuals, whenever they are observable, is always assumed available. For example, stock price, weather, customer behaviors and most training data sets for computer vision and natural language processing. There are many popular models to learn dynamics of full-trajectory data. Typical ones include Hidden Markov Model (HMM) (Alshamaa et al., 2019; Eddy, 1996) , Kalman Filter (KF) (Farahi & Yazdi, 2020; Harvey, 1990; Kalman, 1960) and Particle Filter (PF) (Santos et al., 2019; Djuric et al., 2003) , as well as the models built upon HMM, KF and PF (Deriche et al., 2020; Fang et al., 2019; Hefny et al., 2015; Langford et al., 2009) , they all require full trajectories of each individual, which may not be applicable in the aggregate data situations. On the other side, only a few methods are focused on aggregated data in the recent learning literature. In the work of Hashimoto et al. (2016) , authors assumed that the hidden dynamic of particles follows a stochastic differential equation(SDE), in particular, they use a recurrent neural network to parameterize the drift term. Furthermore, Wang et al. (2018) improved traditional HMM model by using an SDE to describe the evolving process of hidden states. To the best of our knowledge, there is no method directly learning the evolution of the density of objects from aggregate data yet. We propose to learn the dynamics of density through the weak form of Fokker Planck Equation (FPE), which is a parabolic partial differential equation (PDE) governing many dynamical systems subject to random noise perturbations, including the typical SDE models in existing studies. Our learning is accomplished by minimizing the Wasserstein distance between predicted distribution given by FPE and the empirical distribution from data samples. Meanwhile we utilize neural networks to handle higher dimensional cases. More importantly, by leveraging the framework of Wasserstein Generative Adversarial Network (WGAN) (Arjovsky et al., 2017) , our model is capable of approximating the distribution of samples at different time points without solving the SDE or FPE. More specifically, we treat the drift coefficient, the goal of learning, in the FPE as a generator, and the test function in the weak form of FPE as a discriminator. In other words, our method can also be regarded as a data-driven method to estimate transport coefficient in FPE, which corresponds to the drift terms in SDEs. Additionally, though we treat diffusion term as a constant in our model, it is straightforward to generalize it to be a neural network as well, which can be an extension of this work. We would like to mention that several methods of solving SDE and FPE (Weinan et al., 2017; Beck et al., 2018; Li et al., 2019) adopt opposite ways to our method, they utilize neural networks to estimate the distribution P(x, t) with given drift and diffusion terms. In conclusion, our contributions are: • We design an algorithm that is able to recover the density evolution of nonlinear dynamics via minimizing the Wasserstein discrepancy between real aggregate data and our generated data. • By leveraging the weak form of FPE, we are able to compute the Wasserstein distance directly without solving the FPE. • Finally, we demonstrate the accuracy and the effectiveness of our algorithm by several synthetic and real-world examples. 2 Proposed Method 2.1 Fokker Planck Equation for the density evolution We assume the individuals evolve in a pattern in the space R D as shown in Figure 1 . One example satisfying such process is the stochastic differential equation(SDE), which is also known as the Itô process (Øksendal, 2003) : dX t = g(X t , t)dt + σdW t . Here dX t represents an infinitesimal change of {X t } along with time increment dt, g(•, t) = (g 1 (•, t), ..., g D (•, t)) T is the drift term (drifting vector field) that drives the dynamics of the SDE, σ is the diffusion constant, {W t } is the standard Brownian Motion. Then the probability density of {X t } is governed by the Fokker Planck Equation(FPE) (Risken & Caugheyz, 1991) , as stated below in Lemma 1. Lemma 1. Suppose {X t } solves the SDE dX t = g(X t , t)dt + σdW t , denote p(•, t) as the probability density of the random variable X t . Then p(x, t) solves the following equation: ∂p(x, t) ∂t = D i=1 - ∂ ∂x i g i (x, t)p(x, t) + 1 2 σ 2 D i=1 ∂ 2 ∂x i 2 p(x, t) As a linear evolution PDE, FPE describes the evolution of density function of the stochastic process driven by a SDE. Due to this reason, FPE plays a crucial role in stochastic calculus, statistical physics and modeling (Nelson, 1985; Qi & Majda, 2016; Risken, 1989) . Its importance is also drawing more attention among statistic and machine learning communities (Liu & Wang, 2016; Pavon et al., 2018; Rezende & Mohamed, 2015) . In this paper, we utilize the weak form of FPE as a basis to study hidden dynamics of the time evolving aggregated data without solving FPE. Our task can be described as: assume that the individuals evolve with the process indicated by Figure 1 , which can be simulated by Itô process. Then given observations x t along time axis, we aim to recover the drift coefficient g(x, t) in FPE, and thus we are able to recover and predict the density evolution of such dynamic. For simplicity we treat g(x, t) as a function uncorrelated to time t, namely, g(x, t) = g(x). Notice that though evolving process of individuals can be simulated by Itô process, in reality since we lose identity information of individuals, the observed data become aggregate data, thus we need a new way other than traditional methods to study the swarm's distribution.

2.2. Weak Form of Fokker Planck Equation

Given FPE stated in Lemma 1, if we multiply a test function f ∈ H 1 0 (R D ) on both sides of the FPE, then the integration on both sides leads to: ∂p ∂t f (x)dx = D i=1 - ∂ ∂x i g i (x)p(x, t) f (x)dx + 1 2 σ 2 D i=1 ∂ 2 ∂x i 2 p(x, t) f (x)dx. Then integrating by parts on the right hand side leads to the weak form of FPE: ∂p ∂t f (x)dx = D i=1 g i (x) ∂ ∂x i f (x)p(x, t)dx + 1 2 σ 2 D i=1 ∂ 2 ∂x i 2 f (x)p(x, t)dx. where H 1 0 (R D ) denote the Sobolev space. The first advantage of weak solution is that the solution of a PDE usually requires strong regularity and thus may not exist in the classical sense for a certain group of equations, however, the weak solution has fewer regularity requirements and thus their existence are guaranteed for a much larger classes of equations. The second advantage is that the weak formulation may provide new perspectives for numerically solving PDEs (Zienkiewicz & Cheung, 1971) , (Sirignano & Spiliopoulos, 2018) and (Zang et al., 2019) etc. Suppose the observed samples at time points t m-1 and t m are following true densities p(•, t m-1 ) and p(•, t m ). Consider the following SDE d Xt = g ω ( Xt )dt + σdW t , t m-1 ≤ t ≤ t m Xt m-1 ∼ p(•, t m-1 ) Here g ω is an approximation to the real drift term g, in our research, we treat g ω as a neural network with parameters ω . Let us now denote p(•, t) as the density of Xt . Then it is natural to compute and minimize the discrepancy between our approximated density p(•, t m ) and true density p(•, t m ), within this process, we are optimizing g ω and thus will recover the true drift term g. In our research, we choose the Wasserstein-1 distance as our discrepancy function (Villani, 2008 ) (Arjovsky et al., 2017) . Now we apply Kantorovich-Rubinstein duality (Villani, 2008) , this leads to: W 1 ( p(•, t m ), p(•, t m )) = sup ∇ f ≤1 E x r ∼ p(x,t m ) [ f (x r )] -E x g ∼ p(x,t m ) [ f (x g )] The first term E x r ∼ p(x,t m ) [ f (x r )] in Equation (3) can be conveniently computed by Monte-Carlo method since we are already provided with the real data points x r ∼ p(•, t m ). To evaluate the second term, we first approximate p(•, t m ) by trapezoidal rule (Atkinson, 2008) : p(x, t m ) ≈ p(x, t m-1 ) + ∆t 2 ∂ p(x, t m-1 ) ∂t + ∂ p(x, t m ) ∂t here ∆t = t m -t m-1 . And thus we can compute: E x g ∼ p(•,t m ) [ f (x g )] ≈ f (x) p(x, t m-1 )dx + ∆t 2 ∂ p(x, t m-1 ) ∂t f (x)dx + ∂ p(x, t m ) ∂t f (x)dx (5) In the above Equation ( 5), the second and the third term on the right-hand side can be reformulated via the weak form of FPE and we are able to derive a computable formulation for W 1 ( p(•, t m ), p(•, t m )). Furthermore, we can use Monte-Carlo method to approximate the expectations in (5). the first and the second terms can be directly computed via data points from p(•, t m-1 ). For the third term, we need to generate samples from p(•, t m ), to achieve this, we apply Euler-Maruyama scheme (Kloeden & Platen, 2013) to SDE (2) in order to acquire our desired samples xt m : xt m = xt m-1 + g ω ( xt m-1 )∆t + σ √ ∆tz, here z ∼ N(0, I), xt m-1 ∼ p(•, t m-1 ). ( ) Here N(0, I) is the standard Gaussian distribution on R D . Now we summarize these results in Proposition 1: Proposition 1. For a set of points X = {x (1) , ..., x (N) } in R D . We denote: F f (X) = 1 N N k=1 D i=1 g i ω (x (k) ) ∂ ∂x i f (x (k) ) + 1 2 σ 2 D i=1 ∂ 2 ∂x 2 i f (x (k) ) then at time point t m , the Wasserstein distance between p(•, t m ) and p(•, t m ) can be approximated by: W 1 ( p(•, t m ), p(•, t m )) ≈ sup ∇ f ≤1        1 N N k=1 f ( x(k) t m ) - 1 N N k=1 f ( x(k) t m-1 ) - ∆t 2 F f ( Xm-1 ) + F f ( Xm )        Here { x(k) t m-1 } ∼ p(•, t m-1 ), { x(k) t m } ∼ p(•, t m ) and we denote Xm-1 = { x(1) t m-1 , ..., x(N) t m-1 }, Xm = { x(1) t m , ..., x(N) t m }, where each x(k) t m is computed by Euler-Maruyama scheme (6).

2.3. Wasserstein Distance on Time Series

In real cases, it is not realistic to observe the data at arbitrary two consecutive time nodes, especially when ∆t is small. To make our model more flexible, we should extend our formulation so that we are able to plug in observed data at arbitrary time points. To be more precise, suppose we observe data set Xt n = { x(1) t n , ..., x(N) t n } at J + 1 different time points t 0 , t 1 , ..., t J . And we denote the generated data set as Xt n = { x(1) t n , ..., x(N) t n }, here each x(•) t n is derived from the n-step Euler-Maruyama scheme: xt j = xt j-1 + g ω ( xt j-1 )∆t + σ √ ∆tz, with z ∼ N(0, I) 0 ≤ j ≤ n, xt 0 ∼ p(•, t 0 ) (7) Let us denote p(•, t) as the solution to FPE (1) with g replaced by g ω and with initial condition p(•, t 0 ) = p(•, t 0 ), then the approximation formula for evaluating the Wasserstein distance W 1 ( p(•, t n ), p(•, t n )) is provided in the following proposition: Proposition 2. Suppose we keep all the notations defined as above, then we have the approximation: W 1 ( p(•, t n ), p(•, t n )) ≈ sup ∇ f ≤1        1 N N k=1 f ( x(k) t n ) - 1 N N k=1 f ( x(k) t 0 ) - ∆t 2 F f ( X0 ) + F f ( Xn ) + 2 n-1 s=1 F f ( Xs )        Minimizing the Objective Function: Based on Proposition 2, we obtain objective function by summing up the accumulated Wasserstein distances among J observations along the time axis. Thus, our ultimate goal is to minimize the following objection function: min g ω        J n=1 sup ∇ f n ≤1        1 N N k=1 f n ( x(k) t n ) - 1 N N k=1 f n ( x(k) t 0 ) - ∆t 2 F f n ( X0 ) + F f n ( Xn ) + 2 n-1 s=1 F f n ( Xs )               Notice that since we have observations on J distinct time points(despite the initial point), for each time point we compute Wasserstein distance with the help of the dual function f n , thus we will involve J test functions in total. In our actual implementation, we will choose these dual functions as neural networks. We call our algorithm Fokker Planck Process(FPP), the entire procedure is shown in Algorithm 1. We also provide an error analysis in Appendix.

3. Experiments

In this section, we evaluate our model on various synthetic and realistic data sets by employing Algorithm 1. We generate samples xt and make all predictions base on Equation (6) starting with x0 . Baselines: We compare our model with two recently proposed methods. One model (NN) adopts recurrent neural network(RNN) to learn dynamics directly from observations of aggregate data (Hashimoto et al., 2016) . The other one model (LEGEND) learns dynamics in a HMM framework (Wang et al., 2018) . The baselines in our experiments are two typical representatives that have stateof-the-art performance on learning aggregate data. Furthermore, though we simulate the evolving process of the data as a SDE, which is on the same track with NN, as mentioned before, NN trains its RNN via optimizing Sinkhorn distance (Cuturi, 2013) , our model starts with a view of weak form of PDE, focuses more on WGAN framework and easier computation.

Algorithm 1 Fokker Planck Process Algorithm

Require: Initialize f θ n (1 ≤ n ≤ J), g ω Require: Set f n as the inner loop learning rate for f θ n and g as the outer loop learning rate for g ω 1: for # training iterations do 2: for k steps do for observed time t s in {t 1 , ..., t J } do 4: Compute the generated data set Xt s from Euler-Maruyama scheme (7) for 1 ≤ s ≤ J 5: Acquire data sets Xt s = { x(1) t s , ..., x(N) t s } from real distribution p(•, t s ) for 1 ≤ s ≤ J 6: end for 7: For each dual function f θ n , compute: 8: F n = F f θn ( Xt 0 ) + F f θn ( Xt n ) + 2 n-1 s=1 F f θn ( Xt s ) 9: Update each f θ n by θ n ← θ n + f n ∇ θ 1 N N k=1 f θ n ( x(k) t n ) -1 N N k=1 f θ n ( x(k) t 0 ) -∆t 2 F n 10: end for 11: Update g ω by ω ← ω -g ∇ ω J n=1 ( 1 N f θ n ( x(k) t n ) -1 N f θ n ( x(k) t 0 ) -∆t 2 F n ) 12: end for 3.1 Synthetic Data We first evaluate our model on three synthetic data sets which are generated by three artificial dynamics: Synthetic-1, Synthetic-2 and Synthetic-3. Experiment Setup: In all synthetic data experiments, we set the drift term g and the discriminator f as two simple fully-connected networks. The g network has one hidden layer and the f network has three hidden layers. Each layer has 32 nodes for both g and f . The only one activation function we choose is Tanh. Notice that since we need to calculate ∂ 2 f ∂x 2 , the activation function of f must be twice differentiable to avoid loss of weight gradient. In terms of training process, we use the Adam optimizer (Kingma & Ba, 2014) with learning rate 10 -4 . Furthermore, we use spectral normalization to realize ∇ f ≤ 1 (Miyato et al., 2018) . We initialize the weights with Xavier initialization (Glorot & Bengio, 2010 ) and train our model by Algorithm 1. Synthetic-1: x0 ∼ N(0, Σ 0 ), xt+∆t = xt -(A xt + b)∆t + σ √ ∆tN(0, 1) Synthetic-2: x0 ∼ N(0, Σ 0 ), xt+∆t = xt -G∆t + σ √ ∆tN(0, 1) G =       1 σ 1 N 1 N 1 +N 2 (x 1 t -µ 11 ) + 1 σ 2 N 2 N 1 +N 2 (x 1 t -µ 21 ) 0 0 1 σ 1 N 1 N 1 +N 2 (x 2 t -µ 12 ) + 1 σ 2 N 2 N 1 +N 2 (x 2 t -µ 22 )       N 1 = 1 √ 2πσ 1 exp       - (x 1 t -µ 11 ) 2 2σ 2 1 - (x 1 t -µ 12 ) 2 2σ 2 1       , N 2 = 1 √ 2πσ 2 exp       - (x 2 t -µ 21 ) 2 2σ 2 2 - (x 2 t -µ 22 ) 2 2σ 2 2       Synthetic-3 (Van der Pol oscillator (Li, 2018) ): x0 ∼ N(0, Σ 0 ) x1 t+∆t = x1 t + 10 x2 t - 1 3 (x 1 t ) 3 + x1 t ∆t + σ √ ∆tN(0, 1) x2 t+∆t = x2 t + 3(1 -x1 t )∆t + σ √ ∆tN(0, 1) The data size at each time point is N = 2000, the dimension of the data is D = 2. We treat 1200 data points as the training set and the other 800 data points as the test set. In Synthetic-1, the data is following a simple linear dynamic, where A = [(4, 0), (0, 1)], b = [-12, -12] T . We let ∆t = 0.01, σ = 1, Σ 0 = [(1, 0), (0, 1)]. We utilize true x 0 , x 20 and x 200 in training process and predict the distributions of x 50 and x 500 . In Synthetic-2, the data is following a complex nonlinear dynamic. We let ∆t = 0.01, σ = σ 1 = σ 2 = 1, µ 1 = [15, 15] T and µ 2 = [-15, -15] T . We utilize true x 0 , x 3 and x 6 in training process and predict x 2 , x 4 and x 10 . In Synthetic-3, ∆t = 0.01, σ = 1, we utilize true x 3 , x 7 and x 20 in training process then predict the distributions of x 10 , x 30 and x 50 . We also consider cases in higher dimensions: D = 6 and 10. In each high dimensional case, to be convenience, we set every two dimensions follow the dynamics of low dimensional case(D = 2) in each data set. Notice that in Syn-2 and Syn-3, xi t represents the i-th dimension of xt . Results: We first show the capability of our model for learning hidden dynamics of low-dimensional (d = 2) data. As visualized in Figure 2 , the generated data(blue) covers all areas of ground truth(red), which demonstrates that our model is able to precisely learn the dynamics and correctly predict the future distribution of data. The samples we predict converge to stationary distributions finally (as the ground truths suggest). We then evaluate three models using Wasserstein distance as error metric for both low-dimensional (d = 2) and high-dimensional (d = 6, 10) data. As reported in Appendix Table 1 , our model achieves lower Wasserstein error than the two baseline models in all cases.

3.2. Realistic Data -RNA Sequence of Single Cell

In this section, we evaluate our model on a realistic biology data set called Single-cell RNA-seq (Klein et al., 2015) , which is typically used for learning the evolvement of cell differentiation. The cell population begins to differentiate at day 0 (D0). Single-cell RNA-seq observations are then sampled at day 0 (D0), day 2 (D2), day 4 (D4) and day 7 (D7). At each time point, the expression of 24,175 genes of several hundreds cells are measured (933, 303, 683 and 798 cells on D0, D2, D4 and D7 respectively). Notice that there is only whole group's distribution but no trajectory information of each gene on different days. We pick 10 gene markers out of 24,175 to make a 10 dimensional data set. In first task we treat gene expression at D0, D4 and D7 as training data to learn the hidden dynamic and predict the distribution of gene expression at D2. In second task we train the model with gene expression at D0, D2 and D4, then predict the distribution of gene expression at D7. We plot the prediction results of two out of ten markers, i.e. Mt1 and Mt2 in Figure 3 . Experiment Setup: We set both f and g as fully connected three-hidden-layers neural networks, each layer has 64 nodes. The only activation function we choose is Tanh. The other setups of neural networks and training process are the same with the ones we use in Synthetic data. Notice that in realistic cases, ∆t and T/∆t become hyperparameters, here we choose ∆t = 0.05, T/∆t = 35, which means the data evolves 10∆t from D0 to D2 , then 10∆t from D2 to D4 and finally 15∆t from D4 to D7. For preprocessing, we apply standard normalization procedures (Hicks et al., 2015) to correct batch effects and use non-negative matrix factorization to impute missing expression levels (Hashimoto et al., 2016; Wang et al., 2018) . Results: As shown in Table 1 in Appendix, when compared to other baselines, our model achieves lower Wasserstein error on both Mt1 and Mt2 data, which proves that our model is capable of learning the hidden dynamics of the two studied gene expressions. In Figure 3 (a) to (d), we visualized the predicted distributions of the two genes. The distributions of Mt1 and Mt2 predicted by our model (curves in blue) are closer to the true distributions (curves in red) on both D2 and D7. Furthermore, our model precisely indicates the correlations between Mt1 and Mt2, as shown in Figure 3 (e) and (f), which also demonstrates the effectiveness of our model since closer to the true correlation represents better performance. In Figure 3 (g) and (h), we see that with simpler structure, the training process of our model is easier with least computation time.

3.3. Realistic Data -Daily Trading Volume

In this section we would like to demonstrate the performance of our model in financial area. Trading volume is the total quantity of shares or contracts traded for specified securities such as stocks, bonds, options contracts, future contracts and all types of commodities. It can be measured on any type of security traded during a trading day or a specified time period. In our case, daily volume of trade is measured on stocks. Predicting traded volume is an essential component in financial research since the traded volume, as a basic component or input of other financial algorithms, tells investors the market's activity and liquidity. The data set we use is the historical traded volume of the stock "JPM". The data covers period from January 2018 to January 2020 and is obtained from Bloomberg. Each day from 14:30 to 20:55, we have 1 observation every 5 minutes, totally 78 observations everyday. Our task is described as follows: given first two years data, we use the traded volume at 14:30, 14:40, 15:05, 15:20 and 16:20 as training data, namely, x 0 , x 2 , x 7 , x 10 , x 22 to train our model, then for next 100 days we predict traded volume at 14:35, 15:15, 15:35 and 16:15, namely, x 1 , x 9 , x 13 , x 21 . One of baselines we choose is classical rolling means(RM) method, which predicts intraday volume of a particular time interval by the average volume traded in the same interval over the past days. The other one baseline is a kalman filter based model (Chen et al., 2016) that outperforms all available models in predicting intrady trading volume. Experiment Setup: Following similar setup as we did for RNA data set, we utilize the same structures for neural networks here. For hyperparameters we set ∆t = 0.02, T/∆t = 22, which means it takes one single ∆t from x t to x t+1 . For preprocessing, we rescale data by taking natural logarithm of trading volume, which is a common way in trading volume research. We conduct experiments on two groups to show advantages of our method, for first group we train our model on complete data set, in this case the data has full trajectory; for second group we manually delete some trajectories of the data, for instance, we randomly kick out some samples of x 0 , x 2 , x 7 , x 10 , x 22 then follow the same procedures of training and prediction. Results: We present prediction results in Figure 4 . As shown in first four figures, with full trajectory, prediction made by RM is almost a straight line, the prediction value bouncing up and down within a 

4. Discussions

In this section we discuss the limitations and extension of our model. An essential challenge for recovering the drift term: Mathematically it is impossible to recover the exact drift term of an SDE if we are only given the information of density evolution on certain time intervals, because there might be infinitely many drift functions to induce the same density evolution. More precisely, suppose p(x, t) solves FPE (equation 1), consider the following equation 0 = - D i=1 ∂ ∂x i (u i (x, t)p(x, t)) + σ 2 2 D i=1 ∂ 2 ∂x 2 i p(x, t) One can prove, under mild assumptions, that there are infinitely many vector fields u(x, t) = (u 1 (x, t), ..., u D (x, t)) solving equation ( 8). One can check that the solution to FPE (1) with drift term g(x, t) + u(x, t) is still p(x, t), i.e. the vector field u(x, t) will never affect the density evolution of the dynamic. This illustrates that given the density evolution p(•, t), the solution for drift term is not necessarily unique. This is clearly an essential difficulty of determining the exact drift term from the density. In this study, the main goal is to recover the entire density evolution (i.e. interpolate the density between observation time points) and predict how will the density evolve in the future. As a result, although we cannot always acquire the exact drift term of the dynamic, we can still accurately recover and predict the density evolution. This is still meaningful and may find its application in various scientific domains. Curl Field: The drift function we showed in the numerical experiments will apparently cause the evolution of the distribution. If the drift function is a curl, namely g = ∇ × F , then the distribution does not change, is it still possible to learn the density evolution? The answer is yes. To demonstrate this point of view, we simulate a curl field (y, -x) induced by A = [(0, 1), (-1, 0)] on a Gaussian distribution (indicated by green points in Figure 5 (a)), here we set noise part as 0. Realizing that our learning is based on samples but not on the density, suppose we observe samples at four time points which does not perfectly follow the distribution (indicated by four colors: pink(t 0 ), orange(t 1 ), red(t 2 ) and brown(t 3 ) in Figure 5 (a)), we learn and predict the distribution at t 1 and t 3 (indicated by blue points in Figure 5 (b) and (c)). We also show the vector field learned by our model in Figure 5 (d) . We see that the predictions and vector field all satisfy the correct curl field pattern. Learning Diffusion Coefficient: Our framework also works for learning unknown diffusion coefficient in the Itô process. As an extension of our work, if we approximate the diffusion coefficient with a neural network σ η (with parameters η), we revise the operator F as: F f (X) = 1 N N k=1         D i=1 g i ω (x (k) ) ∂ ∂x i f (x (k) ) + D i=1 D j=1 1 2 (σ i j η (x (k) )) 2 ∂ 2 ∂x 2 i f (x (k) )         which can be derived by the same technique we used to derive Proposition 1. We test this formulation on a synthetic data set, where we only consider diffusion influence, namely, drift term in Equation ( 8) is ignored. We set the ground truth of diffusion coefficient as σ = [(1, 0), (0, 2)]. We design the neural network as a simple one fully connected layer with 32 nodes, then show our result in Figure 6 , we see that the predictions(blue) follow the same pattern as the ground truth(red) does.

5. Conclusion

In this paper, we formulate a novel method to recover the hidden dynamics from aggregate data. In particular, our work shows one can simulate the evolving process of aggregate data as an Itô process, in order to investigate aggregate data, we derive a new model that employs the weak form of FPE as well as the framework of WGAN. Furthermore, in Appendix we prove the theoretical guarantees of the error bound of our model. Finally we demonstrate our model through experiments on three synthetic data sets and two real-world data sets. A Appendix 

B Error Analysis

In this section, we provide an error analysis of our model. Suppose the hidden dynamics is driven by g r (x), the dynamics that we learn from data is g f (x), then original Itô process, Euler processes computed by true g r and estimated g f are: dX = g(X)dt + σdW x r t+∆t = x r t + g r (x r t )∆t + σ √ ∆tN(0, 1) x f t+∆t = x f t + g f (x f t )∆t + σ √ ∆tN(0, 1) where X is the ground truth, x r is computed by true g r and x f is computed by estimated g f . Estimating the error between original Itô process and its Euler form can be very complex, hence we cite the conclusion from (Milstein & Tretyakov, 2013) and focus more on the error between original form and our model. Lemma 2. With the same initial X t 0 = x t 0 = x 0 , if there is a global Lipschitz constant K which satisfies: |g(x, t)g(y, t)|≤ K|x -y| then after n steps, the expectation error between Itô process x t n and Euler forward process x r t n is: E|x t n -x r t n |≤ K 1 + E|X 0 | 2 1/2 ∆t Lemma 2 illustrates that the expectation error between original Itô process and its Euler form is not related to total steps n but time step ∆t. Proposition 3. With the same initial x 0 , suppose the generalization error of neural network g is ε and existence of global Lipschitz constant K: |g(x)g(y)|≤ K|x -y| then after n steps with step size ∆t = T/n, the expectation error between Itô process x t n and approximated forward process x f t n is bounded by: E|x t n -x f t n |≤ ε K (e KT -1) + K(1 + E|x 0 | 2 ) 1/2 ∆t (10) Proposition 3 implies that besides time step size ∆t, our expectation error interacts with three factors, generalization error, Lipschitz constant of g and total time length. In our experiments, we find the best way to decrease the expectation error is reducing the value of K and n.

C Proofs

C.1 Proof of Proposition 1 Proof. Suppose x(k) t m and x(k) t m-1 are our observed samples at t m and t m-1 respectively, then expectations could be approximated by: E x∼ p(x,t m ) [ f (x)] = f (x) p(x, t m )dx ≈ 1 N N k=1 f (x (k) t m ) E x∼ p(x,t m ) [ f (x)] = f (x) p(x, t m )dx = f (x) p(x, t m-1 ) + t m t m-1 ∂p(x, τ) ∂t dτ dx = f (x) p(x, t m-1 )dx + f (x) t m t m-1 ∂p(x, τ) ∂t dτdx ≈ 1 N N k=1 f ( x(k) t m-1 ) + f (x) t m t m-1        - D i=1 ∂ ∂x i g i ω (x)p(x, τ) + 1 2 σ 2 D i=1 ∂ 2 ∂x 2 i p(x, τ)        dτdx I Then for the second term I above, it is difficult to calculate directly, but we can use integration by parts to rewrite I as: I = t m t m-1        D i=1 -f (x) ∂ ∂x i g i ω (x)p(x, τ) + 1 2 σ 2 D i=1 f (x) ∂ 2 ∂x 2 i p(x, τ)        dxdτ = t m t m-1        D i=1 g i ω (x)p(x, τ) ∂ ∂x i f (x) + 1 2 σ 2 D i=1 p(x, τ) ∂ 2 ∂x 2 i f (x)        dxdτ = t m t m-1        E x∼p(x,τ)        D i=1 g i ω (x) ∂ ∂x i f (x)        + E x∼p(x,τ)        1 2 σ 2 D i=1 ∂ 2 ∂x 2 i f (x)               dτ ≈ t m t m-1 1 N N k=1        D i=1 g i ω (x (k) ) ∂ ∂x i f (x (k) ) + 1 2 σ 2 D i=1 ∂ 2 ∂x 2 i f (x (k) )        dτ To approximate the integral from t m-1 to t m , we adopt trapezoid rule, then we could rewrite the expectation in Equation ( 12) as: E x∼ p(x,t m ) [ f (x)] ≈ 1 N N k=1 f (x (k) t m-1 ) + ∆t 2        1 N N k=1        D i=1 g i ω (x (k) t m-1 ) ∂ ∂x i f (x (k) t m-1 ) + 1 2 σ 2 D i=1 ∂ 2 ∂x 2 i f (x (k) t m-1 )        + 1 N N k=1        D i=1 g i ω (x (k) t m ) ∂ ∂x f (x (k) t m ) + 1 2 σ 2 D i=1 ∂ 2 ∂x 2 i f (x (k) t m )               = 1 N N k=1 f (x (k) t m-1 ) + ∆t 2 F f ( Xm-1 ) + F f ( Xm ) We subtract ( 11) by ( 14) to finish the proof. 6). Then the expectations can be rewritten as: E x∼ p(x,t n ) [ f (x)] = f (x) p(x, t n )dx ≈ 1 N N k=1 f (x (k) t n ) (15) E x∼ p(x,t n ) [ f (x)] ≈ 1 N N k=1 f (x (k) t 0 ) + t 1 t 0 1 N N k=1        D i=1 g i ω (x (k) ) ∂ ∂x i f (x (k) ) + 1 2 σ 2 D i=1 ∂ 2 ∂x i ∂x j f (x (k) )        dτ + t 2 t 1 1 N N k=1        D i=1 g i ω (x (k) ) ∂ ∂x i f (x (k) ) + 1 2 σ 2 D i=1 ∂ 2 ∂x 2 i f (x (k) )        dτ + ... + t n t n-1 1 N N k=1        n i=1 g i ω (x (k) ) ∂ ∂x i f (x (k) ) + 1 2 σ 2 n i=1 ∂ 2 ∂x 2 i f (x (k) )        dτ which is: E x∼ p(x,t n ) [ f (x)] ≈ 1 N N k=1 f (x (k) t 0 ) + ∆t 2 F f ( X0 ) + F f ( X1 ) + ∆t 2 F f ( X1 ) + F f ( X2 ) + ... + ∆t 2 F f ( Xn-1 ) + F f ( Xn ) Finally it comes to: E x∼ p(x,t n ) [ f (x)] ≈ 1 N N k=1 f (x (k) t 0 ) + ∆t 2         F f ( X0 ) + F f ( Xn ) + 2 n-1 s=1 F f ( Xs )         We subtract ( 15) by ( 18) to finish the proof.

C.3 Proof of Error Analysis

Proof. The proof process of Lemma 2 is quite long and out of the scope of this paper, for more details please see first two chapters in reference book (Milstein & Tretyakov, 2013) . While for the proof of Proposition 3, with initial X and first one-step iteration: x r t 0 = x t 0 x f t 0 = x t 0 (19)        x r t 1 = x r t 0 + g r (x r t 0 )∆t + σ √ ∆tN(0, 1) x f t 1 = x f t 0 + g f (x f t 0 )∆t + σ √ ∆tN(0, 1) Then we have: E|x r t 0 -x f t 0 | = E|x t 0 -x t 0 |= 0 (21) E|x r t 1 -x f t 1 | = E|x r t 0 -x f t 0 + g r (x r t 0 )∆tg f (x f t 0 )∆t + σ √ ∆tN(0, 1) -σ √ ∆tN(0, 1)| ≤ E|x r t 0 -x f t 0 |+E|g r (x r t 0 )g f (x f t 0 )|∆t = E|g r (x r t 0 )g f (x r t 0 ) + g f (x r t 0 )g f (x f t 0 )|∆t ≤ E|g r (x r t 0 )g f (x r t 0 )|∆t + E|g f (x r t 0 )g f (x f t 0 )|∆t ≤ ε∆t + E|g f (x r t 0 )g f (x f t 0 )|∆t = ε∆t + E|g f (x ξ t 0 )(x r t 0 - x f t 0 )|∆t (x ξ t 0 ∈ [x r t 0 , x f t 0 ]) ≤ ε∆t + KE|x r t 0 -x f t 0 |∆t = ε∆t (22) Follow the pattern we have:        x r t 2 = x r t 1 + g r (x r t 1 )∆t + σ √ ∆tN(0, 1) x f t 2 = x f t 1 + g f (x f t 1 )∆t + σ √ ∆tN(0, 1) ...        x r t n = x r t n-1 + g r (x r t n-1 )∆t + σ √ ∆tN(0, 1) x f t n = x f t n-1 + g f (x f t n-1 )∆t + σ √ ∆tN(0, 1) Which leads to:  E|x r t 2 -x f t 2 | = E|x r t 1 -x f t 1 + g r (x Now let S = n-1 i=0 (1 + K∆t) i , then consider followings:  S (K∆t) = S (1 + K∆t) -S = n i=1 (1 + K∆t) i - n-1 i=0 (1 + K∆t) i = (1 + K∆t) n -1 = (1 + K T n ) n -1 ≤ e KT -1



Figure 1: State model of the stochastic process X t

Figure 2: Comparison of generated data(blue) and ground truth(red) of Synthetic-1((a) to (c)), Synthetic-2((d) to (f)) and Synthetic-3((g) to (i)). In each case, it finally converges to a stationary distribution.

Figure 3: (a) and (b): Wasserstein loss as iteration increases of Mt1. (c) to (f): The performance comparision among different models on D2 and D7 of Mt1 and Mt2. (g) and (h): True (red) and predicted (blue) correlations between Mt1(x-axis) and Mt2(y-axis) on D2 (left) and D7 (right).

Figure 4: (a) to (d): Group A: with full trajectory of training data, predictions of traded volume in next 100 days, RM(yellow) fails to capture the regularities of traded volume in time series, kalman filter based model(green) fails to capture noise information and make reasonable predictions, our model(blue) is able to seize the movements of traded volume and yield better predictions. (e) to (h): Group B: predictions of our model without full trajectory. very small range, thus this model cannot capture the volume movements, namely, regularities existing in the time series; prediction made by the Kalman filter based model captures the regularities better than RM model, but it fails to deal with noise component existing in the time series, thus some predictions are out of a reasonable range. Traded volume predicted by our model is closer to the real case, moreover, our model captures regularities meanwhile gives stable predictions. Furthermore, without full trajectory, Kalman filter based model fails to be applied here and RM model still fails to capture the regularities, we randomly drop half of the training samples and display predictions made by our model in last four figures of Figure 4, we see our model still works well.

Figure 5: Results of learning curl field:(a): the whole distribution is indicated by green, from which we choose subsets for training purpose, training samples at t 0 , t 1 , t 2 and t 3 are indicated by pink, orange, red and brown respectively. (b) and (c):the prediction at t 1 and t 3 are indicated by blue points. (d): The vector field generated by our model.

Figure 6: Results of learning diffusion coefficient.

The Wasserstein error of different models on Synthetic-1/2/3 and RNA-sequence data sets.

C.2 Proof of Proposition 2 Proof. Given initial xt 0 , we generate xt 1 , xt 2 , xt 3 ... xt n sequentially by Equation (

r t 1 )∆tg f (x f t 1 )∆t + σ

