VARIATIONAL DYNAMIC MIXTURES

Abstract

Deep probabilistic time series forecasting models have become an integral part of machine learning. While several powerful generative models have been proposed, we provide evidence that their associated inference models are oftentimes too limited and cause the generative model to predict mode-averaged dynamics. Modeaveraging is problematic since many real-world sequences are highly multi-modal, and their averaged dynamics are unphysical (e.g., predicted taxi trajectories might run through buildings on the street map). To better capture multi-modality, we develop variational dynamic mixtures (VDM): a new variational family to infer sequential latent variables. The VDM approximate posterior at each time step is a mixture density network, whose parameters come from propagating multiple samples through a recurrent architecture. This results in an expressive multi-modal posterior approximation. In an empirical study, we show that VDM outperforms competing approaches on highly multi-modal datasets from different domains.

1. INTRODUCTION

Making sense of time series data is an important challenge in various domains, including ML for climate change. One important milestone to reach the climate goals is to significantly reduce the CO 2 emissions from mobility (Rogelj et al., 2016) . Accurate forecasting models of typical driving behavior and of typical pollution levels over time can help both lawmakers and automotive engineers to develop solutions for cleaner mobility. In these applications, no accurate physical model of the entire dynamic system is known or available. Instead, data-driven models, specifically deep probabilistic time series models, can be used to solve the necessary tasks including forecasting. The dynamics in such data can be highly multi-modal. At any given part of the observed sequence, there might be multiple distinct continuations of the data that are plausible, but the average of these behaviors is unlikely, or even physically impossible. Consider for example a dataset of taxi trajectories 1 . In each row of Fig. 1a , we have selected 50 routes from the dataset with similar starting behavior (blue). Even though these routes are quite similar to each other in the first 10 way points, the continuations of the trajectories (red) can exhibit quite distinct behaviors and lead to points on any far edge of the map. The trajectories follow a few main traffic arteries, these could be considered the main modes of the data distribution. We would like to learn a generative model of the data, that based on some initial way points, can forecast plausible continuations for the trajectories. Many existing methods make restricting modeling assumptions such as Gaussianity to make learning tractable and efficient. But trying to capture the dynamics through unimodal distributions can lead either to "over-generalization", (i.e. putting probability mass in spurious regions) or on focusing only on the dominant mode and thereby neglecting important structure of the data. Even neural approaches, with very flexible generative models can fail to fully capture this multi-modality because their capacity is often limited through the assumptions of their inference model. To address this, we develop variational dynamic mixtures (VDM). Its generative process is a sequential latent variable model. The main novelty is a new multi-modal variational family which makes learning and inference multi-modal yet tractable. In summary, our contributions are • A new inference model. We establish a new type of variational family for variational inference of sequential latent variables. By successively marginalizing over previous latent states, the procedure can be efficiently carried-out in a single forward pass and induces a multi-modal posterior Figure 1 : Forecasting taxi trajectories is challenging due to the highly multi-modal nature of the data (Fig. 1a ). VDM (Fig. 1b ) succeeds in generating diverse plausible predictions (red), based the beginning of a trajectory (blue). The other methods, AESMC (Le et al., 2018) , CF-VAE (Bhattacharyya et al., 2019) , VRNN Chung et al. (2015) , RKN Becker et al. (2019) , suffer from mode averaging. approximation. We can see in Fig. 1b , that VDM trained on a dataset of taxi trajectories produces forecasts with the desired multi-modality while other methods overgeneralize. • An evaluation metric for multi-modal tasks. The negative log-likelihood measures predictive accuracy but neglects an important aspect of multi-modal forecasts -sample diversity. In Section 4, we derive a score based on the Wasserstein distance (Villani, 2008) which evaluates both sample quality and diversity. This metric complements our evaluation based on log-likelihoods. • An extensive empirical study. in Section 4, we use VDM to study various datasets, including a synthetic data with four modes, a stochastic Lorenz attractor, the taxi trajectories, and a U.S. pollution dataset with the measurements of various pollutants over time. We illustrate VDM's ability in modeling multi-modal dynamics, and provide quantitative comparisons to other methods showing that VDM compares favorably to previous work.

2. RELATED WORK

Neural recurrent models. Recurrent neural networks (RNNs) such as LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Chung et al., 2014) have proven successful on many time series modeling tasks. However, as deterministic models they cannot capture uncertainties in their dynamic predictions. Stochastic RNNs make these sequence models non-deterministic (Chung et al., 2015; Fraccaro et al., 2016; Gemici et al., 2017; Li & Mandt, 2018) . For example, the variational recurrent neural network (VRNN) (Chung et al., 2015) enables multiple stochastic forecasts due to its stochastic transition dynamics. An extension of VRNN (Goyal et al., 2017) uses an auxiliary cost to alleviate the KL-vanishing problem. It improves on VRNN inference by forcing the latent variables to also be predictive of future observations. Another line of related methods rely on particle filtering (Naesseth et al., 2018; Le et al., 2018; Hirt & Dellaportas, 2019) and in particular sequential Monte Carlo (SMC) to improve the evidence lower bound. In contrast, VDM adopts an explicitly multi-modal posterior approximation. Another SMC-based work (Saeedi et al., 2017) employs search-based techniques for multi-modality but is limited to models with finite discrete states. Recent works (Schmidt & Hofmann, 2018; Schmidt et al., 2019; Ziegler & Rush, 2019) use normalizing flows in the latent space to model the transition dynamics. A normalizing flow requires many layers to transform its base distribution into a truly multi-modal distribution in practice. In contrast, mixture density networks (as used by VDM) achieve multi-modality by mixing only one layer of neural networks. A task orthogonal to multi-modal inference is learning disentangled representations. Here too, mixture models are used (Chen et al., 2016; Li et al., 2017) . These papers use discrete variables and a mutual information based term to disentangle different aspects of the data. VAE-like models (Bhattacharyya et al., 2018; 2019) and GAN-like models (Sadeghian et al., 2019; Kosaraju et al., 2019) only have global, time independent latent variables. Yet, they show good results on various tasks, including forecasting. With a deterministic decoder, these models focus on average dynamics and don't capture local details (including multi-modal transitions) very well. Sequential latent variable models are described next. ℎ 𝑡-2 ℎ 𝑡-1 𝑧 𝑡-1 𝑧 𝑡 𝑥 𝑡 ℎ 𝑡 𝑥 𝑡-1 (a) Generation (Eqs. ( 1) and ( 2) 4), ( 5) and ( 7)) ) ℎ 𝑡-2 𝑠 𝑡-1 𝑠 𝑡 𝑠 𝑡-2 ℎ 𝑡-1 𝑧 𝑡-1 𝑧 𝑡 𝑥 𝑡-1 𝑥 𝑡 (b) Inference (Eqs. ( Figure 2 : Graphical illustrations of VDM. Dashed lines denote deterministic dependencies such as transformations, marginalization, or computing the mean, as explained in the main text, while bold lines denote stochastic dependencies. The half-shaded node for s t indicates that s t is being marginalized out as opposed to conditioned on. Deep state-space models. Classical State-space models (SSMs) are popular due to their tractable inference and interpretable predictions. Similarly, deep SSMs with locally linear transition dynamics enjoy tractable inference (Karl et al., 2017; Fraccaro et al., 2017; Rangapuram et al., 2018; Becker et al., 2019) . However, these models are often not expressive enough to capture complex (or highly multi-modal) dynamics. Nonlinear deep SSMs (Krishnan et al., 2017; Zheng et al., 2017; Doerr et al., 2018; De Brouwer et al., 2019; Gedon et al., 2020) are more flexible. Their inference is often no longer tractable and requires variational approximations. Unfortunately, in order for the inference model to be tractable, the variational approximations are often simplistic and don't approximate multi-modal posteriors well with negative effects on the trained models. Multi-modality can be incorporated via additional discrete switching latent variables, such as recurrent switching linear dynamical systems (Linderman et al., 2017; Nassar et al., 2018; Becker-Ehmck et al., 2019) . However, these discrete states make inference more involved.

3. VARIATIONAL DYNAMIC MIXTURES

We develop VDM, a new sequential latent variable model for multi-modal dynamics. Given sequential observations x 1:T = (x 1 , . . . , x T ), VDM assumes that the underlying dynamics are governed by latent states z 1:T = (z 1 , . . . , z T ). We first present the generative process and the multi-modal inference model of VDM. We then derive a new variational objective that encourages multi-modal posterior approximations and we explain how it is regularized via hybrid-training. Finally, we introduce a new sampling method used in the inference procedure. Generative model. The generative process consists of a transition model and an emission model. The transition model p(z t | z <t ) describes the temporal evolution of the latent states and the emission model p(x t | z ≤t ) maps the states to observations. We assume they are parameterized by two separate neural networks, the transition network φ tra and the emission network φ dec .To give the model the capacity to capture longer range temporal correlations we parametrize the transition model with a recurrent architecture φ GRU (Auger-Méthé et al., 2016; Zheng et al., 2017) such as a GRU (Chung et al., 2014) . The latent states z t are sampled recursively from z t | z <t ∼ N (µ 0,t , σ 2 0,t I), where [µ 0,t , σ 2 0,t ] = φ tra (h t-1 ), h t-1 = φ GRU (z t-1 , h t-2 ), (1) and are then decoded such that the observations can be sampled from the emission model, x t | z ≤t ∼ N (µ x,t , σ 2 x,t I), where [µ x,t , σ 2 x,t ] = φ dec (z t , h t-1 ). (2) This generative process is similar to (Chung et al., 2015) , though we did not incorporate autoregressive feedback due to its negative impact on long-term generation (Ranzato et al., 2016; Lamb et al., 2016) . The competitive advantage of VDM comes from a more expressive inference model. Inference model. VDM is based on a new procedure for multi-modal inference. The main idea is that to approximate the posterior at time t, we can use the posterior approximation of the previous time step and exploit the generative model's transition model φ GRU . This leads to a sequential inference procedure. We first use the forward model to transform the approximate posterior at time t -1 into a distribution at time t. In a second step, we use samples from the resulting transformed distribution and combine each sample with data evidence x t , where every sample parameterizes a Gaussian mixture component. As a result, we obtain a multi-modal posterior distribution that depends on data evidence, but also on the previous time step's posterior. In more detail, for every z t , we define its corresponding recurrent state as the transformed random variable s t = φ GRU (z t , h t-1 ), using a deterministic hidden state h t-1 = E [s t-1 ]. The variational family of VDM is defined as follows: q(z 1:T | x 1:T ) = T t=1 q(z t | x ≤t ) = T t=1 q(z t | s t-1 , x t )q(s t-1 | x ≤t )ds t-1 . (3) Chung et al. (2015) also use a sequential inference procedure, but without considering the distribution of s t . Only a single sample is propagated through the recurrent network and all other information about the distribution of previous latent states z <t is lost. In contrast, VDM explicitly maintains s t as part of the inference model. Through marginalization, the entire distribution is taken into account for inferring the next state z t . Beyond the factorization assumption and the marginal consistency constraint of Eq. ( 3), the variational family of VDM needs two more choices to be fully specified; First, one has to choose the parametrizations of q(z t | s t-1 , x t ) and q(s t-1 | x ≤t ) and second, one has to choose a sampling method to approximate the marginalization in Eq. ( 3). These choices determine the resulting factors q(z t | x ≤t ) of the variational family. We assume that the variational distribution of the recurrent state factorizes as q(s t-1 | x ≤t ) = ω(s t-1 , x t )q(s t-1 | x <t ), i.e. it is the distribution of the recurrent state given the past observationfoot_1 , re-weighted by a weighting function ω(s t-1 , x t ) which involves only the current observations. For VDM, we only need samples from q(s t-1 | x <t ), which are obtained by sampling from the previous posterior approximation q(z t-1 | x <t ) and transforming the sample with the RNN, s (i) t-1 ∼ q(s t-1 | x <t ) equiv. to s (i) t-1 = φ GRU (z (i) t-1 , h t-2 ), z (i) t-1 ∼ q(z t-1 | x <t ), where i indexes the samples. The RNN φ GRU has the same parameters as in the generative model. Augmenting the variational model with the recurrent state has another advantage; approximating the marginalization in Eq. ( 3) with k samples from q(s t-1 | x ≤t ) and choosing a Gaussian parametrization for q(z t | s t-1 , x t ) results in a q-distribution q(z t | x ≤t ) that resembles a mixture density network (Bishop, 2006) , which is a convenient choice to model multi-modal distributions. q(z t | x ≤t ) = k i ω (i) t N (µ (i) z,t , σ (i)2 z,t I), [µ (i) z,t , σ (i)2 z,t ] = φ inf (s (i) t-1 , x t ). We assume q(z t | s t-1 , x t ) to be Gaussian and use an inference network φ inf to model the effect of the observation x t and recurrent state s t-1 on the mean and variance of the mixture components. The mixture weights ω (i) t := ω(s (i) t-1 , x t )/k come from the variational distribution q(s t-1 | x ≤t ) = ω(s t-1 , x t )q(s t-1 | x <t ) and importance samplingfoot_2 . We are free to choose how to parametrize the weights, as long as all variational distributions are properly normalized. Setting ω (i) t = ω(s (i) t-1 , x t )/k := 1(i = arg max j p(x t | h t-1 = s (j) t-1 )), achieves this. In Appendix A, we explain this choice with importance sampling and in Appendix H, we compare the performance of VDM under alternative variational choices for the weights. In the next time-step, plugging the variational distribution q(z t | x ≤t ) into Eq. ( 4) yields the next distribution over recurrent states q(s t | x ≤t ). For this, the expected recurrent state h t-1 is required. We approximate the update using the same k samples (and therefore the same weights) as in Eq. ( 5). h t-1 = E[s t-1 ] = s t-1 q(s t-1 | x ≤t )ds t-1 ≈ k i ω (i) t s (i) t-1 . A schematic view of the generative and inference model of VDM is shown in Fig. 2 . In summary, the inference model of VDM alternates between Eqs. ( 4) to ( 7). Latent states are sampled from the posterior approximation of the previous time-step and transformed by Eq. ( 4) into samples of the recurrent state of the RNN. These are then combined with the new observation x t to produce the next variational posterior Eq. ( 5) and the expected recurrent state is updated (Eq. ( 7)). These are then used in Eq. ( 4) again. Approximating the marginalization in Eq. ( 3) with a single sample, recovers the inference model of VRNN (Chung et al., 2015) , and fails in modeling multi-modal dynamics as shown in Fig. 3 . In comparison, VDM's approximate marginalization over the recurrent states with multiple samples succeeds in modeling multi-modal dynamics. Variational objective. We develop an objective to optimize the variational parameters of VDM φ = [φ tra , φ dec , φ GRU , φ inf ]. The evidence lower bound (ELBO) at each time step is L ELBO (x ≤t , φ) := 1 k k i ω(s (i) t-1 , x t )E q(zt|s (i) t-1 ,xt) log p(x t | z t , h t-1 = s (i) t-1 ) + 1 k k i ω(s (i) t-1 , x t )E q(zt|s (i) t-1 ,xt) log p(z t | h t-1 = s (i) t-1 ) q(z t | s (i) t-1 , x t ) - 1 k k i ω(s (i) t-1 , x t ) log ω(s (i) t-1 , x t ) + C (8) Claim 1. The ELBO in Eq. ( 8) is a lower bound on the log evidence log p( x t | x <t ), log p(x t | x <t ) ≥ L ELBO (x ≤t , φ), (see proof in Appendix B) . In addition to the ELBO, the objective of VDM has two regularization terms, L VDM (φ) = T t=1 E pdata [-L ELBO (x ≤t , φ) -ω 1 L pred (x ≤t , φ)] + ω 2 L adv (x ≤t , φ) . ( ) In an ablation study in Appendix E, we compare the effect of including and excluding the regularization terms in the objective. VDM is competitive without these terms, but we got the strongest results by setting ω 1,2 = 1 (this is the only nonzero value we tried. This hyperparameter could be tuned even further.) The first regularization term L pred , encourages the variational posterior (from the previous time step) to produce samples that maximize the predictive likelihood, L pred (x ≤t , φ) = log E q(st-1|x<t) [p(x t | s t-1 , x <t )] ≈ log 1 k k i p(x t | s (i) t-1 ) . ( ) This regularization term is helpful to improve the prediction performance, since it depends on the predictive likelihood of samples, which isn't involved in the ELBO. The second optional regularization term L adv (Eq. ( 12)) is based on ideas from hybrid adversarial-likelihood training (Grover et al., 2018; Lucas et al., 2019) . These training strategies have been developed for generative models of images to generate sharper samples while avoiding "mode collapse". We adapt these ideas to generative models of dynamics. The adversarial term L adv uses a forward KL-divergence, which enables "quality-driven training" to discourage probability mass in spurious areas. L adv (x ≤t , φ) = D KL (p(x t | x <t ) p D (x t | x <t )) = E [log p(x t | x <t ) -log p D (x t | x <t )] (12) The expectation is taken w.r.t. p(x t | x <t ). The true predictive distribution p D (x t | x <t ) is unknown. Instead, we can train the generator of a conditional GAN (Mirza & Osindero, 2014) , while assuming an optimal discriminator. As a result, we optimize Eq. ( 12) in an adversarial manner, conditioning on x <t at each time step. Details about the discriminator are in Appendix G. Stochastic cubature approximation (SCA). The variational family of VDM is defined by a number of modeling choices, including the factorization and marginal consistency assumptions of Eq. ( 3), the parametrization of the transition and inference networks Eqs. ( 4) and ( 5), and the choice of weighting function ω(•). It is also sensitive to the choice of sampling method which we discuss here. In principle, we could use Monte-Carlo methods. However, for a relatively small number of samples k, Monte-Carlo methods don't have a mechanism to control the quality of samples. We instead develop a semi-stochastic approach based on the cubature approximation (Wan & Van Der Merwe, 2000; Wu et al., 2006; Arasaratnam & Haykin, 2009) , which chooses samples more carefully. The cubature approximation proceeds by constructing k = 2d + 1 so-called sigma points, which are optimally spread out on the d-dimensional Gaussian with the same mean and covariance as the distribution we need samples from. In SCA, the deterministic sigma points are infused with Gaussian noise to obtain stochastic sigma variables. A detailed derivation of SCA is in Appendix D. We use SCA for various reasons: First, it typically requires fewer samples than Monte-Carlo methods because the sigma points are carefully chosen to capture the first two moments of the underlying distribution. Second, it ensures a persistence of the mixture components; when we resample, we sample another nearby point from the mixture component and not an entirely new location.

4. EVALUATION AND EXPERIMENTS

In this empirical study, we evaluate VDM's ability to model multi-modal dynamics and show its competitive forecasting performance in various domains. We first introduce the evaluation metrics and baselines. Experiments on synthetic data demonstrate that VDM is truly multi-modal thereby supporting the modeling choices of Section 3, especially for the inference model. Then, experiments on real-world datasets with challenging multi-modal dynamics show the benefit of VDM over stateof-the art (deep) probabilistic time-series models. Evaluation metrics. In the experiments, we always create a training set, a validation set, and a test set. During validation and test, each trajectory is split into two parts; initial observations (given to the models for inference) and continuations of the trajectories (to be predicted and not accessible to the models). The inference models are used to process the initial observations and to infer latent states. These are then processed by the generative models to produce forecasts. We use 3 criteria to evaluate these forecasts (i) multi-steps ahead prediction p(x t+1:t+τ | x 1:t ), (ii) one-step-ahead prediction p(x t+1 | x 1:t ), and (iii) empirical Wasserstein distance. As in other work (Lee et al., 2017; Bhattacharyya et al., 2018; 2019) , (i) and (ii) are reported in terms of negative log-likelihood. While the predictive distribution for one-step-ahead prediction is in closed-form, the long-term forecasts have to be computed using samples. For each ground truth trajectory x we generate n = 1000 forecasts xi given initial observations from the beginning of the trajectory N LL = -log 1 n n i 1 √ 2π exp - (x i -x) 2 2 , This evaluates the predictive accuracy but neglects a key aspect of multi-modal forecasts -diversity. We propose a new evaluation metric, which takes both diversity and accuracy of predictions into account. It relies on computing the Wasserstein distance between two empirical distributions P , Q W (P, Q) = inf π 1 n n i (x i -y π(i) 2 , where x and y are the discrete samples of P and Q, and π denotes all permutations (Villani, 2008) . To use this as an evaluation measure for multi-modal forecasts, we do the following. We select n samples from the test set with similar initial observations. If the dynamics in the data are multimodal the continuations of those n trajectories will be diverse and this should be reflected in the forecasts. For each of the n samples, the model generates 10 forecasts and we get n groups of samples. With Eq. ( 14) the empirical W-distance between the n true samples, and each group of generated samples can be calculated. The averaged empirical W-distance over groups evaluates how well the generated samples match the ground truth. Repeating this procedure with different initial trajectories evaluates the distance between the modeled distribution and the data distribution. VDM(k = 9) VDM(k = 1) AESMC(k = 9) 0 1 2 3 time dim_0 dim_1 D p(z2|D) p(z2|x ≤1 ) 0 1 2 3 time dim_0 dim_1 D p(z2|D) p(z2|x ≤1 ) 0 1 2 3 time dim_0 dim_1 D p(z2|D) p(z2|x ≤1 ) Figure 3 : Experiments on 2d synthetic data with 4 modes highlight the multi-modality of VDM. We train VDM(k = 9) (left), VDM(k = 1) (middle), and AESMC(k = 9) (right) on a training set of trajectories D of length 4, and plot generated trajectories D (2 colors for 2 dimensions). We also plot the aggregated posterior p(z 2 |D), and the predictive prior p(z 2 |x ≤1 ) (4 colors for 4 clusters, and not related to the colors in the trajectories plot) at the second time step. Only VDM learns a multi-modal predictive prior, which explains its success in modeling multi-modal dynamics. (Becker et al., 2019 ) models the latent space with a locally linear SSMs, which makes the prediction step and update step analytic (as for Kalman filters (Kalman, 1960)). A final baseline is the conditional flow variational autoencoder (CF-VAE) (Bhattacharyya et al., 2019) , which uses conditional normalizing flows to model a global prior for the future continuations and achieves state-of-the-art performances. To investigate the necessity of taking multiple samples in the VDM inference model, we also compared to VDM(k = 1) which uses only a single sample in Eq. ( 5). VDM(k = 1) has a simpler generative model than VRNN (it considers no autoregressive feedback of the observations x), but the same inference model. More ablations for the modeling choices of VDM are in Appendix H. For fair comparison, we fix the dimension of the latent variables z t and h t to be the same for VDM, AESMC, and VRNN which have the same resulting model size (except for the additional autoregressive feedback in VRNN). AESMC and VDM always use the same number of particles/samples. RKN does not have recurrent states, so we choose a higher latent dimension to make model size comparable. In contrast, CF-VAE has only one global latent variable which needs more capacity and we make it higher-dimensional than z t . Details for each experiment are in Appendix G. Synthetic data with multi-modal dynamics. We generate synthetic data with two dimensions and four modes and compare the performance of VDM with 9 samples (Fig. 3 , left), VDM with a single sample (Fig. 3 , middle), and AESMC using 9 particles (Fig. 3 , right). Since variational inference is known to try to match the aggregated posterior with the predictive prior (Tomczak & Welling, 2018), it is instructive to fit all three models and to look at their predictive prior p(z 2 |x ≤1 ) and the aggregated posterior p(z 2 |D). Because of the multi-modal nature of the problem, all 3 aggregated posteriors are multi-modal, but only VDM(k = 9) learns a multi-modal predictive prior (thanks to its choice of variational family). Although AESMC achieves a good match between the prior and the aggregated posterior, the predictive prior does not clearly separate into different modes. In contrast, the inference model of VDM successfully uses the weights (Eq. ( 6)), which contain information about the incoming observation, to separate the latent states into separate modes. Stochastic Lorenz attractor. The Lorenz attractor is a system governed by ordinary differential equations. We add noise to the transition and emission function to make it stochastic (details in Appendix F.1). Under certain parameter settings it is chaotic -even small errors can cause considerable differences in the future. This makes forecasting its dynamics very challenging. All models are trained and then tasked to predict 90 future observations given 10 initial observations. Fig. 4 illustrates qualitatively that VDM (Fig. 4b ) and AESMC (Fig. 4c ) succeed in modeling the chaotic dynamics of the stochastic Lorenz attractor, while CF-VAE (Fig. 4d ) and VRNN (Fig. 4e ) miss local details, and RKN (Fig. 4f ) which lacks the capacity for stochastic transitions does not work at all. VDM achieves the best scores on all metrics (Table 1 ). Since the dynamics of the Lorenz attractor are governed by ordinary differential equations, the transition dynamics at each time step are not obviously multi-modal, which explains why all models with stochastic transitions do reasonably well. Next, we will show the advantages of VDM on real-world data with multi-modal dynamics. Taxi trajectories. The taxi trajectory dataset involves taxi trajectories with variable lengths in Porto, Portugal. Each trajectory is a sequence of two dimensional locations over time. Here, we cut the trajectories to a fixed length of 30 to simplify the comparison (details in Appendix F.2). The task is to predict the next 20 observations given 10 initial observations. Ideally, the forecasts should follow the street map (though the map is not accessible to the models). The results in Table 2 show that VDM outperforms the other sequential latent variable models in all evaluations. However, it turns out that for multi-step forecasting learning global structure is advantageous, and CF-VAE which is a global latent variable model, achieves the highest results. However, this value doesn't match the qualitative results in Fig. 1 . Since CF-VAE has to encode the entire structure of the trajectory forcast into a single latent variable, its predictions seem to average over plausible continuations but are locally neither plausible nor accurate. In comparison, VDM and the other models involve a sequence of latent variables. As the forecasting progresses, the methods update their distribution over latest states, and the impact of the initial observations becomes weaker and weaker. As a result, local structure is captured more accurately. While the forecasts are plausible and can be highly diverse, they potentially evolve into other directions than the ground truth. For this reason, their multi-step prediction results are worse in terms of log-likelihood. That's why the empirical W-distance is useful to complement the evaluation of multi-modal tasks. It reflects that the forecasts of VDM are diverse and plausible. Additionally, we illustrate the predictive prior p(z t |x <t ) at different time steps in Fig. 5 . VDM(k = 13) learns a multi-modal predictive prior, which VDM(k = 1) and AESMC approximate it with an uni-modal Gaussian. The goal is to predict monthly pollution values for the coming 18 months, given observations of the previous six months. We ignore the geographical location and time information to treat the development tendency of pollution in different counties and different times as i.i.d.. The unknown context information makes the dynamics multi-modal and challenging to predict accurately. Due to the small size and high dimensionality of the dataset, there are not enough samples with very similar initial observations. Thus, we cannot evaluate empirical W-distance in this experiment. In multi-step predictions and one-step predictions, VDM outperforms the other methods. NBA SportVu data. This datasetfoot_3 of sequences of 2D coordinates describes the movements of basketball players and the ball. We extract the trajectories and cut them to a fixed length of 30 to simplify the comparisons (details in Appendix F.4). The task is to predict the next 20 observations given 10 initial observations. Players can move anywhere on the court and hence their movement is less structured than the taxi trajectories which are constrained by the underlying street map. Due to this, the initial movement patters are not similar enough to each other to evaluate empirical Wdistance. In multi-step and one-step predictions, VDM outperforms the other baselines (Table 4 ). Fig. 6 illustrates qualitatively that VDM (Fig. 6b ) and CF-VAE (Fig. 6d ) succeed in capturing the multi-modal dynamics. The forecasts of AESMC (Fig. 6c ) are less plausible (not as smooth as data), and VRNN (Fig. 6e ) and RKN (Fig. 6f ) fail in capturing the multi-modality.

5. CONCLUSION

We have presented variational dynamic mixtures (VDM), a sequential latent variable model for multi-modal dynamics. The main contribution is a new variational family. It propagates multiple samples through an RNN to parametrize the posterior approximation with a mixture density network. Additionally, we have introduced the empirical Wasserstein distance for the evaluation of multimodal forecasting tasks, since it accounts for forecast accuracy and diversity. VDM succeeds in learning challenging multi-modal dynamics and outperforms existing work in various applications.

A SUPPLEMENTARY TO WEIGHTING FUNCTION

In this Appendix we give intuition for our choice of weighting function Eq. ( 6). Since we approximate the integrals in Eqs. ( 3) and ( 7) with samples from q(s t-1 | x <t )foot_4 instead of samples from q(s t-1 | x ≤t ), importance sampling tells us that the weigths should be ω(s t-1 , x t ) = q(s t-1 | x ≤t ) q(s t-1 | x <t ) = q(x t | s t-1 , x <t ) q(x t | x <t ) q(s t-1 | x <t ) q(s t-1 | x <t ) = q(x t | s t-1 , x <t ) q(x t | x <t ) ∝ q(x t | s t-1 , x <t ) This is consistent with out earlier definition of q(s t-1 | x ≤t ) = ω(s t-1 , x t )q(s t-1 | x <t ). The weights are proportional to the likelihood of the variational model q(x t | s t-1 , x <t ). We choose to parametrize it using the likelihood of the generative model p(x t | h t-1 = s t-1 ) and get ω (i) t = ω(s (i) t-1 , x t )/k := 1(i = arg max j p(x t | h t-1 = s (j) t-1 )). ( ) With this choice of the weighting function, only the mixture component with the highest likelihood is selected to be in charge of modeling the current observation x t . As a result, other mixture components have the capacity to focus on different modes. This helps avoid the effect of mode-averaging. An alternative weight function is given in Appendix H.

B SUPPLEMENTARY TO LOWER BOUND

Claim. The ELBO in Eq. ( 8) is a lower bound on the log evidence log p( x t | x <t ), log p(x t | x <t ) ≥ L ELBO (x ≤t , φ) . Proof. We write the data evidence as the double integral over the latent variables z t , and z <t . log p(x t | x <t ) = log p(x t | z ≤t , x <t )p(z t | z <t , x <t )p(z <t | x <t )dz t dz <t We multiply the posterior at the previous time step p(z <t | x <t ) with the ratio of the approximated posterior q(z<t|x<t) q(z<t|x<t) and the ratio f (a,b) f (a,b) , where f is any suitable function of two variables a and b. The following equality holds, since the ratios equal to one. log p(x t | x <t ) = log f (a, b) f (a, b) q(z <t | x <t ) q(z <t | x <t ) p(z <t | x <t ) p(x t | z ≤t , x <t )p(z t | z <t , x <t )dz t dz <t (19) We move the integral over z <t with respect to f (a, b)q(z <t | x <t ) out of the log operation with applying the Jensen's inequality. log p(x t | x <t ) ≥ E f (a,b)q(z<t|x<t) log p(x t | z ≤t , x <t )p(z t | z <t , x <t )dz t (20) -E f (a,b)q(z<t|x<t) log f (a, b) + log q(z <t | x <t ) p(z <t | x <t ) We introduce the variational posterior q(z t | z <t , x ≤t ), and apply Jensen's inequality to replace the intractable integral log p(x t | z ≤t , x <t )p(z t | z <t , x <t )dz t with its lower bound. log p(x t | x <t ) ≥ E f (a,b)q(z<t|x<t) E q(zt|z<t,x ≤t ) log p(x t | z ≤t , x <t )p(z t | z <t , x <t ) q(z t | z <t , x ≤t ) -E f (a,b)q(z<t|x<t) log f (a, b) + log q(z <t | x <t ) p(z <t | x <t ) . ( ) γ (i) = 1 2(n+κ) , i = 1, ..., 2n κ n+κ , i = 0 ξ (i) =    √ n + κe i , i = 1, ..., n - √ n + κe i-n , i = n + 1, ..., 2n 0 , i = 0 , ( ) where κ is a hyperparameter controlling the spread of the sigma points in the n-dimensional sphere. Further e i represents a basis in the n-dimensional space, which is choosen to be a unit vector in cartesian space, e.g. e 1 = [1, 0, ..., 0].

Stochastic cubature approximation.

In SCA, we adopt the computation of ξ (i) in Eq. ( 24), and infuse the sigma points with standard Gaussian noise ∼ N (0, I) to obtain stochastic sigma variables s (i) = µ z + σ z (ξ (i) + ). We choose κ = 0.5 to set the weights γ (i) equally.

E SUPPLEMENTARY TO ABLATION STUDY OF REGULARIZATION TERMS

We investigate the effect of the regularization terms using the synthetic data from Fig. 3 . We can see in Table 5 , VDM(k = 9) can be trained successfully with L ELBO only, and both regularization terms improve the performance (negative log-likelihood of multi-steps ahead prediction), while VDM(k = 1) doesn't work whatever the regularization terms. Additionally, we tried to train the model only with the regularization terms (each separate or together) but these options diverged during training. Lorenz attractor is a system of three ordinary differential equations: dx dt = σ(y -x), dy dt = x(ρ -z) -y, dz dt = xy -βz , where σ, ρ, and β are system parameters. We set σ = 10, ρ = 28 and β = 8/3 to make the system chaotic. We simulate the trajectories by RK4 with a step size of 0.01. To make it stochastic, we add process noise to the transition, which is a mixture of two Gaussians 0.5N (m 0 , P) + 0.5N (m 2 , P), where  m 0 = 0 1 0 , m 1 = 0 -

G IMPLEMENTATION DETAILS

Here, we provide implementation details of VDM models used across the three datasets in the main paper. VDM consists of • encoder: embed the first observation x 0 to the latent space as the initial latent state z 0 . • transition network: propagate the latent states z t . • decoder: map the latent states z t and the recurrent states h t to observations x t . • inference network: update the latent states z t given observations x t . • latent GRU: summarize the historic latent states z ≤t in the recurrent states h t . • discriminator: be used for adversarial training. The optimizer is Adam with the learning rate of 1e -3. In all experiments, the networks have the same architectures but different sizes. The model size depends on observation dimension d x , latent state dimension d z , and recurrent state dimension d h . The number of samples used at each time step in the training is 2d z + 1. If the model output is variance, we use the exponential of it to ensure its non-negative. • Encoder: input size is d x ; 3 linear layers of size 32, 32 and 2d z , with 2 ReLUs. • Transition network: input size is d h ; 3 linear layers of size 64, 64, and 2d z , with 3 ReLUs. • Decoder: input size is d h + d z ; 3 linear layers of size 32, 32 and 2d x , with 2 ReLUs. • Inference network: input size is d h + d x ; 3 linear layers of size 64, 64, and 2d z , with 3 ReLUs. • Latent GRU: one layer GRU of input size d z and hidden size d h • Discriminator: one layer GRU of input size d x and hidden size d h to summarize the previous observations as the condition, and a stack of 3 linear layers of size 32, 32 and 1, with 2 ReLUs and one sigmoid as the output activation, whose input size is d h + d x . Stochastic Lorenz attractor. 27) is better than Eq. ( 28), and for the sampling method, SCA is better than Monte-Carlo method.



https://www.kaggle.com/crailtap/taxi-trajectory q(st-1 | x<t) is the distribution obtained by transforming the previous zt-1 ∼ q(zt-1|x<t) through the RNN. It can be expressed analytically using the Kronecker δ to compare whether the stochastic variable st-1 equals the output of the RNN: q(st-1 | x<t) ∝ δ(st-1 -φ GRU (zt-1, ht-2))q(zt-1 | xt-1, λt-1)dzt-1. the ω adjusts for using samples from q(st-1 | x<t) when marginalizing over ω(st-1, xt)q(st-1 | x<t) A version of the dataset is available at https://www.stats.com/data-science/ The ∼ just helps to visually distinguish the two distributions that appear in the main text. https://www.kaggle.com/sogun3/uspollution



Figure 4: Generated samples from VDM and baselines for stochastic Lorenz attractor. The models generate the remaining 990 observations (blue) based on the first 10 observations (red). Due to the chaotic property, the reconstruction is impossible even the model learns the right dynamics. VDM and AESMC capture the dynamics very well, while RKN fails in capturing the stochastic dynamics.

Figure 6: VDM and CF-VAE generate plausible multi-modal trajectories of basketball plays. Each model's forecasts (blue) are based on the first 10 observations (red). Ground truth data is green.

Figure7: Generated trajectories of stochastic Lorenz attractor from VDM variants. The first ten observations (red) are obtained from models given the first 10 true observations. The rest 990 observations (blue) are predicted. We can see, all variants give very good qualitative results. Since the fundamental dynamics is govern by ordinary differential equations, the transition at each time step is not highly multi-modal. Once the model is equipped with a stochastic transition, it is able to model this dynamics.

Figure 8: Generated 50 taxi trajectories in 3 different areas from VDM and the baselines. All models are required to predict the future continuations (red), based the beginning of a trajectory (blue). VDM generates more plausible trajectories compared with the baselines. While the generated trajectories from VDM follow the street map, the generated trajectories from all baselines are physically impossible. AESMC and CF-VAE can capture the general evolving direction, but suffer from capturing the multi-modality at each time step.

Figure9: Generated 50 taxi trajectories from VDM variants. All models are required to predict the future continuations (red), based the beginning of a trajectory (blue). VDM-SCA+δ achieves the best qualitative results among all variants. VDM-SCA+δ can generate plausible trajectories, even it is trained without the adversarial term L adv . We can see, for the weighting function, Eq. (27) is better than Eq. (28), and for the sampling method, SCA is better than Monte-Carlo method.

Prediction error on stochastic Lorenz attractor for three evaluation metrics (details in main text). VDM(k = 13) achieves the best performance, and AESMC also gives comparable results.



Prediction error on U.S. pollution data for two evaluation metrics (details in main text). VDM makes the most accurate multi-step and one-step predictions.

Prediction error on basketball players' trajectories (details in main text). VDM makes the most accurate multi-step and one-step predictions.

Ablation study of the regularization terms for synthetic data from Fig.3L ELBO L ELBO &L pred L ELBO &L adv

The full dataset is very large and the length of trajectories varies. We select the trajectories inside the Porto city area with length in the range of 30 and 45, and only extract the first 30 coordinates of each trajectory. Thus we obtain a dataset with a fixed sequence length of 30. We split it into the training set of size 86386, the validation set of size 200, and the test set of size 10000.F.3 U.S. POLLUTION DATA SETUPThe U.S. pollution dataset consists of four pollutants (NO2, O3, SO2 and O3). Each of them has 3 major values (mean, max value, and air quality index). It is collected from counties in different states for every day from 2000 to 2016. Since the daily measurements are too noisy, we firstly compute the monthly average values of each measurement, and then extract non-overlapped segments with the length of 24 from the dataset. Totally we extract 1639 sequences as training set, 25 sequences as validation set, and 300 sequences as test set.F.4 NBA SPORTVU DATA SETUPWe use a sliding window of the width 30, and the stride 30 to cut the long sequences to short sequences of a fixed length 30. We split them into the training set of size 8324, the validation set of size 489, and the test set of size 980.

Observation dimension d x is 3, latent state dimension d z is 6, and recurrent state dimension d h is 32. Taxi trajectories. Observation dimension d x is 2, latent state dimension d z is 6, and recurrent state dimension d h is 32. U.S. pollution data 6 Observation dimension d x is 12, latent state dimension d z is 8, and recurrent state dimension d h is 48.

Ablation study of VDM's variants on taxi trajectories for three distance metrics (see main text). The variants are defined in Table7. VDM-SCA+δ outperforms other variants and approaches our default VDM (trained with L adv additionally).

Ablation study of VDM's variants on U.S. pollution data for two distance metrics (see main text). The variants are defined in Table7. VDM-SCA+δ outperforms other variants.

annex

The expectation with respect to f (a, b)q(z <t | x <t ) is approximated with samples. Instead of resampling the entire history, samples from previous time steps are reused (they have been aggregated by the RNN) and we sample according to Eq. ( 4). We plugg in the weighting function ω(s (i) t-1 , x t ) for f (a, b). The term log q(z<t|x<t) p(z<t|x<t) is not affected by the incoming observation x t and can be treated as a constant.In this step, we plug in our generative model and inference model as they are described in the main text for p and q. The conditional independence assumptions can be read of Fig. 2 . In the generative model h t-1 and in the inference model s t-1 summarize the dependencies of z t on the previous latent variables z <t and observations x <t . In other words, we assume z t is conditionally independent on z <t and x <t given s (i) t-1 in the inference model (or given h t-1 in the generative model).

C ALGORITHMS OF GENERATIVE MODEL AND INFERENCE MODEL

Algorithm 1 Generative modelend for

D SUPPLEMENTARY TO STOCHASTIC CUBATURE APPROXIMATION

Cubature approximation. The cubature approximation is widely used in the engineering community as a deterministic method to numerically integrate a nonlinear function f (•) of Gaussian random variable z ∼ N (µ z , σ 2 z I), with z ∈ R d . The method proceeds by constructing 2d + 1 sigma points z (i) = µ z +σ z ξ (i) . The cubature approximation is simply a weighted sum of the sigma points propagated through the nonlinear function f (•),Simple analytic formulas determine the computation of weights γ (i) and the locations of ξ (i) .NBA SportVu data. Observation dimension d x is 2, latent state dimension d z is 6, and recurrent state dimension d h is 32.Here, we give the number of parameters for each model in different experiments in Table 6 . 

H ADDITIONAL EVALUATION RESULTS

We evaluate more variants of VDM in the chosen experiments to investigate the different choices of sampling methods (Monte Carlo method, and SCA) and weighting functions (Eqs. ( 27) and ( 28)).In addition to Eq. ( 27) described in the main text, we define one other choice in Eq. ( 28).t-1 ), (28) We define the weighting function as an indicator function, in Eq. ( 27) we set the non-zero component by selecting the sample that achieves the highest likelihood, and in Eq. ( 28) the non-zero index is sampled from a categorical distribution with probabilities proportional to the likelihood. The first choice (Eq. ( 27)) is named with δ-function, and the second choice (Eq. ( 28)) is named with categorical distribution. Besides, in VDM-Net, we evaluate the performance of replacing the closed- 

