VARIATIONAL DYNAMIC MIXTURES

Abstract

Deep probabilistic time series forecasting models have become an integral part of machine learning. While several powerful generative models have been proposed, we provide evidence that their associated inference models are oftentimes too limited and cause the generative model to predict mode-averaged dynamics. Modeaveraging is problematic since many real-world sequences are highly multi-modal, and their averaged dynamics are unphysical (e.g., predicted taxi trajectories might run through buildings on the street map). To better capture multi-modality, we develop variational dynamic mixtures (VDM): a new variational family to infer sequential latent variables. The VDM approximate posterior at each time step is a mixture density network, whose parameters come from propagating multiple samples through a recurrent architecture. This results in an expressive multi-modal posterior approximation. In an empirical study, we show that VDM outperforms competing approaches on highly multi-modal datasets from different domains.

1. INTRODUCTION

Making sense of time series data is an important challenge in various domains, including ML for climate change. One important milestone to reach the climate goals is to significantly reduce the CO 2 emissions from mobility (Rogelj et al., 2016) . Accurate forecasting models of typical driving behavior and of typical pollution levels over time can help both lawmakers and automotive engineers to develop solutions for cleaner mobility. In these applications, no accurate physical model of the entire dynamic system is known or available. Instead, data-driven models, specifically deep probabilistic time series models, can be used to solve the necessary tasks including forecasting. The dynamics in such data can be highly multi-modal. At any given part of the observed sequence, there might be multiple distinct continuations of the data that are plausible, but the average of these behaviors is unlikely, or even physically impossible. Consider for example a dataset of taxi trajectoriesfoot_0 . In each row of Fig. 1a , we have selected 50 routes from the dataset with similar starting behavior (blue). Even though these routes are quite similar to each other in the first 10 way points, the continuations of the trajectories (red) can exhibit quite distinct behaviors and lead to points on any far edge of the map. The trajectories follow a few main traffic arteries, these could be considered the main modes of the data distribution. We would like to learn a generative model of the data, that based on some initial way points, can forecast plausible continuations for the trajectories. Many existing methods make restricting modeling assumptions such as Gaussianity to make learning tractable and efficient. But trying to capture the dynamics through unimodal distributions can lead either to "over-generalization", (i.e. putting probability mass in spurious regions) or on focusing only on the dominant mode and thereby neglecting important structure of the data. Even neural approaches, with very flexible generative models can fail to fully capture this multi-modality because their capacity is often limited through the assumptions of their inference model. To address this, we develop variational dynamic mixtures (VDM). Its generative process is a sequential latent variable model. The main novelty is a new multi-modal variational family which makes learning and inference multi-modal yet tractable. In summary, our contributions are • A new inference model. We establish a new type of variational family for variational inference of sequential latent variables. By successively marginalizing over previous latent states, the procedure can be efficiently carried-out in a single forward pass and induces a multi-modal posterior approximation. We can see in Fig. 1b , that VDM trained on a dataset of taxi trajectories produces forecasts with the desired multi-modality while other methods overgeneralize. • An evaluation metric for multi-modal tasks. The negative log-likelihood measures predictive accuracy but neglects an important aspect of multi-modal forecasts -sample diversity. In Section 4, we derive a score based on the Wasserstein distance (Villani, 2008) which evaluates both sample quality and diversity. This metric complements our evaluation based on log-likelihoods. • An extensive empirical study. in Section 4, we use VDM to study various datasets, including a synthetic data with four modes, a stochastic Lorenz attractor, the taxi trajectories, and a U.S. pollution dataset with the measurements of various pollutants over time. We illustrate VDM's ability in modeling multi-modal dynamics, and provide quantitative comparisons to other methods showing that VDM compares favorably to previous work.

2. RELATED WORK

Neural recurrent models. Recurrent neural networks (RNNs) such as LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Chung et al., 2014) have proven successful on many time series modeling tasks. However, as deterministic models they cannot capture uncertainties in their dynamic predictions. Stochastic RNNs make these sequence models non-deterministic (Chung et al., 2015; Fraccaro et al., 2016; Gemici et al., 2017; Li & Mandt, 2018) . For example, the variational recurrent neural network (VRNN) (Chung et al., 2015) enables multiple stochastic forecasts due to its stochastic transition dynamics. An extension of VRNN (Goyal et al., 2017) uses an auxiliary cost to alleviate the KL-vanishing problem. It improves on VRNN inference by forcing the latent variables to also be predictive of future observations. Another line of related methods rely on particle filtering (Naesseth et al., 2018; Le et al., 2018; Hirt & Dellaportas, 2019) and in particular sequential Monte Carlo (SMC) to improve the evidence lower bound. In contrast, VDM adopts an explicitly multi-modal posterior approximation. Another SMC-based work (Saeedi et al., 2017) employs search-based techniques for multi-modality but is limited to models with finite discrete states. Recent works (Schmidt & Hofmann, 2018; Schmidt et al., 2019; Ziegler & Rush, 2019) use normalizing flows in the latent space to model the transition dynamics. A normalizing flow requires many layers to transform its base distribution into a truly multi-modal distribution in practice. In contrast, mixture density networks (as used by VDM) achieve multi-modality by mixing only one layer of neural networks. A task orthogonal to multi-modal inference is learning disentangled representations. Here too, mixture models are used (Chen et al., 2016; Li et al., 2017) . These papers use discrete variables and a mutual information based term to disentangle different aspects of the data. VAE-like models (Bhattacharyya et al., 2018; 2019) and GAN-like models (Sadeghian et al., 2019; Kosaraju et al., 2019) only have global, time independent latent variables. Yet, they show good results on various tasks, including forecasting. With a deterministic decoder, these models focus on average dynamics and don't capture local details (including multi-modal transitions) very well. Sequential latent variable models are described next.



https://www.kaggle.com/crailtap/taxi-trajectory



Figure 1: Forecasting taxi trajectories is challenging due to the highly multi-modal nature of the data (Fig. 1a). VDM (Fig. 1b) succeeds in generating diverse plausible predictions (red), based the beginning of a trajectory (blue). The other methods, AESMC (Le et al., 2018), CF-VAE (Bhattacharyya et al., 2019), VRNN Chung et al. (2015), RKN Becker et al. (2019), suffer from mode averaging.

