GENERATIVE TIME-SERIES MODELING WITH FOURIER FLOWS

Abstract

Generating synthetic time-series data is crucial in various application domains, such as medical prognosis, wherein research is hamstrung by the lack of access to data due to concerns over privacy. Most of the recently proposed methods for generating synthetic time-series rely on implicit likelihood modeling using generative adversarial networks (GANs)-but such models can be difficult to train, and may jeopardize privacy by "memorizing" temporal patterns in training data. In this paper, we propose an explicit likelihood model based on a novel class of normalizing flows that view time-series data in the frequency-domain rather than the time-domain. The proposed flow, dubbed a Fourier flow, uses a discrete Fourier transform (DFT) to convert variable-length time-series with arbitrary sampling periods into fixedlength spectral representations, then applies a (data-dependent) spectral filter to the frequency-transformed time-series. We show that, by virtue of the DFT analytic properties, the Jacobian determinants and inverse mapping for the Fourier flow can be computed efficiently in linearithmic time, without imposing explicit structural constraints as in existing flows such as NICE (Dinh et al. ( 2014)), RealNVP (Dinh et al. (2016)) and GLOW (Kingma & Dhariwal (2018)). Experiments show that Fourier flows perform competitively compared to state-of-the-art baselines.

1. INTRODUCTION

Lack of access to data is a key hindrance to the development of machine learning solutions in application domains where data sharing may lead to privacy breaches (Walonoski et al. (2018) ). Areas where this problem is most conspicuous include medicine, where access to (highly-sensitive) clinical data is stringently regulated by medical institutions; such strict regulations undermine scientific progress by hindering model development and reproducibility. Generative models that produce sensible and realistic synthetic data present a viable solution to this problem-artificially-generated data sets produced by such models can be shared widely without privacy concerns (Buczak et al. (2010) ). In this paper, we focus on the time-series data setup, where observations are collected sequentially over arbitrary periods of time with different observation frequencies across different features. This general data setup is pervasive in the medical domain-it captures the kind of data maintained in electronic health records (Shickel et al. (2017) ) or collected in intensive care units (Johnson et al. (2016) ). While many machine learning-based predictive models that capitalize on such data have been proposed over the past few years (Jagannatha & Yu (2016); Choi et al. (2017) ; Alaa & van der Schaar (2019)), much less work has been done on generative models that could emulate and synthesize these data sets. Existing generative models for (medical) time-series are based predominantly on implicit likelihood modeling using generative adversarial networks (GANs), e.g., Recurrent Conditional GAN (RCGAN) (Esteban et al. ( 2017)) and TimeGAN (Yoon et al. (2019) ). These models apply representation learning via recurrent neural networks (RNN) combined with adversarial training in order to map noise sequences in a latent space to synthetic sequential data in the output space. Albeit capable of flexibly learning complex representations, GAN-based models can be difficult to train (Srivastava et al. ( 2017)), especially in the complex time-series data setup. Moreover, because they hinge on implicit likelihood modeling, GAN-based models can be hard to evaluate quantitatively due to the absence of an explicitly computable likelihood function. Finally, GANs are vulnerable to training data memorization (Nagarajan et al. ( 2018))-a problem that would be exacerbated in the temporal setting where memorizing only a partial segment of a medical time-series may suffice to reveal a patient's identify, which defeats the original purpose of using synthetic data in the first place. Here, we propose an alternative explicit likelihood approach for generating time-series data based on a novel class of normalizing flows which we call Fourier flows. Our proposed flow-based model operates on time-series data in the frequency-domain rather than the time-domain-it converts variable-length time-series with varying sampling rates across different features to a fixed-size spectral representation using the discrete Fourier transform (DFT), and then learns the distribution of the data in the frequency domain by applying a chain of data-dependent spectral filters to frequency-transformed time-series. Using the convolution property of DFT (Oppenheim (1999)), we show that spectral filtering of a timeseries in the frequency-domain-an operation that mathematically resembles affine transformations used in existing flows (Dinh et al. ( 2016))-is equivalent to a convolutional transformation in the timedomain. This enhancement in the richness of distributions learned by our flow comes at no extra computational cost: using Fast Fourier Transform (FFT) algorithms, we show that the entire steps of our flow run in O(T log T ) time, compared to the polynomial complexity of O(T 2 ) for a direct, time-domain convolutional transformation. We also show that, because the DFT is a linear transform with a Vandermonde transformation matrix, computation of its Jacobian determinant is trivial. The zero-padding and interpolation properties of DFT enables a natural handling of variable-length and inconsistently-sampled time-series. Unlike existing explicit-likelihood models for time-series data, such as deep state-space models (Krishnan et al., 2017; Alaa & van der Schaar, 2019) , our model can be optimized and assessed through the exact likelihood rather than a variational lower bound.

2. PROBLEM SETUP

We consider a general temporal data setup where each instance of a (discrete) time-series comprises a sequence of vectors x = [x 0 , . . . , x T -1 ], x t ∈ X , ∀ 0 ≤ t ≤ T -1, covering a period of T time steps. We assume that each dimension in the feature vector x t is sampled with a different rate, i.e., at each time step t, the observed feature vector is x t = [ x t,1 [r 1 ], . . . , x t,D [r D ] ], where r d ∈ N + is the sampling period of feature dimension d ∈ {1, . . . , D}. That is, for a given sampling period r d , we observe a value of x t,d every r d time steps, and observe a missing value (denoted as *) otherwise, i.e., x t,d [r d ] x t,d , t mod r d = 0, * , t mod r d = 0. (1) The data setup described above is primarily motivated by medical time-series modeling problems, wherein a patient's clinical measurements and bio-markers are collected over time at different rates (Johnson et al. (2016) ; Jagannatha & Yu ( 2016)). Despite our focus on medical data, our proposed generative modeling approach applies more generally to other applications, such as speech synthesis (Prenger et al. ( 2019)) and financial data generation (Wiese et al. ( 2020)). Each realization of the time-series x is drawn from a probability distribution x ∼ p(x). In order to capture variable-length time-series (common in medical problems), the length T of each sequence is also assumed to be a random variable-for notational convenience, we absorb the distribution of T into p. One possible way to represent the joint distribution p(x) is through the factorization:foot_0 p(x) = p(x 0 , . . . , x T -1 , T ) = p(T ) • T -1 t=0 p(x t | x 0 , . . . , x t-1 , T ). (2) We assume that the sampling period r d for each feature d is fixed for all realizations of x. The feature space X is assumed to accommodate a mix of continuous and discrete variables on its D dimensions.



Our proposed method is not restricted to any specific factorization of p(x).

