LATENT LINEAR ODES WITH NEURAL KALMAN FIL-TERING FOR IRREGULAR TIME SERIES FORECASTING

Abstract

Over the past four years, models based on Neural Ordinary Differential Equations have become state of the art in the forecasting of irregularly sampled time series. Describing the data-generating process as a dynamical system in continuous time allows predictions at arbitrary time points. However, the numerical integration of Neural ODEs typically comes with a high computational burden or may even fail completely. We propose a novel Neural ODE model that embeds the observations into a latent space with dynamics governed by a linear ODE. Consequently, we do not require any specialized numerical integrator but only an implementation of the matrix exponential readily available in many numerical linear algebra libraries. We also introduce a novel state update component inspired by the classical Kalman filter, which, to our knowledge, makes our model the first Neural ODE variant to explicitly satisfy a specific self-consistency property. It allows forecasting irregularly sampled time series with missing values and comes with some numerical stability guarantees. We evaluate the performance on medical and climate benchmark datasets, where the model outperforms the state of the art by margins up to 30%.

1. INTRODUCTION

Continuous dynamical systems described by ordinary differential equations (ODE) propagate a given state into any time in the future. Hence ODE based models are natural candidates for the task of forecasting irregularly sampled time series. Furthermore many real world systems are well described by ODEs. Since the seminal paper by Chen et al. (2018) Neural ODEs have become building blocks of state of the art models for irregularly sampled time series forecasting. To predict a future state an ODE model would need an estimation of the present state and then propagate the state by solving an initial value problem. The present work proposes a model that introduces novel ideas both with respect to the state estimation and to the propagation. One serious issue with Neural ODEs is the cost and possible failure of the numerical integration. There exist many numerical schemes for this purpose, but in any case the cost of the integration for a required accuracy depends on the analytical properties of the right hand side and can become arbitrarily large or lead to failure. This is a serious problem for Neural ODEs, which has been tackled by different types of regularizations (Finlay et al., 2020; Ghosh et al., 2020; Kelly et al., 2020) . We propose a model where the observations are nonlinearly mapped into a latent space and a linear ODE with constant coefficients describes the latent dynamics. Solving the initial value problem simplifies to taking the matrix exponential, for which efficient and stable numerical implementations are available. According to Koopman operator theory (Brunton et al., 2022) such linear ODEs are expressive enough to approximate nonlinear ODEs. Furthermore, such linear dynamics are well understood and can be analyzed and modified using tools from linear algebra. For the state estimation we propose a filter inspired by the classical Kalman filter that updates the state given a new observation. However, it does not operate in the linear latent domain, but in the observation domain, and it is not probabilistic. The filter is designed to deal in a natural way with missing values and satisfies a self-consistency condition, such that the model state will only change at an observation if it differs from the model prediction. To the best of our knowledge our model is the first model that gives provable guarantees of forward stability at intialization. We evaluate the model on three benchmark datasets for forecasting irregularly sampled time series with missing values (USHCN, MIMIC-III, MIMIC-IV) and improve on the existing models by a considerable margin in all cases. The contributions of this work are as follows: (1) We provide a joint view of many ODE-based and related models as latent state space models with four different model components system, filter, encoder and decoder, which by design can handle irregular time series data. (2) We formulate and argue five different desiderata for the properties of such models, esp. having fast and simple system components / ODE integrators, self-consistency, forward stability and handling missing values. (3) We propose a model consisting of a linear ODE system and a Kalman-like filter, LinODEnet, and show that it guarantees to fulfil these desidered properties. (4) In experiments on forecasting three different irregular time series datasets with missing values we show that LinODEnet reduces the error by up to 30% over the previous state of the art. Kalman Filtering. The Kalman Filter (Kalman, 1960) provides optimal incremental state updates for linear differential equations with Gaussian noise. While the original version requires complete observations, a modified Kalman filter copes with missing values (Cipra & Romera, 1997) , which is one of the motivations for our filter design.

2. RELATED WORK

The Normalizing Kalman Filter (de Bézenac et al., 2020) (Wilson & Finkel, 2009; Millidge et al., 2021) either addresses very special cases or are related to the modelling of the brain and rather far away from our work. Koopman Theory. A Koopman Operator (Koopman, 1931 ) is a linear operator on a space of time dependent functions that describes the propagation of observations of a dynamical system through time. While these function spaces are infinite dimensional, in many cases there exist useful finite dimensional Koopman operator approximations that can be created by various methods, that have been summarized in a recent review (Brunton et al., 2022) . Such Koopman representations have been combined with Kalman filters for the linear operator (Netto & Mili, 2018) , but the linear representation is obtained by a classical method which does not work for irregularly sampled observations with missing values. Some works use neural networks to learn Koopman representations (cf. section 5.4 of Brunton et al. ( 2022)), but have largely different architectures and are not applied to irregularly sampled time series with missing values. The model we present can be seen as learning an approximate Koopman operator, but it is outside the scope of this work to exploit this representation along the lines of Koopman theory. Neural Flows. Neural Flows (Biloš et al., 2021) are another related model class. The authors propose replacing ODEs by invertible time dependent diffeomorphisms and use different parametrizations for such diffeomorphisms (ResNet, GRU, coupling flow). This is similar to and in special cases would amount to learning the solution of a differential equation instead of learning the right hand side. They mention also the parametrization of the solutions by a matrix exponential, but then opted for other parametrizations.

3. PROBLEM FORMULATION

A time series dataset D is a set of time series instances, sequences (t i , x obs ti ) ∈ (R ∪ O) * encoding observations x obs ti at time t i . The observation space O usually is just composed of M channels: O := R M . If observations in some channels can be missing, we write O := (R ∪ {NaN}) M and dinstinguish it from the space X := R M of complete observations. The time series forecasting problem is, given a time series dataset D from an unknown distribution q, a loss ℓ : O * × X * → R on observations, and a function split : R * → (R * × R * ) * that splits the time points of a time series into (possibly multiple) pairs of two subsequences, pasts p and futures s, to find a model x : R * × (R × O) * → X * that for given future time points and past observations, predicts future observations, minimizing the expected loss E (t,x obs )∼q 1 |split(t)| (p,s)∈split(t) ℓ x obs s , x(s, (p, x obs p )) A simple index-based split function just outputs all possible splits into a past of P time points and F future time points, and the loss usually is just an instance-wise loss, e.g., the instance-wise mean squared error: split index (t; P, F ) := (t i-P +1 , . . . , t i ), (t i+1 , . . . , t i+F ) | i ∈ P :|t| -F , P, F ∈ N ℓ MSE (x obs , x) := 1 |x obs | |x obs | i=1 ||diag(not-missing(x obs i ))(x obs i -xi )|| 2 forward in time by a system component. For each observation, latent state is decoded and updated with a filter component.  D ← sort D ∪ {(s j , NaN)} j=1:m t 0 ← t 1 ; z t0 ← c for t i , x obs ti in D do ∆t i ← t t -t i-1 ẑti ← System(∆t i , z ′ ti-1 ) xti ← Decoder(ẑ ti ) x′ ti ← Filter(x ti , x obs ti ) ẑ′ ti ← Encoder(x ′ ti ) end for Return: Estimated states (x ′ ti ) i=1:n . This setup is indeed a very general model class, Table 1 shows how many current state space models can be described by this schema. Neural ODEs, introduced in 2018 by Chen et al., use a ordinary differential equation to represent the system component. The vector field which describes the right hand side of the ODE is given by a neural network f : ż = f (t, z(t)), z(t 0 ) = z 0 =⇒ z(t) = odeint(f, z 0 , [t 0 , t]) Crucially, they introduced a continuous version of backpropagation through time that allows to compute gradients for a Neural ODE model with respect to a loss function by solving a so called adjoint equation backwards in time. This allows one to compute gradients without implementing a differentiable ODE-integrator and without back-propagating through the integrator steps.  -D ż(t) = -diag(max(0, w i )) ⊙ z(t) x ′ t ← m t x obs t + (1 -m t )x t z ′ t ← GRUCell(x ′ t , z t ) ODE-RNN ż(t) = NN(t, z(t)) z ′ t ← RNNCell(x obs t , z t ) GRU-ODE-Bayes ż(t) = GRUCell(0, z(t)) -z(t) z ′ t ← GRUCell(f (x obs t , m t , z t ), z t ) NCDE ż(t) = g(z(t)) ṡ(t) s(t) ← CubicSpline(t | (t i , x i ) i=1...n ) Neural Flow F (∆t, x) = x + φ(∆t)g(∆t, x) Bayesian Filtering LinODEnet (ours) ż(t) = Az(t) x ′ t = KalmanCell(x obs t , x t ) = x t -f (m t ⊙ (x t -x obs t )) By design, latent state space models can handle irregularly sampled time series. We formulate five further principled desiderata and argue for them in the following: D1. fast (and simple) integrators: solving the ODE system for propagating the latent state should be fast (and simple). D2. self-consistency: the model does not change its forecast when observing its own predictions. D3. forward stability: at initialization the model maps data with zero mean and unit variance to outputs with zero mean and unit variance for arbitrarily long sequences. D4. can handle missing values: the model can handle missing values in the observations. D5. observed channels can influence all other channels: observed values in one channel can influence the estimation of unobserved states of all channels. D1. Fast (and simple) integrators. Neural ODE models only faithfully represent the solution to an ODE when the numerical integrators use sufficiently small adaptive step-sizes (Ott et al., 2020)  D = {(s j , ŷ(s j )) | j = 1 . . . m} ŷ(t | D) = ŷ(t | D ∪ D) For a probabilistic models p(y(t) | D) we similarly define self-consistency as the condition E ŷ∼p(y(t)|D) [ŷ] = E ŷ∼p(y(t)|D∪ D) [ŷ] Note that in this case it is expected that Var ŷ∼p(y(t)|D∪ D) [ŷ] should decrease with the size of D. D3. Model stability. Model stability is crucial in order to allow training over long sequences without issues of divergence. Definition 1 (forward stability). We say a function f : R n → R m is forward stable, if and only if it maps data with zero mean and unit variance to outputs with zero mean and unit variance. ∀i : E x∼p [x i ] V x∼p [x i ] = 0 1 =⇒ ∀j : E x∼p [f (x) j ] V x∼p [f (x) j ] = 0 1 (4) Similarly, one can define backward stability as the condition that the gradients, or more precisely the vector-Jacobian product maps data with zero mean and unit variance to gradients with zero mean and unit variance. Typically, the random distributions of network parameters are chosen in a way to ensure either forward-or backward stability at initialization or a compromise between the two at as in general simultaneous forward and backward stability is impossible He et al. (2015) . For example, Attention layers (Vaswani et al., 2017) introduce a scaling factor of 1 / √ d and Dropout (Srivastava et al., 2014) multiplies the input by the reciprocal of the droprate. However, recently a new approach has emerged that achieves both in a ResNet architecture, by simply introducing an additional single scalar parameter initialized with 0 that masks the nonlinearity, making the model look like an identity map. We will refer to this as the ReZero-technique (Bachlechner et al., 2021) , although previous works also showed similar ideas Hayou et al. ( 2021 x ← x + αf (x) α: learnable scalar initialized with 0 (5) In particular recent research suggests that this techniques allows one to refrain from using batchnormalization layers (Ioffe & Szegedy, 2015) . We use variants of the ReZero-technique throughout all model components. D4. Can handle missing values. The model can handle missing values. For latent state space models the filter needs to be able to do so. D5. Observed channels can influence all other channels. This is crucial as often channels are correlated with each other, hence observing one channel can provide information about all other channels.

5. LATENT LINEAR ODES WITH NEURAL KALMAN FILTERING (LINODENET)

We propose two specific innovations for ODE-based latent state space models: (i) to use a linear ODE for the system component and (ii) to use a Kalman like filter component, such that the overall model fulfills the desiderata D1 to D5, and call it LinODEnet. LinODEnet is structured as shown in Algorithm 1. We describe its components in turn.

5.1. SYSTEM COMPONENT

To avoid having to use complicated numerical integrators, LinODEnet uses a simple homogeneous linear ODE with constant coefficients. This has the huge advantage that the solution can be expressed in closed form in terms of the matrix exponential. Definition 2 (Linear ODE). If the vector field is an affine function of the state vector the ODE is called linear, i.e. if and only if it is of the form: ẋ(t) = A(t) • x(t) + b(t) x(t 0 ) = x 0 (6) for some matrix valued function A : R → R n×n and vector-valued function b : R → R n . If A and b are constant, we call it a linear ODE with constant coefficients. If b = 0, we say it is homogeneous. Lemma 1 (Solution of Linear ODE). The solution of a homogeneous linear ordinary differential with constant coefficients can be expressed in term of the matrix exponential ẋ(t) = Ax(t) ⇐⇒ x(t + ∆t) = e A∆t x(t) Proof. See for instance Teschl (2012) . In particular, implementations of the matrix exponential are readily available in many popular numerical libraries such as SCIPY (Virtanen et al., 2020) , TENSORFLOW (Abadi et al., 2016) or PYTORCH (Paszke et al., 2019) . Typical implementations such as scaling and squaring approaches by Higham ( 2005) and Al-Mohy & Higham (2009) offer high performance and tight error bounds, establishing desideratum D1. A second advantage of the linear system is the possibility to parametrize or regularize the kernel matrix in order to achieve certain properties. We highlight the initialization with a skew-symmetric matrix as of particular importance, and we use it as the default initialization in all experiments. Lemma 2. If K is skew-symmetric, then e K is orthogonal and h(t) = e K•t x is forward stable for all t and x ∼ N (0 n , I n ).

Algorithm 2 LinODECell

Input: scalar time delta ∆t latent state z t Parameters: matrix K zero-initialized scalar ε parametrization ψ z t+∆t ← e ε•ψ(K)•∆t • z t Return: latent state z t+∆t Motivated by these properties, we define the LinODECell (Alg 2). Note that crucially, in comparison to the GRU-D model, a general latent linear ODE model allows for imaginary eigenvalues of the system matrix, corresponding to oscillatory system behaviour. In the GRU-D model, the authors intentionally restricted the model to a non-positive, real diagonal matrix.

5.2. FILTER COMPONENT

Any state space model must have a way of incorporating new measurements as they come in. Definition 3. We call a function of the form f : O × X → X , (x obs , x) → x′ a filter. If the observation space O contains NaN values we say it allows for missing observations. If the state space X is equal to the non-missing part of the observation space O, we call it autoregressive. Finally, we say a filter cross-correlates channels, if even the observation of just a single channel can potentially update the state estimate in all channels. One of the big achievements of classical filtering theory is the Kalman Filter (Kalman, 1960) , which is the provably optimal state update in terms of squared error loss when the system consisting of normally distributed variables evolving according to a linear dynamical system. Assuming the state is distributed as x ∼ N (µ t , Σ t ), at time t, and evolves according to a linear dynamical system ẋt = A t x t + w t , then since the family of Normal distributions are closed under linear transformations, the state is normally distributed for all times t. Given a noisy measurement y t = H t x t + v t with R t = E[v t v ⊤ t ] , which is only partially observed according to a mask m t , then the optimal state update is (Cipra & Romera, 1997 ) µ ′ t = µ t -Σ t H t Π t (H t Σ t H ⊤ t + R t ) -1 Π t (H t µ t -ỹt ) Σ ′ t = Σ t -Σ t H t Π t (H t Σ t H ⊤ t + R t ) -1 Π t H t Σ t Where Π t = diag(m t ), and ỹt is y t where the missing values were replaced with arbitrary values. Inspired by this formula, we introduce the linear and non-linear KalmanCell which can be used as drop-in replacements for regular RNN-, GRU-or LSTMCells. linear KalmanCell: x′ t ← xt -α BH ⊤ Π t AΠ t (H xt -x obs t ) non-linear KalmanCell: x′ t ← xt -εϕ(BH ⊤ Π t AΠ t (H xt -x obs t )) Algorithm 3 Non-Linear KalmanCell Input: Current State estimate xt ∈ R n , observed datapoint x obs t ∈ R m Parameters: Learnable matrices A, B, H zero-initialized scalar ε, neural network ϕ. Options: If autoregressive, m = n and H = I n . Π t ← diag(not-missing(x obs t )) x′ t ← xt -εϕ(BH ⊤ Π t AΠ t (H xt -x obs t )) Return: Updated state estimate x′ t . In both cases A, B, H are learnable weight matrices. In the linear case we introduce a special parametrizations A = I+ε A A and B = I+ε B B. Here ε, ε A , ε B are learnable scalars, that are initialized with zero, which ensures forward stability. ϕ is an arbitrary neural network with ϕ(0) = 0. By design the Kalman-Cell can handle NaN values, for implementation details see 4 and 5, establishing D4. Lemma 3 (KalmanCell at initialization). At initialization, the non-linear KalmanCell is the identity function. The linear KalmanCell's behaviour is dependent on the choice of α: if α = 1, it always updates the state to the last observed value, whereas if α = 0 is carries the first observed value through. The choice α = 1 2 corresponds to the classical Kalman Filter (cf. Appendix B.2) Note that ( 9) is different from both Szirtes et al. (2005) , Wilson & Finkel (2009) , and the recent pre-prints Millidge et al. (2021) and Revach et al. (2022) . Moreover, since the KalmanCells do, in contrast to the GRU-D model, use full and not diagonal matrices, D5 is satisfied. Moreover it is possible to stack multiple filters (f i ) i=1...k . Stacked Filter: x(i+1) t = f i (x obs t , x(i) t ) for i=1. . .k With regards to desideratum D2, there is a strong relationship to the setup of the filter component. Definition 4. We say an autoregressive filter F : X × X → X , (x obs , x) → x′ is idempotent, if and only if it returns the original state estimate as-is whenever all non-missing observations agree with it. x obs i = xi ∀i : x obs i ̸ = NaN =⇒ F (x obs , x) = x Lemma 4. If a latent state space model (Alg. 1) is self-consistent, then its filter must be idempotent. Proof. It this wasn't the case, then ŷ(t | D) ̸ = ŷ(t | D ∪ {t, ŷ(t | D)}).

5.3. OVERALL MODEL PROPERTIES

Proposition 1. If in Algorithm 1 the system component represents a dynamical system, and the filter component is idempotent, and the encoder is left-inverse to the decoder, then the model is self-consistent. Corollary 1. LinODEnet is self-consistent at initialization. If the encoder is left-inverse to the decoder it is self-consistent throughout training, establishing desideratum D3. The proofs can be found in Appendix B. Table 2 summaries the properties of git adLinODEnet in comparison to other models.  ✓ ✓ ✓ ✓ ✓ continuous time ✓ ✗ ✓ ✓ ✓ ✓ global existence <NA> ✓ ✓ ✗ ✓ ✓ self-consistency ✗ ✗ ✗ ✗ ✗ ✓ forward stability ✗ ✗ ✗ ✗ ✗ ✓ coupled channels <NA> ✗ ✗ ✓ ✓ ✓ Remark 1 (LinODEnet with hidden state). Since LinODEnet can parse NaN values, we consider a small modification consisting of concatenating a number of dummy channels, completely filled with NaN-values to all input data. This allows the model to have a working memory in case that the state space does not capture the full dynamics. 6 EMPIRICAL EVALUATION

6.1. EXPERIMENTS ON IRREGULAR TIME SERIES

While there are many publications dealing with irregular time series for classifications or imputation tasks, there are few that approach with irregular sampled time series forecasting natively. We identify GRU-ODE-Bayes (De Brouwer et al., 2019) and Neural Flow (Biloš et al., 2021) . Irregular time series occur naturally whenever collections of sensor devices sample data independently, sometimes as vastly different rates. This is for example the case with clinical records such as the MIMIC-III (Johnson et al., 2016) and MIMIC-IV (Johnson et al., 2021) datasets or in weather observations such as the USHCN dataset (Menne et al., 2015) . We use the same data-preprocessing and evaluation protocol as our baselines De Brouwer et al. (2019) and Biloš et al. (2021) . The task is to predict the next 3 samples using an observation horizon of 36 hours. Table 3 shows that LinODEnet outperforms all baselines by significant margin. We also observe that in the USHCN dataset, adding hidden channels (Remark 1) gives an additional lift. We suspect that this is due to the numbers of channels being very small (5) for this dataset. Training Details. We use the ADAMW-optimizer (Loshchilov & Hutter, 2018) , a variant of the popular ADAM-optimizer Kingma & Ba (2015) that provides a correction when using weight decay. For the filter we use a stack of a linear KalmanCell and two nonlinear KalmanCells. The full hyperparameter selection is in Appendix E.2. Reproducibility. We created two pip-installable python packages: tsdmprovides utilities for dataset acquisition, pre-processing and a library of encoders. A reference implementation of the model in PYTORCH is available as the package linodenet.The experimental code is in a separate repository. During training, we consistently observed the emergence of correlation between the rows and columns of the system-components kernel-matrix (Figure 4 ). This indicates that the matrix get close to a low-rank matrix. Since the rank itself cannot be computed in a numerically stable manner, we considered a smoothed relaxation known as the effective rank (Roy & Vetterli, 2007) . Definition 5. The effective rank of a matrix is defined as the exponential of the entropy erank(A) = e H(p) , where p = σ/∥σ∥ 1 is the discrete probability distribution given by normalizing the singular values σ of A.

6.2. OBSERVATIONS

Figure 3 shows the evolution of the spectrum of the kernel matrix for a sample run on the USHCN dataset. One can see that the eigenvalues stay close to being purely imaginary. We speculate that this is due to the main dynamics are essentially periodic in nature, as weather patterns repeat over time 

7. CONCLUSIONS

We propose a novel forecasting model for irregularly sampled time series with missing values, that maps the observation space to latent space with constant linear ODE dynamics and performs state estimations by an update rule inspired by the Kalman filter. For solving the linear ODE we do not need numerical integration but just matrix exponentials, for which stable and efficient implementations exist. Forward stability of the model at intialization is guaranteed. The model is evaluated and most of the existing forecasting benchmarks for irregularly sampled time series and improves on existing models by a considerable margin. The model opens a way for interesting future work: It naturally allows for future covariates in the forecasting problem. The linear representation of the dynamics allows modification and analysis by means of linear algebra.



where not-missing(x obs i ) yields the indicator vector for x obs i,m ̸ = NaN.4 LATENT STATE SPACE MODELSFirst, consider a general class of latent state space models that follow the schema from Algorithm 1. That is, the model has a state estimate xt which is encoded into a latent state ẑt and propagated



); Skorski et al. (2021); De & Smith (2020); Zhang et al. (2018); Balduzzi et al. (2017); Shang et al. (2017) ReZero:

Figure 2

Figure 3: Evolution of the kernel spectrum.

Figure 4: Evolution of the kernel values.

is closest in spirit to our model, it proposes a discrete time linear Gaussian model and an invertible observation map given by a normalizing flow. As ours the model can cope with missing values, but it is not apt for irregularly sampled time series.KalmanNet(Revach et al., 2022) is another model for regularly sampled time series that uses an RNN to calculate the Kalman gain. It is meant to present a robust alternative to other approaches of nonlinear Kalman filtering (extended Kalman filter) and not a general times series forecasting algorithm. Some other work on Kalman filtering by neural networks

Comparison of Continuous Time Latent State Space Models

. Thus, fitting generic Neural ODEs is a challenging task since during training the ODE can become stiff, which forces adaptive step-size integrators to take minuscule time steps. Towards this end, several remedies are available: Ghosh et al. (2020) andFinlay et al. (2020) propose temporal regularization terms, and many models implicitly address this issue by choosing a special form of vector field. An overview is in Table1. For example, GRU-ODE-Bayes uses a GRUCell with a tanh activation function that induces global Lipschitz continuity, which increases stability.D2. Self consistency. All of the proposed models use different filter mechanisms to update the state estimate when new observations are recorded. However, in our analysis we noted that a classical Kalman Filter(Kalman, 1960) satisfies a self-consistency property that none of the published Neural ODE models seems to incorporate.We say a point-forecasting model is self-consistent, if and only if the model does not change its forecast when observing its own predictions. More specifically, given a forecasting model ŷ(t | D), where D = {(t i , x i ) | i = 1 . . . n} is the set of observations, then ŷ is self-consistent if and only if for all finite sets of generated predictions

Comparison of foreasting model features ( †: transformer model)

Average MSE and standard deviation across 5 cross-validation folds. †: results reported byDe Brouwer et al. (2019)   ‡: results reported byBiloš et al. (2021)

