ANAMNESIC NEURAL DIFFERENTIAL EQUATIONS WITH ORTHOGONAL POLYNOMIALS PROJECTIONS

Abstract

Neural ordinary differential equations (Neural ODEs) are an effective framework for learning dynamical systems from irregularly sampled time series data. These models provide a continuous-time latent representation of the underlying dynamical system where new observations at arbitrary time points can be used to update the latent representation of the dynamical system. Existing parameterizations for the dynamics functions of Neural ODEs limit the ability of the model to retain global information about the time series; specifically, a piece-wise integration of the latent process between observations can result in a loss of memory on the dynamic patterns of previously observed data points. We propose PolyODE, a Neural ODE that models the latent continuous-time process as a projection onto a basis of orthogonal polynomials. This formulation enforces long-range memory and preserves a global representation of the underlying dynamical system. Our construction is backed by favourable theoretical guarantees and in a series of experiments, we demonstrate that it outperforms previous works in the reconstruction of past and future data, and in downstream prediction tasks. Our code is available at https://github.com/edebrouwer/polyode.

1. INTRODUCTION

Time series are ubiquitous in many fields of science and as such, represent an important but challenging data modality for machine learning. Indeed, their temporal nature, along with the potentially high dimensionality makes them arduous to manipulate as mathematical objects. A long-standing line of research has thus focused on efforts in learning informative time series representations, such as simple vectors, that are capable of capturing local and global structure in such data (Franceschi et al., 2019; Gu et al., 2020) . Such architectures include recurrent neural networks (Malhotra et al., 2017) , temporal transformers (Zhou et al., 2021) and neural ordinary differential equations (neural ODEs) (Chen et al., 2018) . In particular, neural ODEs have emerged as a popular choice for time series modelling due to their sequential nature and their ability to handle irregularly sampled time-series data. By positing an underlying continuous time dynamic process, neural ODEs sequentially process irregularly sampled time series via piece-wise numerical integration of the dynamics between observations. The flexibility of this model family arises from the use of neural networks to parameterize the temporal derivative, and different choices of parameterizations lead to different properties. For instance, bounding the output of the neural networks can enforce Lipschitz constants over the temporal process (Onken et al., 2021) . The problem this work tackles is that the piece-wise integration of the latent process between observations can fail to retain a global representation of the time series. Specifically, each change to the hidden state of the dynamical system from a new observation can result in a loss of memory about prior dynamical states the model was originally in. This pathology limits the utility of neural ODEs when there is a necessity to retain information about the recent and distant past; i.e. current neural ODE formulations are amnesic. We illustrate this effect in Figure 1 , where we see that backward integration of a learned neural ODE (that is competent at forecasting) quickly diverges, indicating the state only retains sufficient local information about the future dynamics.

Data points

Figure 1 : PolyODE: Illustration of the ability of PolyODE to reconstruct past trajectories. The solid lines show the forecasting trajectories conditioned on past observations for NODE (blue) and PolyODE (red). The dotted line represents the backward reconstruction for the past trajectories conditioned on the latent process at the last observation. We observe that PolyODE is able to accurately reconstruct the past trajectories while NODE quickly diverges. PolyODE is also more accurate in terms of forecasting. One strategy that has been explored in the past to address this pathology is to regularize the model to encourage it to capture long-range patterns by reconstructing the time series from the last observation, using an auto-encoder architecture (Rubanova et al., 2019) . This class of approaches results in higher complexity and does not provide any guarantees on the retention of the history of a time series. In contrast, our work proposes an alternative parameterization of the dynamics function that, by design, captures long-range memory within a neural ODE. Inspired by the recent successes of the HiPPO framework (Gu et al., 2020) , we achieve this by enforcing that the dynamics of the hidden process follow the dynamics of the projection of the observed temporal process onto a basis of orthogonal polynomials. The resulting model, PolyODE, is a new neural ODE architecture that encodes long-range past information in the latent process and is thus anamnesic. As depicted in Figure 1 , the resulting time series embeddings are able to reconstruct the past time series with significantly better accuracy.

Contributions (1)

We propose a novel dynamics function for a neural ODE resulting in PolyODE, a model that learns a global representation of high-dimensional time series and is capable of longterm forecasting and reconstruction by design. PolyODE is the first investigation of the potential of the HiPPO operator for neural ODEs architectures. (2) Methodologically, we highlight the practical challenges in learning PolyODE and show how adaptive solvers for ODEs can overcome them. Theoretically, we provide bounds characterizing the quality of reconstruction of time series when using PolyODE. (3) Empirically, we study the performance of our approach by assessing the ability of the learnt embeddings to reconstruct the past of the time series and by studying their utility as inputs for downstream predictive tasks. We show that our model provides better time series representations, relative to several existing neural ODEs architectures, based on the ability of the representations to accurately make predictions on several downstream tasks based on chaotic time series and irregularly sampled data from patients in intensive care unit.

Time series modelling in machine learning:

There is vast literature on the use of machine learning for time series modelling and we highlight some of the ideas that have been explored to adapt diverse kinds of models for irregular time series data. Although not naturally well suited to learning representations of such data, there have been modifications proposed to discrete-time models such as recurrent neural networks (Hochreiter and Schmidhuber, 1997; Cho et al., 2014) to handle such data. Models such as mTANs (Shukla and Marlin, 2021 ) leverage an attention-based approach to interpolate sequences to create discrete-time data from irregularly sampled data. Another strategy has been architectural modifications to the recurrence equations e.g. CT-GRU (Mozer et al., 2017) , GRU-D (Che et al., 2018) and Unitary RNNs (Arjovsky et al., 2016) . Much more closely aligned to our work, and a natural fit for irregularly sampled data is research that uses differential equations to model continuous-time processes (Chen et al., 2018) . By parameterizing the derivative of a time series using neural networks and integrating the dynamics over unobserved time points, this class of models is well suited to handle irregularly sampled data. This includes models such as ODE-RNN (Rubanova et al., 2019) , ODE-LSTM (Lechner and Hasani, 2020) and Neural CDE (Kidger et al., 2020) . ODE-based approaches require the use of differential equation solvers during training and inference, which can come at the cost of runtime (Shukla and Marlin, 2021) . PolyODEs lie in this family of models; specifically, this work proposes a new parameterization of the dynamics function and a practical method for learning that enables this model family to accurately forecast the future and reconstruct the past greatly enhancing the scope and utility of the learned embeddings. Orthogonal polynomials: PolyODEs are inspired by a rich line of work in orthogonal decomposition of time series data. Orthogonal polynomials have been a mainstay in the toolkit for engineering (Heuberger et al., 2003) and uncertainty quantification (Li et al., 2011) . In the context of machine learning, the limitations of RNNs to retain long-term memory have been studied empirically and theoretically (Zhao et al., 2020) . Indeed, the GRU (Chung et al., 2014) and LSTM (Graves et al., 2007) architectures were created in part to improve the long-term memory of such models. Recent approaches for discrete-time models have used orthogonal polynomials and their ability to represent temporal processes in a memory-efficient manner. The Legendre Memory Unit (Voelker et al., 2019) and Fourier Recurrent Unit can be seen as a projection of data onto Legendre polynomials and Fourier basis respectively. Our method builds upon and is inspired by the HiPPO framework which defines an operator to compute the coefficients of the projections on a basis of orthogonal polynomials. HiPPO-RNN and S4 are the most prominent examples of architectures building upon that framework (Gu et al., 2020; 2021) . These models rely on a linear interpolation of the data in between observations, which can lead to a decrease of performance when the sampling rate of the input process is low. Furthermore, HiPPO-RNN and S4 perform the orthogonal polynomial projection of a non-invertible representation of the input data, which therefore doesn't enforce reconstruction in the observation space by design. Their design choices are motivated toward the goal of efficient mechanisms for capturing long term dependency for a target task (such as trajectory classification). In contrast, this work aims at exploring the abilities of the HiPPO operator for representation learning of irregular time series, when the downstream task is not known in advance. Despite attempts to improve the computational performance of learning from long-term sequences (Morrill et al., 2021) , to our knowledge, PolyODE is the first work that investigates the advantages of the HiPPO operator in the context of memory retention for continuous time architectures.

3. BACKGROUND

Orthogonal Polynomial Projections: Orthogonal polynomials are defined with respect to a measure µ as a sequence of polynomials {P 0 (s), P 1 (s), ...} such that deg(P i ) = i and hP n , P m i = Z P n (s)P m (s)dµ(s) = n=m ↵ n , where ↵ n are normalizing scalars and is the Kronecker delta. For simplicity, we consider only absolutely continuous measures with respect to the Lebesgue measure, such that there exists a weight function !(•) such that dµ(s) = !(s)ds. The measure µ determines the class of polynomials obtained from the conditions above (Eq. 1). Examples include Legendre, Hermite or Laguerre classes of orthogonal polynomials. The measure µ also defines an inner product h•, •i µ such that the orthogonal projection of a 1-dimensional continuous process f (•) : R ! R on the space of polynomials of degree N , P N , is given as f N (t) = N X n=0 c n P n (t) 1 ↵ n with c n = hf, P n i µ = Z f (s)P n (s)dµ(s). This projection minimizes the distance kf pk µ for all p 2 P N and is thus optimal with respect to the measure µ. One can thus encode a process f by storing its projection coefficients {c 0 , ..., c N }. We write the vector of coefficients up to degree N as c (the degree N is omitted) and c i = c i . Intuitively, the measure assigns different weights at times of the process and thus allows for modulating the importance of different parts of the input signal for the reconstruction.

Continuous update of approximation coefficients:

The projection of a process f onto a basis of orthogonal polynomials provides an optimal representation for reconstruction. However, there is often a need to update this representation continuously as new observations of the process f become available. Let f <t be the temporal process observed up until time t. We wish to compute the coefficients of this process at different times t. We can define for this purpose a time-varying measure µ t and corresponding weight function ! t that can incorporate our requirements in terms of reconstruction abilities over time. For instance, if one cares about reconstruction of a process temporal units in the past, one could use a time-varying weight function ! t (s) = I[s 2 (t , t)]. This time-varying weight function induces a time-varying basis of orthogonal polynomials P t n for n = 0, ..., N. We can define the time-varying orthogonal projection and its coefficients c n (t) as f <t ⇡ f <t,N = N X n=0 c n (t)P t n 1 ↵ t n with c n (t) = hf <t , P t n i µ t = Z f <t (s)P t n (s)dµ t (s). Dynamics of the projection coefficients: Computing the coefficients of the projection at each time step would be both computationally wasteful and would require storing the whole time series in memory, going against the principle of sequential updates to the model. Instead, we can leverage the fact that the coefficients evolve according to known linear dynamics over time. Remarkably, for a wide range of time-varying measures µ t , Gu et al. (2020) show that the coefficients c N (t) follow: dc n (t) dt = d dt Z f <t (s)P t n (s)dµ t (s), 8n 2 N dc(t) dt = A µ c(t) + B µ f (t) where A µ and B µ are fixed matrices (for completeness, we provide a derivation of the relation for the translated Legendre measure in Appendix A). We use the translated Legendre measure in all our experiments. Using the dynamics of Eq. 4, it is possible to update the coefficients of the projection sequentially by only using the new incoming sample f (t), while retaining the desired reconstruction abilities. Gu et al. (2020) use a discretization of the above dynamics to model discrete timesequential data via a recurrent neural network architecture. Specifically, their architecture projects the hidden representation of an RNN onto a single time series that is projected onto an polynomial basis. Our approach differs in two ways. First, we work with a continuous time model. Second, we jointly model the evolution of d-dimensional time-varying process as a overparameterized hidden representation that uses orthogonal projections to serve as memory banks. The resulting model is a new neural ODE architecture as we detail below.

4. METHODOLOGY

Problem Setup. We consider a collection of sequences of temporal observations x = {(x i , m i , t i ) : i 2 {1, ..., T }} that consist of a set of time-stamped observations and masks (x i 2 R d , m i 2 R d , t i 2 R). We write x i,j and m i,j for the value of the j th dimension of x i and m i respectively. The mask m i encodes the presence of each dimension at a specific time point. We set m i,j = 1 if x i,j is observed and m i,j = 0 otherwise. The number of observations for each sequence x, T , can vary across sequences. We define the set of sequences as S and the distance between two time series observed at the same times as d(x, x 0 ) = 1 T P T i kx i x 0 i k 2 . Our goal is to be able to embed a sequence x into a vector h 2 R d h such that (1) h retains a maximal amount of information contained in x and (2) h is informative for downstream prediction tasks. We formalize both objectives below. Definition (Reverse reconstruction). Given an embedding h t of a time series x at time t, we define the reverse reconstruction x<t as the predicted values of the time series at times prior to t. We write the observed time series prior to t as x <t . Objective 1 (Long memory representation). Let h t and h 0 t be two embeddings of the same time series x. Let x<t and x0 <t be their reverse reconstruction. We say that h t enjoys more memory than h 0 t if d(x <t , x <t ) < d(x 0 <t , x <t ). Objective 2 (Downstream task performance). Let y 2 R dy be an auxiliary vector drawn from a unknown distribution depending on x. Let ŷ(x) and ŷ(x) 0 be the predictions obtained from embeddings h t and h 0 t . For a performance metric ↵ : S ⇥ R dy ! R, we say that h t is more informative than h 0 t if E x,y [↵(ŷ(x), y)] > E x,y [↵(ŷ(x) 0 , y)].

4.1. POLYODE: ANAMNESIC NEURAL ODES

We make the assumption that the observed time series x comes from an unknown but continuous temporal process x(t). Given h(t) 2 R d h and a read-out function g : R d h ! R d we posit the following generative process for the data: x(t) = g(h(t)), dh(t) dt = (h(t)) where part of (•) is parametrized via a neural network ✓ (•). The augmentation of the state space is a known technique to improve the expressivity of Neural ODEs (Dupont et al., 2019) . Here, to ensure that the hidden representation in our model has the capacity to retain long-term memory, we augment the state space of our model by including the dynamics of coefficients of orthogonal polynomials as described in Equation 4. Similarly as classical filtering architectures (e.g. Kalman filters and ODE-RNN (Rubanova et al., 2019) ), PolyODE alternates between two regimes : an integration step (that takes place in between observations) and an update step (that takes place at the times of observations), described below. We structure the hidden state as h(t) = [h 0 (t), h 1 (t), . . . , h d (t)] where h 0 (t) 2 R d has the same dimension as the input process x, h i (t) 2 R N , 8i 2 1, . . . , d, has the same dimension as the vector of projection coefficients c i (t) and [•, •] is the concatenation operator. We define the readout function g i (•) : R (N +1)d ! R such that g i (h(t)) = h 0 (t) i . That is, g i is fixed and returns the i th value of the input vector. This leads to the following system of ODEs that characterize the evolution of h(t): Integration Step. 8 > > > > < > > > > : dc 1 (t) dt = A µ c 1 (t) + B µ g 1 (h(t)) . . . dc d (t) dt = A µ c d (t) + B µ g d (h(t)) dh(t) dt = ✓ (h(t)) This parametrization allows learning arbitrarily complex dynamics for the temporal process x. We define a sub-system of equations of projection coefficients update for each dimension of the input temporal process x(t) 2 R d . This sub-system is equivalent to Equation 4, where we have substituted the input process by the prediction from the hidden process h(t) through a mapping g i (•). The hidden process h 0 (t) acts similarly as in a classical Neural ODEs and the processes c(t) captures long-range information about the observed time series. During the integration step, we integrate both the hidden process h(t) and the coefficients c(t) forward in time, using the system of Equation 6. At each time step, we can provide an estimate of the time series x(t) conditioned on the hidden process h(t), with x(t) = g(h(t)). The coefficients c(t) are influenced by the values of h(t) through h 0 (t) only. The process h 0 (t) provides the signal that will be memorized by projecting onto the orthogonal polynomial basis. The c(t) serve as memory banks and do not influence the dynamics of h(t) during the integration step.

Forward prediction

Figure 2 : PolyODE time series embedding process. The model processes the time series sequentially by alternating between integration steps (between observations) and update steps when observations are collected. Informative embeddings should allow for (1) reconstructing the past of the time series (reverse reconstruction -in red), (2) forecasting the future of the sequence (forward prediction -in blue) and (3) being informative for downstream predictions (in green). The system of equations in Eq. 6 characterises the dynamics in between observations. When a new observation becomes available, we update the system as follows.

Update

Step. At time t = t i , after observing x i and mask m i , we set ⇢ h j (t i ) := c j (t i ), 8j s.t. m i,j = 1 h 0 (t i ) j := x i,j , 8j s.t. m i,j = 1 (7) The update step serves the role of incorporating new observations in the hidden representation of the system. It proceeds by (1) reinitializing the hidden states of the system with the orthogonal polynomial projection coefficients c(t): h j (t i ) := c j (t i ); and (2) resetting h 0 (t) to the newly collected observation: h 0 (t i ) j := x i,j (t). Remarks: Our model blends orthogonal polynomials with the flexibility offered in modelling the observations with NeuralODEs. The consequence of this is that while the coefficients serve as memory banks for each dimension of the time series, the Neural ODE over h 0 (t) can be used to forecast from the model. That said, we acknowledge that a significant limitation of our current design is the need for the hidden dimension to track N coefficients for each time-series dimension. Given that many adjacent time series might be correlated, we anticipate that methods to reduce the space footprint of the coefficients within our model is fertile ground for future work.

4.2. TRAINING

We train this architecture by minimizing the reconstruction error between the predictions and the observations: L = P T i=1 kx(t i ) x i k 2 2 . We first initialize the hidden processes c(0) = 0 and h(0) = 0 though they can be initialized with static information b, if available (e.g. h(0) = ✓ (b)). We subsequently alternate between integration steps between observations and update steps at observation times. The loss is updated at each observation time t i . A pseudo-code description of the overall procedure is given in Algorithm 1. Numerical integration. We integrate the system of differential equations of Equation 6 using differentiable numerical solvers as introduced in Chen et al. ( 2018). However, one of the technical challenges that arise with learning PolyODE is that the dynamical system in Equation 6 is relatively stiff and integrating this process with acceptable precision would lead to prohibitive computation with explicit solvers. To deal with this instability we used an implicit solver such as Backward Euler or Adams-Moulton for the numerical integration (Sauer, 2011) . A comparison of numerical integration schemes and an analysis of the stability of the ODE are available in Appendix I.  t ⇤ 0 Initialize h j (0) = c j (0) = 0 N , 8j 2 1, ..., d, Loss L = 0 for i 1 to T do Integrate c 1,...,d (t) and h 0,..,d (t) from t = t ⇤ until t = t i xi h 0 (t ⇤ ) Update c 1,...,d (t i ) and h 0,...,d (t i ) with x i , m i . L = L + k(x i x i ) m i k 2 2 t ⇤ t i end Forecasting: From time t, we forecast the time series at an arbitrary time t ⇤ as: x>t (t ⇤ ) = g(h(t) + Z t ⇤ t ✓ (h(s))ds), where ✓ (•) is the learned model that we use in the integration step and introduced in Eq. 5. Reverse Reconstruction: Using Equation 3, we can compute the reverse reconstruction of the time series at any time t using the projection coefficients part of the hidden process: x<t,j = N X n=0 c j n (t) • P t n • 1 ↵ t n . More details about this reconstruction process and its difference with respect to classical NODEs are available in Appendix E. The error between the prediction obtained during the integration step, x(t), and the above reconstruction estimator is bounded above, as Result 4.1 shows. Result 4.1. For a shifted rectangular weighting function with width , ! t (x) = 1 I [t ,t] (which generate Legendre polynomials), the mean square error between the forward (x) and reverse prediction (x <t ) at each time t is bounded by: kx x<t k 2 µ t  C 0 2 L 2 (K + 1) 2 N (2N 1) + C 1 L(K + 1)S K ⇠ ✓ 3 2 , N ◆ + C 2 S 2 K ⇠ ✓ 3 2 , N ◆ , where K is the number of observations in the interval [t , t], L is the Lipschitz constant of the forward process, N is the degree of the polynomial approximation and ⇠(•, •) is the Hurwitz zeta function. S K = P K i=1 |x x i | is the sum of absolute errors between the forward process and observations incurred at the update steps. C 0 , C 1 and C 2 are constants. Expectedly, the bound goes to 0 as the degree of the approximation increases. The lower cumulative absolute error S K also leads to a reduction of this bound. As the cumulative absolute error S K and our loss function L share the same optimum, for fixed , L, K and N , our training objective therefore implicitly enforces a minimization of the reconstruction error. This corresponds to optimizing Objective 1, where we set d(x, x 0 ) = kx x 0 k 2 µ t . Our architecture thus jointly minimizes both global reconstruction and forecasting error. Notably, when S K = 0, this result boils down to the well-known projection error for orthogonal polynomials projection of continuous processes (Canuto and Quarteroni, 1982) . What is more, increasing the width of the weighting function (increasing ) predictably results in higher reconstruction error. However, this can be compensated by increasing the dimension of the polynomial basis accordingly. We also note a quadratic dependency on the Lipschitz constant of the temporal process, which can limit the reverse reconstruction abilities for high-frequency components. The full proof can be found in Appendix B.

5. EXPERIMENTS

We evaluate our approach on two objectives : (1) the ability of the learned embedding to encode global information about the time series, through the reverse reconstruction performance (or memorization) and (2) the ability of embedding to provide an informative input for a downstream task. We study our methods on the following datasets: Synthetic Univariate. We validate our approach using a univariate synthetic time series. We simulate 1000 realizations from this process and sample it at irregularly spaced time points using a Poisson point process. For each generated irregularly sampled time series x, we create a binary label y = I[x(5) > 0.5]. Further details about datasets are to be found in Appendix G. Chaotic Attractors. Chaotic dynamical systems exhibit a large dependence of the dynamics on the initial conditions. This means that a noisy or incomplete evaluation of the state space may not contain much information about the past of the time series. We consider two widely used chaotic dynamical systems: Lorenz63 and a 5-dimensional Lorenz96. We generate 1000 irregularly sampled time series from different initial conditions. We completely remove one dimension of the time series such that the state space is never fully observed. This forces the model to remember the past trajectories to create an accurate estimate of the state space at each time t. MIMIC-IIII dataset. We use a pre-processed version of the MIMIC-III dataset (Johnson et al., 2016; Wang et al., 2020) . This consists of the first 24 hours of follow-up for ICU patients. For each time series, the label y is the in-hospital mortality. Baselines: We compare our approach against two sets of baselines: Neural ODEs architecture and variants of recurrent neural networks architectures designed for long-term memory. To ensure a fair comparison, we use the same dimensionality of the hidden state for all models. Neural ODE baselines. We use a filtering implementation of Neural ODEs, GRU-ODE-Bayes (De Brouwer et al., 2019) and ODE-RNN (Rubanova et al., 2019) , an auto-encoder relying on a Neural ODE for both the encoder and the decoder part. For theses baselines, we compute the reverse reconstruction by integrating the system of learnt ODEs backward in time. In case of ODE-RNN, we use the ODE of the decoder. Additionally, we compare against Neural RDE neural controlled differential equations for long time series (Neural RDE) (Morrill et al., 2021) . Long-term memory RNN baselines. We compare against HiPPO-RNN (Gu et al., 2020) , a recurrent neural network architecture that uses orthogonal polynomial projections of the hidden process. We also use a variant of this approach where we directly use the HiPPO operator on the observed time series, rather than on the hidden process. We call this variant HiPPO-obs. We also compare against S4, an efficient state space model relying on the HiPPO matrix (Gu et al., 2021) . Long-range representation learning: For each dataset, we evaluate our method and the various baselines on different tasks. Implementation details are available in Appendix H. Downstream Classification. We train the models on the available time series. After training, we extract time series embedding from each model and use them as input to a multi-layer perceptron trained to predict the time series label y. We report the area under the operator-characteristic curve evaluated on a left-out test set with 5 repetitions. Time Series Reconstruction. Similarly as for the downstream classification, we extract the time series embeddings from models trained on the time series. We then compute the reverse reconstruction x<t and evaluate the MSE with respect to the true time series. Forecasting. We compare the ability of all models to forecast the future of the time series. We compute the embedding of the time series observed until some time t cond and predict over a horizon t horizon . We then report the MSE between the prediction and true trajectories. Table 1 : Downstream task and reverse reconstruction results for synthetic and Lorenz datasets. Results for these tasks are presented in Table 1 for Synthetic and Lorenz datasets and in Table 2 for MIMIC. We report additional results in Appendix C, with a larger array of irregular sampling rates. We observe that the reconstruction abilities of PolyODE clearly outperforms the other baselines, for all datasets under consideration. A similar trend is to be noted for the downstream classification for the synthetic and Lorenz datasets. For these datasets, accurate prediction of the label y requires a global representation of the time series, which results in better performance for our approach. For the MIMIC dataset, our approach compares favourably with the other methods for the downstream classification objective and outperforms other methods for trajectory forecasting. What is more, the reconstruction ability of PolyODE is significantly better than compared approaches. In Figure 3 , we plot the reverse reconstructions of PolyODE for several vitals of a random patient over the first 24 hours in the ICU. This reconstruction is obtained by first sequentially processing the time series until t = 24 hours and subsequently using the hidden process to reconstruct the time series as in Equation 9. We observe that PolyODE can indeed capture the overall trend of the time series over the whole history. Ablation study -the importance of the auxiliary dynamical system: Is there utility in leveraging the neural network ✓ (•) to learn the dynamics of the time series? How well would various interpolation schemes for irregularly sampled observations perform in the context of reverse reconstruction and classification? In response to these questions, we first note that they do not support extrapolation and are thus incapable of forecasting the future of the time series. However, we compare the performance in terms of reverse reconstruction and classification in Table 3 . We consider constant interpolation (last observation carried forward), linear interpolation and Hermite spline interpolation. Our results indicate a significant gap in performance between PolyODE and the linear and constant interpolation schemes. The Hermite spline interpolation allows us to capture most of the signal needed for the downstream classification task but results in significantly lower performance in terms of the reverse reconstruction error. These results therefore strongly support the importance of ✓ (•) for producing informative time series embeddings. Complementary results are available in Appendix C. Incorporating global time series uncertainty: Previous experiments demonstrate the ability of PolyODE to retain memory of the past trajectory. A similar capability can be obtained for capturing global model uncertainties over the time series history. In Figure 4 , we evaluate the association between the recovered uncertainties of Poly-ODE and the reverse reconstruction errors. We plot the predicted uncertainties against the root mean square error (RMSE) on a logarithmic scale. We compare our approach with using the uncertainty of the model at the last time step only. We observe that the uncertainties recovered by PolyODE are significantly more correlated with the errors (Pearson-⇢ = 0.56) compared to using the uncertainties obtained from the last time step (Pearson-⇢ = 0.11). More details are available in Appendix F.

6. CONCLUSION

Producing time series representations that are easy to manipulate, representative of global dynamics, practically useful for downstream tasks and robust to irregular sampling remains an ongoing challenge. In this work, we took a step in that direction by proposing a simple but novel architecture that satisfies those requirements by design. As a Neural ODE, PolyODE inherits the ability to handle irregular time series elegantly but at the same time, PolyODE also incurs computational cost associated with numerical integration. Currently, our approach also requires a large hidden space dimension and finding methods to address this that exploit the correlation between dimensions of the time series is a fruitful direction for future work. Reproducibility Statement Details for reproducing experiments shown are available in Appendix H. The code for reproducing all experiments will be made publicly available.



PolyODE Training Data: x, matrices A µ , B µ , number of dimensions d, number of observations T , number of polynomial coefficients N Result: Training loss L over a whole sequence x

Figure 4: Association between uncertainties and reverse reconstruction errors for PolyODE (top) and classical Neural ODEs (bottom).

Performance on MIMIC-III dataset.

Impact of the interpolation scheme on performance.

Acknowledgements

EDB is funded by a FWO-SB PhD research grant (S98819N) and a FWO research mobility grant (V424722N). RGK was supported by a CIFAR AI Chair. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

