CUBIC SPLINE SMOOTHING COMPENSATION FOR IRREGULARLY SAMPLED SEQUENCES

Abstract

The marriage of recurrent neural networks and neural ordinary differential networks (ODE-RNN) is effective in modeling irregularly sampled sequences. While ODE produces the smooth hidden states between observation intervals, the RNN will trigger a hidden state jump when a new observation arrives and thus cause the interpolation discontinuity problem. To address this issue, we propose the cubic spline smoothing compensation, which is a stand-alone module upon either the output or the hidden state of ODE-RNN and can be trained end-to-end. We derive its analytical solution and provide its theoretical interpolation error bound. Extensive experiments indicate its merits over both ODE-RNN and cubic spline interpolation.

1. INTRODUCTION

Recurrent neural networks (RNNs) are commonly used for modeling regularly sampled sequences (Cho et al., 2014) . However, the standard RNN can only process discrete series without considering the unequal temporal intervals between sample points, making it fail to model irregularly sampled time series commonly seen in domains, e.g., healthcare (Rajkomar et al., 2018) and finance (Fagereng & Halvorsen, 2017) . While some works adapt RNNs to handle such irregular scenarios, they often assume an exponential decay (either at the output or the hidden state) during the time interval between observations (Che et al., 2018; Cao et al., 2018) , which may not always hold. To remove the exponential decay assumption and better model the underlying dynamics, Chen et al. (2018) proposed to use the neural ordinary differential equation (ODE) to model the continuous dynamics of hidden states during the observation intervals. Leveraging a learnable ODE parametrized by a neural network, their method renders higher modeling capability and flexibility. However, an ODE determines the trajectory by its initial state, and it fails to adjust the trajectory according to subsequent observations. A popular way to leverage the subsequent observations is ODE-RNN (Rubanova et al., 2019; De Brouwer et al., 2019) , which updates the hidden state upon observations using an RNN, and evolves the hidden state using an ODE between observation intervals. While ODE produces smooth hidden states between observation intervals, the RNN will trigger a hidden state jump at the observation point. This inconsistency (discontinuity) is hard to reconcile, thus jeopardizing continuous time series modeling, especially for interpolation tasks (Fig. 1 top-left ). We propose a Cubic Spline Smoothing Compensation (CSSC) module to tackle the challenging discontinuity problem, and it is especially suitable for continuous time series interpolation. Our CSSC employs the cubic spline as a means of compensation for the ODE-RNN to eliminate the jump, as illustrated in Fig. 1 top-right. While the latent ODE (Rubanova et al., 2019) with an encoder-decoder structure can also produce continuous interpolation, CSSC can further ensure the interpolated curve pass strictly through the observation points. Importantly, we can derive the closed-form solution for CSSC and obtain its interpolation error bound. The error bound suggests two key factors for a good interpolation: the time interval between observations and the performance of ODE-RNN. Furthermore, we propose the hidden CSSC that aims to compensate for the hidden state of ODE-RNN (Fig. 1 bottom), which not only assuage the discontinuity problem but is more efficient when the observations are high-dimensional and only have continuity on the semantic level. We conduct extensive experiments and ablation studies to demonstrate the effectiveness of CSSC and hidden CSSC, and both of them outperform other comparison methods. 

2. RELATED WORK

Spline interpolation is a practical way to construct smooth curves between a number of points (De Boor et al., 1978) , even for unequally spaced points. Cubic spline interpolation leverages the piecewise third order polynomials to avoid the Runge's phenomenon (Runge, 1901) and is applied as a classical way to impute missing data (Che et al., 2018) . Recent literature focuses on adapting RNNs to model the irregularly sampled time series, given their strong modeling ability. Since standard RNNs can only process discrete series without considering the unequal temporal intervals between sample points, different improvements were proposed. One solution is to augment the input with the observation mask or concatenate it with the time lag ∆t and expect the network to use interval information ∆t in an unconstrained manner (Lipton et al., 2016; Mozer et al., 2017) . While such a flexible structure can achieve good performance under some circumstances (Mozer et al., 2017) , a more popular way is to use prior knowledge for missing data imputation. GRU-D (Che et al., 2018) imputes missing values with the weighted sum of exponential decay of the previous observation and the empirical mean. Shukla & Marlin (2019) employs the radial basis function kernel to construct an interpolation network. Cao et al. (2018) let hidden state exponentially decay for non-observed time points and use bi-directional RNN for temporal modeling. Another track is the probabilistic generative model. Due to the ability to model the missing data's uncertainty, Gaussian processes (GPs) are adopted for missing data imputing (Futoma et al., 2017; Tan et al., 2020; Moor et al., 2019) . However, this approach introduced several hyperparameters, such as the covariance function, making it hard to fine-tune in practice. Neural processes (Garnelo et al., 2018) eliminate such constraints by introducing a global latent variable that represents the whole process. Generative adversarial networks are also adopted for imputing (Luo et al., 2018) . While the ODE produces the smooth hidden states between observation intervals, the RNN will trigger a jump of the hidden state at the observation point, leading to a discontinuous hidden state along the trajectory. This inconsistency (discontinuity) is hard to reconcile, thus jeopardizing the modeling of continuous time series, especially for interpolation tasks. The neural CDE (Kidger et al., 2020) directly apply cubic splines interpolation at the input sequence to make the sparse input continuous and thus produce continuous output. On the contrary, our method tackles this jumping problem by introducing the cubic spline as a compensation for the vanilla ODE-RNN, at either the output or hidden space.

3. METHODS

In this section, we first formalize the irregularly sampled time series interpolation problem (Sec.  ḣ(t) = f (h(t)); (1) o(t) = g(h(t)), where the h ∈ R m is the hidden embedding of the data, ḣ = dh dt is the temporal derivative of the hidden state, o ∈ R d is the interpolation output of F (t). Here, f : R m → R m and g : R m → R d are the transfer function and the output function parameterized by two neural networks, respectively. At the observation time t = t k , the hidden state will be updated by an RNN as: h(t k ) = RNNCell(h(t - k ), x k ); (3) o(t k ) = g(h(t k )), where the input x ∈ R d , t - k and t + k are the left-and right-hand limits of t k . The above formulation has two downsides. The first is the discontinuity problem: while the function described by ODE is right continuous o(t k ) = o(t + k ), the RNN cell in Eq. (3) renders the hidden state discontinuity h(t - k ) = h(t + k ) and therefore output discontinuity o(t - k ) = o(t + k ). The second is that the model cannot guarantee o(t k ) = x k without explicit constraints.

3.3. CUBIC SPLINE SMOOTHING COMPENSATION

To remedy the two downsides, we propose the module Cubic Spline Smoothing Compensation (CSSC), manifested in the top-right of Fig. 1 . It computes a compensated output ô(t) as: ô(t) = c(t) + o(t), where o(t) is the ODE-RNN output, and the c(t) is a compensation composed of piecewise continuous functions. Our key insight is that adding another continuous function to the already piecewise continuous o(t) will ensure the global continuity. For simplicity, we set c(t) as a piecewise polynomials function and then narrow it to a piecewise cubic function since it is the most commonly used polynomials for interpolation (Burden & Faires, 1997) . As the cubic spline is computed for each dimension of c individually, w.l.o.g., we will discuss one dimension of the o, c, ô, x and thus denote them as o, c, ô, x, respectively. c(t) is composed with pieces as c(t) = n-1 k=0 c k (t) with each piece c k defined at domain [t k , t k+1 ). To guarantee the smoothness, we propose four constraints to ô(t): 1. ô(t - k ) = ô(t + k ) = x k , k = 1, ..., n -1, ô(t 0 ) = x 0 , ô(t n ) = x n (output continuity); 2. ȯ(t - k ) = ȯ(t + k ), k = 1, ..., n -1 (first order output continuity); 3. ö(t - k ) = ö(t + k ), k = 1, ..., n -1 (second order output continuity); 4. ö(t 0 ) = ö(t n ) = 0 (natural boundary condition). The constraint 1 ensures the interpolation curves continuously pass through the observations. Constraint 2 and 3 enforce the first and second-order continuity at the observation points, which usually holds when the underline curve x is smooth. And constraint 4 specifies the natural boundary condition owing to the lack of information of the endpoints (Burden & Faires, 1997) . Given o(t) and such four constraints, c(t) has unique analytical solution expressed in Theorem 1. Theorem 1. Given the first order and second order jump difference of ODE-RNN as ṙk = ȯ(t + k ) -ȯ(t - k ); (6) rk = ö(t + k ) -ö(t - k ). ( ) where the analytical expression of ȯ and ö can be obtained as ȯ = ∂g ∂h f ; ö = f ∂ 2 g ∂h 2 f + ∂g ∂h ∂f ∂h f, and the error defined as + k = x k -o(t + k ); - k = x k -o(t - k ), then c k can be uniquely determined as c k (t) = M k+1 + rk+1 -M k 6τ k (t -t k ) 3 + M k 2 (t -t k ) 2 + ( - k+1 -+ k τ k - τ k (M k+1 + rk+1 + 2M k ) 6 )(t -t k ) + + k , where M k is obtained as M = A -1 d, A =       2 λ 1 µ 2 2 λ 2 . . . . . . . . . µ n-2 2 λ n-2 µ n-1 2       , M =       M 1 M 2 . . . M n-2 M n-1       , d =       d 1 d 2 . . . d n-2 d n-1       , τ k = t k+1 -t k , µ k = τ k-1 τ k-1 +τ k , λ k = τ k τ k-1 +τ k , d k = 6 [t + k ,t - k+1 ]-[t + k-1 ,t - k ] τ k-1 +τ k + 6 ṙk -2r k τ k-1 -r k+1 τ k τ k-1 +τ k , [t + k , t - k+1 ] = - k+1 -+ k τ k , M 0 = M n = 0. The proof for Theorem. 1 is in Appx. A. The c(t) is obtained by computing each c(t) individually according to Theorem. 1. Computational Complexity. The major cost is the inverse of A, a tridiagonal matrix, whose inverse can be efficiently computed in O(n) complexity with the tridiagonal matrix algorithm (implementation detailed in Appx. C.1). Another concern is that ȯ and ö needs to compute Jacobian and Hessian in Eq. ( 8). We can circumvent this computing cost by computing the numerical derivative or an empirical substitution, detailed in Appx. C.2. Model Reduction. Our CSSC can reduce to cubic spline interpolation if setting o in Eq. ( 5) as zero. In light of this, we further analyze our model with techniques used for cubic spline interpolation and experimentally show our advantages against it in Sec. 4.2.

3.4. INFERENCE AND TRAINING

For inference, firstly compute the predicted value o from ODE-RNN (Eq. (1-4)), then calculate the compensation c with CSSC(Eq. ( 11)); thus yielding smoothed output ô (Eq. ( 5)). For training, the CSSC is a standalone nonparametric module (since we have its analytical solution) on top of the ODE-RNN that allows the end-to-end training for ODE-RNN parameters. We employ Mean Squire Error (MSE) loss to supervise ô. In addition, we expect the compensation c to be small to push ODE-RNN to the leading role for interpolation and take full advantage of its model capacity. Therefore a 2-norm penalty for c is added to construct the final loss: L = 1 N N i=1 (||x(t i ) -ô(t i )|| 2 + α||c(t i )|| 2 ) (14) The ablation study (Sec. 4.5) shows that the balance weight α can effectively arrange the contribution of ODE-RNN and CSSC. Gradient flow. Although c is non-parametric module, but the gradient can flow from it into the ODE-RNN because c(t) depends on the left and right limit of o(t k ), ȯ(t k ), ö(t k ). We further analyze that ȯ(t k ) plays a more important role than ö(t k ) in the contribution to c(t), elaborated in Appx. C.3.

3.5. INTERPOLATION ERROR BOUND

With CSSC, we can even derive an interpolation error bound, which is hard to obtain for ODE-RNN. Without loss of generality, we analyze one dimension of ô, which is scalable to all dimensions. Theorem 2. Given the CSSC at the output space as Eq. ( 5), if x ∈ C 4 [a, b], f ∈ C 3 , g ∈ C 4 , then the error and the first order error are bounded as ||(x -ô) (r) || ∞ ≤ C r ||(x -o) (4) || ∞ τ 4-r , (r = 0, 1), ( ) where || • || ∞ is uniform norm, (•) r is r-th derivative, C 0 = 5 384 , C 1 = 1 24 , τ is the maximum interval over Π. The proof of Theorem 2 is in Appx. B. The error bound guarantee the error can converge to zero if τ → 0 or ||(x -o) (4) || → 0. This suggests that a better interpolation can come from a denser observation or a better ODE-RNN output. Interestingly, Eq. ( 15) can reduce to the error bound for cubic spline interpolation (Hall & Meyer, 1976) if o is set zero. Compared with ODE-RNN, which lacks the convergence guarantee for τ , our model more effectively mitigates the error for the densely sampled curve at complexity O(τ 4 ) ; compared with the cubic spline interpolation, our error bound has an adjustable o that can leads smaller ||(x -o) (4) || than ||x (4) ||. An implicit assumption for this error bound is that the x should be 4-th order derivable; hence this model is not suitable for sharply changing signals.

3.6. EXTEND TO INTERPOLATE HIDDEN STATE

Although CSSC has theoretical advantages from the error bound, it is still confronted with two challenges. Consider the example of video frames: each frame is high-dimensional data, and each pixel is not continuous through the time, but the spatial movement of the content (semantic manifold) is continuous. So the first challenge is that CSSC has linear complexity w.r.t. data dimension, which is still computation demanding when the data dimension becomes very high. The second challenge is that CSSC assumes the underlying function x is continuous, and it cannot handle the discontinuous data that is continuous in its semantic manifold (e.g. video). To further tackle these two challenges, we propose a variant of CSSC that is applied to the hidden states, named as hidden CSSC, illustrated in bottom of Fig. 1 . As we only compute the compensation to hidden state, which can keep a fixed dimension regardless the how large the data dimension is, the computational complexity of hidden CSSC is weakly related to data dimension. Also, the hidden state typically encodes a meaningful low-dimensional manifold of the data. Hence smoothing the hidden state is equivalent to smoothing the semantic manifold to a certain degree. Hence, the Eq. ( 5) is modified to ĥ(t) = c(t) + h(t), and the output (Eq. ( 2),( 4)) to o(t) = g( ĥ(t)) However, since there is no groundtruth for the hidden state like the constraint 1, we assume the h(t + k ) are the knots passed through by the compensated curve, rendering the constraints 1 as ĥ(t - k ) = ĥ(t + k ) = h(t + k ). The rationality for the knots is that h(t + k ) is updated by x(t k ) and thus contain more information than h(t - k ). Given the above redefined variable, the closed-form solution of c is provided in Theorem. 3. Theorem 3. Given the second order jump at hidden state as ṙk = ḣ(t + k ) -ḣ(t - k ); rk = ḧ(t + k ) -ḧ(t - k ), where ḣ = f , ḧ = ∂f ∂h f , and the error defined as + k = h(t + k ) -h(t + k ) = 0; - k = h(t + k ) -h(t - k ), the c k is uniquely determined as Eq. ( 11). Theorem 3 suggests another prominent advantage: hidden CSSC can be more efficiently implemented because its computation does not involve Hessian matrix. 

4.2. TOY SINUOUS WAVE DATASET

The toy dataset is composed of 1,000 periodic trajectories with variant frequency and amplitude. Following the setting of Rubanova et al. ( 2019), each trajectory contains 100 irregularly-sampled time points with the initial point sampled from a standard Gaussian distribution. Fixed percentages of observations are randomly selected with the first and last time points included. The goal is to interpolate the full set of 100 points. The interpolation error on testing data is shown in Table 1 , where our method outperforms all baselines in different observation percentages. 

4.3. MUJOCO PHYSICS SIMULATION

We test our proposed method with the "hopper" model provided by DeepMind Control Suite (Tassa et al., 2018) based on MuJoCo physics engine. To increase the trajectory's complexity, the hopper is thrown up, then rotates and freely falls to the ground (Fig. 3 ). We will interpolate the 7-dimensional state that describes the position of the hopper. Both our hidden CSSC and CSSC achieve improved performance from ODE-RNN, especially when observations become sparser, as Tab. 1 indicates. The visual result is shown in Fig. 3 , where CSSC resemble the GT trajectory most.

4.4. MOVING MNIST

In addition to low-dimensional data, we further evaluate our method on high-dimensional image interpolation. Moving MNIST consists of 20-frame video sequences where two handwritten digits are drawn from MNIST and move with arbitrary velocity and direction within the 64×64 patches, with potential overlapping and bounce at the boundaries. As a matter of expediency, we use a subset of 10k videos and resize the frames into 32×32. 4 (20%) and 6 (30%) frames out of 20 are randomly observed, including the starting and ending frames. We encode the image with 2 ResBlock (He et al., 2016) into 32-d hidden vector and decode it to pixel space with a stack of transpose convolution layers. Since the pixels are only continuous at the semantic level, only hidden CSSC is evaluated with comparison methods. As shown in Tab. 1, the hidden CSSC can further improve ODE-RNN's result, and spline interpolation behaves the worse since it can only interpolate at the pixel space, which is discontinuous through time. The visual result (Fig. 4 ) shows that the performance gain comes from the smoother movement and the clearer overlapping.

4.5. ABLATION STUDY

The effect of end-to-end training. Apart from our standard end-to-end training of CSSC, two alternative training strategies are pre-hoc and post-hoc CSSN. Pre-hoc CSSC is to train a standard CSSC but only use the ODE-RNN part when inference. On the contrary, post-hoc CSSC is to train an ODE-RNN without CSSC, but apply CSSC upon the output of ODE-RNN when inference. The The post-hoc CSSC can increase the performance of ODE-RNN with simple post-processing when inference. However, such performance gain is not guaranteed when the observation is sparse. For example, the post-hoc CSSC even decreases the performance of ODE-RNN in 10% observation setting. Standard CSSC has higher performance than ODE-RNN in all our experiments, indicating the importance of end-to-end training. The effect of α. We study the effect of α ranging from 0 to 10000 given different percentages of observation on MuJoCo dataset. The performance of CSSC is quite robust to the choice of α, shown in Tab. 4 (in Appendix), especially the MSE only fluctuated from 0.000375 to 0.000463 as α ranges from 1 to 10000 in 30% observation setting. Interestingly, Fig. 5 (in Appendix) visually compares the interpolation for o, c, and ô under variant α and indicates higher α contributes to lower c, thus o is more dominant of smoothed output ô. On the other hand, the smaller α will make o less correlated with the ground truth, but the CSSC c can always make ô a well-interpolated curve.

5. DISCUSSION

Limitations. While the CSSC model interpolates the trajectory that can strictly cross the observation points, such interpolation is not suitable for noisy data whose observations are inaccurate. Moreover, interpolation error bound (Eq. ( 15)) requires the underlying data is fourth-order continuous, which indicates that CSSC is not suitable to interpolate sharply changed data, e.g., step signals. Future work. The CSSC can not only be applied to ODE-RNN but can smooth any piecewise continuous function. Applying CSSC to other more general models is a desirable future work. Also, while interpolation for noisy data is beyond this paper's scope, but hidden CSSC shows the potential to tolerate data noise by capturing the data continuity at the semantic space rather than the observation space, which can be a future direction.

6. CONCLUSION

We introduce the CSSC that can address the discontinuity issue for ODE-RNN. We have derived the analytical solution for the CSSC and even proved its error bound for the interpolation task, which is hard to obtain in pure neural network models. The CSSC combines the modeling ability of deep neural networks (ODE-RNN) and the smoothness advantage of cubic spline interpolation. Our experiments have shown the benefit of such a combination. The hidden CSSC extends the smoothness from output space to the hidden semantic space, enabling a more general format of continuous signals. 

A PROOF OF INTERPOLATION

Substitute ô with Eq. ( 5) we will have c k (t + k ) + o(t + k ) = x k (18) c k (t - k+1 ) + o(t - k+1 ) = x k+1 (19) ċk-1 (t - k ) + ȯ(t - k ) = ċk (t + k ) + ȯ(t + k ) (20) ck-1 (t - k ) + ö(t - k ) = ck (t + k ) + ö(t + k ) c0 (t + 0 ) + ö(t + 0 ) = 0 (22) cn-1 (t - n ) + ö(t - n ) = 0 And we let ck (t + k ) = M k , k = 0, 1, ..., n -1, and cn-1 (t - n ) = M n . We define the first order and second order jump difference of ODE-RNN as ṙk = ȯ(t + k ) -ȯ(t - k ); rk = ö(t + k ) -ö(t - k ). With Eq. ( 21), we have Mk+1 = ck (t - k+1 ) = M k+1 + r(t k+1 ). Using constraint Eq. ( 18)(19), we denote c k (t k ) = x k -o(t + k ) = + k (27) c k (t k+1 ) = x k+1 -o(t - t+1 ) = - k+1 . Also denote the step size τ k = t k+1 -t k . Then applying constraint Eq. ( 18)(19)(21) we have the piece cubic function expressed as c k (t) = Mk+1 -M k 6τ k (t-t k ) 3 + M k 2 (t-t k ) 2 +( - k+1 -+ k τ k - τ k ( Mk+1 + 2M k ) 6 )(t-t k )+ + k . (29) Next, we try to solve all M k . We firstly express ċk-1 (t - k ) and ċk (t + k ) as ċk-1 (t - k ) = Mk -M k-1 2 τ k -1 + M k-1 τ k-1 + - k -+ k-1 τ k-1 - Mk + 2M k-1 6 τ k-1 , ċk (t + k ) = - k+1 -+ k τ k - Mk+1 + 2M k 6 τ k Applying Eq. ( 20) we have 2M k + τ k-1 τ k-1 + τ k M k-1 + τ k τ k-1 + τ k M k+1 = 6 [t + k , t - k+1 ] -[t + k-1 , t - k ] τ k-1 + τ k + 6 ṙk -2r k τ k-1 -rk+1 τ k τ k-1 + τ k where [t + k , t - k+1 ] = - k+1 -+ k τ k . µ k = τ k-1 τ k-1 + τ k (33) λ k = τ k τ k-1 + τ k (34) d k = 6 [t + k , t - k+1 ] -[t + k-1 , t - k ] τ k-1 + τ k + 6 ṙk -2r k τ k-1 -rk+1 τ k τ k-1 + τ k . ( ) Then M k can be obtained by solving by system of linear equations: AM = d, A =       2 λ 1 µ 2 2 λ 2 . . . . . . . . . µ n-2 2 λ n-2 µ n-1 2       , M =       M 1 M 2 . . . M n-2 M n-1       , d =       d 1 d 2 . . . d n-2 d n-1       (37) And A is non-singular since it is strict diagonally dominant matrix, hence guarantee single solution for M k . Hence M = A -1 d. (38) Now we still need to calculate ȯ(t) and ö(t). ȯ(t) = ∂g ∂h ∂h ∂t = ∂g ∂h f (39) ḧ(t) = df (h(t)) dt = ∂f ∂h dh dt = ∂f ∂h f, ö(t) = d ∂g ∂h dt ḣ + ∂g ∂h ḧ = ( ∂ 2 g ∂h 2 f ) f + ∂g ∂h ∂f ∂h f B PROOF FOR ERROR BOUND The following proof are based on the default notation and setting for scalar time series interpolation: given a set of n+1 points {x(t i )} n i=0 irregularly spaced at time points Π : a = t 0 < t 1 < ... < t n = b, the goal is to approximate the underlying ground truth function x(t), t ∈ Ω = [a, b] . Let o(t) as the ODE-RNN prediction, and ô(t) = c(t) + o(t) as our smoothed output where c(t) is the compensation defined by Eq. ( 11). To simplify the notation, we drop the argument t for a function, e.g. x(t) → x. We firstly introduce several lemmas, then come the the main proof for the interpolation error bound. Lemma 4. (Hall & Meyer, 1976 ) Let e be any function in C 2 (Π)  (r) (t k )| ≤ ρ r ||e (4) || ∞ τ 4-r (k = 0, ..., n; r = 1, 2), ( ) where ρ 1 = 1 24 , ρ 2 = 1 4 . Lemma 5. Let e = x -ô, if x ∈ C 4 (Ω), f ∈ C 3 , g ∈ C 4 , then e ∈ C 2 (Π) n k=1 C 4 (t k-1 , t k ) Proof. Since ô ∈ C 2 (Π), and c ∈ n k=1 C 4 (t k-1 , t k ), hence we only have to prove that o ∈ n k=1 C 4 (t k-1 , t k ). Since o(t) in each time period (t k-1 , t k ) is calculated by ODE (Eq. (1), and (2)), we can express different orders of the derivative of o as: ȯ = ∂g ∂h f, ö = ( ∂ 2 g ∂h 2 f ) f + ∂g ∂h ∂f ∂h f, o (3) = f ( ∂f ∂h ∂ 2 g ∂h 2 + ∂ 3 g ∂h 3 • f + 2 ∂ 2 g ∂h 2 ∂f ∂h )f + ∂g ∂h ( ∂ 2 f ∂h 2 • f + ∂f ∂h ∂f ∂h )f, o (4) =f ∂ 4 g ∂h 4 • f • f + ∂ 3 g ∂h 3 • ( ∂f ∂h ∂f ∂h ) + 3 ∂ 3 g ∂h 3 • f ∂f ∂h + 2 ∂f ∂h ∂ 3 g ∂h 3 • f + 3 ∂ 2 g ∂h 2 ∂ 2 f ∂h 2 • f + 3 ∂ 2 g ∂h 2 ∂f ∂h ∂f ∂h + ∂ 2 f ∂h 2 • f ∂ 2 g ∂h 2 + ∂f ∂h ∂f ∂h ∂ 2 g ∂h 2 + 3 ∂f ∂h ∂ 2 g ∂h 2 ∂f ∂h f + ∂g ∂h ∂ 3 f ∂h 3 • f • f + ∂ 2 f ∂h 2 • ( ∂f ∂h f ) + 2 ∂ 2 f ∂h 2 • f ∂f ∂h + ∂f ∂h ∂ 2 f ∂h 2 • f + ∂f ∂h ∂f ∂h ∂f ∂h f, where ∂f ∂h is the Jacobian matrix since f is a multi-valued function, ∂ 2 g ∂h 2 is the Hessian matrix since o is a scalar and g is single-valued function. The higher order derivative than second order (Hessian) becomes multi-dimensional matrix which can not be mathmatically expressed for standard matrix production, so we indicate the product of a multi-dimensional matrix and a vector as •, which has higher computing priority then normal matrix production. From Eq. ( 46), o (4) depends on ∂ 4 g ∂h 4 and  ∂ 3 f ∂h 3 . Hence given g ∈ C 4 and f ∈ C 3 , we obtain o ∈ n k=1 C 4 (t k-1 , t k ). |(x(t) -ô(t)) (r) | ≤ B r (x)||e (4) || ∞ τ 4-r = B r (x)||(x -o) (4) || ∞ τ 4-r , ( ) with B r (t) = ρ 1 |H (r) 1 (t)| + |H (r) 2 (t)| τ r-1 . Using Triangle inequality, Lemma 6, and Eq. 57, it issues |x (r) (t) -ô(r) (t)| = |x (r) (t) -o (r) (t) -u (r) (t) + u (r) (t) -c (r) (t)| ≤ |x (r) (t) -o (r) (t) -u (r) (t)| + |u (r) (t) -c (r) (t)| ≤ (A r (x) + B r (x))||(x -o) (4) || ∞ τ 4-r . Let C r (x) = A r (x) + B r (x), and an analysis of the optimality (Hall & Meyer, 1976) yields C 0 (x) ≤ 5 384 ; C 1 (x) ≤ 1 12 . C COMPUTATIONAL COMPLEXITY

C.1 IMPLEMENTATION OF THE INVERSE OF MATRIX

For sake of simplicity, our Pytorch implementation adopts torch.inverse to compute A -foot_0 in Eq. ( 12), which is actually the implementation of the LU composition using partial pivoting with best complexity O(n 2 ). Its complexity is higher than the complexity of tridiagonal matrix algorithm O(n), whose implementation will be left for future work.

C.2 COMPUTATION OF ȯ AND ö

As Eq. 8 indicates, computing ȯ and ö requires the Hessian and Jacobian of g and the Jacobian of f . These Jacobians and Hessians will first participate in the inference stage and then are involved in the gradient backpropagation. However, the latest Pytorch 1 does not support the computing of Jacobian and Hessian in batch, so the training process will be very slow in practice, rendering the computing cost prohibitively high. Therefore, we propose two ways to circumvent such issues from both numerical and analytical views. Numerical Differentiation. The first solution is to use a numerical derivative. We approximate the left limitation and right limitation of ȯ(t) and ö(t) as ȯ(t -) = o(t -) -o(t -∆t) ∆ (60) ö(t -) = o(t -) -2o(t -∆) + o(t -2∆) ∆ 2 (61) ȯ(t + ) = o(t + ∆t) -o(t + ) ∆ (62) ö(t + ) = o(t + 2∆t) -2o(t + ∆t) + o(t + ) ∆ 2 where ∆ = 0.001. In this way, we can avoid computing Jacobian or Hessian, and the computational complexity almost remains the same as ODE-RNN. The last row of Table . 3 shows that the numerical differentiation can maintain the same performance with the analytical solution with 30% and 50% observation. However, when observations become sparser, e.g., 10%, the analytical differentiation will gain a better performance. Analytical Approximation. For analytical solution, the major computing burden is the Hessian matrix; thus, we approximate ö with Eq. ( 65) by simply dropping the Hessian term. The motivation and rationality is detailed in Appx. C.3.

C.3 COMPUTATION REDUCTION FOR ANALYTICAL DERIVATIVE

To further reduce the computation for the analytical derivative of ȯ and ö, we investigate whether blocking the gradient of ȯ or ö will affect the interpolation performance. From first three rows in Table . 3, we can see that performance of blocking the gradient of ȯ is worse than that of blocking ö, indicating ȯ is more important than ö in terms of computing the compensation c. In light of this, we further drop ö in the Eq. ( 7), meaning the second order jump difference rk is zero. According to the performance shown in the fourth row of Table . 3, ö has a minor impact on the compensation c when the observation is dense. We investigate the reason by check how ö impact the computation of c and find that they are correlated by d k in Eq. ( 12). We write d k again here for better clarification: d k = 6 [t + k , t - k+1 ] -[t + k-1 , t - k ] τ k-1 + τ k + 6 ṙk -2r k τ k-1 -rk+1 τ k τ k-1 + τ k . From (Eq. (6, 7)) and second term of above equation (Eq. ( 64)), we noticed that ṙk , rk serve as the bridge between ȯ and ö and d k , indicating that the importance of ȯ and ö in generating compensation c can be examined by estimating the relative significance of ṙk , rk appear in d k . We can denote the relative significance as fraction of terms included rk and ṙk as s = . It shows that rk has a coefficient of τ k-1 on the numerator, which is the time interval between two adjacent observations. In our experiment, 100 samples is uniformly sampled in 5s, and a certain percent of the samples will be selected as observations. In this setting, if we have 50% samples as observations, the average τ k = 5/50 = 0.1; thus s can be informally estimated as s ≤ |r k | 20| ṙk | , indicating ȯ is more important than ö in our experiments. As the observation ratio is higher, the τ k becomes smaller, hence the relative significance of rk and ṙk will become even larger. Armed with the above intuition that ö is less important and the fact that he Hessian contribute the major complexity O(W 2 ), we drop the term with Hessian and find a better approximation of ö, leading the final approximation of ö as: ö = ∂g ∂h ∂f ∂h f. Such approximation yields descent performance in practice. In addition, because o is multi-variable and g is a multi-valued function in practice, and such Jacobian needs to run in batch. We implement our Jacobian operation to tackle these difficulties and can run fast. 10% 0.006551 0.011915 0.009691 0.005421 0.006097 0.005886 30% 0.000745 0.000463 0.000426 0.000400 0.000375 0.000422 50% 0.000126 0.000102 0.000074 0.000072 0.000087 0.000057 



Pytorch Version 1.6.0, updated in July



Figure 2: The visual result for Sinuous Wave dataset with 10 observations. (a) shows the comparison of CSSC against other methods; (b) demonstrates the effect of c to smooth the o; (c) shows the hidden CSSC output is smooth because its unsmooth hidden state h is smoothed into ĥ.

Figure 2 (a) illustrates the benefit of the CSSC over cubic spline interpolation and ODE-RNN. Cubic spline interpolation cannot interpolate the curve when observations are sparse, without learning from the dataset. ODE-RNN performs poorly when a jump occurs at the observation. However, CSSC can help eliminate such jump and guarantee smoothness at the observation time, thus yielding good interpolation. In Figure 2 (b), we visualized the ODE-RNN output o and compensation c, and the CSSC output ô in detail to demonstrate how the the bad o becomes a good ô by adding c. Finally, Fig. 2 (c) demonstrates the first dimension of hidden states before and after smoothing, where the smoothed hidden state leads to a smoothed output.

Figure 3: The visual result for MuJoCo. We visualize part of the interpreted trajectory with 5 frames interval. 10% frames are observed out of 100 frames. The observation is highlighted with white box.

Figure 4: The visual result for Moving MNIST. The observation is indicated in white box. The comparison of discontinuity is highlighted in red box.

n k=1 C 4 (t k-1 , t k ) with constraints x(t k ) = 0, k = 0, ..., n, and natural boundary e (a) = e (b) = 0, then

|e

|2r k τ k-1 +r k+1 τ k | |6 ṙk | ≤ 3 max{|r k τ k-1 |,|r k+1 τ k |} 6| ṙk |. Without loss of generality, we letmax{|r k τ k-1 |, |r k+1 τ k |} = |r k τ k-1 |, then we have s ≤ 3 max{|r k τ k-1 |,|r k+1 τ k |} 6| ṙk | = |r k τ k-1 | 2| ṙk |

Figure 5: The interpolation MSE of CSSC with different α of MuJoCo dataset. The curve is the 4-th dimension of the hopper's state.

+ 1 observations {x k |x k = x(t k )} n k=0 ∈ R d sampled from x(t) at the irregularly spaced time points Π : a = t 0 < t 1 < ... < t n = b, the goal is to learn a function F (t) : R → R d to approximate x, such that F (t k ) = x k . Rubanovaet al., 2019) achieves the interpolation by applying ODE and RNN interchangeably through a time series, illustrated in top-left of Fig. 1. The function F on time interval t ∈ [t k , t k+1 ) is described by a neural ODE with the initial hidden state h(t k ):

Interpolation MSE on toy, MuJoCo, Moving MNIST test sets with different percentages of observations.

The MSE for on MuJoCo test set for the study of different training strategies. -hoc, post-hoc, and standard CSSC is presented in Tab. 2, where standard CSSC behaves the best. The pre-hoc CSSC is the worst because training with CSSC can tolerate the error of the ODE-RNN; thus, inference without CSSC exposes the error of ODE-RNN, and it even performs worse than standard ODE-RNN.

Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine, 1(1):18, 2018. Yulia Rubanova, Tian Qi Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In NeurIPS, 2019. Carl Runge. Über empirische funktionen und die interpolation zwischen äquidistanten ordinaten. Zeitschrift für Mathematik und Physik, 46(224-243):20, 1901. Satya Narayan Shukla and Benjamin Marlin. Interpolation-prediction networks for irregularly sampled time series. In ICLR, 2019. Qingxiong Tan, Mang Ye, Baoyao Yang, Si-Qi Liu, and Andy Jinhua Ma. Data-gru: Dual-attention time-aware gated recurrent unit for irregular multivariate time series. 2020. Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.

The interpolation MSE for on toy sinuous wave test set at different observation ratio. We compare the analytical and numerical differentiation of ȯ and ö for CSSC under different settings, where block means block gradient, drop indicates set as zero, and CSSC is the standard implementation.

The MSE of CSSC for different α on MuJoCo test set. It compares the different data samples.

annex

Lemma 6. (Birkhoff & Priver, 1967) Given any function v ∈ C 4 (t k , t k + 1), let u be the cubic Hermite interpolation matching v, we havewithThis is the error bound for cubic Hermite interpolation.Given the above lemmas, we are ready to the formal proof for Theorem 2.Proof of Theorem 2:Proof. According to Lemma 6, we let v = x -o (since x -o ∈ C 4 (t k , t k+1 ) according to Lemma 5), let u be the cubic Hermite interpolation matching v, then we havewithand) is a cubic function with endpoints satisfying can be constructed for each interval [t k , t k+1 ) as Eq. ( 51) and ( 52), which can be reconstructed aswhereFrom Lemma 4 and Lemma 5, e is bounded as Eq. ( 42). Combining Eq. ( 42) and ( 53) yields: 

