FOR LONG TIME SERIES VIA THE LOG-ODE METHOD

Abstract

Neural Controlled Differential Equations (Neural CDEs) are the continuous-time analogue of an RNN, just as Neural ODEs are analogous to ResNets. However just like RNNs, training Neural CDEs can be difficult for long time series. Here, we propose to apply a technique drawn from stochastic analysis, namely the log-ODE method. Instead of using the original input sequence, our procedure summarises the information over local time intervals via the log-signature map, and uses the resulting shorter stream of log-signatures as the new input. This represents a length/channel trade-off. In doing so we demonstrate efficacy on problems of length up to 17k observations and observe significant training speed-ups, improvements in model performance, and reduced memory requirements compared to the existing algorithm.

1. INTRODUCTION

Neural controlled differential equations (Neural CDEs) (Kidger et al., 2020) are the continuous-time analogue to a recurrent neural network (RNN), and provide a natural method for modelling temporal dynamics with neural networks. Neural CDEs are similar to neural ordinary differential equations (Neural ODEs), as popularised by Chen et al. (2018) . A Neural ODE is determined by its initial condition, without a direct way to modify the trajectory given subsequent observations. In contrast the vector field of a Neural CDE depends upon the time-varying data, so that the trajectory of the system is driven by a sequence of observations.

1.1. CONTROLLED DIFFERENTIAL EQUATIONS

We begin by stating the definition of a CDE. Let a, b ∈ R with a < b, and let v, w ∈ N. Let ξ ∈ R w . Let X : [a, b] → R v be a continuous function of bounded variation (which is for example implied by it being Lipschitz), and let f : R w → R w×v be continuous. Then we may define Z : [a, b] → R w as the unique solution of the controlled differential equation Z a = ξ, Z t = Z a + t a f (Z s )dX s for t ∈ (a, b], The notation "f (Z s )dX s " denotes a matrix-vector product, and if X is differentiable then t a f (Z s )dX s = t a f (Z s ) dX ds (s)ds. If in equation (1), dX s was replaced with ds, then the equation would just be an ODE. Using dX s causes the solution to depend continuously on the evolution of X. We say that the solution is "driven by the control, X".

1.2. NEURAL CONTROLLED DIFFERENTIAL EQUATIONS

We recall the definition of a Neural CDE as introduced in Kidger et al. (2020) . Consider a time series x as a collection of points x i ∈ R v-foot_0 with corresponding time-stamps t i ∈ R such that x = ((t 0 , x 0 ), (t 1 , x 1 ), ..., (t n , x n )), and t 0 < ... < t n . Let X : [t 0 , t n ] → R v be some interpolation of the data such that X ti = (t i , x i ). Kidger et al. (2020) use natural cubic splines. Here we will actually end up finding piecewise linear interpolation to be a more convenient choice. (We avoid issues with adaptive solvers as discussed in Kidger et al. (2020, Appendix A) simply by using fixed solvers.) Let ξ θ : R v → R w and f θ : R w → R w×v be neural networks. Let θ : R w → R q be linear, for some output dimension q ∈ N. Here θ is used to denote dependence on learnable parameters. We define Z as the hidden state and Y as the output of a neural controlled differential equation driven by X if Z t0 = ξ θ (t 0 , x 0 ), with Z t = Z t0 + t t0 f θ (Z s )dX s and Y t = θ (Z t ) for t ∈ (t 0 , t n ]. (2) That is -just like an RNN -we have evolving hidden state Z, which we take a linear map from to produce an output. This formulation is a universal approximator (Kidger et al., 2020, Appendix B) . The output may be either the time-evolving Y t or just the final Y tn . This is then fed into a loss function (L 2 , cross entropy, . . . ) and trained via stochastic gradient descent in the usual way. The question remains how to compute the integral of equation ( 2). Kidger et al. (2020) let g θ,X (Z, s) = f θ (Z) dX ds (s), where the right hand side denotes a matrix multiplication, and then note that the integral can be written as Z t = Z t0 + t t0 g θ,X (Z s , s)ds. This reduces the CDE to an ODE, so that existing tools for Neural ODEs may be used to evaluate this, and to backpropagate. By moving from the discrete-time formulation of an RNN to the continuous-time formulation of a Neural CDE, then every kind of time series data is put on the same footing, whether it is regularly or irregularly sampled, whether or not it has missing values, and whether or not the input sequences are of consistent length. Besides this, the continuous-time or differential equation formulation may be useful in applications where such models are explicitly desired, as when modelling physics.

1.3. CONTRIBUTIONS

Neural CDEs, as with RNNs, begin to break down for long time series. Training loss/accuracy worsens, and training time becomes prohibitive due to the sheer number of forward operations within each training epoch. Here, we apply the log-ODE method, which is a numerical method from stochastic analysis and rough path theory. It is a method for converting a CDE to an ODE, which may in turn be solved via standard ODE solvers. Thus this acts as a drop-in replacement for the original procedure that uses the derivative of the control path. In particular, we find that this method is particularly beneficial for long time series (and incidentally does not require differentiability of the control path). With this method both training time and model performance of Neural CDEs are improved, and memory requirements are reduced. The resulting scheme has two very neat interpretations. In terms of numerical differential equation solvers, this corresponds to taking integration steps larger than the discretisation of the data, whilst incorporating substep information through additional terms 1 . In terms of machine learning, this corresponds to binning the data prior to running a Neural CDE, with bin statistics carefully chosen to extract precisely the information most relevant to solving a CDE. t0 tm • • • Time Data x Path X Hidden state Zt Integration steps r0 r1 r2 r3 rm-2 rm-1 rm • • • • • • Time Data x Path X LogSig ri,ri+1 (X) Log-signature path Hidden state Zt

Integration steps

Figure 1 : Left: The original Neural CDE formulation. The path X is quickly varying, meaning a lot of integration steps are needed to resolve it. Right: The log-ODE method. The log-signature path is more slowly varying (in a higher dimensional space), and needs fewer integration steps to resolve.

2. THEORY

We begin with motivating theory, though we note that this section is not essential for using the method. Readers more interested in practical applications should feel free to skip to section 3.

2.1. SIGNATURES AND LOG-SIGNATURES

The signature transform is a map from paths to a vector of real values, specifying a collection of statistics about the path. It is a central component of the theory of controlled differential equations, since these statistics describe how the data interacts with dynamical systems. The log-signature is then formed by representing the same information in a compressed format. We begin by providing a formal definition of the signature, and a description of the log-signature. We will then give some intuition, first into the geometry of the first few terms of the (log-)signature, and then by providing a short example of how these terms appear when solving CDEs. Signature transform Let x = (x 1 , ..., x n ), where x i ∈ R v . Let T > 0 and 0 = t 1 < t 2 < ... < t n-1 < t n = T be arbitrary. Let X = (X 1 , ..., X d ) : [0, T ] → R d be the unique continuous function such that X(t i ) = x i and is affine on the intervals between (essentially just a linear interpolation of the data). Lettingfoot_1  S i1,...i k a,b (X) = ... 0<t 1 <...<t k <T k j=1 dX ij dt (t j )dt j , then the depth-N signature transform of X is given by Sig N a,b (X) = S(X) (i) d i=1 , S(x) (i,j) d i,j=1 , . . . , S(x) (i1,...,i N ) d i1,...,i N =1 . ( ) This definition is independent of the choice of T and t i (Bonnier et al., 2019, Proposition A.7 ). We see that the signature is a collection of integrals, with each integral defining a real value. It is a graded sequence of statistics that characterise the input time series. In particular, (Hambly & Lyons, 2010) show that under mild conditions, Sig ∞ (X) completely determines X up to translation (provided time is included in a channel in X). Log-signature transform However, the signature transform has some redundancy: a little algebra shows that for example X) , so that for instance we already know S 2,1 a,b (X) provided we know the other three quantities. S 1,2 a,b (X) + S 2,1 a,b (X) = S 1 a,b (X)S 2 a,b Data, x X 1 X 2 ∆X 1 ∆X 2 A- A+ Path, X X 1 X 2 = ∆X 1 = ∆X 2 = A + -A - . . . . . . Log-signature Depth 1 Depth 2

Higher order

Figure 2 : Geometric intuition for the first two levels of the log-signature for a 2-dimensional path. The depth 1 terms correspond to the change in each of the coordinates over the interval. The depth 2 term corresponds to the Lévy area of the path, this being the signed area between the curve and the chord joining its start and endpoints. The log-signature transform is then essentially obtained by computing the signature transform, and throwing out redundant terms, to obtain some (nonunique) minimal collection. Starting from the depth-N signature transform and removing some fixed set of redundancies produces the depth-N log-signature transform. 3 We denote this LogSig N a,b , which is a map from Lipschitz continuous paths N ) , where β(v, N ) denotes the dimension of the log-signature. The precise procedure is a little involved; both this and a formula for β(v, N ) can be found in Appendix A. [a, b] → R v into R β(v,

Geometric intuition

In figure 2 we provide a geometric intuition for the first two levels of the log-signature (which have particularly natural interpretations). (Log-)Signatures and CDEs (Log-)signatures are intrinsically linked to solutions of CDEs. Let D f denote the Jacobian of a function f . Now expand equation (1) by linearising the vector field f and neglecting higher order terms: Z t ≈ Z a + t a f (Z a ) + D f (Z a )(Z s -Z a ) dX dt (s)ds = Z a + t a f (Z a ) + D f (Z a ) s a f (Z u ) dX dt (u) du dX dt (s) ds ≈ Z a + f (Z a ) t a dX dt (s) ds + D f (Z a )f (Z a ) t a s a dX dt (u) du dX dt (s)ds = Z a + f (Z a ) S(X) (i) } d i=1 + D f (Z a )f (Z a ) S(X) (i,j) d i,j=1 . This gives a Taylor expansion of the solution, and moreover the coefficients involve the terms in the signature. Higher order Taylor expansions results in corrections using higher order signature terms. We refer the reader to section 7.1 of Friz & Victoir (2010) for further details.

2.2. THE LOG-ODE METHOD

Recall for X : N ) . The log-ODE method states [a, b] → R v that LogSig N a,b (X) ∈ R β(v, Z b ≈ Z b where Z u = Z a + u a f ( Z s ) LogSig N a,b (X) b -a ds, and Z a = Z a . ( ) where Z is as defined in equation ( 2), and the relationship between f to f is given in Appendix A. That is, the solution of the CDE may be approximated by the solution to an ODE. In practice, we go further and pick some points r i such that a = r 0 < r 1 < • • • < r m = b. We split up the CDE of equation ( 1) into an integral over [r 0 , r 1 ], an integral over [r 1 , r 2 ], and so on, and apply the log-ODE method to each interval separately. See Appendix A for more details and Appendix B for a proof of convergence. Also see Janssen (2011) ; Lyons (2014) ; Boutaib et al. (2014) for other discussions of the log-ODE method. See Gaines & Lyons (1997) ; Gyurkó & Lyons (2008) ; Flint & Lyons (2015) ; Foster et al. (2020) for applications of the log-ODE method to stochastic differential equations (SDEs).

3. METHOD

We move on to discussing the application of the log-ODE method to Neural CDEs. Recall that we observe some time series x = ((t 0 , x 0 ), (t 1 , x 1 ), ..., (t n , x n )), and have constructed a piecewise linear interpolation X : [t 0 , t n ] → R v such that X ti = (t i , x i ). We now pick points r i such that t 0 = r 0 < r 1 < • • • < r m = t n . In principle these can be variably spaced but in practice we will typically space them equally far apart. The total number of points m should be much smaller than n. In section 2 the log-signature transform was introduced. To recap, for X : N ) . In particular, these statistics are precisely those most relevant for solving the CDE equation (1). [t 0 , t n ] → R v and t 0 ≤ r i < r i+1 ≤ t n the depth-N log-signature of X over the interval [r i , r i+1 ] is some collection of statistics LogSig N ri,ri+1 (X) ∈ R β(v,

3.1. UPDATING THE NEURAL CDE HIDDEN STATE EQUATION VIA THE LOG-ODE METHOD

Recall how the Neural CDE formulation of equation ( 2) was solved via equations ( 3), (4). For the log-ODE approach we replace (3) with the piecewise g θ,X (Z, s) = f θ (Z) LogSig N ri,ri+1 (X) r i+1 -r i for s ∈ [r i , r i+1 ), where f θ : R w → R w×β(v,N ) is an arbitrary neural network, and the right hand side denotes a matrix-vector product between f θ and the log-signature. Equation (4) then becomes Z t = Z t0 + t t0 g θ,X (Z s , s)ds. This may now be solved as a (neural) ODE using standard ODE solvers.

3.2. RELATIONSHIP TO THE ORIGINAL METHOD

Suppose we happened to choose r i = t i and r i+1 = t i+1 . Then the log-signature term is LogSig N ti,ti+1 (X) t i+1 -t i (11) The depth 1 the log-signature is just the increment of the path over the interval, and so this becomes ∆X [ti,ti+1] t i+1 -t i = dX linear dt (t i ) for s ∈ [t i , t i+1 ), that is to say the same as obtained via the original method if using linear interpolation.

3.3. DISCUSSION

Ease of Implementation This method is straightforward to implement using pre-existing tools. There are standard libraries available for computing the log-signature transform; we use Signatory (Kidger & Lyons, 2020b) . Then, as equation ( 10) is an ODE, it may be solved directly using tools such as torchdiffeq (Chen, 2018). As an alternative, we note that the formulation in equation ( 11) can be written in precisely the same form as equation ( 3), with the driving path taken to be piecewise linear in log-signature space. Computation of the log-signatures can therefore be considered as a preprocessing step, producing a sequence of logsignatures. From this we may construct a path in log-signature space, and apply existing tools for neural CDEs. This idea is summarised in figure 1. We make this approach available in the [redacted] open source project.

Structure of f

The description here aligns with the log-ODE scheme described in equation ( 8). There is one discrepancy: we do not attempt to model the specific structure of f . This is in principle possible, but is computationally expensive. Instead, we model f as a neural network directly. This need not necessarily exhibit the requisite structure, but as neural networks are universal approximators (Pinkus, 1999; Kidger & Lyons, 2020a ) then this approach is at least as general from a modelling perspective.

Lossy Representation

The log-signature transform can be thought of as a lossy representation for time series. This is made rigorous in Diehl et al. (2020) , where it is shown that the log-signature can be obtained by iterating an "area" operation between paths. For CDEs, these geometric features precisely encode the interaction between the data and the system. Length/Channel Trade-Off The sequence of log-signatures is now of length m, which was chosen to be much smaller than n. As such, it is much more slowly varying over the interval [t 0 , t n ] than the original data, which was of length n. The differential equation it drives is better behaved, and so larger integration steps may be used in the numerical solver. This is the source of the speed-ups of this method; we observe typical speed-ups by a factor of about 100. Each element is a log-signature of size β(v, N ) ≥ v; the additional channels are higher-order corrections to compensate for the larger integration steps. Generality of the Log-ODE Method If depth N = 1 and steps r i = t i are used, then the above formulation exactly reduces onto the original Neural CDE formulation using linear interpolation. Thus the log-ODE method in fact generalises the original approach. Applications In principle the log-ODE method may be applied to solve any Neural CDE. In practice, the reduction in length (from n to m), coupled with the loss of information (from using the log-signature as a summary statistic) makes this particularly useful for long time series. Memory Efficiency Long sequences need large amounts of memory to perform backpropagationthrough-time (BPTT). As with the original Neural CDEs, the log-ODE approach supports memoryefficient backpropagation via the adjoint equations, alleviating this issue. See Kidger et al. (2020) .

The Depth and

Step Hyperparameters To solve a Neural CDE accurately via the log-ODE method, we should be prepared to take the depth N suitably large, or the intervals r i+1 -r i suitably small. Accomplishing this would realistically require that they are taken very large or very small, respectively. Instead, we treat these as hyperparameters. This makes use of the log-ODE method a modelling choice rather than an implementation detail. Increasing step size will lead to faster (but less informative) training by reducing the number of operations in the forward pass. Increasing depth will lead to slower (but more informative) training, as more information about each local interval is used in each update.

4. EXPERIMENTS

We investigate solving a Neural CDE with and without the log-ODE method on four real-world problems. Every problem was chosen for its long length. The lengths are in fact sufficiently long that adjoint-based backpropagation (Chen et al., 2018) was needed to avoid running out of memory at any reasonable batch size. Every problem is regularly sampled, so we take t i = i. We will denote a Neural CDE model with log-ODE method, using depth N and step s, as NCDE s N . Taking N = 1 (and any s) corresponds to not using the log-ODE method, with the data subsampled at rate 1/s, as per section 3.3. Thus we use NCDE 1 1 as our benchmark: no subsampling, no log-ODE method. In principle we could compare against RNN variants. This is for simple practical reasons: RNNbased models do not fit in the memory of the GPU resources we have available. This is one of the main advantages of using differential equation models in the first place, for which adjoint backpropagation is available. (As per the first paragraph of this section.) Each model is run three times and we report the mean and standard deviation of the test metrics along with the mean training times and memory usages. For each task, the hyperparameters were selected by performing a grid search on the NCDE s 1 model, where s was chosen so that the length of the sequence was 500 steps. This was found to create a reasonable balance between training time and sequence length. (Doing hyperoptimisation on the baseline NCDE 1 1 model would have been more difficult due to the larger training times.) Precise details of the experiments can be found in Appendices C and D.

4.1. CLASSIFYING EIGENWORMS

Our first example uses the EigenWorms dataset from the UEA archive from Bagnall et al. (2017) . This consists of time series of length 17 984 and 6 channels (including time), corresponding to the movement of a roundworm. The goal is to classify each worm as either wild-type or one of four mutant-type classes. Table 2: Mean and standard deviation of the L 2 losses on the test set for each of the vitals signs prediction tasks (RR, HR, SpO 2 ) on the BIDMC dataset, across three repeats. Only mean times are shown for space. The memory usage is given as the mean over all three of the tasks as it was approximately the same for any task for a given depth and step. The bold values denote the algorithm with the lowest test set loss for a fixed step size for each task. Figure 4 : Heatmap of normalised losses on the three BIDMC datasets for differing step sizes and depths. See Table 1 . We see that the straightforward NCDE 1 1 model takes roughly a day to train. Using the log-ODE method (NCDE 2 , NCDE 3 ) speeds this up to take roughly minutes. Doing so additionally improves model performance dramatically, and reduces memory usage. Naive subsampling approaches (NCDE 8 1 , NCDE 32 1 , NCDE 128 1 ) only achieve speed-ups without performance improvements, this can be seen in the NCDE 1 column which corresponds to naive subsampling for a step size greater than 1. We notice that the NCDE 3 model has faster training times than the depth 2 model (and sometimes better then depth 1) over each step size. This is due to the fact we imposed a stopping criterion if the loss failed to decrease after 60 epochs, meaning that the NCDE 3 has converged with less epochs (time per epoch will still be larger though). See also Figure 3 , in which we summarise results for a larger range of step sizes.

4.2. ESTIMATING VITALS SIGNS FROM PPG AND ECG DATA

Next we consider the problem of estimating vital signs from PPG and ECG data. This comes from the TSR archive (Tan et al., 2020) using data from the Beth Israel Deaconess Medical Centre (BIDMC). We consider three separate tasks, in which we aim to predict a person's respiratory rate (RR), their heart rate (HR), and their oxygen saturation (SpO2). This data is sampled at 125Hz with each series having a length of 4 000. There are 7 949 training samples, and 3 channels (including time). We train a model on each of the three vitals sign prediction tasks. The metric used to evaluate performance is the L 2 loss. The results over a range of step sizes are presented in table (2). We also provide heatmaps in Figure 4 for each dataset containing the loss values (normalised to [0, 1]) for each task. The full results over all step sizes may be found in Appendix D. We find that the depth 3 model is the top performer for every task at any step size. What's more, it does so with a significantly reduced training time. We attribute the improved performance to the log-ODE model being better able to learn long-term dependencies due to the reduced sequence length. Note that the performance of the NCDE s 2 , NCDE s 3 models actually improves as the step size is increased. This is in contrast to NCDE s 1 , which sees a degradation in performance.

5. LIMITATIONS OF THE LOG-ODE METHOD

Number of hyperparameters Two new hyperparameters -truncation depth and step size -with substantial effects on training time and memory usage must now also be tuned.

Number of input channels

The log-ODE method is most feasible for low numbers of input channels, as the number of log-signature channels β(v, N ) grows exponentially in v.

6. RELATED WORK

There has been some work on long time series for classic RNN (GRU/LSTM) models. (2015) . There the data is split into windows, an RNN is run over each window, and then an additional RNN is run over the first RNN's outputs; we may describe this as an RNN/RNN pair. Liao et al. (2019) then perform the equivalent operation with a log-signature/RNN pair. In this context, our use of log-ODE method may then be described as a log-signature/NCDE pair. In comparison to Liao et al. (2019) , this means moving from an inspired choice of pre-processing to an actual implementation of the log-ODE method. In doing so the differential equation structure is preserved. Moreover this takes advantage of the synergy between log-signatures (which extract statistics on how data drives differential equations), and the controlled differential equation it then drives. Broadly speaking these connections are natural: at least within the signature/CDE/rough path community, it is a well-known but poorly-published fact that (log-)signatures, RNNs and (Neural) CDEs are all related; see for example Kidger et al. (2020) for a little exposition on this. CNNs and Transformers have been shown to offer improvements over RNNs for modelling longterm dependencies (Bai et al., 2018; Li et al., 2019) . However, both can be expensive in their own right; Transformers are famously O(L 2 ) in the length of the time series L. Whilst several approaches have been introduced to reduce this, for example Li et al. (2019) reduce this to O(L(log L) 2 ), this can still be difficult with long series. Extensions specifically to long sequences do exist (Sourkov, 2018) , but these typically focus on language modelling rather than multivariate time series data. 

7. CONCLUSION

We demonstrate how to effectively apply Neural CDEs to long (17k) time series, via the log-ODE method. The model may still be solved via ODE methods and thus retains adjoint backpropagation and continuous dynamics. In doing so we see significant training speed-ups, improvements in model performance, and reduced memory requirements.

Supplementary material

In sections A and B, we give a more thorough introduction to solving CDEs via the log-ODE method. In section C we discuss the experimental details such as the choice of network structure, computing infrastructure and hyperparameter selection approach. In section D we give a full breakdown of every experimental result.

A AN INTRODUCTION TO THE LOG-ODE METHOD FOR CONTROLLED DIFFERENTIAL EQUATIONS

The log-ODE method is an effective method for approximating the controlled differential equation: dY t = f (Y t ) dX t , Y 0 = ξ, where X : [0, T ] → R d has finite length, ξ ∈ R n f : R n → L(R d , R n ) is a function with certain smoothness assumptions so that the CDE ( 13) is well posed. Throughout these appendices, L(U, V ) denotes the space of linear maps between the vector spaces U and V . In rough path theory, the function f is referred to as the "vector field" of ( 13) and usually assumed to have Lip(γ) regularity (see definition 10.2 in Friz & Victoir ( 2010)). In this section, we assume one of the below conditions on the vector field: 1. f is bounded and has N bounded derivatives. 2. f is linear. In order to define the log-ODE method, we will first consider the tensor algebra and path signature.  Definition A.1 We say that T R d := R ⊕ R d ⊕ (R d ) ⊗2 ⊕ • • • is the tensor algebra of R d and T R d := a = a 0 , a 1 , • • • : a k ∈ R d a + b = a 0 + b 0 , a 1 + b 1 , • • • , ( ) a ⊗ b = c 0 , c 1 , c 2 , • • • , ( ) where for n ≥ 0, the n-th term c n ∈ R d ⊗n can be written using the usual tensor product as c n := n k=0 a k ⊗ b n-k . The operation ⊗ given by ( 15) is often referred to as the "tensor product".

Definition A.2

The signature of a finite length path X : [0, T ] → R d over the interval [s, t] is defined as the following collection of iterated (Riemann-Stieltjes) integrals: S s,t X := 1 , X (1) s,t , X (2) s,t , X (3) s,t , • • • ∈ T R d , ( ) where for n ≥ 1, X (n) s,t := • • • s<u1<•••<un<t dX u1 ⊗ • • • ⊗ dX un ∈ R d ⊗n . Similarly, we can define the depth-N (or truncated) signature of the path X on [s, t] as S N s,t X := 1 , s<u1<t dX u , • • • , • • • s<u1<•••<u N <t dX u1 ⊗ • • • ⊗ dX u N ∈ T N R d , ( ) where ⊗N denotes the truncated tensor algebra. T N R d := R ⊕ R d ⊕ (R d ) ⊗2 ⊕ • • • ⊕ (R d ) The (truncated) signature provides a natural feature set that describes the effects a path X has on systems that can be modelled by ( 13). That said, defining the log-ODE method actually requires the so-called "log-signature" which efficiently encodes the same integral information as the signature. The log-signature is obtained from the path's signature by removing certain algebraic redundancies, such as t 0 s 0 dX i u dX j s + t 0 s 0 dX j u dX i s = X i t X j t , for i, j ∈ {1, • • • , d}, which follows by the integration-by-parts formula. To this end, we will define the logarithm map on the depth-N truncated tensor algebra T N R d := R ⊕ R d ⊕ • • • ⊕ (R d ) ⊗N . Definition A.3 (The logarithm of a formal series) For a = (a 0 , a 1 , • • • ) ∈ T R d with a 0 > 0, define log(a) to be the element of T R d given by the following series: log(a) := log(a 0 ) + ∞ n=1 (-1) n n 1 - a a 0 ⊗n , ( ) where 1 = (1, 0, • • • ) is the unit element of T R d and log(a 0 ) is viewed as log(a 0 )1. Definition A.4 (The logarithm of a truncated series) For a = (a 0 , a 1 , • • • , a N ) ∈ T R d with a 0 > 0, define log N (a) to be the element of T N R d defined from the logarithm map (18) as log N (a) := P N log( a) , ( ) where a := (a 0 , a 1 , • • • , a N , 0, • • • ) ∈ T R d and P N denotes the standard projection map from T R d onto T N R d . Definition A.5 The log-signature of a finite length path X : [0, T ] → R d over the interval [s, t] is defined as LogSig s,t (X) := log(S s,t (X)), where S s,t (X) denotes the path signature of X given by Definition A.2. Likewise, the depth-N (or truncated) log-signature of X is defined for each N ≥ 1 as LogSig N s,t (X) := log N (S N s,t (X)). The log-signature is a map from X : N ) . The exact form of β(d, N ) is given by [0, T ] → R d → R β(d, β(d, N ) = N k=1 1 k i|k µ k i d i with µ the Möbius function. We note that the order of this remains an open question. The final ingredient we use to define the log-ODE method are the derivatives of the vector field f . It is worth noting that these derivatives also naturally appear in the Taylor expansion of (13). Definition A.6 (Vector field derivatives) We define f •k : R n → L((R d ) ⊗k , R n ) recursively by f •(0) (y) := y, f •(1) (y) := f (y), f •(k+1) (y) := D f •k (y)f (y), for y ∈ R n , where D f •k denotes the Fréchet derivative of f •k . Using these definitions, we can describe two closely related numerical methods for the CDE (13). Definition A.7 (The Taylor method) Given the CDE (13), we can use the path signature of X to approximate the solution Y on an interval [s, t] via its truncated Taylor expansion. That is, we use Taylor(Y s , f, S N s,t (X)) := N k=0 f •k (Y s )π k S N s,t (X) , as an approximation for Y t where each π k : Definition A.8 (The Log-ODE method) Using the Taylor method (20), we can define the function T N (R d ) → (R d ) ⊗k is the projection map onto R d ⊗k . Y s , X : [s, t] → R d Y Taylor t := Y s + f (Y s )S N s,t (X) z = f (z)LogSig N s,t (X) z(0) = Y s Y Log t := z(1) Log-ODE method ≈ Action of f on signature of X Action of f on log-signature of X Solve ODE on [0, 1] f : R n → L(T N (R d ), R n ) by f (z) := Taylor(z, f, •) . By applying f to the truncated log-signature of the path X over an interval [s, t], we can define the following ODE on [0, 1] dz du = f (z)LogSig N s,t (X), z(0) = Y s . Then the log-ODE approximation of Y t (given Y s and LogSig N s,t (X)) is defined as LogODE(Y s , f, LogSig N s,t (X)) := z(1). Remark A.9 Our assumptions of f ensure that z → f (z)LogSig N s,t (X) is either globally bounded and Lipschitz continuous or linear. Hence both the Taylor and log-ODE methods are well defined. Remark A.10 It is well known that the log-signature of a path X lies in a certain free Lie algebra (this is detailed in section 2.2.4 of Lyons et al. (2007) ). Furthermore, it is also a theorem that the Lie bracket of two vector fields is itself a vector field which doesn't depend on choices of basis. By expressing LogSig N s,t (X) using a basis of the free Lie algebra, it can be shown that only the vector field f and its (iterated) Lie brackets are required to construct the log-ODE vector field f (z)LogSig N s,t (X). In particular, this leads to our construction of the log-ODE (8) using the Lyndon basis of the free Lie algebra (see Reizenstein (2017) for a precise description of the Lyndon basis). We direct the reader to Lyons (2014) and Boutaib et al. (2014) for further details on this Lie theory.

To illustrate the log-ODE method, we give two examples:

Example A.11 (The "increment-only" log-ODE method) When N = 1, the ODE (21) becomes dz du = f (z)X s,t , z(0) = Y s . Therefore we see that this "increment-only" log-ODE method is equivalent to driving the original CDE (13) by a piecewise linear approximation of the control path X. This is a classical approach for stochastic differential equations (i.e. when X t = (t, W t ) with W denoting a Brownian motion) and is an example of a Wong-Zakai approximation (see Wong & Zakai (1965) for further details). Example A.12 (An application for SDE simulation) Consider the following affine SDE, dY t = a(b -y t ) dt + σ y t • dW t , y(0) = y 0 ∈ R ≥0 , where a, b ≥ 0 are the mean reversion parameters, σ ≥ 0 is the volatility and W denotes a standard real-valued Brownian motion. The • means that this SDE is understood in the Stratonovich sense. The SDE ( 23) is known in the literature as Inhomogeneous Geometric Brownian Motion (or IGBM). Using the control path X = {(t, W t )} t≥0 and setting N = 3, the log-ODE (21) becomes dz du = a(b -z u )h + σ z u W s,t -abσA s,t + abσ 2 L (1) s,t + a 2 bσL (2) s,t , z(0) = Y s . where h := t -s denotes the step size and the random variables A s,t , L s,t , L s,t are given by A s,t := t s W s,r dr - 1 2 hW s,t , L s,t := t s r s W s,v • dW v dr - 1 2 W s,t A s,t - 1 6 hW 2 s,t , L s,t := t s r s W s,v dv dr - 1 2 hA s,t - 1 6 h 2 W s,t . In Foster et al. (2020) , the depth-3 log-signature of X = {(t, W t )} t≥0 was approximated so that the above log-ODE method became practical and this numerical scheme exhibited state-of-the-art convergence rates. For example, the approximation error produced by 25 steps of the high order log-ODE method was similar to the error of the "increment only" log-ODE method with 1000 steps.

EQUATIONS

In this section, we shall present "rough path" error estimates for the log-ODE method. In addition, we will discuss the case when the vector fields governing the rough differential equation are linear. We begin by stating the main result of Boutaib et al. (2014) which quantifies the approximation error of the log-ODE method in terms of the regularity of the systems vector field f and control path X. Since this section uses a number of technical definitions from rough path theory, we recommend Lyons et al. (2007) as an introduction to the subject. For T > 0, we will use the notation T := {(s, t) ∈ [0, T ] 2 : s < t} to denote a rescaled 2-simplex. Theorem B.1 (Lemma 15 in Boutaib et al. (2014) ) Consider the rough differential equation dY t = f (Y t ) dX t , Y 0 = ξ, where we make the following assumptions: • X is a geometric p-rough path in R d , that is X : T → T p (R d ) is a continuous path in the tensor algebra T p (R d ) := R ⊕ R d ⊕ R d ⊗2 ⊕ • • • ⊕ R d ⊗ p with increments X s,t = 1, X s,t , X s,t , • • • , X ( p ) s,t , X (k) s,t := π k X s,t , where π k : T p R d → R d ⊗k is the projection map onto R d ⊗k , such that there exists a sequence of continuous finite variation paths x n : [0, T ] → R d whose truncated signatures converge to X in the p-variation metric: d p S p (x n ), X → 0, as n → ∞, where the p-variation between two continuous paths Z 1 and Z 2 in T p (R d ) is d p Z 1 , Z 2 := max 1≤k≤ p sup D ti∈D π k Z 1 ti,ti+1 -π k Z 2 ti,ti+1 p k k p , ( ) where the supremum is taken over all partitions D of [0, T ] and the norms • must satisfy (up to some constant) a ⊗ b ≤ a b , for a ∈ (R d ) ⊗n and b ∈ (R d ) ⊗m . For example, we can take • to be the projective or injective tensor norms (see Propositions 2.1 and 3.1 in Ryan (2002) ). • The solution Y and its initial value ξ both take their values in R n . • The collection of vector fields {f 1 , • • • , f d } on R n are denoted by f : R n → L(R n , R d ), where L(R n , R d ) is the space of linear maps from R n to R d . We will assume that f has Lip(γ) regularity with γ > p. That is, f it is bounded with γ bounded derivatives, the last being Hölder continuous with exponent (γ -γ ). Hence the following norm is finite: f Lip(γ) := max 0≤k≤ γ D k f ∞ ∨ D γ f (γ-γ )-Höl , where D k f is the k-th (Fréchet) derivative of f and • α-Höl is the standard α-Hölder norm with α ∈ (0, 1). • The RDE ( 24) is defined in the Lyon's sense. Therefore by the Universal Limit Theorem (see Theorem 5.3 in Lyons et al. (2007) ), there exists a unique solution Y : [0, T ] → R n . We define the log-ODE for approximating the solution Y over an interval [s, t] ⊂ [0, T ] as follows: 1. Compute the depth-γ log-signature of the control path X over [s, t] . That is, we obtain LogSig  z s,t 0 = Y s , where the vector field F : R n → R n is defined from the log-signature as F (z) := γ k=1 f •k (z)π k LogSig γ s,t (X) . ( ) Recall that f •k : R n → L((R d ) ⊗k , R n ) was defined previously in Definition A.6. Then we can approximate Y t using the u = 1 solution of (29). Moreover, there exists a universal constant C p,γ depending only on p and γ such that Although the above theorem requires some sophisticated theory, it has a simple conclusion -namely that log-ODEs can approximate controlled differential equations. That said, the estimate (31) does not directly apply when the vector fields {f i } are linear as they would be unbounded. Fortunately, it is well known that linear RDEs are well posed and the growth of their solutions can be estimated. Output,  Y t -z s,t 1 ≤ C p,γ f γ Lip(γ) X γ p-var;[s,t] , Z ri+1 v × 1 h × 1 h × 1 v × p p × 1 v × 1 . . .

Remark B.8

The above error estimate also holds when the vector field f is linear (by Remark B.6)). Since γ is the truncation depth of the log-signatures used to construct each log-ODE vector field, we see that high convergence rates can be achieved through using more terms in each log-signature. It is also unsurprising that the error estimate (36) increases with the "roughness" of the control path. So just as in our experiments, we see that the performance of the log-ODE method can be improved by choosing an appropriate step size and depth of log-signature.

C EXPERIMENTAL DETAILS

Code The code to reproduce the experiments is available at [redacted; see supplementary material] Data splits Each dataset was split into a training, validation, and testing dataset with relative sizes 70%/15%/15%.

Normalisation

The training splits of each dataset were normalised to zero mean and unit variance. The statistics from the training set were then used to normalise the validation and testing datasets. Architecture We give a graphical description of the architecture used for updating the Neural CDE hidden state in figure 6 . The input is first run through a multilayer perceptron with n layers of size h, with with n, h being hyperparameters. ReLU nonlinearities are used at each layer except the final one, where we instead use a tanh nonlinearity. The goal of this is to help prevent term blow-up over the long sequences. Note that this is a small inconsistency between this work and the original model proposed in Kidger et al. (2020) . Here, we applied the tanh function as the final hidden layer nonlinearity, whilst in the original paper the tanh nonlinearity is applied after the final linear map. Both methods are used to constrain the rate of change of the hidden state; we do not know of a reason to prefer one over the other. Note that the final linear layer in the multilayer perceptron is reshaped to produce a matrix-valued output, of shape v × p. (As f θ is matrix-valued.) A matrix-vector multiplication with the logsignature then produces the vector field for the ODE solver. ODE Solver All problems used the 'rk4' solver as implemented by torchdiffeq (Chen, 2018) version 0.0.1.

Validation loss

Hidden dim Num layers Hidden hidden multiplier Total params RR HR SpO2 Table 6 : Mean and standard deviation of the L 2 losses on the test set for each of the vitals signs prediction tasks (RR, HR, SpO 2 ) on the BIDMC dataset, across three repeats. Only mean times are shown for space. The memory usage is given as the mean over all three of the tasks as it was approximately the same for any task for a given depth and step. The bold values denote the algorithm with the lowest test set loss for a fixed step size for each task.



For the reader familiar with numerical methods for SDEs, this is akin to the additional correction term in Milstein's method as compared to Euler-Maruyama. This is a slightly simplified definition, and the signature is often instead defined using the notation of stochastic calculus; see Definition A.2. Similar terminology such as "step-N log-signature" is also used in the literature.



Figure 3: Left: Heatmap of normalised accuracies on the EigenWorms dataset for differing step sizes and depths. Right: Log-log plot of the elapsed time of the algorithm against the step size.

Brouwer et al. (2019);Lechner & Hasani (2020) amongst others consider continuous time analogues of GRUs and LSTMs, going some way to improving the learning of long-term dependencies.Voelker et al. (2019);Gu et al. (2020) consider links with ODEs and approximation theory, to improve the long-term memory capacity of RNNs.

⊗k ∀k ≥ 0 is the set of formal series of tensors of R d . Moreover, T R d and T R d can be endowed with the operations of addition and multiplication. Given a = (a 0 , a 1 , • • • ) and b = (b 0 , b 1 , • • • ), we have

Figure 5: Illustration of the log-ODE and Taylor methods for controlled differential equations.

(X) ∈ T γ (R d ), where log γ (•) is defined by projecting the standard tensor logarithm map onto {a ∈ T γ (R d ) : π 0 (a) > 0}.2. Construct the following (well-posed) ODE on the interval [0, 1], dz s,t du = F z s,t ,

Figure 6: Overview of the hidden state update network structure. We give the dimensions at each layer in the top right hand corner of each box.

Mean and standard deviation of test set accuracy (in %) over three repeats, as well as memory usage and training time, on the EigenWorms dataset for depths 1-3 and a small selection of step sizes. The bold values denote that the model was the top performer for that step size.

where • p-var;[s,t]  is the p-variation norm defined for paths in T p (R d ) by

Mean and standard deviation of test set accuracy (in %) over three repeats, as well as memory usage and training time, on the EigenWorms dataset for depths 1-3 and a small selection of step sizes. The bold values denote that the model was the top performer for that step size.

annex

 Theorem B.3 (Theorem 10.57 in Friz & Victoir (2010) ) Consider the linear RDE on [0, T ]where X is a geometric p-rough path in R d , ξ ∈ R n and the vector fields {f i } 1≤i≤d take the form f i (y) = A i y + B where {A i } and {B i } are n × n matrices. Let K denote an upper bound on max i ( A i + B i ). Then a unique solution Y : [0, T ] → R n exists. Moreover, it is bounded and there exists a constant C p depending only on p such thatfor all 0 ≤ s ≤ t ≤ T .When the vector fields of the RDE (24) are linear, then the log-ODE (29) also becomes linear. Therefore, the log-ODE solution exists and is explicitly given as the exponential of the matrix F .Theorem B.4 Consider the same linear RDE on [0, T ] as in Theorem B.3,Then the log-ODE vector field F given by ( 30) is linear and the solution of the associated ODE (29) exists and satisfiesfor u ∈ [0, 1] and all 0 ≤ s ≤ t ≤ T .Proof B.5 Since F is a linear vector field on R n , we can view it as an n × n matrix and so for u ∈ [0, 1], z s,t u = exp(uF )z s,t 0 , where exp denotes the matrix exponential. The result now follows by the standard estimate exp(F ) ≤ exp( F ).Remark B.6 Due to the boundedness of linear RDEs (33) and log-ODEs (34), the arguments that established Theorem B.1 will hold in the linear setting as f Lip(γ) would be finite when defined on the domains that the solutions Y and z lie in.Given the local error estimate (31) for the log-ODE method, we can now consider the approximation error that is exhibited by a log-ODE numerical solution to the RDE (24). Thankfully, the analysis required to derive such global error estimates was developed by Greg Gyurkó in his PhD thesis.Thus the following result is a straightforward application of Theorem 3.2.1 from Gyurkó (2008).Theorem B.7 Let X, f and Y satisfy the assumptions given by Theorem B.1 and suppose thatk+1 to be the solution at u = 1 of the following ODE:where the vector field F is constructed from the log-signature of X over the interval [t k , t k+1 ] according to (30). Then there exists a constant C depending only on p, γ and f Lip(γ) such thatfor 0 ≤ k ≤ N .Computing infrastructure All EigenWorms experiments were run on a computer equipped with three GeForce RTX 2080 Ti's. All BIDMC experiments were run on a computed with two GeForce RTX 2080 Ti's and two Quadro GP100's.Optimiser All experiments used the Adam optimiser. The learning rate was initialised at 0.032 divided by batch size. The batch size used was 1024 for EigenWorms and 512 for the BIDMC problems. If the validation loss failed to decrease after 15 epochs the learning rate was reduced by a factor of 10. If the validation loss did not decrease after 60 epochs, training was terminated and the model was rolled back to the point at which it achieved the lowest loss on the validation set.Hyperparameter selection Hyperparameters were selected to optimise the score of the NCDE 1 model on the validation set. For each dataset the search was performed with a step size that meant the total number of hidden state updates was equal to 500, as this represented a good balance between length and speed that allowed us to complete the search in a reasonable time-frame. In particular, this was short enough that we could train using the non-adjoint training method which helped to speed this section up. The hyperparameters that were considered were:• Hidden dimension: [16, 32, 64] -The dimension of the hidden state Z t .• Number of layers: [2, 3, 4] -The number of hidden state layers.• Hidden hidden multiplier: [1, 2, 3] -Multiplication factor for the hidden hidden state, this being the 'Hidden layer k' in figure 6 . The dimension of each of these 'hidden hidden' layers with be this value multiplied by 'Hidden dimension'.We ran each of these 27 total combinations for every dataset and the parameters that corresponded were used as the parameters when training over the full depth and step grid. The full results from the hyperparameter search are listed in tables (3, 4) with bolded values to show which values were eventually selected.

D EXPERIMENTAL RESULTS

Here we include the full breakdown of all experimental results. Tables 5 and 6 include all results from the EigenWorms and BIDMC datasets respectively. 

