FOR LONG TIME SERIES VIA THE LOG-ODE METHOD

Abstract

Neural Controlled Differential Equations (Neural CDEs) are the continuous-time analogue of an RNN, just as Neural ODEs are analogous to ResNets. However just like RNNs, training Neural CDEs can be difficult for long time series. Here, we propose to apply a technique drawn from stochastic analysis, namely the log-ODE method. Instead of using the original input sequence, our procedure summarises the information over local time intervals via the log-signature map, and uses the resulting shorter stream of log-signatures as the new input. This represents a length/channel trade-off. In doing so we demonstrate efficacy on problems of length up to 17k observations and observe significant training speed-ups, improvements in model performance, and reduced memory requirements compared to the existing algorithm.

1. INTRODUCTION

Neural controlled differential equations (Neural CDEs) (Kidger et al., 2020) are the continuous-time analogue to a recurrent neural network (RNN), and provide a natural method for modelling temporal dynamics with neural networks. Neural CDEs are similar to neural ordinary differential equations (Neural ODEs), as popularised by Chen et al. (2018) . A Neural ODE is determined by its initial condition, without a direct way to modify the trajectory given subsequent observations. In contrast the vector field of a Neural CDE depends upon the time-varying data, so that the trajectory of the system is driven by a sequence of observations.

1.1. CONTROLLED DIFFERENTIAL EQUATIONS

We begin by stating the definition of a CDE. Let a, b ∈ R with a < b, and let v, w ∈ N. Let ξ ∈ R w . Let X : [a, b] → R v be a continuous function of bounded variation (which is for example implied by it being Lipschitz), and let f : R w → R w×v be continuous. Then we may define Z : [a, b] → R w as the unique solution of the controlled differential equation Z a = ξ, Z t = Z a + t a f (Z s )dX s for t ∈ (a, b], The notation "f (Z s )dX s " denotes a matrix-vector product, and if X is differentiable then t a f (Z s )dX s = t a f (Z s ) dX ds (s)ds. If in equation (1), dX s was replaced with ds, then the equation would just be an ODE. Using dX s causes the solution to depend continuously on the evolution of X. We say that the solution is "driven by the control, X".

1.2. NEURAL CONTROLLED DIFFERENTIAL EQUATIONS

We recall the definition of a Neural CDE as introduced in Kidger et al. (2020) . Consider a time series x as a collection of points x i ∈ R v-foot_0 with corresponding time-stamps t i ∈ R such that x = ((t 0 , x 0 ), (t 1 , x 1 ), ..., (t n , x n )), and t 0 < ... < t n . Let X : [t 0 , t n ] → R v be some interpolation of the data such that X ti = (t i , x i ). Kidger et al. ( 2020) use natural cubic splines. Here we will actually end up finding piecewise linear interpolation to be a more convenient choice. (We avoid issues with adaptive solvers as discussed in Kidger et al. (2020, Appendix A) simply by using fixed solvers.) Let ξ θ : R v → R w and f θ : R w → R w×v be neural networks. Let θ : R w → R q be linear, for some output dimension q ∈ N. Here θ is used to denote dependence on learnable parameters. We define Z as the hidden state and Y as the output of a neural controlled differential equation driven by X if Z t0 = ξ θ (t 0 , x 0 ), with Z t = Z t0 + t t0 f θ (Z s )dX s and Y t = θ (Z t ) for t ∈ (t 0 , t n ]. (2) That is -just like an RNN -we have evolving hidden state Z, which we take a linear map from to produce an output. This formulation is a universal approximator (Kidger et al., 2020, Appendix B) . The output may be either the time-evolving Y t or just the final Y tn . This is then fed into a loss function (L 2 , cross entropy, . . . ) and trained via stochastic gradient descent in the usual way. The question remains how to compute the integral of equation ( 2). Kidger et al. ( 2020) let g θ,X (Z, s) = f θ (Z) dX ds (s), where the right hand side denotes a matrix multiplication, and then note that the integral can be written as Z t = Z t0 + t t0 g θ,X (Z s , s)ds. This reduces the CDE to an ODE, so that existing tools for Neural ODEs may be used to evaluate this, and to backpropagate. By moving from the discrete-time formulation of an RNN to the continuous-time formulation of a Neural CDE, then every kind of time series data is put on the same footing, whether it is regularly or irregularly sampled, whether or not it has missing values, and whether or not the input sequences are of consistent length. Besides this, the continuous-time or differential equation formulation may be useful in applications where such models are explicitly desired, as when modelling physics.

1.3. CONTRIBUTIONS

Neural CDEs, as with RNNs, begin to break down for long time series. Training loss/accuracy worsens, and training time becomes prohibitive due to the sheer number of forward operations within each training epoch. Here, we apply the log-ODE method, which is a numerical method from stochastic analysis and rough path theory. It is a method for converting a CDE to an ODE, which may in turn be solved via standard ODE solvers. Thus this acts as a drop-in replacement for the original procedure that uses the derivative of the control path. In particular, we find that this method is particularly beneficial for long time series (and incidentally does not require differentiability of the control path). With this method both training time and model performance of Neural CDEs are improved, and memory requirements are reduced. The resulting scheme has two very neat interpretations. In terms of numerical differential equation solvers, this corresponds to taking integration steps larger than the discretisation of the data, whilst incorporating substep information through additional terms 1 . In terms of machine learning, this corresponds to binning the data prior to running a Neural CDE, with bin statistics carefully chosen to extract precisely the information most relevant to solving a CDE.



For the reader familiar with numerical methods for SDEs, this is akin to the additional correction term in Milstein's method as compared to Euler-Maruyama.

