CHARACTERISTIC NEURAL ORDINARY DIFFERENTIAL EQUATIONS

Abstract

We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODE models the evolution of latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This reduction along characteristic curves allows for analyzing PDEs through standard techniques used for ODEs, in particular the adjoint sensitivity method. We also derive C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for irregularly sampled time series prediction on MuJoCo, PhysioNet, and Human Activity datasets and classification and density estimation on CIFAR-10, SVHN, and MNIST datasets given a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the system efficiency using a lower number of parameters and function evaluations compared with those of the baselines.

1. INTRODUCTION

Deep learning and differential equations share many connections, and techniques in the intersection have led to insights in both fields. One predominant connection is based on certain neural network architectures resembling numerical integration schemes, leading to the development of Neural Ordinary Differential Equations (NODEs) (Chen et al., 2019b) . NODEs use a neural network parameterization of an ODE to learn a mapping from observed variables to a latent variable that is the solution to the learned ODE. A central benefit of NODEs is the constant memory cost, when backward passes are computed using the adjoint sensitivity method rather than backpropagating through individual forward solver steps. Backpropagating through adaptive differential equation solvers to train NODEs will often result in extensive memory use, as mentioned in Chen et al. (2019b) . Moreover, NODEs provide a flexible probability density representation often referred to as continuous normalizing flows (CNFs). However, since NODEs can only represent solutions to ODEs, the class of functions is somewhat limited and may not apply to more general problems that do not have smooth and one-to-one mappings. To address this limitation, a series of analyses based on methods from differential equations have been employed to enhance the representation capabilities of NODEs, such as the theory of controlled differential equations (Kidger et al., 2020) , learning higher-order ODEs (Massaroli et al., 2021) , augmenting dynamics (Dupont et al., 2019) , and considering dynamics with delay terms (Zhu et al., 2021) . Additionally, certain works consider generalizing the ODE case to partial differential equations (PDEs), such as in Ruthotto & Haber (2020) and Sun et al. (2019) . These PDE-based methods do not use the adjoint method, removing the primary advantage of constant memory cost. This leads us to the central question motivating the work: can we combine the benefits of the rich function class of PDEs with the efficiency of the adjoint method? To do so, we propose a method of continuous-depth neural networks that solves a PDE over parametric curves that reduce the PDE to an ODE. Such curves are known as characteristics, and they define the solution of the PDE in terms of an ODE (Griffiths et al., 2015) . The proposed Characteristic Neural Ordinary Differential Equations (C-NODE) learn both the characteristics and the ODE along the characteristics to solve the PDE over the data space. This allows for a richer class of models while still incorporating the same memory efficiency of the adjoint method. The proposed C-NODE is also an extension of existing methods, as it improves the empirical accuracy of these methods in classification tasks, time series prediction tasks, and image quality in generation tasks. NODE represents a single ODE, and can only represent u(x, t) along one dimension, for example, u(x = 0, t).

2. RELATED WORK

NODE is often motivated as a continuous form of a Residual Network (ResNet) (He et al., 2015) , since the ResNet can be interpreted as a forward Euler integration scheme on the latent state (Weinan, 2017) . Specifically, a ResNet is composed of multiple blocks with each block can be represented as: u t+1 = u t + f (u t , θ), where u t is the evolving hidden state at time t and f (u t , θ) is interpreted as the gradient at time t, namely du/dt(u t ). Generalizing the model to a step size given by ∆t results in u t+∆t = u t + f (u t , θ)∆t. To adapt this model to a continuous setting, we let ∆t → 0 and obtain: lim f (u(t), t, θ)dt. Numerical integration can then be treated as a black box, using numerical schemes beyond the forward Euler to achieve higher numerical precision. However, since black box integrators can take an arbitrary number of intermediate steps, backpropagating through individual steps would require too much memory since the individual steps must be saved. Chen et al. (2019b) addressed this problem by using adjoint backpropagation, which has a constant memory usage. For a given loss function on the terminal state t = 1 of the hidden state L(u(t 1 )), the adjoint a(t) is governed by another ODE: da(t) dt = -a(t) ⊺ ∂f (u(t), t, θ) ∂u , a(t 1 ) = ∂L ∂u(t 1 ) , that dictates the gradient with respect to the parameters. The loss L(u(t 1 )) can then be calculated by solving another ODE (the adjoint) rather than backpropagating through the calculations involved in the numerical integration. However, assuming that the hidden state is governed by an ODE imposes a limitation on the expressiveness of the mapping. For example, Dupont et al. (2019) describes a notable limitation of NODEs is in the inability to represent dynamical systems with intersecting trajectories. In response to such limitations, many works have tried to increase the expressiveness of the mapping. Dupont et al. (2019) proposed to solve the intersection trajectories problem by augmenting the vector space, lifting the points into additional dimensions; Zhu et al. (2021) included time delay in the equation to represent dynamical systems of greater complexity; Massaroli et al. (2021) proposed to condition the vector field on the inputs, allowing the integration limits to be conditioned on the input; Massaroli et al. (2021) and Norcliffe et al. (2020) additionally proposed and proved a second-order ODE system can efficiently solve the intersecting trajectories problem. We note however that the interpretation of NODE as a continuous form of ResNet is also problematic, owing to the fact that the empirical behavior of the ResNet does not match the theoretical properties (Krishnapriyan et al., 2022; Ott et al., 2021) . As such, alternative interpretations of the process represented by ODE have been considered. In Zhang et al. (2019) , the authors considered an augmentation where the augmented state corresponds to the parameters of the network governing the latent state. Queiruga et al. (2021) describes the latent evolution through a series of basis functions thereby allowing important concepts such as BatchNorm to be effectively translated in the continuous setting, achieving state-of-the-art performance on a variety of image classification tasks. Further improvements to performance have been made by considering different numerical integrators (Matsubara et al., 2021; Zhuang et al., 2020; 2021) . On a related front, multiple works have attempted to expand NODE to other types of differential equation beyond ODEs. Sun et al. (2019) employed a dictionary method and expanded NODEs to a PDE case, achieving high accuracies both in approximating PDEs and in classifying real-world image datasets. However, Sun et al. (2019) suggested that the method is unstable when training with the adjoint method and therefore is unable to make use of the benefits that come with training with adjoint. Zhang et al. (2018) proposed a density transform approach based on the Monge-Ampere PDE, but did not consider using adjoint-based training. Multiple works have expanded to the stochastic differential equations setting and developed efficient optimization methods for them including (Güler et al., 2019; Jia & Benson, 2019; Kidger et al., 2021a; b; Li et al., 2020; Liu et al., 2019; Xu et al., 2022) . 

3. METHOD

We describe the proposed C-NODE method in this section by first providing a brief introduction to the method of characteristics (MoC) for solving PDEs with an illustrative example. We then discuss the types of PDEs we can describe using this method. We finally discuss how we apply the MoC to our C-NODE framework.

3.1. METHOD OF CHARACTERISTICS

The MoC provides a procedure for transforming certain PDEs into ODEs along paths known as characteristics. In the most general sense, the method applies to general hyperbolic differential equations. We will introduce MoC using a canonical example involving the inviscid Burgers equation, and defer to Griffiths et al. (2015, Chapter 9) for a more complete introduction to the topic. Let u(x, t) : R × R + → R satisfy the following inviscid Burgers equation ∂u ∂t + u ∂u ∂x = 0, where we drop the dependence on x and t for ease of notation. We are interested in the solution of u over some bounded domain Ω ⊂ R × R + . Consider parametric forms for the spatial component x(s) : [0, T ] → R and temporal components t(s) : [0, T ] → R + over the fictitious variable s ∈ [0, T ]. Intuitively, this allows us to solve an equation on curves x, t that are parameterized by a variable s which we denote (x(s), t(s)) as the characteristic. Expanding and writing d as the total derivative, we get d ds u(x(s), t(s)) = ∂u ∂x dx ds + ∂u ∂t dt ds . ( ) Recall the original PDE in equation 1 and substituting the proper terms into equation 2 for dx/ds = u, dt/ds = 1, du/ds = 0, we then recover equation 1. Note that we now have a system of 3 ODEs which we can solve to obtain the characteristics as x(s) = us + x 0 and t(s) = s + t 0 , which are functions of initial conditions x 0 , t 0 . Finally, by solving over a grid of initial conditions {x  (i) 0 } ∞ i=1 ∈ ∂Ω, := T 0 d ds u(us + x 0 , s)ds, using the adjoint method with boundary conditions x 0 , t 0 . This contrasts the usual direct integration over the variable t that is done in NODE; we now jointly couple the integration through the characteristics. An example of solving this equation over multiple initial conditions is given in Figure 1 with the contrast to standard NODE integration.

Hyperbolic PDEs in Machine Learning

To motivate using MoC for machine learning problems such as classification or density estimation, we again note that MoC most generally applies to hyperbolic PDEs. These PDEs roughly describe the propagation of physical quantities through time. Such equations may be appropriate for deep learning tasks due to their ability to transport data into different regions of the state space. For instance, in a classification task, we consider the problem of transporting high-dimensional data points that are not linearly separable to spaces where they are linearly separable. Similarly, in generative modeling, we transport a base distribution to data distribution.

3.2. NEURAL REPRESENTATION OF CHARACTERISTICS

In the proposed method, we learn the components involved in the MoC, namely the characteristics and the function coefficients. We now generalize the example given in 3.1, which involved two variables, to a k-dimensional system. Specifically, consider the following nonhomogeneous boundary value problem (BVP): ∂u ∂t + k i=1 a i (x 1 , ..., x k , u) ∂u ∂xi = c(x 1 , ..., x k , u), on x, t ∈ R k × [0, ∞) u(x(0), 0) = u 0 , on x ∈ R k . Here, u : R k × R → R n is a multivariate map, a i : R k+n → R and c : R k+n → R n are functions dependent on values of u and x's. This problem is well-defined and has a solution as long as k i=1 a i ∂u ∂xi is continuous (Evans, 2010) . MoC is generally used in a scalar context, but the correspondence to the vector case is relatively straightforward. A proof of this can be found in Appendix C.1. To begin, we decompose the PDE in equation 3 into the following system of ODEs dx i ds = a i (x 1 , ..., x k , u), dt ds = 1, du ds = k i=1 ∂u ∂x i dx i ds = c(x 1 , ..., x k , u). We represent this ODE system by parameterizing dx i /ds and ∂u/∂x i with neural networks. Consequently, du/ds is evolving according to equation 6. Following this expansion, we arrive at u(x(T ), T ) = u(x(0), 0) + T 0 du ds (x, u) ds (7) = u(x(0), 0) + T 0 [J x u] (x, u; Θ 2 ) dx ds ( x, u; Θ 2 ) ds, where we remove u's dependency on x(s) and x's dependency on s for simplicity of notation. In equation 7, the functions J x u and dx/ds are represented as neural networks with inputs x, u and parameters Θ 2 . Conditioning on data Previous works primarily modeled the task of classifying a set of data points with a fixed differential equation, neglecting possible structural variations lying in the data. Here, we condition C-NODE on each data point, resulting in solving a PDE with a different initial condition and the hyperbolic PDE interpretation of the latent variables. Consider the term given by the integrand in equation 7. The neural network representing the characteristic dx/ds is conditioned on the input data z ∈ R w . Define a mapping g(•) : R w → R n and we have dx i ds = a i (x 1 , . . . , x k , u; g(z)). By introducing g(z) in equation 8, the equation describing the characteristics changes depending on the current data point. This leads to the classification task being modeled with a family rather than one single differential equation and allows the C-NODE system to model dynamical systems with intersecting trajectories. This property becomes helpful in Proposition 4.1 in proving that C-NODE can represent intersecting dynamics.

3.3. TRAINING C-NODES

Having introduced the main components of C-NODEs, we can now integrate them into a unified algorithm. To motivate this section, and to be consistent with part of the empirical evaluation, we will consider classification tasks with data {(z j , y j )} N j=1 , z j ∈ R w , y j ∈ Z + . For instance, z j may be an image, and y j is its class label. In the approach we pursue here, the image z j is first passed through a feature extractor function g(•; Θ 1 ) : R w → R n with parameters Θ 1 . The output of g is the feature u (j) 0 = g(z j ; Θ 1 ) that provides the boundary condition for the PDE on u (j) . We integrate along different characteristic curves indexed by s ∈ [0, T ] with boundary condition u (j) (x(0), 0) = u (j) 0 , and compute the terminal values as given by equation 7, where we mentioned in Section 3.2, u (j) (x(T ), T ) = u (j) 0 + T 0 J x u (i) x, u (j) ; Θ 2 dx ds x, u (j) ; u (j) 0 ; Θ 2 ds Finally, u (j) (x(T ), T ) is passed through another neural network, Φ(u (j) (x(T )); Θ 3 ) with input u (j) (x(T ), T ) and parameters Θ 3 whose output are the probabilities of each class labels for image z j . The entire learning process is now reduced to finding optimal weights (Θ 1 , Θ 2 , Θ 3 ) which can be achieved by minimizing the loss L = N j=1 L(Φ(u (j) (x(T ), T ); Θ 3 ), y j ), where L (•) is the corresponding loss function (e.g. cross entropy in classification). In Algorithm 2, we illustrate the implementation procedure with the forward Euler method for simplicity for the framework but note any ODE solver can be used. Combining MoC with Existing NODE Modifications As mentioned in Section 2, the proposed C-NODEs method can be used as an extension to existing NODE frameworks. In all NODE modifications, the underlying expression of b a f (t, u; Θ)dt remains the same. Modifying this expression to b a J x u(x, u; Θ)dx/ds(x, u; u 0 ; Θ)ds results in the proposed C-NODE architecture, with the size of x being a hyperparameter.

4. PROPERTIES OF C-NODES

C-NODE has a number of theoretical properties that contribute to its expressiveness. We provide some theoretical results on these properties in the proceeding sections. We also define continuous normalizing flows (CNFs) with C-NODEs, extending the CNFs originally defined with NODEs. Intersecting trajectories As mentioned in Dupont et al. (2019) , one limitation of NODE is that the mappings cannot represent intersecting dynamics. We prove by construction that conditioning on initial conditions allows C-NODEs to represent some dynamical systems with intersecting trajectories in the following proposition: Proposition 4.1. The C-NODE can represent a dynamical system on u(s), du/ds = G(s, u) : R + × R → R, where when u(0) = 1, then u(1) = u(0) + 1 0 G(s, u)ds = 0; and when u(0) = 0, then u(1) = u(0) + 1 0 G(s, u)ds = 1. Proof. See Appendix C.2. Density estimation with C-NODEs C-NODEs can also be used to define a continuous density flow that models the density of a variable over space subject to the variable satisfying a PDE, extending the continuous normalizing flows defined with NODEs. For NODEs, if u(t) ∈ R n follows the ODE du(t)/dt = f (u(t)), where f (u(t)) ∈ R n , then its log likelihood from Chen et al. (2019b, Appendix A) is given by: ∂ log p(u(t)) ∂t = -tr df du(t) . ( ) Similar to the change of log probability of NODEs, as in equation 10, we provide the following proposition for C-NODEs: Proposition 4.2. Let u(s) be a finite continuous random variable with probability density function p(u(s)) and let u(s) satisfy du(s) ds = k i=1 ∂u ∂xi dxi ds . Assuming ∂u ∂xi and dxi ds are uniformly Lipschitz continuous in u and continuous in s, then the evolution of the log probability of u follows: ∂ log p(u(s)) ∂s = -tr ∂ ∂u k i=1 ∂u ∂x i dx i ds Proof. See Appendix C.3. CNFs are continuous and invertible one-to-one mappings onto themselves, i.e., homeomorphisms. Zhang et al. (2020) proved that vanilla NODEs are not universal estimators of homeomorphisms, but augmented neural ODEs (ANODEs) are universal estimators of homeomorphisms. We demonstrate that C-NODEs are pointwise estimators of homeomorphisms, which we formalize in the following proposition: Proposition 4.3. Given any homeomorphism h : Υ → Υ, Υ ⊂ R p , initial condition u 0 , and time T > 0, there exists a flow u(s, u 0 ) ∈ R n following du ds = ∂u ∂x dx ds + ∂u ∂t dt ds such that u(T, u 0 ) = h(u 0 ). Proof. See Appendix C.4.

5. EXPERIMENTS

We present experiments on image classification tasks, time series prediction tasks, image generation tasks on benchmark datasets, and a synthetic PDE regression task.

5.1. CLASSIFICATION EXPERIMENTS WITH IMAGE DATASETS

We first conduct experiments for classification tasks on high-dimensional image datasets, including MNIST, CIFAR-10, and SVHN. We provide results for C-NODE and also combine the framework with existing methods, including ANODEs (Dupont et al., 2019) , Input Layer NODEs (IL-NODEs) (Massaroli et al., 2021) , and 2nd-Order NODEs (Massaroli et al., 2021) . For all classification experiments, we set the encoder of input images for conditioning to be identity, i.e., g(z) = z, making the input into C-NODE the original image. This way we focus exclusively on the performance of C-NODE. The results for the experiments using the adjoint method are reported in Table 1 . We investigate the performances of the models on classification accuracy and the number of function evaluations (NFE) taken in the adaptive numerical integration. NFE is an indicator of the model's computational complexity and can be interpreted as the network depth for the continuous NODE (Chen et al., 2019b) . Using a similar number of parameters, combining C-NODEs with different models consistently results in higher accuracy and mostly uses a smaller numbers of NFEs, indicating a better parameter efficiency. An ablation study on C-NODEs' and NODEs' parameters can be found in Appendix E.2. While the average number of function evaluations tends to be lower for C-NODE, we additionally note that, compared to ANODE, training C-NODE with the adjoint method can sometimes have decreased stability. We define instability as having a NFE > 1000. To get a rough idea of the differences in stability, when training NODE, C-NODE, and ANODE for image classification on the SVHN dataset for forty instances, NODE appeared unstable six times, C-NODE was unstable three times, and ANODE was never unstable. We note that this was only apparent in the SVHN experiment and when considering C-NODE by itself; the average NFE decreases when adding C-NODE to A-NODE and this was never experienced in the ANODE+C-NODE experiments. dataset readily fit the modeling frameworks. On the other hand, the PhysioNet and Human Activity datasets require augmenting the dynamics of the NODE models with stochasticity to model the arrival of events. We follow the experimental setup for interpolation tasks in Rubanova et al. (2019) , where we define an autoregressive model with the encoder being an ODE-RNN model and the decoder being a latent differential equationfoot_0 . The main purpose of this experiment is to compare the ODE-RNN when using C-NODE versus NODE. ODE-RNN is a standard method for including ODE modeling in time series tasks as described in Chen et al. (2019b) and Rubanova et al. (2019) . We consider interpolation tasks by first encoding the time series {x i , t i } N i=0 of length N and computing the approximate posterior q(z 0 |{x i , t i } N i=0 ) = N (µ z0 , σ z0 ) as done in Rubanova et al. (2019) . Then, µ(z 0 ), , σ(z 0 ) are computed as ϱ(ODE-RNN ϕ ({x i , t i } N i=0 )), where ϱ is a function that encodes the terminal hidden states into mean and variance of the latent variable z 0 . ODE-RNN(•) is a model whose states obey an ODE between observations and are updated according to new observations as described in Rubanova et al. (2019) . To predict the state at an observation time t i , we sample initial states z 0 from the posterior which are then decoded using another neural network. We finally average generated observations at each observation time to compute the test data errors. The results are presented in Table 2 and suggest that C-NODE based models use slightly fewer parameters while achieving lower error rates than NODE models. C-NODE models the latent dynamics as a first order PDE, which is a natural extension of the ODE model that NODE uses.

5.3. CONTINUOUS NORMALIZING FLOW WITH C-NODES

We compare the performance of CNFs defined with NODEs to flows defined with C-NODEs on MNIST and CIFAR-10. We use the Hutchinson trace estimator to calculate the trace and use multiscale convolutional architectures to model the density transformation as done in (Dinh et al., 2017; Grathwohl et al., 2019) 1 . Differential equations are solved using the Runge-Kutta method of order 5 of the Dormand-Prince-Shampine solver and trained with the adjoint method. Although the Euler forward method is faster, experimental results show that its fixed step size often leads to negative Bits/Dim, indicating the importance of adaptive solvers. As shown in table 3 and figure 7 , using a similar number of parameters, experimental results show that CNFs defined with C-NODEs perform better than CNFs defined with NODEs in terms of Bits/Dim, as well as having lower variance, and using a lower NFE on both MNIST and CIFAR-10.

5.4. PDE MODELING WITH C-NODES

We consider a regression example for a hyperbolic PDE with a known analytical solution. Since NODEs assume that the latent state is only dependent on a scalar (namely time), they cannot model dependencies that vary over multiple spatial variables required by most PDEs. We modify the assumptions used in the classification and density estimation experiments where the boundary conditions were constant as in equation 3. We approximate the following BVP: u ∂u ∂x + ∂u ∂t = u, u(x, 0) = 2t, 1 ≤ x ≤ 2 (11) which has an analytical solution given by u(x, t) = 2x exp(t) 2 exp(t)+1 . We generate a training dataset by randomly sampling 200 points (x, t), x ∈ [1, 2], t ∈ [0, 1], as well as values u(x, t) at those points. We test C-NODE and NODE on 200 points randomly sampled as (x, t) ∈ [1, 2] × [0, 1]. For this experiment, C-NODE uses 809 parameters while NODE uses 1185 parameters. We quantify the differences in the representation capabilities by examining how well each method can represent the PDE. C-NODE deviates 8.05% from the test set, while NODE deviates 30.52%. Further experimental details can be found in Appendix A.4.1.

6. DISCUSSION

We describe an approach for extending NODEs to the case of PDEs by solving a series of ODEs along the characteristics of a PDE. The approach applies to any black-box ODE solver and can combine with existing NODE-based frameworks. We empirically showcase its efficacy on classification tasks while demonstrating its success in improving convergence using Euler forward method without the adjoint method. C-NODEs also consistently achieve lower testing MSEs over different time series prediction datasets, while having lower standard errors. Additionally, C-NODEs empirically achieve better performances on density estimation tasks, while being more efficient with the number of parameters and using lower NFEs. C-NODE's efficiency over physical modeling is also highlighted with additional experiments. Discussion on limitations can be found in Appendix B.

A EXPERIMENTAL DETAILS

Implementation details of this paper can be found at https://github.com/XingziXu/ NeuralPDE.git.

A.1 EXPERIMENTAL DETAILS OF CLASSIFICATION TASKS

We report the average performance over five independent training processes, and the models are trained for 100 epochs for all three datasets. We also report the training dynamics of C-NODE and NODE using the adjoint sensitivity method and the euler backpropagation, as shown in Figure 2 . As shown in Figure 2 , using the Euler solver, it appears that C-NODEs converge faster than the vanilla NODEs (usually in one epoch) while generally having a more stable training process with smaller variance. Additionally, on experiments with MNIST, C-NODEs converge in only one epoch, while NODEs converge in roughly 15 epochs. This provides additional empirical evidence on the benefits of training using the characteristics. The input for 2nd-Ord, NODE, and C-NODE are the original images. In the IL-NODE, we transform the input to a latent space before the integration by the integral; that is, we raise the R c×h×w dimensional input image into the R (c+p)×h×w dimensional latent feature spacefoot_1 . For SVHN and CIFAR-10, we assume x ∈ R 3 , i.e., the Jacobian J x u = (∂u/∂x 1 , ∂u/∂x 2 , ∂u/∂x 3 ). We model each partial derivative ∂u/∂x i with a separate convolutional network. The network architecture for the network modeling the partial derivatives is as shown in Table 6 . The network architecture for the network modeling dx/ds is as shown in Table 6 . The architecture in Tables 5, 6 are used for both CIFAR-10 and SVHN. Note that the network architecture for MNIST differs slightly due to the lower dimensionality of MNIST. The hyperparameters used are as shown in Table 4 . 2a and with Euler in Fig. 2b averaged over five runs. The first column is the training process of SVHN, the second column is of CIFAR-10, and the third column is of MNIST. By incorporating the C-NODE method, we achieve a more stable training process in both CIFAR-10 and SVHN, while achieving higher accuracy. Full-sized figure in supplementary materials. We decode the result after performing the continuous transformations along characteristics curves, back to the R c×h×w dimensional object space. Combining this with the C-NODE can be seen as solving a PDE on the latest features of the images rather than on the images directly. Unlike ODEs, we take derivatives with respect to different variables in PDEs. For a PDE with k variables, this results in the constraint of the balance equations ∂ 2 u ∂x i x j = ∂ 2 u ∂x j x i , i, j ∈ {1, 2, ..., k}, i ̸ = j. This can be satisfied by defining the k-th derivative with a neural network, and integrate k -1 times to get the first order derivatives. Another way of satisfying the balance equation is to drop the dependency on the variables, i.e., ∀i ∈ {1, 2, ..., k}, ∂u ∂x i = f i (u; θ). When we drop the dependency, all higher order derivatives are zero, and the balance equations are satisfied. All experiments were performed on NVIDIA RTX 3090 GPUs on a cloud cluster. We provide a more detailed explanation of the time-series experiments which are based on the ODE-RNN framework described in Rubanova et al. (2019) . Our experiments test the effectiveness of NODE, C-NODE, ANODE, and their combinations under the ODE-RNN framework by computing Algorithm 1 Algorithm for ODE-RNN model (Rubanova et al., 2019) Input: Data points and their timestamps {(x i , t i )} i=1,...,N h 0 = 0 for i in 1, 2, ..., N do We consider the task of predicting u(x, t) = 2•x•e t 2•e t +1 at different times t, over x ∈ [1, 2]. We specify the initial condition of u(1, 0).

Data

h ′ i = ODESolve(f θ , h i-1 , (t i-1 , t i )) h i = RNNCell(h ′ i , x i ) end for o i = OutputNN(h i ) for all i = 1, ..., We use a 8 dimensional C-NODE network. The result is calculated with u(x, t) = u(1, 0) + t 0 8 i=1 ∂u ∂z i dz i ds ds. NODE is calculated with u(x, t) = u(1, 0) + t 0 ∂u ∂t dt. The experiment results are given in table 11. In our experiments, C-NODEs use 1221 parameters, ANODEs use 1270 parameters, NODEs use 1290 parameters. All experiments were performed on NVIDIA RTX 3080 ti GPUs on a local machine.

A.3 EXPERIMENTAL DETAILS OF CONTINUOUS NORMALIZING FLOWS

We report the average performance over four independent training processes. As shown in Figure 8 , compared to NODE, using a C-NODE structure improves the stability of training, as well as having better performance. Specifically, the standard errors for C-NODEs on MNIST, SVHN, and CIFAR-10 are 0.37%, 0.51%, and 0.24% respectively, and for NODEs the standard errors on MNIST, SVHN, and CIFAR-10 are 1.07%, 0.32%, and 0. The network structures are provided in Tables 12, 13 . The training hyperparameters are provided in Table 14 . The experiments are developed using code adapted from the code that the authors of Grathwohl et al. (2019) provided in https://github.com/rtqichen/ffjord. All experiments were performed on NVIDIA RTX 3090 GPUs on a cloud cluster.

A.4.1 2-DIMENSIONAL BURGER'S EQUATION

We want to solve the initial value problem u ∂u ∂x + ∂u ∂t = u, u(x, 0) = 2x, 1 ≤ x ≤ 2, where the exact solution is u(x, t) = 2xe t (2e t +1) . Our dataset's input are 200 randomly sampled points (x, t), x ∈ [1, 2], t ∈ [0, 1], and the dataset's outputare the exact solutions at those points. For the C-NODE architecture, we define four networks: N N 1 (x, t) for ∂u ∂x , N N 2 (x, t) for ∂u ∂t , N N 3 (t) for the characteristic path (x(s), t(s)), N N 4 (x) for the initial condition. The result is calculated in four steps: 1. Integrate ∆u = t 0 du(x(s),t(s)) ds ds = t 0 ∂u ∂t dt ds + ∂u ∂x dx ds ds = N N 2 * N N 3 [0] + N N 1 * N N 3 [1]ds as before. 2. Given x, t, solve equation ι + N N 3 (N N 4 (ι))[0] * t = x for ι iteratively, with ι n+1 = x -N N 3 (N N 4 (ι n ))[0] * t. ι 0 is initialized to be x. 3. Calculate initial value u(x(0), t(0)) = N N 4 (ι). 4. u(x, t) = ∆u + u(x(0), t(0)). For the NODE architecture, we define one network: N N 1 (x, t) for ∂u ∂t . The result is calculated as u(x, t) = t 0 ∂u ∂t dt = t 0 N N 1 dt. All experiments were performed on NVIDIA RTX 3080-TI GPUs on a local machine.

A.4.2 100-DIMENSIONAL CONVECTION EQUATION

We additionally experiment on solving a 100-dimensional convection equation given by:                    ∂u ∂t = -div(µ ′ (t)u(x, t)) u(x, 0) = exp(-1 2 ∥x -µ(0)∥ 2 ) µ(t) = sin           0 0.01t . . . t           where x ∈ R 100 , u : R 100 × R + → R, µ : R + → R 100 . This equation has an analytical solution u(x, t) = exp(-1 2 ∥x -µ(t)∥ 2 ). We generate a training dataset with 1000 points, and a testing dataset with 1000 points. We uniformly sample x, with each x i ∈ [0, 0.5] and t ∈ [0, 10]. We calculate the output using the analytical solution. We define three networks for C-NODE. The first network models dx/ds : R 101 → R 101 , the second network models J x u : R 101 → R 101 , the third network Γ : R 101 → R models the initial condition u(x, 0). We first integrate (x, t) using the first two networks. The output of these networks is then input to the third network to obtain the value of the solution. The total number of parameters for the networks is 11611. We define two networks for NODE. The first network models du/ds : R 101 → R 101 , the second network Γ : R 101 → R models the initial condition u(x, 0). We first integrate (x, t) using the first network. The output is R 101 . We put the output into the second network, and arrive at the output. The number of parameters used here is 14214. We test C-NODE and NODE on the dataset with Gaussian noises at different magnitudes. As shown in Table 15 , C-NODE performs 14.2% better than NODE when there is no noise, 44.4% better Figure 9 : Red: NODE. Blue: C-NODE. Computation time at each epochs using the adjoint method and Euler forward integration. C-NODE uses slightly more computation time for Euler forward integration, and the same amount of time when using the adjoint method. when there's 10% noise, 39.18% better when there's 20% noise, and 40.1% better when there's 30% noise. The architecture of C-NODE and NODE are as given in Tables 16, 17 , 19, 18. The models are optimized with the AdamW optimizer, with a learning rate of 5 × 10 -3 and a weight decay of 5 × 10 -4 . We also report computation time of NODE and C-NODE using the adjoint method and the Euler forward integration. As shown in Figure 9 , C-NODE uses slightly more computation time for Euler forward integration, and the same amount of time when using the adjoint method. Although each call of C-NODE uses slightly more time, NODE uses a bigger number of functional evaluations. This results in roughly a similar computation time for NODE and C-NODE when integrated using the adjoint method. A note on the difference between physics-informed neural networks (PINNs) PINNs (Raissi et al., 2019 ) is a method that solves a PDE using a regularization approach by minimizing a regularization term enforcing the PDE. Comparing this to C-NODE, there are a few apparent differences. First, PINNs is a very general framework that can accommodate almost any type of PDE. This contrasts with C-NODE which only computes solutions to hyperbolic PDEs. On the other hand, PINNs generally scale poorly with dimension due to the difficulty in computing high dimensional derivatives. Besides, PINNs require the exact form of the PDEs being solved, whereas C-NODE does not require this information. 

B LIMITATIONS

There are several limitations to the proposed method. The MoC only applies to hyperbolic PDEs, and we only consider first-order semi-linear PDEs in this paper. This may be a limitation since this is a specific class of PDEs that does not model all data. We also did not enforce any particular structure to prevent characteristics from intersecting, which may result in shock waves and rarefactions. However, we believe that this is unlikely to happen due to the high dimensionality of the ambient space. As noted in the experiments section, C-NODE can have decreased stability compared to ANODE when defined by exceeding an extreme threshold of number of function evaluations.

C APPROXIMATION CAPABILITIES OF C-NODE

Proposition C.1 (Method of Characteristics for Vector Valued PDEs). Let u(x 1 , . . . , x k ) : R k → R n be the solution of a first order semilinear PDE on a bounded domain Ω ⊂ R k of the form k i=1 a i (x 1 , . . . , x k , u) ∂u ∂x i = c(x 1 , . . . , x k , u) on (x 1 , . . . , x k ) = x ∈ Ω. Additionally, let a = (a 1 , . . . , a k ) T : R k+n → R k , c : R k+n → R n be Lipschitz continuous functions. Define a system of ODEs as              dx ds (s) = a(x(s), U(s)) dU ds (s) = c(x(s), U(s)) x(0) := x 0 , x 0 ∈ ∂Ω u(x 0 ) := u 0 U(0) := u 0 where x 0 and u 0 define the initial condition, ∂Ω is the boundary of the domain Ω. Given initial conditions x 0 , u 0 , the solution of this system of ODEs U(s) : [a, b] → R d is equal to the solution of the PDE in Equation equation 12 along the characteristic curve defined by x(s), i.e., u(x(s)) = U(s). The union of solutions U(s) for all x 0 ∈ ∂Ω is equal to the solution of the original PDE in Equation equation 12 for all x ∈ Ω. Lemma C.2 (Gronwall's Lemma (Howard, 1998) ). Let U ⊂ R n be an open set. Let f : U × [0, T ] → R n be a continuous function and let h 1 , h 2 : [0, T ] → U satisfy the initial value problems: dh 1 (t) dt = f (h 1 (t), t), h 1 (0) = x 1 , dh 2 (t) dt = f (h 2 (t), t), h 2 (0) = x 2 . If there exists non-negative constant C such that for all t ∈ [0, T ] ∥f (h 2 (t), t) -f (h 1 (t), t)∥ ≤ C∥h 2 (t) -h 1 (t)∥, where ∥ • ∥ is the Euclidean norm. Then, for all t ∈ [0, T ], ∥h 2 (t) -h 1 (t)∥ ≤ e Ct ∥x 2 -x 1 ∥. C.1 PROOF OF PROPOSITION C.1 This proof is largely based on the proof for the univarate case provided atfoot_2 . We extend for the vector valued case. Proof. For PDE on u with k input, and an n-dimensional output, we have a i : R k+n → R, ∂u ∂xi ∈ R n , and c : R k+n → R n . In proposition C.1, we look at PDEs in the following form k i=1 a i (x 1 , . . . , x k , u) ∂u ∂x i = c(x 1 , . . . , x k , u). Defining and substituting x = (x 1 , . . . , x k ) ⊺ , a = (a 1 , . . . , a k ) ⊺ , and Jacobian J(u(x)) = ( ∂u ∂x1 , ..., ∂u ∂x k ) ∈ R n×k into Equation equation 12 result in J(u(x))a(x, u) = c(x, u). From proposition C.1, the characteristic curves are given by dx i ds = a i (x 1 , . . . , x k , u), and the ODE system is given by dx ds (s) = a(x(s), U(s)), (15) dU ds (s) = c(x(s), U(s)). ( ) Define the difference between the solution to equation 16 and the PDE in equation 12 as ∆(s) = ∥u(x(s)) -U(s)∥ 2 = (u(x(s)) -U(s)) ⊺ (u(x(s)) -U(s)) , Differentiating ∆(s) with respect to s and plugging in equation 15, we get ∆ ′ (s) := d∆(s) ds = 2(u(x(s)) -U(s)) • (J(u)x ′ (s) -U ′ (s)) = 2[u(x(s)) -U(s)] • [J(u)a(x(s), U(s)) -c(x(s), U(s))]. equation 14 gives us k i=1 a i (x 1 , . . . , x k , u) ∂u ∂xi -c(x 1 , . . . , x k , u) = 0. Plugging this equality into equation 17 and rearrange terms, we have ∆ ′ (s) = 2[u(x(s)) -U(s)] • {[J(u)a(x(s), U(s)) -c(x(s), U(s))] -[J(u)a(x(s), u(s)) -c(x(s), u(s))]}. Combining terms, we have ∆ ′ = 2(u -U) • ([J(u)a(U) -c(U)] -[J(u)a(u) -c(u)]) = 2(u -U) • (J(u) [a(U) -a(u)] + [c(U) -c(u)]) . Applying triangle inequality, we have ∥∆ ′ ∥ ≤ 2∥u -U∥(∥J(u)∥∥a(U) -a(u)∥ + ∥c(U) -c(u)∥). By the assumption in proposition C.1, a and c are Lipschitz continuous. By Lipschitz continuity, we have ∥a(U) -a(u))∥ ≤ A∥u -U∥ and ∥c(U) -c(u))∥ ≤ B∥u -U∥, for some constants A and B in R + . Also, for compact set [0, s 0 ], s 0 < ∞, since both u and Jacobian J are continuous mapping, J(u) is also compact. Since a subspace of R n is compact if and only it is closed and bounded, J(u) is bounded (Strichartz, 2000) . Thus, ∥J(u)∥ ≤ M for some constant M in R + . Define C = 2(AM + B), we have  ∥∆ ′ (s)∥ ≤ 2(AM ∥u -U∥ + B∥u -U∥)∥u -U∥ = C∥u -U∥ 2 = C∥∆(s)∥. =⇒ u(s; u 0 ) = (1 -2u 0 ) s =⇒ u s; 0 1 = 1 -2 0 1 s = 1 -1 s. To be specific, we can represent this system with the following family of PDEs: ∂u ∂x + u 0 ∂u ∂t = 1 -2u 0 . We can solve this system to obtain a function that has intersecting trajectories. The solution is visualized in Figure 10 , which shows that C-NODE can be used to learn and represent this function G. It should be noted that this is not the only possible solution to function G, as when ∂t/∂s = 0, we fall back to a NODE system with the dynamical system conditioned on the input data. In this conditioned setting, we can then represent G by stopping the dynamics at different times t as in (Massaroli et al., 2021) . C.3 PROOF OF PROPOSITION 4.2 The proof uses the change of variables formula for a particle that depends on a vector rather than a scalar and it follows directly from the proof given in (Chen et al., 2019b, Appendix A). We provide the full proof for completeness. Proof. We initially assume that k i=1 ∂u ∂xi dxi ds is Lipschitz continuous in u and continuous in t, so every initial value problem has a unique solution (Evans, 2010) . We additionally assume u(s) is bounded. We want to show that the probability flow satisfies ∂p(u(s)) ∂s = tr ∂ ∂u k i=1 ∂u ∂x i dx i ds . Define T ϵ = u(s + ϵ). The discrete change of variables states that u & Mohamed, 2015) . Taking the limit of the time difference between u 0 and u 1 , and using the definition of derivatives, 1 = f (u 0 ) ⇒ log p(u 1 ) = log p(u 0 ) -log | det ∂f ∂u0 | (Rezende ∂ log p(u(s)) ∂t = lim ϵ→0 + log p(u(s + ϵ)) -log p(u(s)) ϵ = lim ϵ→0 + log p(u(s)) -log | det ∂ ∂u T ϵ (u(t))| -log p(u(s)) ϵ = -lim ϵ→0 + log | det ∂ ∂u T ϵ (u(s))| ϵ = -lim ϵ→0 + ∂ ∂ϵ log | det ∂ ∂u T ϵ (u(s))| ∂ ∂ϵ ϵ = -lim ϵ→0 + ∂ ∂ϵ log | det ∂ ∂u T ϵ (u(s))| -lim ϵ→0 + ∂ ∂ϵ log | det ∂ ∂u T ϵ (u(s))| = -lim ϵ→0 + 1 | det ∂ ∂u T ϵ (u(s))| ∂ ∂ϵ | det ∂ ∂u T ϵ (u(s))| = - lim ϵ→0 + ∂ ∂ϵ | det ∂ ∂u T ϵ (u(s))| lim ϵ→0 + | det ∂ ∂u T ϵ (u(s))| = -lim ϵ→0 + ∂ ∂ϵ | det ∂ ∂u T ϵ (u(s))| The Jacobi's formula states that if A is a differentiable map from the real numbers to n × n matrices, then d dt det A(t) = tr adj(A(t)) dA(t) dt , where adj denotes the adjugate matrix. Applying this relation, we obtain ∂ log p(u(t)) ∂t = -lim ϵ→0 + tr adj ∂ ∂u T ϵ (u(s)) ∂ ∂ϵ ∂ ∂u T ϵ (u(s)) = -tr lim ϵ→0 + adj ∂ ∂u T ϵ (u(t)) lim ϵ→0 + ∂ ∂ϵ ∂ ∂u T ϵ (u(s)) = -tr adj ∂ ∂u u(t) lim ϵ→0 + ∂ ∂ϵ ∂ ∂u T ϵ (u(s)) = -tr lim ϵ→0 + ∂ ∂ϵ ∂ ∂u T ϵ (u(s)) Substituting T ϵ with its Taylor series expansion and taking the limit, we have the desired result ∂ log p(u(t)) ∂t = -tr lim ϵ→0 + ∂ ∂ϵ ∂ ∂u u + ϵ du ds + O(ϵ 2 ) + O(ϵ 3 ) + ... = -tr lim ϵ→0 + ∂ ∂ϵ ∂ ∂u u + ϵ k i=1 ∂u ∂x i dx i ds + O(ϵ 2 ) + O(ϵ 3 ) + ... = -tr lim ϵ→0 + ∂ ∂ϵ I + ∂ ∂u ϵ k i=1 ∂u ∂x i dx i ds + O(ϵ 2 ) + O(ϵ 3 ) + ... = -tr lim ϵ→0 + ∂ ∂u k i=1 ∂u ∂x i dx i ds + O(ϵ) + O(ϵ 2 ) + ... = -tr ∂ ∂u k i=1 ∂u ∂x i dx i ds C.4 PROOF OF PROPOSITION 4.3 Proof. To prove proposition 4.3, we need to show that for any homeomorphism h(•), there exists a u(s, u 0 ) ∈ R n following a C-NODE system such that u(s = T, u 0 ) = h(u 0 ). Without loss of generality, we suppose that T = 1. Defining a C-NODE system as:              du ds = ∂u ∂x dx ds + ∂u ∂t dt ds , dx ds (s, u 0 ) = 1, ∂u ∂x (u(x, t)) = h(u 0 ), dt ds (s, u 0 ) = u 0 , ∂u ∂t (u(x, t)) = -1. Then, du ds = h(u 0 ) -u 0 . At s = 1, have u(s = 1, u 0 ) = u(s = 0, u 0 ) + 1 0 du ds ds = u 0 + 1 0 ∂u ∂x dx ds + ∂u ∂t dt ds ds = u 0 + 1 0 h(u 0 ) • 1 + (-1) • u 0 ds = u 0 + h(u 0 ) -u 0 = h(u 0 ). The inverse map will be defined by integration backwards. Specifically, we have u(s = 0, u 0 ) = u(s = 1, u 0 ) + 0 1 du ds ds = h(u 0 ) - 1 0 ∂u ∂x dx ds + ∂u ∂t dt ds ds = h(u 0 ) - 1 0 h(u 0 ) • 1 + (-1) • u 0 ds = h(u 0 ) -h(u 0 ) + u 0 = u 0 . Thus, for any homeomorphism h(•), there exists a C-NODE system, such that forward integration for time s = 1 is equivalent as applying h(•), and backward integration for time s = 1 is equivalent to applying h -1 (•).

D INCLUDING INITIAL STATE IN NODE'S INPUT

Conditioning on the initial condition allows NODE to model intersecting trajectories. We compare NODE conditioned on initial values and C-NODE's performance on image classification tasks on MNIST, SVHN, and CIFAR-10 datasets, using the forward Euler solver and adjoint solver. As shown in Figure 11 , by conditioning on initial values, NODE's performances are improved over all datasets when using adjoint integration, and improved on SVHN dataset when using Euler forward integration. When using Euler forward integration, C-NODE performs better on all three datasets than NODE and conditioned-NODE. C-NODE higher accuracy within fewer epochs and has lower variance throughout the training process. All methods use a comparable numbers of parameters with C-NODE using the fewest as reported in Table 20 . When using adjoint method, C-NODE performs better on all datasets than NODE, and performs better on SVHN and CIFAR-10 datasets than conditioned-NODE, while achieving similar performance to conditioned-NODE. All experiments were performed on NVIDIA RTX 3090 GPUs on a cloud cluster.

E ABLATION STUDY E.1 ABLATION STUDY ON DIMENSION OF C-NODE

We perform an ablation study on the impact of the number of dimensions of the C-NODE we implement. This study allows us to evaluate the relationship between the model performance and the model's limit of mathematical approximating power. Empirical results show that as we increase the number of dimensions used in the C-NODE model, the C-NODE's performance first improves and then declines, due to overfitting. We have found out that information criteria like AIC and BIC can be successfully applied for dimension selection in this scenario. In previous experiments, we represent ∂u/∂x i with separate and independent neural networks c i (u, θ). Here, we represent all k functions as a vector-valued function [∂u/∂x 1 , ..., ∂u/∂x k ] T . We approximate this vector-valued function with a neural network c(u, θ). The model is trained using the Euler solver to have better training stability when the neural network has a large number of parameters. Experiment details for the ablation study is as shown in Figures 12, 13, 14. 



This is based on the code ofRubanova et al. (2019) provided at https://github.com/ YuliaRubanova/latent_ode 1 This is based on the code that the authors of(Grathwohl et al., 2019) provided in https://github. com/rtqichen/ffjord This is based on the code ofMassaroli et al. (2021) provide in https://github.com/DiffEqML/ torchdyn https://en.wikipedia.org/wiki/Method_of_characteristics#Proof_for_ quasilinear_Case



Figure 1: Comparison of traditional NODE (left) and proposed C-NODE (right). The solution to NODE is the solution to a single ODE, whereas C-NODE represents a series of ODEs that form the solution to a PDE. Each color in C-NODE represents the solution to an ODE with a different initial condition. NODE represents a single ODE, and can only represent u(x, t) along one dimension, for example, u(x = 0, t).

u t+∆t -u t )/∆t = du(t)/dt. The model can then be evaluated through existing numerical integration techniques, as proposed by Chen et al. (2019b):

Figure 2: Red: NODE. Blue: C-NODE. Training dynamics of different datasets with adjoint in Fig.2aand with Euler in Fig.2baveraged over five runs. The first column is the training process of SVHN, the second column is of CIFAR-10, and the third column is of MNIST. By incorporating the C-NODE method, we achieve a more stable training process in both CIFAR-10 and SVHN, while achieving higher accuracy. Full-sized figure in supplementary materials.

Figure 3: Red: NODE. Blue: C-NODE. Training dynamics of CNODE, NODE, and their augmented versions on Physionet dataset. C-NODE achieves lower testing MSE, and has a more stable training dynamics

Figure 4: Red: NODE. Blue: C-NODE. Training dynamics of CNODE, NODE, and their augmented versions on Human Activity dataset. C-NODE achieves lower testing MSE, and has a more stable training dynamics

Figure 7: Red: NODE. Blue: C-NODE. Training dynamics of CNFs on MNIST dataset with adjoint method. We present Bits/dim of the first 50 training epochs.

Figure 8: The training process averaged over 4 runs of C-NODE and NODE. The first row are the results on MNIST, the second row are the results on SVHN, the third row are the results on CIFAR-10.

Computation Time With Euler Forward Method;

Figure10: Comparison of C-NODEs and NODEs. C-NODEs (solid blue) learn a family of integration paths conditioned on the input value, avoiding intersecting dynamics. NODEs (dashed red) integrate along a 1D line that is not conditioned on the input value and can not represent functions requiring intersecting dynamics.

Figure 11: Red: NODE. Blue: C-NODE. Orange: Conditional NODE Training dynamics of different datasets with adjoint in Fig.11aand with Euler in Fig.11baveraged over four runs. The first column is the training process of SVHN, the second column is of CIFAR-10, and the third column is of MNIST. By conditioning on the initial values, NODE's performances are improved on all datasets when using adjoint integration, and improved on SVHN dataset when using the Euler backpropagation.

Figure 15: The training process averaged over four runs of C-NODE with 95071, 55855, and 17379 parameters on the CIFAR-10 dataset, and NODE with 96044, 55855, and 17379 parameters. The first row is the prediction accuracy, the second row is the testing error, and the third row is the training error. Blue lines are the results for C-NODE, and red lines are the results for NODE.

Mean test results over 5 runs of different NODE models over SVHN, CIFAR-10, and MNIST. Accuracy and NFE at convergence are reported. Applying C-NODE always increases models' accuracy and usually reduces models' NFE as well as the standard error.

Mean test results over 4 runs of different NODE models over PhysioNet, Human Activity,

Param. ↓ NFE ↓ B/D ↓ Param. ↓ NFE ↓ Real NVP (Dinh et al., 2017) Experimental results on generation tasks, with NODE, C-NODE, and other models. B/D indicates Bits/dim. Using a similar number of parameters, C-NODE outperforms NODE on all three datasets, and has a significantly lower NFE when training for CIFAR-10.

Training hyperparameters for image classification.

Return: {o i } i=1, ..., N ; h N

Network Structure of dx/ds when using ODE dataset

Training hyperparameters for time series analysisA.2.2 EXPERIMENT RESULTS OF TIME SERIES PREDICTIONS ON SYNTHETIC DATASETWe test C-NODEs, ANODEs, and NODEs on a synthetic time series prediction problem. We define a function by u(x, t) = 2x exp(t)  2 exp(t)+1 , and we sample ũ = u(x, t) + 0.1ϵ t , where ϵ t ∼ N (0, 1) over x ∈ [1, 2], t ∈ [0, 1] to generate the training dataset. We test the performance on t ∈ [n, n + 1] with n ∈ {0, 1, . . . , 5}.

Time series prediction results for NODE, ANODE, and C-NODE at different time intervals. Errors are testing mean squared errors. Across all time intervals, C-NODE outperforms NODE and ANODE.We also test C-NODEs, NODEs, and ANODEs on time series prediction with different levels of noise. Specifically, using the same function as above, we form training and testing dataset with ϵ t ∼ N (0, m), m ∈ {0, 1, . . . , 5}. We test the performance on the time period t ∈ [0, 1].

Time series prediction results for NODE, ANODE, and C-NODE at different noise levels. Errors are testing mean squared errors.

C-NODE performs better than NODE on all noise levels. C-NODE uses 11611 parameters, NODE uses 14215 parameters.

Network structure of J x u

Network structure of dx/ds

Network structure of NODE

It is worth noting that the characteristic curves of the C-NODE also depend on the initial values, which allows C-NODE to model a dynamical system with intersecting trajectories. C-NODE and conditioned NODE are two different methods of including the initial values. Empirically, C-NODE achieves better results and is more efficient with parameters, as suggested in Figure11and Table20. Parameter counts and classification accuracy for different models and integration schemes in Conditional-NODE experiment.

ACKNOWLEDGMENTS

This work was supported in part by the Office of Naval Research (ONR) under grant number N00014-21-1-2590. AH was supported by NSF-GRFP.

annex

Table 6 : Network architecture of the network modeling the characteristic curve dx/ds.evaluation metrics under the different ODE models. The ODE-RNN framework involves combining the strengths of both a neural ODE and RNN to initially embed the time series history into a latent distribution parameterized by an RNN and then decode the predicted latent distribution into the original data space. The main components of the method are illustrated in Algorithm 1. Forecasts are computed by integrating the latent space in time according to the neural ODE model with initial condition z 0 distributed according to the parameterization by the RNN. We define the approximate posterior of z 0 as q(z 0 |{x i , t i } N i=0 ) = N (µ z0 , σ z0 ), where µ z0 , σ z0 = ϱ(ODE-RNN ϕ ({x i , t i } N i=0 )). To bypass the RNN requirement of a fixed observation rate, the ODE-RNN model uses states that obey an ODE in between observations and are updated at new observations. Given a set of time series data {x i , t i } N i=0 , we embed the time series using the ODE-RNN. We then pass the output of the ODE-RNN model through a function ϱ(•) to get the mean µ z0 and the variance σ z0 of the posterior of z 0 . To predict the value at timestamp T , we sample K initial conditions z 0 from the posterior q(z 0 |{x i , t i } N i=0 ), integrate the latent ODE model until timestamp T and use the average of the K integrations as the result.For all experiments, we follow the experimental setup as described in https://github.com/ YuliaRubanova/latent_ode. We experiment with the ODE-RNN framework using NODE, C-NODE, ANODE, and their combinations as the ODE backbone of both the ODE-RNN model and the latent ODE. Training NODE follows the original setup, with the dimension of the ODE model in the ODE-RNN being 20, the number of units per layer in each of GRU update networks being 100, the number of units per layer in ODE function being around 100, the number of layers in ODE function in generative and the recognition ODE both being 1.We use a C-NODE with a dimensionality of 8. The network architecture details are given in Tables 7, 8 . The training hyperparameters are given in Table 9 . The number of units per layer in the network describing dx/ds is 12. For the network describing ∂u/∂x i , the dimension of the ODE model in the ODE-RNN is 20, the number of units per layer in each of GRU update networks is 100, the number of units per layer in the ODE function is tuned to match the number of parameters in the NODE models, the number of layers in ODE function in generative and the recognition ODE is 1.

Operation Layer

Input Channels Output Channels Kernel Size Stride Padding Convolutional Layer 4 8 (3, 3) (1, 1) (1, 1) for each input data z j do extract image feature u(s = 0) = g(z j ; Θ 1 ) with a feature extractor neural network. procedure Integration along s = 0 → 1 for each time step s m do calculate dx ds (x, u; g(z j ; Θ 1 ); Θ 2 ) and J x u(x, u; Θ 2 ). calculate du ds = J x u dx ds . calculate u(s m+1 ) = u(s m ) + du ds (s m+1 -s m ). calculate x(s m+1 ) = x(s m ) + dx ds (s m+1 -s m ) end for end procedure classify u(s = 1) with neural network Φ(u(x(s = 1)), Θ 3 ). procedure Integrate from 1 → 0 to get u(0) log p(z j ) -log p(u(0)) for each time step s m do calculate dx ds (x, u; g(z j ; Θ 1 ); Θ 2 ) and J x u(x, u; Θ 2 ). calculate du ds = J x u dx ds . calculate -tr( ∂ ∂u J x u dx ds ) with Hutchinson trace estimator (Grathwohl et al., 2019) .end for evaluate p 0 (u(0)) calculate log p(z j ) = (log p(z j ) -log p(u(0))) + log p 0 (u(0)) optimize log p(z j ) with an optimization algorithm (stochastic gradient descent etc.) end for Algorithm 4 Algorithm for sampling CNFs defined with C-NODE procedure sample u(s = 0) from base distribution p 0 (•) procedure Integrate from 0 → 1 to get u(s = 1) for each time step s m do calculate dx ds (x, u; g(z j ; Θ 1 ); Θ 2 ) and J x u(x, u; Θ 2 ). calculate du ds = J x u dx ds . calculate u(s m+1 ) = u(s m ) + du ds (s m+1 -s m ). end for end procedure u(s = 1) is our sample from the CNF

