NEURAL DELAY DIFFERENTIAL EQUATIONS

Abstract

Neural Ordinary Differential Equations (NODEs), a framework of continuousdepth neural networks, have been widely applied, showing exceptional efficacy in coping with some representative datasets. Recently, an augmented framework has been successfully developed for conquering some limitations emergent in application of the original framework. Here we propose a new class of continuous-depth neural networks with delay, named as Neural Delay Differential Equations (ND-DEs), and, for computing the corresponding gradients, we use the adjoint sensitivity method to obtain the delayed dynamics of the adjoint. Since the differential equations with delays are usually seen as dynamical systems of infinite dimension possessing more fruitful dynamics, the NDDEs, compared to the NODEs, own a stronger capacity of nonlinear representations. Indeed, we analytically validate that the NDDEs are of universal approximators, and further articulate an extension of the NDDEs, where the initial function of the NDDEs is supposed to satisfy ODEs. More importantly, we use several illustrative examples to demonstrate the outstanding capacities of the NDDEs and the NDDEs with ODEs' initial value. Specifically, (1) we successfully model the delayed dynamics where the trajectories in the lower-dimensional phase space could be mutually intersected, while the traditional NODEs without any argumentation are not directly applicable for such modeling, and (2) we achieve lower loss and higher accuracy not only for the data produced synthetically by complex models but also for the real-world image datasets, i.e., CIFAR10, MNIST, and SVHN. Our results on the NDDEs reveal that appropriately articulating the elements of dynamical systems into the network design is truly beneficial to promoting the network performance.

1. INTRODUCTION

A series of recent works have revealed a close connection between neural networks and dynamical systems (E, 2017; Li et al., 2017; Haber & Ruthotto, 2017; Chang et al., 2017; Li & Hao, 2018; Lu et al., 2018; E et al., 2019; Chang et al., 2019; Ruthotto & Haber, 2019; Zhang et al., 2019a; Pathak et al., 2018; Fang et al., 2018; Zhu et al., 2019; Tang et al., 2020) . On one hand, the deep neural networks can be used to solve the ordinary/partial differential equations that cannot be easily computed using the traditional algorithms. On the other hand, the elements of the dynamical systems can be useful for establishing novel and efficient frameworks of neural networks. Typical examples include the Neural Ordinary Differential Equations (NODEs), where the infinitesimal time of ordinary differential equations is regarded as the "depth" of a considered neural network (Chen et al., 2018) . Though the advantages of the NODEs were demonstrated through modeling continuous-time datasets and continuous normalizing flows with constant memory cost (Chen et al., 2018) , the limited capability of representation for some functions were also studied (Dupont et al., 2019) . Indeed, the NODEs cannot be directly used to describe the dynamical systems where the trajectories in the lower-dimensional phase space are mutually intersected. Also, the NODEs cannot model only a few variables from some physical or/and physiological systems where the effect of time delay is inevitably present. From a view point of dynamical systems theory, all these are attributed to the characteristic of finite-dimension for the NODEs. In this article, we propose a novel framework of continuous-depth neural networks with delay, named as Neural Delay Differential Equations (NDDEs). We apply the adjoint sensitivity method to compute the corresponding gradients, where the obtained adjoint systems are also in a form of delay differential equations. The main virtues of the NDDEs include: • feasible and computable algorithms for computing the gradients of the loss function based on the adjoint systems, • representation capability of the vector fields which allow the intersection of the trajectories in the lower-dimensional phase space, and • accurate reconstruction of the complex dynamical systems with effects of time delays based on the observed time-series data.

2. RELATED WORKS

NODEs Inspired by the residual neural networks (He et al., 2016) and the other analogous frameworks, the NODEs were established, which can be represented by multiple residual blocks as h t+1 = h t + f (h t , w t ), where h t is the hidden state of the t-th layer, f (h t ; w t ) is a differential function preserving the dimension of h t , and w t is the parameter pending for learning. The evolution of h t can be viewed as the special case of the following equation h t+∆t = h t + ∆t • f (h t , w t ) with ∆t = 1. As suggested in (Chen et al., 2018) , all the parameters w t are unified into w for achieving parameter efficiency of the NODEs. This unified operation was also employed in the other neural networks, such as the recurrent neural networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) and the ALBERT (Lan et al., 2019) . Letting ∆t → 0 and using the unified parameter w instead of w t , we obtain the continuous evolution of the hidden state h t as lim ∆t→0 h t+∆t -h t ∆t = dh t dt = f (h t , t; w), which is in the form of ordinary differential equations. Actually, the NODEs can act as a feature extraction, mapping an input to a point in the feature space by computing the forward path of a NODE as: h(T ) = h(0) + T 0 f (h t , t; w)dt, h(0) = input, where h(0) = input is the original data point or its transformation, and T is the integration time (assuming that the system starts at t = 0). Under a predefined loss function L(h(T )), (Chen et al., 2018) employed the adjoint sensitivity method to compute the memory-efficient gradients of the parameters along with the ODE solvers. More precisely, they defined the adjoint variable, λ(t) = ∂L(h(T )) ∂h(t) , whose dynamics is another ODE, i.e., dλ(t) dt = -λ(t) ∂f (h t , t; w) ∂h t . The gradients are computed by an integral as: dL dw = 0 T -λ(t) ∂f (h t , t; w) ∂w dt. (Chen et al., 2018) calculated the gradients by calling an ODE solver with extended ODEs (i.e., concatenating the original state, the adjoint, and the other partial derivatives for the parameters at each time point into a single vector). Notably, for the regression task of the time series, the loss function probably depend on the state at multiple observational times, such as the form of L(h(t 0 ), h(t 1 ), ..., h(t n )). Under such a case, we must update the adjoint state instantly by adding the partial derivative of the loss at each observational time point. As emphasized in (Dupont et al., 2019) , the flow of the NODEs cannot represent some functions omnipresently emergent in applications. Typical examples include the following two-valued function with one argument: g 1-D (1) = -1 and g 1-D (-1) = 1. Our framework desires to conquer the representation limitation observed in applying the NODEs. Optimal control As mentioned above, a closed connection between deep neural networks and dynamical systems have been emphasized in the literature and, correspondingly, theories, methods and tools of dynamical systems have been employed, e.g. the theory of optimal control (E, 2017; Li et al., 2017; Haber & Ruthotto, 2017; Chang et al., 2017; Li & Hao, 2018; E et al., 2019; Chang et al., 2019; Ruthotto & Haber, 2019; Zhang et al., 2019a) . Generally, we model a typical task using a deep neural network and then train the network parameters such that the given loss function can be reduced by some learning algorithm. In fact, training a network can be seen as solving an optimal control problem on difference or differential equations (E et al., 2019) . The parameters act as a controller with a goal of finding an optimal control to minimize/maximize some objective function. Clearly, the framework of the NODEs can be formulated as a typical problem of optimal control on ODEs. Additionally, the framework of NODEs has been generalized to the other dynamical systems, such as the Partial Differential Equations (PDEs) (Han et al., 2018; Long et al., 2018; 2019; Ruthotto & Haber, 2019; Sun et al., 2020) and the Stochastic Differential Equations (SDEs) (Lu et al., 2018; Sun et al., 2018; Liu et al., 2019) , where the theory of optimal control has been completely established. It is worthwhile to mention that the optimal control theory is tightly connected with and benefits from the method of the classical calculus of variations (Liberzon, 2011) . We also will transform our framework into an optimal control problem, and finally solve it using the method of the calculus of variations. 3 THE FRAMEWORK OF NDDES In this section, we establish a framework of continuous-depth neural networks. To this end, we first introduce the concept of delay deferential equations (DDEs). The DDEs are always written in a form where the derivative of a given variable at time t is affected not only by the current state of this variable but also the states at some previous time instants or time durations (Erneux, 2009) . Such kind of delayed dynamics play an important role in description of the complex phenomena emergent in many real-world systems, such as physical, chemical, ecological, and ! ["#, 0] = $ %( ) = $ × [••, 0] R d R d R d R d ODE DE f(h t , h t-τ ; w) f(h t ; w) R R h( ) L T L T Figure 1 : Sketchy diagrams of the NODEs and the NDDES, respectively, with the initial value h(0) and the initial function φ(t). The NODEs and the NDDEs act as the feature extractors, and the following layer processes the features with a predefined loss function. physiological systems. In this article, we consider a system of DDE with a single time delay: dht dt = f (h t , h t-τ , t; w), t >= 0, h(t) = φ(t), t <= 0, where the positive constant τ is the time delay and h(t) = φ(t) is the initial function before the time t = 0. Clearly, in the initial problem of ODEs, we only need to initialize the state of the variable at t = 0 while we initialize the DDEs using a continuous function. Here, to highlight the difference between ODEs and DDEs, we provide a simple example: dxt dt = -2x t-τ , t >= 0, x(t) = x 0 , t <= 0. where τ = 1 (with time delay) or τ = 0 (without time delay). As shown in Figure 2 , the DDE flow can map -1 to 1 and 1 to -1; nevertheless, this cannot be made for the ODE whose trajectories are not intersected with each other in the t-x space in Figure 2 .

3.2. ADJOINT METHOD FOR NDDES

Assume that the forward pass of DDEs is complete. Then, we need to compute the gradients in a reverse-mode differentiation by using the adjoint sensitivity method (Chen et al., 2018; Pontryagin et al., 1962) . We consider an augmented variable, named as adjoint and defined as λ(t) = ∂L(h(T )) ∂h(t) , where L(•) is the loss function pending for optimization. Notably, the resulted system for the adjoint is in a form of DDE as well. Theorem 1 (Adjoint method for NDDEs). Consider the loss function L(•). Then, the dynamics of adjoint can be written as        dλ(t) dt = -λ(t) ∂f (h t , h t-τ , t; w) ∂h t -λ(t + τ ) ∂f (h t+τ , h t , t; w) ∂h t χ [0,T -τ ] (t), t <= T λ(T ) = ∂L(h(T )) ∂h(T ) , (4) where χ [0,T -τ ] (•) is a typical characteristic function. We provide two ways to prove Theorem 1, which are, respectively, shown in Appendix. Using h(t) and λ(t), we compute the gradients with respect to the parameters w as: dL dw = 0 T -λ(t) ∂f (h t , h t-τ , t; w) ∂w dt. Clearly, when the delay τ approaches zero, the adjoint dynamics degenerate as the conventional case of the NODEs (Chen et al., 2018) . We solve the forward pass of h and backward pass for h, λ and dL dw by a piece-wise ODE solver, which is shown in Algorithm 1. For simplicity, we denote by f (t) and g(t) the vector filed of h and λ, respectively. Moreover, in this paper, we only consider the initial function φ(t) as a constant function, i.e., φ(t) = h 0 . Assume that T = n • τ and denote f k (t) = f (k • τ + t), g k (t) = g(k • τ + t) and λ k (t) = λ(k • τ + t). In the traditional framework of the NODEs, we can calculate the gradients of the loss function and recompute the hidden states by solving another augmented ODEs in a reversal time duration. However, to achieve the reverse-mode of the NDDEs in Algorithm 1, we need to store the checkpoints of the forward hidden states h(i • τ ) for i = 0, 1, ..., n, which, together with the adjoint λ(t), can help us to recompute h(t) backwards in every time periods. The main idea of the Algorithm 1 is to convert the DDEs as a piece-wise ODEs such that one can naturally employ the framework of the NODEs to solve it. The complexity of Algorithm 1 is analyzed in Appendix. Algorithm 1 Piece-wise reverse-mode derivative of an DDE initial function problem Input: dynamics parameters w, time delay τ , start time 0, stop time T = n • τ , final state h(T ), loss gradient ∂L/∂h(T ) ∂L ∂w = 0 |w| for i in range(n -1, -1, -1): s 0 = [h(T ), h(T -τ ), ..., h(τ ), ∂L ∂h(T ) , ..., ∂L ∂h((i+1)•τ ) , ∂L ∂w ] def aug dynamics([h n-1 (t),...,h 0 (t), λ n-1 (t),..,λ i (t), .], t, w): return [f n-1 (t), f n-2 (t) , ..., f 0 (t), g n-1 (t), ..., g i (t), -λ i (t) ∂fi(t) ∂w ] [ ∂L ∂h(i•τ ) , ∂L ∂w ]=ODESolve(s 0 , aug dynamics, τ , 0, w) return ∂L ∂h(0) , ∂L ∂w 4 ILLUSTRATIVE EXPERIMENTS 4.1 EXPERIMENTS ON SYNTHETIC DATASETS Here, we use some synthetic datasets produced by typical examples to compare the performance of the NODES and the NDDEs. In (Dupont et al., 2019) , it is proved that the NODEs cannot represent the function g : R d → R, defined by g(x) = 1, if x ≤ r 1 , -1, if r 2 ≤ x ≤ r 3 , where 0 < r 1 < r 2 < r 3 and • is the Euclidean norm. The following proposition show that the NDDEs have stronger capability of representation. Proposition 1 The NDDEs can represent the function g(x) specified above. To validate this proposition, we construct a special form of the NDDEs by    dhi(t) dt = h t-τ -r, t >= 0 and i = 1, dhi(t) dt = 0, t >= 0 and i = 2, ..., d, h(t) = x, t <= 0. where r := (r 1 + r 2 )/2 is a constant and the final time point T is supposed to be equal to the time delay τ with some sufficient large value. Under such configurations, we can linearly separate the two clusters by some hyper plane. The training losses and the flows of the NODEs and the NDDEs are depicted, respectively, in Fig. 5 . Particularly, the NDDEs achieve lower losses with the faster speed and directly separate the two clusters in the original 2-D space; however, the NODEs achieve it only by increasing the dimension of the data and separating them in a higher-dimensional space. In general, we have the following theoretical result for the NDDEs, whose proof is provided in Appendix. Figure 4 : Evolutions of the NDDEs (top) and the NODEs (bottom) in the feature space during the training procedure. Here, the evolution of the NODEs is directly produced by the code provided in (Dupont et al., 2019) . Under review as a conference paper at ICLR 2021 (a) (b) (c) (d) Training loss the population, r is the growth rate. The Mackey-Glass system is written as ẋ = β x(t-τ ) 1+x n (t-τ ) -γx(t), where x(t) is the number of the blood cells, β, n, τ, γ are the parameters of biological significance. The NODEs and the NDDEs are tested on these two dynamics. As shown in Figure 7 , a very low training loss is achieved for the NDDEs while the the loss of the NODEs does not go down afterwards, always sustaining at some larger value. As for the predicting durations, the NDDEs thereby achieve a better performance than the NODEs.

4.2. EXPERIMENTS ON IMAGE DATASETS

Here, we apply the NODEs and the NDDEs to the image datasets. For the NODEs, we model the vector filed f (h(t)) as the convolutional architectures together with the same hyperparameter setups in (Dupont et al., 2019) . The initial hidden state h(0) ∈ R c×h×w with respect to an image, where c, h, w are, respectively, the number of channels, height, and width of the image. For the NDDEs, we design the vector filed as f (concat(h(t), h(tτ ))), mapping the space from R 2c×h×w to R c×h×w , where concat(•, •) executes the concatenation of two tensors on the channel dimension. Its initial function is designed as a constant function, i.e., h(t) is the input/image for t < 0. (Dupont et al., 2019) . Theorem 2 (Universal approximating capability of the NDDEs). For any given continuous function F : R n → R n , if one can construct a neural network for approximating the map G (x) = 1 T [F (x)- x], then there exists an NDDE of n-dimension that can model the map x → F (x), that is, h(T ) ≈ F (x) with the initial function φ(t) = x for t ≤ 0. Additionally, NDDEs are suitable for fitting the time series with the delay effect in the original systems, which cannot be easily achieved by using the NODEs. To illustrate this, we use a model of 2-D DDEs, written as ẋ = A tanh(x(t) + x(tτ )) with x(t) = x 0 for t < 0. Given the time series generated by the system, we use the NDDEs and the NODEs to fit it, respectively. Figure 6 shows that the NNDEs approach a much lower loss, compared to the NODEs. More interestingly, the NODEs prefer to fit the dimension of x 2 and its loss always sustains at a larger value, e.g. 0.25 in Figure 6 . The main reason is that two different trajectories generated by the autonomous ODEs cannot intersect with each other in the phase space, due to the uniqueness of the solution of ODEs. We perform experiments on another two classical DDEs, i.e., the population dynamics and the Mackey-Glass system (Erneux, 2009) . Specifically, the equation of the dimensionless population dynamics is ẋ = rx(t)(1x(tτ )), where x(t) is the ratio of the population to the carrying capacity of the population, r is the growth rate and τ is the time delay. The Mackey-Glass system is written as ẋ = β x(t-τ ) 1+x n (t-τ )γx(t), where x(t) is the number of the blood cells, β, n, τ, γ are the parameters of biological significance. The NODEs and the NDDEs are tested on these two dynamics. As shown in Figure 7  For the image datasets, we not only use the NODEs and the NDDEs but also an extension of the NDDEs, called the NODE+NDDE, which treats the initial function as an ODE. In our experiments, such a model exhibits the best performance, revealing strong capabilities in modelling and feature representations. Moreover, inspired by the idea of the augmented NODEs (Dupont et al., 2019) , we extend the NDDEs and the NODE+NDDE to the A+NDDE and the A+NODE+NDDE, respectively. Precisely, for the augmented models, we augment the original image space to a higher-dimensional space, i.e., R c×h×w → R (c+p)×h×w , where c, h, w, and p are, respectively, the number of channels, height, width of the image, and the augmented dimension. With such configurations of the same augmented dimension and approximately the same number of the model parameters, comparison studies on the image datasets using different models are reasonable. For the NODEs, we model the vector filed f (h(t)) as the convolutional architectures together with a slightly different hyperparameter setups in (Dupont et al., 2019) . The initial hidden state is set as h(0) ∈ R c×h×w with respect to an image. For the NDDEs, we design the vector filed as f (concat(h(t), h(tτ ))), mapping the space from R 2c×h×w to R c×h×w , where concat(•, •) executes the concatenation of two tensors on the channel dimension. Its initial function is designed as a constant function, i.e., h(t) is identical to the input/image for t < 0. For the NODE+NDDE, we model the initial function as an ODE, which follows the same model structure of the NODEs. For the augmented models, the augmented dimension is chosen from the set {1, 2, 4}. Moreover, the training details could be found in Appendix, including the training setting and the number of function evaluations for each model on the image datasets. The training processes on MNIST, CIFAR10, and SVHN are shown in Fig. 8 . Overall, the NDDEs and its extensions have their training losses decreasing faster than the NODEs/ANODEs which achieve lower training and test loss. Also the test accuracies are much higher than the that of the NODEs/ANODEs (refer to Tab. 1). Naturally, the better performance of the NDDEs is attributed to the integration of the information not only on the hidden states at the current time t but at the previous time tτ . This kind of framework is akin to the key idea proposed in (Huang et al., 2017) , where the information is processed on many hidden states. Here, we run all the experiments for 5 times independently. 

5. DISCUSSION

In this section, we present the limitations of the NDDEs and further suggest several potential directions for future works. We add the delay effect to the NDDEs, which renders the model absolutely irreversible. Algorithm 1 thus requires a storage of the checkpoints of the hidden state h(t) at every instant of the multiple of τ . Actually, solving the DDEs is transformed to solving the ODEs of an increasingly high dimension with respect to the ratio of the final time T and the time delay τ . This definitely indicates a high computational cost. To further apply and improve the framework of the NDDEs, a few potential directions for future works are suggested, including: 1 : The test accuracies with their standard deviations over 5 realizations on the three image datasets. In the first column, p (=1, 2, or 4) in Ap means the number of the channels of zeros into the input image during the augmentation of the image space R c×h×w → R (c+p)×h×w (Dupont et al., 2019) . For each model, the initial (resp. final) time is set as 0 (resp. 1), and the delays of the NDDEs and its extensions are all set as 1, simply equal to the final time. Applications to more real-world datasets. In the real-world systems such as physical, chemical, biological, and ecological systems, the delay effects are inevitably omnipresent, truly affecting the dynamics of the produced time-series (Bocharov & Rihan, 2000; Kajiwara et al., 2012) . The NDDEs are undoubtedly suitable for realizing model-free and accurate prediction (Quaglino et al., 2019) . 2020), the framework of the NDDEs probably can be applied to the analogous areas, where the delay effects should be ubiquitous in those streaming data. Extension of the NDDEs. A single constant time-delay in the NDDEs can be further generalized to the case of multiple or/and distributed time delays (Shampine & Thompson, 2001) . As such, the model is likely to have much stronger capability to extract the feature, because the model leverages the information at different time points to make the decision in time. All these extensions could be potentially suitable for some complex tasks. However, such complex model may require tremendously huge computational cost. Time-dependent controllers. From a viewpoint of control theory, the parameters in the NODEs/NDDEs could be regarded as time-independent controllers, viz. constant controllers. A natural generalization way is to model the parameters as time-dependent controllers. In fact, such controllers were proposed in (Zhang et al., 2019b) , where the parameters w(t) were treated as another learnable ODE ẇ = q(w(t), p), q(•, •) is a different neural network, and the parameters p and the initial state w(0) are pending for optimization. Also, the idea of using a neural network to generate the other one was initially conceived in some earlier works including the study of the hypernetworks (Ha et al., 2016) .

6. CONCLUSION

In this article, we establish the framework of NDDEs, whose vector fields are determined mainly by the states at the previous time. We employ the adjoint sensitivity method to compute the gradients of the loss function. The obtained adjoint dynamics backward follow another DDEs coupled with the forward hidden states. We show that the NDDEs can represent some typical functions that cannot be represented by the original framework of NODEs. Moreover, we have validated analytically that the NDDEs possess universal approximating capability. We also demonstrate the exceptional efficacy of the proposed framework by using the synthetic data or real-world datasets. All these reveal that integrating the elements of dynamical systems to the architecture of neural networks could be potentially beneficial to the promotion of network performance. 

A APPENDIX

A.1 THE FIRST PROOF OF THEOREM 1 Here, we present a direct proof of the adjoint method for the NDDEs. For the neat of the proof, the following notations are slightly different from those in the main text. Let x(t) obey the DDE written as ẋ(t) = f (x(t), y(t), θ(t)), y(t) = x(t -τ ), t ∈ [0, T ], x(t) = x 0 , t ∈ [-τ, 0]. whose adjoint state is defined as λ(t) := ∂L ∂x(t) , ( ) where L is the loss function, that is, L := L(X(T )). After discretizing the above DDE, we have x(t + ∆t) = x(t) + ∆t • f (x(t), y(t), θ(t)), = x(t) + ∆t • f (x(t), x(t -τ ), θ(t)), x(t + τ + ∆t) = x(t + τ ) + ∆t • f (x(t + τ ), y(t + τ ), θ(t + τ )), = x(t + τ ) + ∆t • f (x(t + τ ), x(t), θ(t + τ )). According to the definition of λ(t) and applying the chain rule, we have λ(t) = ∂L ∂x(t + ∆t) ∂x(t + ∆t) ∂x(t) + ∂L ∂x(t + τ + ∆t) ∂x(t + τ + ∆t) ∂x(t) • χ [0,T -τ ] (t) = λ(t + ∆t) ∂x(t + ∆t) ∂x(t) + λ(t + τ + ∆t) ∂x(t + τ + ∆t) ∂x(t) • χ [0,T -τ ] (t) = λ(t + ∆t)(I + ∆t • f x (x(t), y(t), θ(t))) + λ(t + τ + ∆t)∆t • f y (x(t + τ ), y(t + τ ), θ(t + τ ))) • χ [0,T -τ ] (t) = λ(t + ∆t) + ∆t • λ(t + ∆t) • f x (t) + λ(t + τ + ∆t) • f y (t + τ ) • χ [0,T -τ ] (t) which implies, λ(t) = lim ∆t→0 λ(t + ∆t) -λ(t) ∆t = -λ(t) • f x (t) -λ(t + τ ) • f y (t + τ ) • χ [0,T -τ ] (t), where χ [0,T -τ ] (•) is a characteristic function. For the parameter θ(t), the result can be analogously derived as ∂L ∂θ(t) = ∂L ∂x(t + ∆t) ∂x(t + ∆t) ∂θ(t) = ∆t • λ(t + ∆t) • f θ (t) (12) In this article, θ(t) is considered to be a constant variable, i.e., θ(t) ≡ θ, which yields: ∂L ∂θ = lim ∆t→0 ∆t • λ(t + ∆t) • f θ (t) = T 0 λ(t) • f θ (t)dt = 0 T -λ(t) • f θ (t)dt. ( ) To summarize, we get the gradients with respect to x(0) and θ in a form of augmented DDEs. These DDEs are backward in time and written in an integral form of x(0) = x(T ) + 0 T f (x, y, θ)dt, ∂L ∂x(0) = ∂L ∂x(T ) + 0 T -λ(t) • f x (t) -λ(t + τ ) • f y (t + τ ) • χ [0,T -τ ] (t)dt, ∂L ∂θ = 0 T -λ(t) • f θ (t)dt. A.2 THE SECOND PROOF OF THEOREM 1 Here, we mainly employ the Lagrangian multiplier method and the calculus of variation to derive the adjoint method for the NDDEs. First, we define the Lagrangian by L := L(X(T )) + T 0 λ(t)( ẋ -f (x(t), y(t), θ))dt. We thereby need to find the so-called Karush-Kuhn-Tucker (KKT) conditions for the Lagrangian, which is necessary for finding the optimal solution of θ. To obtain the KKT conditions, the calculus of variation is applied by taking variations with respect to λ(t) and x(t). For λ(t), let λ(t) be a continuous and differentiable function with a scalar . We add the perturbation λ(t) to the Lagrangian L, which results in a new Lagrangian, denoted by L( ) := L(x(T )) + T 0 (λ(t) + λ(t))( ẋ -f (x(t), y(t), θ))dt. In order to obey the optimal condition for λ(t), we require the following equation dL( ) d = T 0 λ(t)( ẋ -f (x(t), y(t), θ))dt to be zero. Due to the arbitrariness of λ(t), we obtain ẋ -f (x(t), y(t), θ) = 0, ∀t ∈ [0, T ], which is exactly the DDE forward in time. Analogously, we can take variation with respect to x(t) and let x(t) be a continuous and differentiable function with a scalar . Here, x(t) = 0 for t ∈ [-τ, 0]. We also denote by ŷ(t) := x(tτ ) for t ∈ [0, T ]. The new Lagrangian under the perturbation of x(t) becomes L( ) := L(x(T ) + x(T )) + T 0 λ(t) dx(t) + x(t) dt -f (x(t) + x(t), y(t) + ŷ(t), θ) dt. We then compute the dL( ) d , which gives dL( ) d | =0 = ∂L ∂x(T ) x(T ) + T 0 λ(t) dx(t) dt -f x (x(t), y(t), θ)x(t) -f y (x(t), y(t), θ)ŷ(t) dt = ∂L ∂x(T ) x(T ) + λ(t)x(t)| T 0 + T 0 -x(t) dλ(t) dt integration by parts + T 0 -λ(t)f x (x(t), y(t), θ)x(t) -λ(t)f y (x(t), y(t), θ)ŷ(t)dt = ∂L ∂x(T ) + λ(T ) x(T ) + T 0 -x(t) dλ(t) dt -λ(t)f x (x(t), y(t), θ)x(t) - T 0 λ(t)f y (x(t), y(t), θ)x(t -τ )dt = ∂L ∂x(T ) + λ(T ) x(T ) + T 0 -x(t) dλ(t) dt -λ(t)f x (x(t), y(t), θ)x(t) - T -τ -τ λ(t + τ )f y (x(t + τ ), y(t + τ ), θ)x(t )dt = ∂L ∂x(T ) + λ(T ) x(T ) + T 0 -x(t) dλ(t) dt -λ(t)f x (x(t), y(t), θ)x(t) - T -τ 0 λ(t )f y (x(t + τ ), y(t + τ ), θ)x(t )dt no variation on the interval [-τ, 0] = ∂L ∂x(T ) + λ(T ) x(T ) + T 0 x(t) - dλ(t) dt -λ(t)f x (x(t), y(t), θ) -λ(t + τ )f y (x(t + τ ), y(t + τ ), θ)χ [0,T -τ ] (t) dt (20) Notice that dL( ) d | =0 = 0 is satisfied for all continuous differentiable x(t) at the optimal x(t). Thus, we have dλ(t) dt = -λ(t)f x (x(t), y(t), θ) -λ(t + τ )f y (x(t + τ ), y(t + τ ), θ)χ [0,T -τ ] (t), λ(T ) = ∂L ∂x(T ) . (21) Therefore, the adjoint state follows a DDE as well.

A.3 THE PROOF OF THEOREM 2

The validation for Theorem 2 is straightforward. Consider the NDDEs in the following form: dht dt = f (h t-τ ; w), t >= 0, h(t) = x, t <= 0 where τ equals to the final time T . Due to h(t) = x for t ≤ 0, then the vector field of the NDDEs in the interval [0, T ] is constant. This implies h(T Here, we intend to analyze the complexity of Algorithm 1. It should be noting that the state-of-the-art for the DDE software is not as advanced as that for the ODE software. Hence, solving a DDE is much difficult compared with solving the ODE. There exists several DDE solvers, such as the popular solver, the dde23 provided by MATLAB. However, these DDE solvers usually need to store the history states to help the solvers access the past time state h(tτ ). Hence, the memory cost of DDE solvers is O(H), where H is the number of the history states. This is the major difference between the DDEs and the ODEs, as solving the ODEs is memory efficient, i.e., O(1). In Algorithm 1, we propose a piece-wise ODE solver to solve the DDEs. The underlying idea is to transform the DDEs into the piece-wise ODEs for every τ time periods such that one can naturally employ the framework of the NODEs. More precisely, we compute the state at time kτ by solving an augmented ODE with the augmented initial state, i.e., concatenating the states at time -τ, 0, τ, ..., (k -1)τ into a single vector. Such a method has several strengths and has weaknesses as well. The strengths include: ) = x + T • f (x; Assume that the neural network f (x; w) is able to approximate the map G(x) = 1 T [F (x) -x]. Then, we have h(T ) = x + T • 1 T [F (x) -x] = F (x). • One can easily implement the algorithm using the framework of the NODEs, and • Algorithm 1 becomes quite memory efficient O(n), where we only need to store a small number of the forward states, h(0), ..., h(nτ ), to help the algorithm compute the adjoint and the gradients in a reverse mode. Here, the final T is assumed to be not very large compared with the time delay τ , for example, T = nτ with a small integer n. Notably, we chosen n = 1 (i.e., T = τ = 1.0) of the NDDEs in the experiments on the image datasets, resulting in a quite memory efficient computation. Additionally, the weaknesses involve: • Algorithm 1 may suffer from a high memory cost, if the final T is extremely larger than the τ (i.e., many forward states, h(0), ..., h(nτ ), are required to be stored), and • the increasing augmented dimension of the ODEs may lead to the heavy time cost for each evaluation in the ODE solver.

B EXPERIMENTAL DETAILS B.1 SYNTHETIC DATASETS

For the classification of the dataset of concentric spheres, we almost follow the experimental configurations used in (Dupont et al., 2019) . The dataset contain 2000 data points in the outer annulus and 1000 data points in the inner sphere. We choose r 1 = 0.5, r 2 = 1.0, and r 3 = 1.5. We solve the classification problem using the NODE and the NDDE, whose structures are described, respectively, as: 1. Structure of the NODEs: the vector field is modeled as f (x) = W out ReLU(W ReLU(W in x)), where W in ∈ R 32×2 , W ∈ R 32×32 , and W out ∈ R 2×32 ; 2. Structure of the NDDEs: the vector field is modeled as f (x(t -τ )) = W out ReLU(W ReLU(W in x(t -τ ))), where W in ∈ R 32×2 , W ∈ R 32×32 , and W out ∈ R 2×32 . For each model, we choose the optimizer Adam with 1e-3 as the learning rate, 64 as the batch size, and 5 as the number of the training epochs. We also utilize the synthetic datasets to address the regression problem of the time series. The optimizer for these datasets is chosen as the Adam with 0.01 as the learning rate and the mean absolute error (MAE) as the loss function. In this article, we choose the true time delays of the underlying systems for the NDDEs. We describe the structures of the models for the synthetic datasets as follows. For the DDE ẋ = A tanh(x(t) + x(tτ )), we generate a trajectory by choosing its parameters as τ = 0.5, A = [[-1, 1], [-1, -1]], x(t) = [0, 1] for t ≤ 0, the final time T = 2.5, and the sampling time period equal to 0.1. We model it using the NODE and the NDDE, whose structures are illustrated, respectively, as: 1. Structure of the NODE: the vector field is modeled as f (x) = W out tanh(W in x) , where W in ∈ R 10×2 and W out ∈ R 2×10 ; 2. Structure of the NDDE: the vector field is modeled as f (x(t), x(t -τ )) = W out tanh(W in (x(t) + x(t -τ ))), where W in ∈ R 10×2 and W out ∈ R 2×10 . We use the whole trajectory to train the models with the total iterations equal to 5000. For the population dynamics ẋ = rx(t)(1x(tτ 1 )) with x(t) = x 0 for t ≤ 0 and the Mackey-Glass systems ẋ = β x(t-τ ) 1+x n (t-τ2)γx(t) with x(t) = x 0 for t ≤ 0, we choose the parameters as τ 1 = 1, τ 2 = 1, r = 1.8, β = 4, n = 9.65, and γ = 2. We then generate 100 different trajectories within the time interval [0, 8] with 0.05 as the sampling time period for both the population dynamics and the Mackey-Glass systems. We split the trajectories into two parts. The first part within the time interval [0, 3] is used as the training data. The other part is used for testing. We model both systems by applying the NODE, the NDDE, and the ANODE, whose structures are introduced, respectively, as: 1. Structure of the NODE: the vector field is modeled as f (x) = W out tanh(W tanh(W in x)), where W in ∈ R 10×1 , W ∈ R 10×10 , and W out ∈ R 1×10 ; 2. Structure of the NDDE: the vector field is modeled as f (x(t), x(t -τ )) = W out tanh(W tanh(W in concat(x(t), x(t -τ )))), where W in ∈ R 10×2 , W ∈ R 10×10 , and W out ∈ R 1×10 . 3. Structure of the ANODE: the vector field is modeled as f (x aug (t)) = W out tanh(W tanh(W in x aug )), where W in ∈ R 10×2 , W ∈ R 10×10 , W out ∈ R 2×10 , and the augmented dimension is equal to 1. Notably, we need to align the augmented trajectories with the target trajectories to be regressed. To do it, we choose the data in the first component and exclude the data in the augmented component, i.e., simply projecting the augmented data into the space of the target data. We train each model for the total 3000 iterations.

B.2 IMAGE DATASETS

The image experiments shown in this article are mainly depend on the code provided in (Dupont et al., 2019) . Also, the structures of each model almost follow from the structures proposed in (Dupont et al., 2019) . More precisely, we apply the convolutional block with the following strutures and dimensions • 1 × 1 conv, k filters, 0 paddings, • ReLU activation function, • 3 × 3 conv, k filters, 1 paddings, • ReLU activation function, • 1 × 1 conv, c filters, 0 paddings, where k is different for each model and c is the number of the channels of the input images. In Tab. 2, we specify the information for each model. As can be seen, we fix approximately the same number of the parameters for each model. We select the hyperparamters for the optimizer Adam through 1e-3 as the learning rate, 256 as the batch size, and 30 as the number of the training epochs. 



Figure 2: (Right) Two continuous trajectories generated by the DDEs are intersected, mapping -1 (resp., 1) to 1 (resp., -1), while (Left) the ODEs cannot represent such mapping.

Figure 3: (Left) The data at the time t = 0 and (Right) the transformed data at a sufficient large final time T of the DDEs (6). Here, the transformed data are linearly separable.In Fig.3, we, respectively, show the original data of g(x) for the dimension d = 2 and the transformed data by the DDEs (6) with the parameters r 1 = 1, r 2 = 2, r 3 = 3, T = 10, and τ = 10. Clearly, the transformed data by the DDEs are linearly separable. We also train the NDDEs to represent the g(x) for d = 2, whose evolutions during the training procedure are, respectively, shown in Fig.4. This figure also includes the evolution of the NODEs. Notably, while the NODEs is struggled to break apart the annulus, the NDDEs easily separate them. The training losses and the flows of the NODEs and the NDDEs are depicted, respectively, in Fig.5. Particularly, the NDDEs achieve lower losses with the faster speed and directly separate the two clusters in the original 2-D space; however, the NODEs achieve it only by increasing the dimension of the data and separating them in a higher-dimensional space.

Figure 5: Presented are the training losses (a) of the NODEs and the NDDEs on fitting the function g(x) for d = 2. Also presented are the flows, from the data at the initial time point (b), of the NODEs (d) and the NDDEs (c) after training. The flow of the NODEs is generated by the code provided in(Dupont et al., 2019).

Figure6: Comparison of the NDDEs (top) versus the NODEs (bottom) in the fitting a 2-D time series with the delay effect in the original system. From the left to the right, the true and the fitted time series, the true and the fitted trajectories in the phase spaces, the dynamics of the parameters in the neural networks, and the dynamics of the losses during the training processes.

Figure 8: The training loss (left column), the test loss (middle column), and the accuracy (right column) over 5 realizations for the three image sets, i.e., CIFAR10 (top row), MNIST (middle row), and SVHN (bottom row).

Additionally, since the NODEs have been generalized to the areas of Computer Vision and Natural Language Processing He et al. (2019); Yang et al. (2019); Liu et al. (

The proof is completed. A.4 COMPLEXITY ANALYSIS OF THE ALGORITHM 1 As shown in (Chen et al., 2018), the memory and the time costs for solving the NODEs are O(1) and O(L), where L is the number of the function evaluations in the time interval [0, T ]. More precisely, solving the NODEs is memory efficient without storing any intermediate states of the evaluated time points in solving the NODEs.

Figure9: The phase portraits of the population dynamics in the training and the test stages. We only show the 10 phase portraits from the total 100 phase portraits for the training and the testing time-series. Here, the solid lines correspond to the true dynamics, while the dash lines correspond to the trajectories generated by the trained models in the training duration and in the test duration, respectively.

Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4541-4550, 2019.

The number of the filters and the whole parameters in each model used for CIFAR10, MNIST, and SVHN. In the first column, Ap with p = 1, 2, 4 indicates the augmentation of the image space R c×h×w → R (c+p)×h×w .

ACKNOWLEDGMENTS

WL was supported by the National Key RD Program of China (Grant no. 2018YFC0116600), the National Natural Science Foundation of China (Grant no. 11925103), and by the STCSM (Grant No. 18DZ1201000, 19511101404, 19511132000).

AUTHOR CONTRIBUTIONS

WL conceived the idea, QZ, YG & WL performed the research, QZ & YG performed the mathematical arguments, QZ analyzed the data, and QZ & WL wrote the paper.

