NEURAL DELAY DIFFERENTIAL EQUATIONS

Abstract

Neural Ordinary Differential Equations (NODEs), a framework of continuousdepth neural networks, have been widely applied, showing exceptional efficacy in coping with some representative datasets. Recently, an augmented framework has been successfully developed for conquering some limitations emergent in application of the original framework. Here we propose a new class of continuous-depth neural networks with delay, named as Neural Delay Differential Equations (ND-DEs), and, for computing the corresponding gradients, we use the adjoint sensitivity method to obtain the delayed dynamics of the adjoint. Since the differential equations with delays are usually seen as dynamical systems of infinite dimension possessing more fruitful dynamics, the NDDEs, compared to the NODEs, own a stronger capacity of nonlinear representations. Indeed, we analytically validate that the NDDEs are of universal approximators, and further articulate an extension of the NDDEs, where the initial function of the NDDEs is supposed to satisfy ODEs. More importantly, we use several illustrative examples to demonstrate the outstanding capacities of the NDDEs and the NDDEs with ODEs' initial value. Specifically, (1) we successfully model the delayed dynamics where the trajectories in the lower-dimensional phase space could be mutually intersected, while the traditional NODEs without any argumentation are not directly applicable for such modeling, and (2) we achieve lower loss and higher accuracy not only for the data produced synthetically by complex models but also for the real-world image datasets, i.e., CIFAR10, MNIST, and SVHN. Our results on the NDDEs reveal that appropriately articulating the elements of dynamical systems into the network design is truly beneficial to promoting the network performance.

1. INTRODUCTION

A series of recent works have revealed a close connection between neural networks and dynamical systems (E, 2017; Li et al., 2017; Haber & Ruthotto, 2017; Chang et al., 2017; Li & Hao, 2018; Lu et al., 2018; E et al., 2019; Chang et al., 2019; Ruthotto & Haber, 2019; Zhang et al., 2019a; Pathak et al., 2018; Fang et al., 2018; Zhu et al., 2019; Tang et al., 2020) . On one hand, the deep neural networks can be used to solve the ordinary/partial differential equations that cannot be easily computed using the traditional algorithms. On the other hand, the elements of the dynamical systems can be useful for establishing novel and efficient frameworks of neural networks. Typical examples include the Neural Ordinary Differential Equations (NODEs), where the infinitesimal time of ordinary differential equations is regarded as the "depth" of a considered neural network (Chen et al., 2018) . Though the advantages of the NODEs were demonstrated through modeling continuous-time datasets and continuous normalizing flows with constant memory cost (Chen et al., 2018) , the limited capability of representation for some functions were also studied (Dupont et al., 2019) . Indeed, the NODEs cannot be directly used to describe the dynamical systems where the trajectories in the lower-dimensional phase space are mutually intersected. Also, the NODEs cannot model only a few variables from some physical or/and physiological systems where the effect of time delay is inevitably present. From a view point of dynamical systems theory, all these are attributed to the characteristic of finite-dimension for the NODEs. In this article, we propose a novel framework of continuous-depth neural networks with delay, named as Neural Delay Differential Equations (NDDEs). We apply the adjoint sensitivity method to compute the corresponding gradients, where the obtained adjoint systems are also in a form of delay differential equations. The main virtues of the NDDEs include: • feasible and computable algorithms for computing the gradients of the loss function based on the adjoint systems, • representation capability of the vector fields which allow the intersection of the trajectories in the lower-dimensional phase space, and • accurate reconstruction of the complex dynamical systems with effects of time delays based on the observed time-series data.

2. RELATED WORKS

NODEs Inspired by the residual neural networks (He et al., 2016) and the other analogous frameworks, the NODEs were established, which can be represented by multiple residual blocks as h t+1 = h t + f (h t , w t ), where h t is the hidden state of the t-th layer, f (h t ; w t ) is a differential function preserving the dimension of h t , and w t is the parameter pending for learning. The evolution of h t can be viewed as the special case of the following equation h t+∆t = h t + ∆t • f (h t , w t ) with ∆t = 1. As suggested in (Chen et al., 2018) , all the parameters w t are unified into w for achieving parameter efficiency of the NODEs. This unified operation was also employed in the other neural networks, such as the recurrent neural networks (RNNs) (Rumelhart et al., 1986; Elman, 1990) and the ALBERT (Lan et al., 2019) . Letting ∆t → 0 and using the unified parameter w instead of w t , we obtain the continuous evolution of the hidden state h t as lim ∆t→0 h t+∆t -h t ∆t = dh t dt = f (h t , t; w), which is in the form of ordinary differential equations. Actually, the NODEs can act as a feature extraction, mapping an input to a point in the feature space by computing the forward path of a NODE as: h(T ) = h(0) + T 0 f (h t , t; w)dt, h(0) = input, where h(0) = input is the original data point or its transformation, and T is the integration time (assuming that the system starts at t = 0). Under a predefined loss function L(h(T )), (Chen et al., 2018) employed the adjoint sensitivity method to compute the memory-efficient gradients of the parameters along with the ODE solvers. More precisely, they defined the adjoint variable, λ(t) = ∂L(h(T )) ∂h(t) , whose dynamics is another ODE, i.e., dλ t) dt = -λ(t) ∂f (h t , t; w) ∂h t . The gradients are computed by an integral as: dL dw = 0 T -λ(t) ∂f (h t , t; w) ∂w dt. (Chen et al., 2018) calculated the gradients by calling an ODE solver with extended ODEs (i.e., concatenating the original state, the adjoint, and the other partial derivatives for the parameters at each time point into a single vector). Notably, for the regression task of the time series, the loss function probably depend on the state at multiple observational times, such as the form of

