OSCILLATION NEURAL ORDINARY DIFFERENTIAL EQUATIONS

Abstract

Neural ordinary differential equations (NODEs) have received a lot of attention in recent years due to their memory efficiency. Different from traditional deep learning, it defines a continuous deep learning architecture based on the theory of ordinary differential equations (ODEs), which also improves the interpretability of deep learning. However, it has several obvious limitations, such as a NODE is not a universal approximator, it requires a large number of function evaluations (NFEs), and it has a slow convergence rate. We address these drawbacks by modeling and adding an oscillator to the framework of the NODEs. The oscillator enables the trajectories of our model to cross each other. We prove that our model is a universal approximator, even in the original input space. Due to the presence of oscillators, the flows learned by the model will be simpler, thus our model needs fewer NFEs and has a faster convergence speed. We apply our model to various tasks including classification and time series extrapolation, then compare several metrics including accuracy, NFEs, and convergence speed. The experiments show that our model can achieve better results compared to the existing baselines.

1. INTRODUCTION

Neural Ordinary Differential Equations (NODEs) (Chen et al., 2018) are the latest continuous deep learning architectures that were first developed in the context of continuous recurrent networks (Cohen & Grossberg, 1983) . This continuous deep learning architecture provides a new perspective that theoretically bridges the gap between deep learning and dynamic systems. This deep learning architecture can be efficiently trained with backpropagation and has shown great promise on several tasks including modeling continuous time data, classification, and building normalizing flows. The core idea of a NODE is to use a neural network to parameterize the vector field (Chen et al., 2018; Kidger, 2022) . Typically, a simple neural network is enough to represent the vector field, which will be optimized during the training process. Based on the well-learned vector field, trajectories will be obtained as the estimate functions. However, this architecture has several limitations. First, NODEs cannot learn any crossover-mapping functions (Dupont et al., 2019) , which results in them not being universal approximators. Second, to optimize the vector field, it will need many function evaluations during both forward evaluation and backpropagation processes of training. Third, the time and convergence rate of the training process is relatively slow. The first limitation is caused by the continuity of vector-field-based trajectories because the trajectories in NODEs cannot cross each other at the same time (Massaroli et al., 2020; Norcliffe et al., 2020) . This property causes NODEs to be powerless against some special topologies, such as the cases of concentric circles and intersecting lines mentioned by Dupont et al. (2019) . We conjecture the reason for the second limitation is caused by the straightforward optimization approach of the vector field. There is no guarantee that learning the vector field is a better choice than learning the estimated functions directly. Sometimes it will need many function evaluations to optimize the vector field, so the difficulties go beyond learning the estimated functions themselves. The third limitation is caused by the trade-off between accuracy and speed for the ordinary differential equation solver (ODE solver). NODEs perform forward evaluation and backpropagation calculations via ODE solvers, which can be treated as black boxes. If we need to ensure the accuracy of an ODE solver, then we must sacrifice the speed. To address the first problem, we intend to add some discrete elements during the process of learning the trajectories. A discrete element can be realized by a "jump" in the trajectory. This jump allows the trajectories of the NODEs to cross at the same time. It solves the above-mentioned cases. We also give proof in Section 4.1 that our approach is a universal approximator. To solve the second and third problems, we propose to join a function g that directly optimizes the trajectories while optimizing the vector field at the same time. It shifts part of the burden from optimizing the vector field to optimizing the original estimated function. The addition of g will make the final optimized vector fields less complicated, thus a "simple" vector field can be utilized to estimate the function well. This will result in fewer NFEs, less training time, and faster convergence. Based on the above two points, we design an oscillator to enhance neural ordinary differential equations, namely Oscillation Neural Ordinary Differential Equations (ONODE). The architecture is shown in Figure 1 . This oscillator is designed to achieve both "jumping" of trajectories and optimizing the estimation function while simultaneously optimizing the vector field. This design solves the three problems mentioned above to some extent. Our proposed method is not only a universal approximator, but it also reduces the NFEs as well as improves the training convergence speed. Specifically, the vector field is parameterized by a simple neural network, the same as the NODEs proposed by Chen et al. ( 2018). The vector field can be optimized by ODE solvers which can be used such as Runge (1895), Kutta (1901), and Hairer et al. (1993) . The oscillator we designed is parameterized by a shallow neural network structure and it has only one hidden layer. The oscillator can be located before vector field modeling, or after the ODE solver. The oscillator modeling structure may vary slightly in different tasks. For example, in the usual classification tasks and extrapolating time series tasks, these structures will have the perception of one hidden layer. In the image classification task, the structure will have two convolutional layers with an activation function. It is worth noting that since our oscillator is parameterized by a shallow neural network that does not take much space, we can still consider the model to be memory efficient. We will illustrate this in Section 5. Besides, our oscillator simplifies the learning complexity of the vector field, which leads to a reduction in NFEs and convergence speed. A simple example is shown in Figure 1 right, the final trained vector field will be a very simple one.

2. RELATED WORK

The basic ideas of neural ordinary differential equations were originally considered in Rico-Martinez et al. (1992 ), Rico-Martinez & Kevrekidis (1993 ), and Rico-Martinez et al. (1994) . For example, Rico-Martinez et al. (1992) proposed a vector field that can be trained using a Multilayer Perceptron (MLP). Rico-Martinez & Kevrekidis (1993) used an implicit integrator and recurrent networks for continuous-time modeling of nonlinear systems. Chen et al. ( 2018) specified the architecture of NODEs and applied it to several tasks and achieved good results, including image classification, continuous normalizing flows, and latent ODEs, leading to an explosion of interest in NODEs.



Figure 1: The diagram on the left shows the structure of our proposed model, and the right shows how the oscillator helps our model learn the cross-trajectory by example.

