DISE: DYNAMIC INTEGRATOR SELECTION TO MINI-MIZE FORWARD-PASS TIME IN NEURAL ODES

Abstract

Neural ordinary differential equations (Neural ODEs) are appreciated for their ability to significantly reduce the number of parameters when constructing a neural network. On the other hand, they are sometimes blamed for their long forwardpass inference time, which is incurred by solving integral problems. To improve the model accuracy, they rely on advanced solvers, such as the Dormand-Prince (DOPRI) method. To solve an integral problem, however, it requires at least tens (or sometimes thousands) of steps in many Neural ODE experiments. In this work, we propose to i) directly regularize the step size of DOPRI to make the forwardpass faster and ii) dynamically choose a simpler integrator than DOPRI for a carefully selected subset of input. Because it is not the case that every input requires the advanced integrator, we design an auxiliary neural network to choose an appropriate integrator given input to decrease the overall inference time without significantly sacrificing accuracy. We consider the Euler method, the fourth-order Runge-Kutta (RK4) method, and DOPRI as selection candidates. We found that 10-30% of cases can be solved with simple integrators in our experiments. Therefore, the overall number of functional evaluations (NFE) decreases up to 78% with improved accuracy.

1. INTRODUCTION

Neural ordinary differential equations (Neural ODEs) are to learn time-dependent physical dynamics describing continuous residual networks (Chen et al., 2018) . It is well known that residual connections are numerically similar to the explicit Euler method, the simplest integrator to solve ODEs. In this regard, Neural ODEs are considered as a generalization of residual networks. In general, it is agreed by many researchers that Neural ODEs have two advantages and one disadvantage: i) Neural ODEs can sometimes reduce the required number of neural network parameters, e.g., (Pinckaers & Litjens, 2019) , ii) Neural ODEs can interpret the neural network layer (or time) as a continuous variable and a hidden vector at an arbitrary layer can be calculated, iii) however, Neural ODEs's forward-pass inference can sometimes be numerically unstable (i.e., the underflow error of DOPRI's adaptive step size) and/or slow to solve an integral problem (i.e., too many steps in DOPRI) (Zhuang et al., 2020b; Finlay et al., 2020; Daulbaev et al., 2020; Quaglino et al., 2020) . Much work has been actively devoted to address the numerically unstable nature of solving integral problems. In this work, however, we are interested in addressing the problem of long forward-pass inference time. To overcome the challenge, we i) directly regularize the numerical errors of the Dormand-Prince (DOPRI) method (Dormand & Prince, 1980) , which means we try to learn an ODE that can be quickly solved by DO-PRI, and ii) dynamically select an appropriate integrator for each sample rather than relying on only one integrator. In many cases, Neural ODEs use DOPRI, one of the most advanced adaptive step integrator, for its best accuracy. However, our method allows that we rely on simpler integrators, such as the Euler method or the fourth-order Runge-Kutta (RK4) method (Ixaru & Vanden Berghe, 2004) , for carefully selected inputs. Table 1 shows an experimental result that our proposed regularization not only reduces the number of function evaluations (NFE) -the inference time is linearly proportional to the number of function evaluations in Neural ODEs -but also increases the inference accuracy in the MNIST classification task. We can reduce the inference time by reducing the average number of steps (and thus, the average NFE) of DOPRI, which can be obtained when the learned ODE is trained to be in a suitable form to solve with DOPRI with a proper regularization. However, the NFE of DOPRI in a step is 6 whereas RK4 has 4 and the Euler method has 1. So, the Euler method is six times faster than DOPRI even when their step sizes are identical. Therefore, the automatic step size adjustment of DOPRI is not enough to minimize the NFE of forward-pass inference (see Section B in Appendix for more detailed descriptions with a concrete example). To this end, we design an auxiliary network that chooses an appropriate integrator for each sample. The combination of our regularization and the proposed Dynamic Integrator SElection (DISE) shows the best performance in the table. We conduct experiments for three different tasks and datasets: MNIST image classification, Phys-ioNet mortality prediction, and continuous normalizing flows. Our method shows the best (or close to the best) accuracy with a much smaller NFE than state-of-the-art methods. Our contributions can be summarized as follows: 1. We design an effective regularization to reduce the number of function evaluations (NFE) of Neural ODEs. 2. We design a sample-wise dynamic integrator selection (DISE) method to further accelerate Neural ODEs without significantly sacrificing model accuracy. 3. We conduct in-depth analyses with three popular tasks of Neural ODEs.

2. RELATED WORK

In this section, we review the literature on Neural ODEs. In particular, we review recent regularization designs for Neuarl ODEs and numerical methods to solve ODEs.

2.1. NEURAL ODES

It had been attempted by several researchers to model neural networks as differential equations (Weinan, 2017; Ruthotto & Haber, 2019; Lu et al., 2018; Ciccone et al., 2018; Chen et al., 2018; Gholami et al., 2019) . Among them, the seminal neural ordinary differential equations (Neural ODEs), as shown in Fig. 1 , consist of three parts in general: a feature extractor, an ODE, and a classifier (Chen et al., 2018; Zhuang et al., 2020a) . Given an input x, the feature extractor produces an input to the ODE, denoted h(0). Let h(t) be a hidden vector at layer (or time) t in the ODE part. In Neural ODEs, a neural network f with a set of parameters, denoted θ, approximates ∂h(t) ∂t and h(t 1 ) becomes h(0) + t1 t0 f (h(t), t; θ) dt, where f (h(t), t; θ) = ∂h (t) ∂t . In other words, the internal dynamics of the hidden vector evolution is described by an ODE. One key advantage of Neural ODEs is that we can reduce the number of parameters without sacrificing model accuracy. For instance, one recent work based on a Neural ODE marked the best accuracy for medical image segmentation with an order of magnitude smaller parameter numbers (Pinckaers & Litjens, 2019) . In general, we calculate



Figure 1: The general architecture of Neural ODEs. We assume a classification task in this figure.



