RESNET AFTER ALL? NEURAL ODES AND THEIR NUMERICAL SOLUTION

Abstract

A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. We observe that if training relies on a solver with overly coarse discretization, then testing with another solver of equal or smaller numerical error results in a sharp drop in accuracy. In such cases, the combination of vector field and numerical method cannot be interpreted as a flow generated from an ODE, which arguably poses a fatal breakdown of the Neural ODE concept. We observe, however, that there exists a critical step size beyond which the training yields a valid ODE vector field. We propose a method that monitors the behavior of the ODE solver during training to adapt its step size, aiming to ensure a valid ODE without unnecessarily increasing computational cost. We verify this adaptation algorithm on a common bench mark dataset as well as a synthetic dataset.

1. INTRODUCTION

The choice of neural network architecture is an important consideration in the deep learning community. Among a plethora of options, Residual Neural Networks (ResNets) (He et al., 2016) have emerged as an important subclass of models, as they mitigate the gradient issues (Balduzzi et al., 2017) arising with training deep neural networks by adding skip connections between the successive layers. Besides the architectural advancements inspired from the original scheme (Zagoruyko & Komodakis, 2016; Xie et al., 2017) , recently Neural Ordinary Differential Equation (Neural ODE) models (Chen et al., 2018; E, 2017; Lu et al., 2018; Haber & Ruthotto, 2017) have been proposed as an analog of continuous-depth ResNets. While Neural ODEs do not necessarily improve upon the sheer predictive performance of ResNets, they offer the vast knowledge of ODE theory to be applied to deep learning research. For instance, the authors in Yan et al. (2020) discovered that for specific perturbations, Neural ODEs are more robust than convolutional neural networks. Moreover, inspired by the theoretical properties of the solution curves, they propose a regularizer which improved the robustness of Neural ODE models even further. However, if Neural ODEs are chosen for their theoretical advantages, it is essential that the effective model-the combination of ODE problem and its solution via a particular numerical method-is a close approximation of the true analytical, but practically inaccessible ODE solution. In this work, we study the empirical risk minimization (ERM) problem L D = 1 |D| (x,y)∈D l(f (x; w), y) where D = {(x n , y n ) | x n ∈ R Dx , y n ∈ R Dy , n = 1, . . . , N } is a set of training data, l : R Dy × R Dy → R is a (non-negative) loss function and f is a Neural ODE model with weights w, i.e., f = f d • ϕ fv T • f u (2) where f x , x ∈ {d, v, u} are neural networks and u and d denote the upstream and downstream layers respectively. ϕ is defined to be the (analytical) flow of the dynamical system dz dt = f v (z; w v ), z(t) = ϕ fv t (z(0)). (3) As the vector field f v of the dynamical system is itself defined by a neural network, evaluating ϕ fv T is intractable and we have to resort to a numerical scheme Ψ t to compute ϕ Since the numerical solvers play an essential role in the approximation of the solutions of an ODE, it is intuitive to ask: how does the choice of the numerical method affect the training of a Neural ODE model? Specifically, does the discretization of the numerical solver impact the resulting flow of the ODE? To test the effect of the numerical solver on a Neural ODE model, we first train a Neural ODE on a synthetic classification task consisting of three concentric spheres, where the outer and inner sphere correspond to the same class (for more information see Section 2.4). For this problem there are no true underlying dynamics and therefore, the model only has to find some dynamics which solve the problem. We train the Neural ODE model using a fixed step solver with a small step size and a solver with a large step size (see Figure 1 Specifically, we observe that trajectories of IVPs belonging to different classes cross. This crossing behavior contradicts the expected behavior of autonomous ODE solutions, as according to the Picard-Lindelöf theorem we expect unique solutions to the IVPs. We observe crossing trajectories because the discretization error of the solver is so large that the resulting numerical solutions no longer maintain the properties of ODE solutions. We observe that both, the model trained with the small step size and the model trained with the large step size, achieve very high accuracy. This leads us to the conclusion that the step size parameter is not like any other hyperparameter, as its chosen value often does not affect the performance of the model. Instead, the step size affects whether the trained model has a valid ODE interpretation.



Code: https://github.com/boschresearch/numerics_independent_neural_odes



t . Ψ belongs either to a class of fixed step methods or is an adaptive step size solver as proposed in Chen et al. (2018). For fixed step solvers with step size h one can directly compute the number of steps taken by the solver #steps = T h -1 . We set the final time T = 1 for all our experiments. The global numerical error e train of the model is the difference between the true, (unknown), analytical solution of the model and the numerical solution e train = ||ϕ T (z(0)) -Ψ T (z(0))|| at time T . The global numerical error for a given problem can be controlled by adjusting either the step size or the local error tolerance.

Figure 1: The Neural ODE was trained on a classification task with a small (a) and large (b) step size. (a) and (b) show the trajectories for the two different solvers. The colors of the trajectories indicate the label for each IVP. Panels (c) and (d) show the test accuracy of the Neural ODE solver using different step sizes for testing. ( ) indicates the number of steps used for testing are the same as the number of steps used for training. ( ) -the number of steps used for testing are different from the number of steps used for training.

(a) and (b) respectively). If the model is trained with a large step size, then the numerically computed trajectories for the individual Initial Value Problems (IVPs) cross in phase space (see Figure 1 (b)).

