PDE-REGULARIZED NEURAL NETWORKS FOR IMAGE CLASSIFICATION

Abstract

Neural ordinary differential equations (neural ODEs) introduced an approach to approximate a neural network as a system of ODEs after considering its layer as a continuous variable and discretizing its hidden dimension. While having several good characteristics, neural ODEs are known to be numerically unstable and slow in solving their integral problems, resulting in errors and/or much computation of the forward-pass inference. In this work, we present a novel partial differential equation (PDE)-based approach that removes the necessity of solving integral problems and considers both the layer and the hidden dimension as continuous variables. Owing to the recent advancement of learning PDEs, the presented novel concept, called PR-Net, can be implemented. Our method shows comparable (or better) accuracy and robustness in much shorter forward-pass inference time for various datasets and tasks in comparison with neural ODEs and Isometric Mo-bileNet V3. For the efficient nature of PR-Net, it is suitable to be deployed in resource-scarce environments, e.g., deploying instead of MobileNet.

1. INTRODUCTION

It had been discovered that interpreting neural networks as differential equations is possible by several independent research groups (Weinan, 2017; Ruthotto & Haber, 2019; Lu et al., 2018; Ciccone et al., 2018; Chen et al., 2018; Gholami et al., 2019) . Among them, the seminal neural ordinary differential equation (neural ODE) research work, which considers the general architecture in Figure 1 (a) , is to learn a neural network approximating ∂h(t) ∂t , where h(t) is a hidden vector at layer (or time) t (Chen et al., 2018) . As such, a neural network is described by a system of ODEs, each ODE of which describes a dynamics of a hidden element. While neural ODEs have many good characteristics, they also have limitations, which are listed as follows: Pros. Neural ODEs can interpret t as a continuous variable and we can have hidden vectors at any layer (or time) l by h(l) = h(0) + l 0 o(h(t), t; θ o ) dt, where o(h(t), t; θ o ) = ∂h(t) ∂t is a neural network parameterized by θ o . Pros. Neural ODEs sometimes have smaller numbers of parameters than those of other conventional neural network designs, e.g., (Pinckaers & Litjens, 2019). Cons. Neural ODEs, which use an adaptive step-size ODE solver, sometimes show numerical instability (i.e., the underflow error of the step-size) or their forward-pass inference can take a long time (i.e., too many steps) in solving integral problems, e.g, a forward-pass time of 37.6 seconds of ODE-Net vs. 9.8 seconds of PR-Net in Table 2 . Several countermeasures have been proposed but it is unavoidable to solve integral problems (Zhuang et al., 2020; Finlay et al., 2020; Daulbaev et al., 2020) . To tackle the limitation, we propose the concept of partial differential equation-regularized neural network (PR-Net) to directly learn a hidden element, denoted h(d, t) at layer (or time) t ∈ [0, T ] and dimension d ∈ R m . Under general contexts, a PDE consists of i) an initial condition at t = 0, ii) a boundary condition at a boundary location of the spatial domain R m , and iii) a governing equation describing ∂h(d,t) ∂t . As such, learning a PDE from data can be reduced to a regression-like problem to predict h(d, t) that meets its initial/boundary conditions and governing equation. In training our proposed PR-Net, h(0) is provided by an earlier feature extraction layer, which is the same as neural ODEs. However, an appropriate governing equation is unknown for downstream

