PDE-REGULARIZED NEURAL NETWORKS FOR IMAGE CLASSIFICATION

Abstract

Neural ordinary differential equations (neural ODEs) introduced an approach to approximate a neural network as a system of ODEs after considering its layer as a continuous variable and discretizing its hidden dimension. While having several good characteristics, neural ODEs are known to be numerically unstable and slow in solving their integral problems, resulting in errors and/or much computation of the forward-pass inference. In this work, we present a novel partial differential equation (PDE)-based approach that removes the necessity of solving integral problems and considers both the layer and the hidden dimension as continuous variables. Owing to the recent advancement of learning PDEs, the presented novel concept, called PR-Net, can be implemented. Our method shows comparable (or better) accuracy and robustness in much shorter forward-pass inference time for various datasets and tasks in comparison with neural ODEs and Isometric Mo-bileNet V3. For the efficient nature of PR-Net, it is suitable to be deployed in resource-scarce environments, e.g., deploying instead of MobileNet.

1. INTRODUCTION

It had been discovered that interpreting neural networks as differential equations is possible by several independent research groups (Weinan, 2017; Ruthotto & Haber, 2019; Lu et al., 2018; Ciccone et al., 2018; Chen et al., 2018; Gholami et al., 2019) . Among them, the seminal neural ordinary differential equation (neural ODE) research work, which considers the general architecture in Figure 1 (a) , is to learn a neural network approximating ∂h(t) ∂t , where h(t) is a hidden vector at layer (or time) t (Chen et al., 2018) . As such, a neural network is described by a system of ODEs, each ODE of which describes a dynamics of a hidden element. While neural ODEs have many good characteristics, they also have limitations, which are listed as follows: Pros. Neural ODEs can interpret t as a continuous variable and we can have hidden vectors at any layer (or time) l by h(l) = h(0) + l 0 o(h(t), t; θ o ) dt, where o(h(t), t; θ o ) = ∂h(t) ∂t is a neural network parameterized by θ o . Pros. Neural ODEs sometimes have smaller numbers of parameters than those of other conventional neural network designs, e.g., (Pinckaers & Litjens, 2019) . Cons. Neural ODEs, which use an adaptive step-size ODE solver, sometimes show numerical instability (i.e., the underflow error of the step-size) or their forward-pass inference can take a long time (i.e., too many steps) in solving integral problems, e.g, a forward-pass time of 37.6 seconds of ODE-Net vs. 9.8 seconds of PR-Net in Table 2 . Several countermeasures have been proposed but it is unavoidable to solve integral problems (Zhuang et al., 2020; Finlay et al., 2020; Daulbaev et al., 2020) . To tackle the limitation, we propose the concept of partial differential equation-regularized neural network (PR-Net) to directly learn a hidden element, denoted h(d, t) at layer (or time) t ∈ [0, T ] and dimension d ∈ R m . Under general contexts, a PDE consists of i) an initial condition at t = 0, ii) a boundary condition at a boundary location of the spatial domain R m , and iii) a governing equation describing ∂h(d,t) ∂t . As such, learning a PDE from data can be reduced to a regression-like problem to predict h(d, t) that meets its initial/boundary conditions and governing equation. In training our proposed PR-Net, h(0) is provided by an earlier feature extraction layer, which is the same as neural ODEs. However, an appropriate governing equation is unknown for downstream In other words, neural ODEs directly learn a governing equation (i.e., ∂h(t) ∂t ), whereas PR-Net learns a governing equation in conjunction with a regression model that conforms with the learned governing equation. The key advantage in our approach is that we can eliminate the necessity of solving integral problems -in neural ODEs, where we learn a governing equation only, solving integral problems is mandatory. Such forward and inverse problems (i.e., solving PDEs for h(d, t) and identifying governing equations, respectively) arise in many important computational science problems and there have been many efforts applying machine learning/deep learning techniques to those problems (e.g., in earth science (Reichstein et al., 2019; Bergen et al., 2019) and climate science (Rolnick et al., 2019) ). Recently, physics-informed or physics-aware approaches (Battaglia et al., 2016; Chang et al., 2017; de Bezenac et al., 2018; Raissi et al., 2019; Sanchez-Gonzalez et al., 2018; Long et al., 2018) have demonstrated that designing neural networks to incorporate prior scientific knowledge (e.g., by enforcing physical laws described in governing equations (Raissi et al., 2019) ) greatly helps avoiding over-fitting and improving generalizability of the neural networks. There also exist several approaches to incorporate various ideas of classical mechanics in designing neural-ODE-type networks (Greydanus et al., 2019; Chen et al., 2020; Cranmer et al., 2020; Zhong et al., 2020; Lee & Parish, 2020) . However, all these works are interested in solving either forward or inverse problems whereas we solve the two different problem types at the same time for downstream tasks. The most similar existing work to our work is in (Long et al., 2018) . However, this work studied scientific PDEs and do not consider t as a continuous variable but use a set of discretized points of t. Compared to previous approaches, the proposed method has a distinct feature that forward and inverse problems are solved simultaneously with a continuous variable t. Due to this unique feature, the method can be applied to general machine learning downstream tasks, where we do not have a priori knowledge on governing equations, such as image classification. Our proposed PR-Net had the following characteristics: Pros. PR-Net trains a regression model that outputs a scalar element h(d, t) (without solving any integral problems), and we can consider both d and t as continuous variables. Therefore, it is possible to construct flexible hidden dimension vectors. Pros. PR-Net does not require solving integral problems. As such, there is no numerical instability and their forward-pass time is much shorter than that of neural ODEs. Pros. By learning a governing equation, we can regularize the overall behavior of PR-Net. Cons. PR-Net sometimes requires a larger number of parameters than that of neural ODEs or conventional neural networks.

2. PARTIAL DIFFERENTIAL EQUATIONS

The key difference between ODEs and PDEs is that PDEs can have derivatives of multiple variables whereas ODEs should have only one such variable's derivative. Therefore, our PDE-based method interprets both the layer of neural network and the dimension of hidden vector as continuous variables, which cannot be done in neural ODEs. In our context, h(d, t) means a hidden scalar element at layer t ∈ R and dimension d ∈ R m , e.g., m = 1 if h(t) is a vector, m = 3 if h(t) is a convolutional feature map, and so on.



Figure 1: The proposed PR-Net avoids solving integral problems by learning a regression model that conforms with a learned governing equation.

