PDE-REGULARIZED NEURAL NETWORKS FOR IMAGE CLASSIFICATION

Abstract

Neural ordinary differential equations (neural ODEs) introduced an approach to approximate a neural network as a system of ODEs after considering its layer as a continuous variable and discretizing its hidden dimension. While having several good characteristics, neural ODEs are known to be numerically unstable and slow in solving their integral problems, resulting in errors and/or much computation of the forward-pass inference. In this work, we present a novel partial differential equation (PDE)-based approach that removes the necessity of solving integral problems and considers both the layer and the hidden dimension as continuous variables. Owing to the recent advancement of learning PDEs, the presented novel concept, called PR-Net, can be implemented. Our method shows comparable (or better) accuracy and robustness in much shorter forward-pass inference time for various datasets and tasks in comparison with neural ODEs and Isometric Mo-bileNet V3. For the efficient nature of PR-Net, it is suitable to be deployed in resource-scarce environments, e.g., deploying instead of MobileNet.

1. INTRODUCTION

It had been discovered that interpreting neural networks as differential equations is possible by several independent research groups (Weinan, 2017; Ruthotto & Haber, 2019; Lu et al., 2018; Ciccone et al., 2018; Chen et al., 2018; Gholami et al., 2019) . Among them, the seminal neural ordinary differential equation (neural ODE) research work, which considers the general architecture in Figure 1 (a) , is to learn a neural network approximating ∂h(t) ∂t , where h(t) is a hidden vector at layer (or time) t (Chen et al., 2018) . As such, a neural network is described by a system of ODEs, each ODE of which describes a dynamics of a hidden element. While neural ODEs have many good characteristics, they also have limitations, which are listed as follows: Pros. Neural ODEs can interpret t as a continuous variable and we can have hidden vectors at any layer (or time) l by h(l) = h(0) + l 0 o(h(t), t; θ o ) dt, where o(h(t), t; θ o ) = ∂h(t) ∂t is a neural network parameterized by θ o . Pros. Neural ODEs sometimes have smaller numbers of parameters than those of other conventional neural network designs, e.g., (Pinckaers & Litjens, 2019) . Cons. Neural ODEs, which use an adaptive step-size ODE solver, sometimes show numerical instability (i.e., the underflow error of the step-size) or their forward-pass inference can take a long time (i.e., too many steps) in solving integral problems, e.g, a forward-pass time of 37.6 seconds of ODE-Net vs. 9.8 seconds of PR-Net in Table 2 . Several countermeasures have been proposed but it is unavoidable to solve integral problems (Zhuang et al., 2020; Finlay et al., 2020; Daulbaev et al., 2020) . To tackle the limitation, we propose the concept of partial differential equation-regularized neural network (PR-Net) to directly learn a hidden element, denoted h(d, t) at layer (or time) t ∈ [0, T ] and dimension d ∈ R m . Under general contexts, a PDE consists of i) an initial condition at t = 0, ii) a boundary condition at a boundary location of the spatial domain R m , and iii) a governing equation describing ∂h(d,t) ∂t . As such, learning a PDE from data can be reduced to a regression-like problem to predict h(d, t) that meets its initial/boundary conditions and governing equation. In training our proposed PR-Net, h(0) is provided by an earlier feature extraction layer, which is the same as neural ODEs. However, an appropriate governing equation is unknown for downstream In other words, neural ODEs directly learn a governing equation (i.e., ∂h(t) ∂t ), whereas PR-Net learns a governing equation in conjunction with a regression model that conforms with the learned governing equation. The key advantage in our approach is that we can eliminate the necessity of solving integral problems -in neural ODEs, where we learn a governing equation only, solving integral problems is mandatory. Such forward and inverse problems (i.e., solving PDEs for h(d, t) and identifying governing equations, respectively) arise in many important computational science problems and there have been many efforts applying machine learning/deep learning techniques to those problems (e.g., in earth science (Reichstein et al., 2019; Bergen et al., 2019) and climate science (Rolnick et al., 2019) ). Recently, physics-informed or physics-aware approaches (Battaglia et al., 2016; Chang et al., 2017; de Bezenac et al., 2018; Raissi et al., 2019; Sanchez-Gonzalez et al., 2018; Long et al., 2018) have demonstrated that designing neural networks to incorporate prior scientific knowledge (e.g., by enforcing physical laws described in governing equations (Raissi et al., 2019) ) greatly helps avoiding over-fitting and improving generalizability of the neural networks. There also exist several approaches to incorporate various ideas of classical mechanics in designing neural-ODE-type networks (Greydanus et al., 2019; Chen et al., 2020; Cranmer et al., 2020; Zhong et al., 2020; Lee & Parish, 2020) . However, all these works are interested in solving either forward or inverse problems whereas we solve the two different problem types at the same time for downstream tasks. The most similar existing work to our work is in (Long et al., 2018) . However, this work studied scientific PDEs and do not consider t as a continuous variable but use a set of discretized points of t. Compared to previous approaches, the proposed method has a distinct feature that forward and inverse problems are solved simultaneously with a continuous variable t. Due to this unique feature, the method can be applied to general machine learning downstream tasks, where we do not have a priori knowledge on governing equations, such as image classification. Our proposed PR-Net had the following characteristics: Pros. PR-Net trains a regression model that outputs a scalar element h(d, t) (without solving any integral problems), and we can consider both d and t as continuous variables. Therefore, it is possible to construct flexible hidden dimension vectors. Pros. PR-Net does not require solving integral problems. As such, there is no numerical instability and their forward-pass time is much shorter than that of neural ODEs. Pros. By learning a governing equation, we can regularize the overall behavior of PR-Net. Cons. PR-Net sometimes requires a larger number of parameters than that of neural ODEs or conventional neural networks.

2. PARTIAL DIFFERENTIAL EQUATIONS

The key difference between ODEs and PDEs is that PDEs can have derivatives of multiple variables whereas ODEs should have only one such variable's derivative. Therefore, our PDE-based method interprets both the layer of neural network and the dimension of hidden vector as continuous variables, which cannot be done in neural ODEs. In this section, we first introduce the forward and inverse problems of PDEs in general contexts (see Table 1 ). Then, we extend them to design our proposed method in deep-learning contexts.

2.1. FORWARD PROBLEM OF PDES IN GENERAL CONTEXTS

The forward PDE problem in general contexts is to find a solution h(d, t), where d is in a spatial domain R m and t is in a time domain [0, T ], given i) an initial condition h(d, 0), ii) a boundary condition h(d bc , t), where d bc is a boundary location of the spatial domain R m , and iii) a governing equation g (Raissi et al., 2019) We note that the boundary condition can be missing in some cases (Kim, 2018) . The governing equation is typically in the following form with particular choices of α i,j (Raissi, 2018; Peng et al., 2020) : g(d, t; h) def = h t -α 0,0 + α 1,0 h + α 2,0 h 2 + α 3,0 h 3 + α 0,1 h d + α 1,1 hh d + α 2,1 h 2 h d + α 3,1 h 3 h d + α 0,2 h dd + α 1,2 hh dd + α 2,2 h 2 h dd + α 3,2 h 3 h dd + α 0,3 h ddd + α 1,3 hh ddd + α 2,3 h 2 h ddd + α 3,3 h 3 h ddd , where h t = ∂h(d,t) ∂t , h d = ∂h(d,t) ∂d , h dd = ∂ 2 h(d,t) ∂d 2 , and h ddd = ∂ 3 h(d,t) ∂d 3 . We also note that g is always zero in all PDEs, i.e., g(d, t; h) = 0. In many cases, it is hard to solve the forward problem and hence general purpose PDE solvers do not exist. Nevertheless, one can use the following optimization to train a neural network f (d, t; θ) to approximate the solution function h(d, t) as shown in Figure 2 (Raissi et al., 2019) : arg min θ L I + L B + L G , L I def = 1 N I d f (d, 0; θ) -h(d, 0) 2 , L B def = 1 N B (dbc,t) f (d bc , t; θ) -h(d bc , t) 2 , L G def = 1 N G (d,t) g(d, t; f, θ) 2 , ( ) where N I , N B , N G are the numbers of training samples, L I is to train θ for the initial condition, L B is for the boundary condition, and L G is for the governing equation. Because the governing equation is always zero, we simply minimize its squared term. Note that i) f t , f d , f dd , f ddd can be easily constructed using the automatic differentiation implemented in TensorFlow or PyTorch, and ii) we only need h(d, 0), h(d bc , t), which are known a priori, to train the parameters θ.

2.2. INVERSE PROBLEM OF PDES IN GENERAL CONTEXTS

The inverse problem is to find a governing equation given i) an initial condition h(d, 0) and ii) a solution function h(d, t) (Raissi, 2018) . It learns α i,j in Eq. 1 with the following loss (if possible, they use reference solutions as well): arg min αi,j 1 N G (d,t) g(d, t; h) 2 . Given a solution function h and its partial derivative terms, we train α i,j by minimizing the objective loss. Note that we know h in this case. Therefore, the objective loss is defined with h rather than with f , unlike Eq. 5. The optimal solution of α i,j is not unique sometimes. However, we note that no trivial solutions, e.g., α i,j = 0 for all i, j, exist for the inverse problem.

3. PDE-REGULARIZED NEURAL NETWORKS

Our goal in this work is to replace a system of ODEs (cf. Figure 1 (a)) with a PDE. Assuming that a target task-specific PDE is known a priori, given an initial condition h(0) extracted by the feature extractor from a sample x, a forward problem can be solved via the method described in Section 2.1. However, a target task-specific PDE is not known a priori in general, and thus, the governing equation should be learned from data via solving the inverse problem. Unfortunately, the solution function h(d, t)) is not also known a priori in our setting. Therefore, we make an assumption on the governing equation that it consists of the most common partial derivative terms (cf. Eq. 1) and then we propose to solve the forward and the inverse problems alternately: to train θ, we fix its governing equation g (more precisely, α i,j for all i, j), and to train α i,j for all i, j, we fix θ. How to Solve Forward Problem. We customize the method presented in Section 2.1 by i) adding a task-specific loss, e.g., cross-entropy loss for image classification, ii) parameterizing the neural network f by the initial condition h(0), and iii) dropping the boundary condition. Let f (h(0), d, t; θ) be our neural network to approximate h(d, t) given the varying initial condition h(0)foot_0 . The definition of the governing equation is also extended to g(d, t; f, h(0), θ). We use the following loss definition to train θ: arg min θ L T + LI + LG , LI def = 1 N X x∈X 1 dim(h) d f (h(0), d, 0; θ) -h(d, 0) 2 , LG def = 1 N X x∈X 1 N H (d,t)∈H g(d, t; f, h(0), θ) 2 , ( ) where L T is a task-specific loss, X is a training set, and H is a set of (d, t) pairs, where d ∈ R ≥0 , t ∈ R ≥0 , with which we construct the hidden vector that will be used for downstream tasks, denoted by h task (See Figure 3 ). We query f (h(0), d, t; θ) with the (d, t) pairs in H to construct h task . One more important point to note is that in order to better construct h task , we can train even the pairs in H as follows: arg min (d,t)∈H L T (line 7 in Alg. 1). Thus, the elements of h task can be collected from different dimensions and layers. A similar approach to optimize the end time of integral was attempted for neural ODEs in (Massaroli et al., 2020) . How to Solve Inverse Problem. After fixing θ, we train α i,j for all i, j by using the following L 1 regularized loss with a coefficient w: arg min αi,j LG + R G , R G def = w i,j |α i,j |. ( ) We minimize the sum of |α i,j | to induce a sparse governing equation according to Occam's razor and since in many PDEs, their governing equations are sparse. This optimization allows us to choose a sparse solution among many possible governing equations. In many cases, therefore, our regularized inverse problem can be uniquely solved.

ĥ(0)

Feature Extractor Training Algorithm. Our overall training algorithm is in Alg. 1. We alternately train θ, (d, t) ∈ H, and α i,j for all i, j. The forward problem to train θ becomes a well-posed problem (i.e., its solution always exists and is unique) if the neural network f is analytical or equivalently, uniformly Lipschitz continuous (Chen et al., 2018) . Many neural network operators are analytical, such as softplus, fully-connected, and exponential. Under the mild condition of analytical neural networks, therefore, the well-posedness can be fulfilled. The inverse problem can also be uniquely solved in many cases due to the sparseness requirement. As a result, our proposed training algorithm can converge to a cooperative equilibrium. Note that θ, (d, t) ∈ H, and α i,j for all i, j cooperate to minimize L T + LI + LG + R G . Therefore, the proposed training method can be seen as a cooperative game (Mas-Colell, 1989) . After finishing the training process, α i,j , for all i, j, are not needed any more (because θ already conforms with the learned governing equation at this point) and can be discarded during testing. h task PDE f(h(0),d,t;θ) Classifier L T Data flow Gradient flow L I L G R G Algorithm 1 For complicated downstream tasks, training for L T should be done earlier than others (line 5). Then, we carefully update the PDE parameters (line 6) and other training procedures follow. The proposed sequence in Alg. 1 produces the best outcomes in our experiments. However, this sequence can be varied for other datasets or downstream tasks. Complexity Analyses. The adjoint sensitivity method of neural ODEs enables the space complexity of O(1) while calculating gradients. However, its forward-pass inference time is O( 1 s ), where s is the (average) step-size of an underlying ODE solver. Because s can sometimes be very small, its inference via forward-pass can take a long time. Our PR-Net uses the standard backpropagation method to train and its gradient computation complexity is the same as that in conventional neural networks. In addition, the forward-pass inference time is O(1), given a fixed network f , because we do not solve integral problems.

4. EXPERIMENTS

In this section, we introduce our experimental evaluations with various datasets and tasks. All experiments were conducted in the following software and hardware environments: UBUNTU 18.04 LTS, PYTHON 3.6.6, NUMPY 1.18.5, SCIPY 1.5, MATPLOTLIB 3.3.1, PYTORCH 1.2.0, CUDA 10.0, and NVIDIA Driver 417.22, i9 CPU, and NVIDIA RTX TITAN. In Section J of Appendix, we summarize detailed dataset information and additional experiments. 2 . See Appendix for the architecture and the hyperparameters of the network f in PR-Net for this experiment. We reuse their codes and strictly follow their experimental environments. Its detailed results are summarized in Table 2 . We compare with ResNet, RK-Net and ODE-Net. In ResNet, we have a downsampling layer followed by 6 standard residual blocks (He et al., 2016) . For RK-Net and ODE-Net, we replace the residual blocks with an ODE but they differ at the choice of ODE solvers. RK-Net uses the fourth-order Runge-Kutta method and ODE-Net uses the adaptive Dormand-Prince method for their forward-pass inference -both of them are trained with the adjoint sensitivity method which is a standard backward-pass gradient computation method. Our PR-Net, which does not require solving integral problems, shows the best performance in all aspects for MNIST. In particular, PR-Net shows much better efficiency than ResNet, considering their numbers of parameters, i.e., 0.60M of ResNet and 0.21M of PR-Net. Comparing ODE-Net and PR-Net for the inference time, our method shows much faster performance, i.e., 24.8355 seconds of ODE-Net vs. 6.5023 seconds of PR-Net to classify a batch of 1,000 images. Considering its short inference time, in SVHN we can say that its efficiency is still better than that of ODE-Net. One interesting point is that using the fourth-order Runge-Kutta method in RK-Net produces better accuracy and inferente time than ODE-Net in our experiments, which is slightly different from the original neural ODE paper (Chen et al., 2018) . We tested more hyperparameters for them.

4.2. IMAGE CLASSIFICATION WITH TINY IMAGENET

We use one more convolutional neural network to test with Tiny ImageNet. Tiny ImageNet is the modified subset of ImageNet with downscaled image resolution 64 × 64. It consists of 200 different classes with 100,000 training images and 10,000 validation images. Our baseline model is Isometric MobileNet V3 (Sandler et al., 2019) . For the efficient nature of ODE-Net and PR-Net, we consider that the resource-scarce environments, for which MobileNet was designed, are one of their best application areas. The isometric architecture of Isometric MobileNet V3 maintains constant resolution throughout all layers. Therefore, pooling layers are not needed and computation efficiency is high, according to their experiments. In addition, neural ODEs require an isometric architecture, i.e., the dimensionality of h(t), t ≥ 0, cannot be varied. In our PR-Net, we do not have such restrictions. For fair comparison, however, we have decided to use Isometric MobileNet V3. We replace some of its MobileNet V3 blocks with ODEs or PDEs, denoted ODE-Net and PR-Net in Table 3 , respectively. We train our models from scratch without using any pretrained network, with a synchronous training setup. Table 3 summarizes their results. We report both of the top-1 and the top-5 accuracy, which is a common practice for (Tiny) ImageNet. In general, our PR-Net shows the best accuracy. PR-Net achieves an top-1 accuracy of 0.6157 with 4.56M parameters. The full Isometric MobileNet V3 marks an top-1 accuracy of 0.6578 with 20M parameters and the reduced Isometric MobileNet V3 with 4.30M parameters shows an top-1 accuracy of 0.6076. Considering the large difference on the number of parameters, PR-Net's efficiency is high. In particular, it outperforms others in the top-5 accuracy by non-trivial margins, e.g., 0.7911 of ODE-Net vs. 0.8115 of Isometric MobileNet V3 vs. 0.8357 of PR-Net. In addition, PR-Net shows faster forward-pass inference time in comparison with ODE-Net. The inference time is to classify a batch of 1,000 images. 

4.3. EXPERIMENTS ON ROBUSTNESS WITH TINY IMAGENET

To check the efficacy of learning a governing equation, we conduct three more additional experiments with Tiny ImageNet: i) out-of-distribution image classification, ii) adversarial attacks, and iii) transfer learning to other image datasets. In the first and second experiments, we apply many augmentation/perturbation techniques to generate out-of-distribution/adversarial images and check how each model responses to them. Being inspired by the observations that robust models are better transferred to other datasets (Engstrom et al., 2019a; Allen-Zhu & Li, 2020; Salman et al., 2020) , in the third experiment, we check the transfer learning accuracy to other image datasets. According to our hypothesis, PR-Net which knows the governing equation for classifying Tiny ImageNet should show better robustness than others (as seen in Figure 4 for a scientific PDE problem in Appendix). Neural networks are typically vulnerable to out-of-distribution and adversarial samples (Shen et al., 2016; Azulay & Weiss, 2019; Engstrom et al., 2019b) . As being more fitted to training data, they typically show lower robustness to out-of-distribution and adversarial samples. However, PR-Net's processing them should follow its learned governing equation. Therefore, one way to understand learning a governing equation is a sort of regularization which prevents overfitting and implanting knowledge governing the classification process. Out-of-Distribution Image classification. We use four image augmentation methods: i) adding a Gaussian noise of N (0, 0.1), ii) cropping a ceter area by size 56 × 56 and resizing to the original size, iii) rotating into a random direction for 30 degree, and iv) perturbing colors through randomly jittering the brightness, contrast, saturation, and hue with a strength coefficient of 0.2. All these are popular out-of-distribution augmentation methods (Shen et al., 2016; Azulay & Weiss, 2019; Engstrom et al., 2019b) . Our PR-Net shows the best accuracy (i.e., robustness) in all cases. In comparison with ODE-Net, it shows much better robustness, e.g., 0.3812 of ODE-Net vs. 0.4429 of PR-Net for the color jittering augmentation. One interesting point is that all methods are commonly more vulnerable to the random rotation and the color jittering augmentations than the other two augmentations. Adversarial Attack Robustness. It is well-known that neural networks are vulnerable to adversarial attacks. Because the governing equation regularizes PR-Net's behaviors, it can be robust to unknown adversarial samples. We use FGSM (Goodfellow et al., 2015) and PGD (Madry et al., 2018) to find adversarial samples and the robustness to them is reported in Table 4 . With various settings for the key parameter that controls the degree of adversarial perturbations, we generate adversarial samples. The configuration of doubling the number of channels used in each layer, denoted as "Width Multiplier 2", showed better performance in Table 3 and we use only the con- Transfer Learning. As reported in (Engstrom et al., 2019a; Allen-Zhu & Li, 2020; Salman et al., 2020) , robust models tend to produce feature maps suitable for transfer learning than regular models. In this regard, we checked the transferability of the pretrained PR-Net for Tiny ImageNet to other datasets: CIFFAR100 (Krizhevsky, 2009) , CIFAR10 (Krizhevsky, 2009) , FGVC Aircraft (Maji et al., 2013 ), Food-101 (Bossard et al., 2014) , DTD (Cimpoi et al., 2014) , and Cars (Yang et al., 2015) . As shown in Table 5 , PR-Net shows the best transfer learning accuracy in all cases except Cars. The improvements over M.Net V3 and ODE-Net are significant for Aircraft and DTD.

5. DISCUSSIONS & CONCLUSIONS

It recently became popular to design neural networks based on differential equations. In most cases, ODEs are used to approximate neural networks. In this work, on the other hand, we presented a PDE-based approach to design neural networks. Our method simultaneously learns a regression model and a governing equation that conform with each other. Therefore, the internal processing mechanism of the learned regression model should follow the learned governing equation. One can consider that this mechanism is a sort of implanting domain knowledge into the regression model. The main challenge in our problem definition is that we need to discover a governing equation from data while training a regression model. Therefore, we adopt a joint training method of the regression model and the governing equation. To show the efficacy, we conducted five experiments: i) MNIST/SVHN classification, ii) Tiny Ima-geNet classification, iii) classification with out-of-distribution samples, iv) adversarial attack robustness, and v) transfer learning. Our method shows the best accuracy and robsutness (or close to the best) only except SVHN. In particular, the challenging robustness experiments empirically prove why learning an appropriate governing equation is important. One limitation on this method is that it is sometimes hard to achieve a good trade-off between all different loss and regularization terms. Our method intrinsically involves various terms and we found that it is important to tune hyperparameters (especially for various coefficients and learning rates) in order to achieve reliable performance. In particular, α i,j , for all i, j, are important to learn reliable governing equations. Because the trained network f is greatly influenced by the governing equation, hyperparameters should be tuned to learn meaningful equations. We also plan to study the proposed concept for many other classification/regression tasks.



Therefore, one can consider that our neural network f approximates a general solution rather than a particular solution. A general solution means a solution of PDE with no specified initial conditions and a particular solution means a solution of PDE given an initial condition. Both neural ODEs and PR-Net approximate general solutions because initial conditions are varied.



Figure 1: The proposed PR-Net avoids solving integral problems by learning a regression model that conforms with a learned governing equation.

Figure 3: The general architecture and the training algorithm of PR-Net

How to train PR-NetInput: training data X, validating data V , max iteration number max iter Output: θ, (d, t) ∈ H, and αi,j for all i, j 1 Initialize θ, (d, t) ∈ H, and αi,j for all i, j; 2 k ← 0; 3 Lsum ← ∞; 4 while Lsum is not converged and k < max iter do Train θ, Feature Extractor, and Classifier with LT ;Train θ with LT + LI + LG ; Train αi,j for all i, j with LG + RG;

Image classification in MNIST and SVHN. The inference time is the time in seconds to classify a batch of 1,000 images. In general, PR-Net shows the best efficiency.We reuse the convolutional neural network, called ODE-Net, in the work byChen et al. (2018) to classify MNIST and SVHN and replace its ODE part with our proposed PDE, denoted PR-Net in Table

Image classification in Tiny ImageNet. PR-Net shows better efficiency than ODE-Net.

Adversarial attacks in Tiny ImageNet. PR-Net shows better robustness than ODE-Net.

Transfer learning in Tiny ImageNet. PR-Net shows better transferability than ODE-Net.

