ROBUST NEURAL ODES VIA CONTRACTIVITY-PROMOTING REGULARIZATION

Abstract

Neural networks can be fragile to input noise and adversarial attacks. In this work, we consider Neural Ordinary Differential Equations (NODEs) -a family of continuous-depth neural networks represented by dynamical systems -and propose to use contraction theory to improve their robustness. A dynamical system is contractive if two trajectories starting from different initial conditions converge to each other exponentially fast. Contractive NODEs can enjoy increased robustness as slight perturbations of the features do not cause a significant change in the output. Contractivity can be induced during training by using a regularization term involving the Jacobian of the system dynamics. To reduce the computational burden, we show that it can also be promoted using carefully selected weight regularization terms for a class of NODEs with slope-restricted activation functions, including convolutional networks commonly used in image classification. The performance of the proposed regularizers is illustrated through benchmark image classification tasks on MNIST and FashionMNIST datasets, where images are corrupted by different kinds of noise and attacks.

1. INTRODUCTION

Neural networks (NNs) have demonstrated outstanding performance in image classification, natural language processing, and speech recognition tasks. However, they can be sensitive to input noise or meticulously crafted adversarial attacks (Xu et al., 2020; Carlini & Wagner, 2017; Athalye et al., 2018; Szegedy et al., 2013) . The customary remedies are either heuristic, such as feature obfuscation (Miller et al., 2020) , adversarial training (Goodfellow et al., 2014; Allen-Zhu & Li, 2022) , and defensive distillation (Papernot et al., 2016) , or certificate-based such as Lipschitz regularization (Xu et al., 2020; Fazlyab et al., 2019; Pauli et al., 2021; Aquino et al., 2022; Virmaux & Scaman, 2018; Combettes & Pesquet, 2020) . The overall intent of certificate-based approaches is to penalize the input-to-output sensitivity of NNs to improve robustness. Recently, the connections between NNs and dynamical systems have been extensively explored. Representative results include classes of NNs stemming from the discretization of dynamical systems (Haber & Ruthotto, 2017) and NODEs (Chen et al., 2018) , which transform the input through a continuous-time ODE embedding training parameters. The continuous-time nature of NODEs makes them particularly suitable for learning complex dynamical systems (Rubanova et al., 2019; Greydanus et al., 2019) and allows borrowing tools from dynamical system theory to analyze their properties (Fazlyab et al., 2022; Galimberti et al., 2021) . In this paper, we employ contraction theory to improve the robustness of NODEs. A dynamical system is contractive if all trajectories converge exponentially fast to each other (Lohmiller & Slotine, 1998; Tsukamoto et al., 2021) . Through the lens of contraction, slight perturbations of initial conditions have a diminishing impact over time on the NODE state. With the above considerations, we propose a class of regularizers that promote contractivity of NODEs during the training. In the most general case, the regularizers require the Jacobian matrix of the NODE, which might be computationally challenging to obtain for deep networks. Nevertheless, for a wide class of NODEs with slope-restricted activation functions, we show that contractivity can be promoted by directly penalizing the weights during the training. Moreover, by leveraging the linearity of convolution operations, we demonstrate that contractivity can be promoted for convolutional NODEs by regularizing the convolution filters only.

1.1. RELATED WORK

Several works have focused on improving the robustness of general NNs against input noise and adversarial attacks using dynamical system theory. For example, the notion of incremental dissipativity is used to provide robustness certificates for NNs in the form of a linear matrix inequality (Aquino et al., 2022) . The works Chen et al. (2021; 2022) address the robustness issue of NNs by using a closed-loop control method from the perspective of dynamical systems. A control process is added to a trained NN to generate control signals to mitigate the perturbations in input data. Nevertheless, the method requires to solve an optimal control problem for the inference of an input sample, which increases the computational burden. A detailed study on the robustness of NODEs has been done by Hanshu et al. (2019) , where the authors show that NODEs can be more robust against random perturbations than common convolutional NNs. Moreover, they study time-invariant NODEs, and propose to regularize their flows to further enhance the robustness. To bolster the defense against adversarial attacks, NODEs equipped with Lyapunov-stable equilibrium points have been proposed (Kang et al., 2021) . Likewise, Rodriguez et al. (2022) introduced a loss function to promote robustness based on a control-theoretic Lyapunov condition. Both methods have shown promising performance against adversarial attacks. Finally, Massaroli et al. (2020) design provably stable NODEs and argue that stability can reduce the sensitivity to small perturbations of the input data. Nevertheless, this claim is not supported by theoretical analysis or numerical validation. In comparison to all the aforementioned works, in this paper, we employ contraction theory to regularize the trajectories of NODEs and improve robustness. Recently, contraction theory has been employed in the framework of NNs for various purposes. For instance, contractivity is exploited to improve the well-posedness and robustness of implicit NNs (Jafarpour et al., 2021) , the trainability of recurrent NNs (Revay & Manchester, 2020; Jafarpour et al., 2022) , and the analysis of Hopfield NNs with Hebbian learning (Centorrino et al., 2022) . In Zakwan et al. (2022) , the authors propose a Hamiltonian NODE that is contractive by design to improve robustness. However, the extension to different classes of NODEs, including convolutional NODEs, is not straightforward. Besides the robustification of NNs and NODEs, contractivity has also been exploited for learning NN-based dynamical models from data. For instance, Singh et al. (2021) and Revay et al. (2021a; b) utilize contraction theory to learn stabilizable nonlinear NN models from available data.

1.2. CONTRIBUTIONS

The contribution of this paper is fourfold. • We show that contractivity can be used to improve the robustness of NODEs, and demonstrate how to promote contractivity for general NODEs during training by including regularization terms in the cost function. • The regularization terms involve optimizing the Jacobian matrix in NODEs, which might be computationally expensive. Interestingly, for a wide class of NODEs with slope-restricted activation functions, we prove that contractivity can be promoted by carefully penalizing weight matrices and without optimizing the Jacobian matrix. • By exploiting the linearity of convolution operations and the above results for NODEs with slope-restricted activation functions, we show that contractivity for convolutional NODEs can be induced by suitably penalizing the convolutional filters. • We conduct experiments on MNIST and FashionMNIST datasets with test images perturbed by different kinds of noise and adversarial attacks. Compared to vanilla NODEs, by using contractivity-promoting regularization terms the average test accuracy can be improved up to 34% in the presence of input noise and up to 30% in the case of adversarial attacks.

1.3. ORGANIZATION AND NOTATION

The paper is organized as follows: Section 2 provides preliminaries on NODEs and contraction theory. In Section 3, we propose several regularization approaches for NODEs to promote contractivity. Numerical experiments are described in Section 4, and Section 5 concludes the paper. The set of real numbers is R. ∂f (x) ∂x represents the Jacobian matrix of a continuously differentiable function f (•). The minimal eigenvalue of a symmetric matrix A is denoted as λ min (A). diag(x) represents a diagonal matrix with the entries of the vector x on the diagonal. For symmetric matrices A and B, A ≻ (⪰)B means that A -B is positive (semi)definite. I denotes the identity matrix. The 2-norm is denoted as || • ||.

2. PRELIMINARIES 2.1 NEURAL ORDINARY DIFFERENTIAL EQUATION

A NODE is represented by the dynamical system ẋt = f (x t , θ t , t), t ∈ [0, T ] , where x t ∈ R n is the state of the NODE and f (x t , θ t , t) is a generic smooth function with parameters θ t ∈ R m . When used in machine learning tasks, the NODE is usually pre-and post-pended with additional layers, e.g., x 0 = h α (z) and y = g β (x T ), where h α , g β are NNs with parameters α ∈ R nα , β ∈ R n β , respectively, z ∈ R nz is the input feature, y ∈ R p represents the output, and x 0 , x T are the state of the NODE (1) at time t = 0, and t = T , respectively. Several methods have been proposed for training NODEs, such as the adjoint sensitivity method (Chen et al., 2018) , and the auto-differentiation technique (Paszke et al., 2017) . In this paper, we use the most straightforward approach, that is, the time-discretization of (1) (Haber & Ruthotto, 2017) . Consider a classification task, and suppose the training dataset is {z i , c i } s i=1 , where z i are the input features (e.g. images), c i are the corresponding labels, and s is the number of training samples. Before training, the NODE (1) is discretized and the resulting discrete-time equations define each of the network layers. For instance, by using Forward Euler (FE) method one obtainsfoot_0  x k+1 = x k + hf (x k , θ k , k), k = 0, . . . , T h -1 , where h > 0 is the sampling period. Then, the NODE is trained by solving the optimization problem min α,{θ k } T /h-1 k=0 ,β s i=1 l(y i , c i ) + γreg(α, {θ k } T /h-1 k=0 , β) s.t. x i 0 = h α (z i ), i = 1, . . . , s , x i k+1 = x i k + hf (x i k , θ k , k), k = 0, . . . , T h -1, y i = g β (x i T /h ) , where l(•, •) denotes the loss function, and reg(•) is a regularization term suitably scaled by the regularization parameter γ > 0. For brevity, throughout the paper, we omit the pre-and post-pended layers h α (•) and g β (•), which usually depend on the specific learning task (Chen et al., 2018) .

2.2. CONTRACTIVITY

Contractivity is a property of dynamical systems, and it implies that the trajectories of the dynamical system converge to each other asymptotically. The formal definition is given below. Definition 1. The dynamics ( 1) is contractive with a contraction rate ρ > 0 if ∥x t -x t ∥ ≤ e -ρt ∥x 0 -x 0 ∥, ∀t ∈ [0, T ] , for all x 0 , x0 ∈ R n , where xt and x t are the solutions of (1) with initial conditions x0 and x 0 , respectively. Therefore, if a NODE is contractive, the Lipschitz constant between the input and the output is smaller than 1, that is, ∥x T -x T ∥ ∥x0-x0∥ < 1 for any x0 , x 0 . As a result, contractive NODEs are robust in the sense that a slight perturbation in the input features x 0 would not result in a large deviation in the output x T . Moreover, we have that the NODE (1) is contractive with a contraction rate ρ, if and only if (Tsukamoto et al., 2021 ) -ρI - ∂f ∂x + ∂f ∂x ⊤ ≻ 0, ∀t ∈ [0, T ], x ∈ R n , where ∂f ∂x is the Jacobian matrix of f . Remark 1. The notion of asymptotic stability used in Massaroli et al. (2020) might not be appropriate for promoting robustness of NNs. Indeed, as shown in Rüffer et al. (2013) , although for convergent dynamics the perturbed states eventually converge to a unique trajectory, after a finite time, the distance between trajectories can be arbitrarily large, which can result in poor robustness of the NODE (1). In contrast, contractive dynamics does not suffer from this problem, and we will show in Section 4 that contractivity can considerably improve the robustness of NODEs. Remark 2. Contractivity implies all the state trajectories of (1) converge exponentially fast to an equilibrium (Tsukamoto et al., 2021) , which may limit the representation power of NODEs. However, a loss of expressivity might be unavoidable for increasing robustness, as discussed in Tsipras et al. (2019) . Remark 3. When training NODEs with global contractivity requirement, the training time T is finite, and we can also tune the contraction rate ρ, which is a hyper-parameter. As a result, the NODE trajectory would neither diverge nor converge to the same point during training, which ensures good learning and robustness performance. The readers can also refer to Figure 1 in Zakwan et al. (2022) as an illustration showing that global contraction can still ensure good learning result.

3. CONTRACTIVITY-PROMOTING REGULARIZATION

To promote the robustness of the NODE (1), one can leverage a regularization term penalizing the violation of (6). Contractivity requires the inequality (6) to hold for all t ∈ [0, T ], x ∈ R n . However, during the training, we only have access to discretized states x i k and hence, we can promote the fulfillment of the condition (6) by using the following regularization term in (3) reg({θ k } T /h-1 k=0 ) = s i=1 T /h k=0 ReLU -λ min -ρI - ∂f ∂x + ∂f ∂x ⊤ | x i k ,k , where ReLU(•) denotes the ReLU activation function. Remark 4. Although the regularizer (7) stems from ( 6), there are some differences. The condition ( 6) implies that all the trajectories converge to each other exponentially fast. In contrast, the regularizer (7) only penalizes the violation of contractivity locally on the sampled state x i k , which is weaker than (6) and therefore imposes fewer constraints on NODEs. Due to the smoothness property of NODEs, one can show that the learned trajectories {x i 0 , x i 1 , . . . , x i T /h } are locally contractive in the sense that the relation (5) holds only in the neighborhood of x k , k = 0, . . . , T /h. We defer the reader to Section 4.1 for an illustration showing the benefits of using (7).

3.1. WEIGHT REGULARIZATION FOR IMPROVING TRAINING COMPLEXITY

Since the regularization term (7) involves the Jacobian matrices ∂f ∂x | x i k ,k for all i, k, it might be computationally expensive to obtain. In this section, we focus on a family of NODEs with sloperestricted activation functions and show that one can directly regularize their trainable parameters to promote contractivity. Consider the following NODE ẋt = σ(W t x t + b t ), t ∈ [0, T ] , where x t ∈ R n is the state, W t ∈ R n×n , b t ∈ R n are NN parameters, and σ(•) is the activation function. The following theorem provides a sufficient condition on the weights W t guaranteeing that (8) is contractive. Theorem 1. Assume σ ′ (•) ∈ [κ, κ], where σ ′ (•) denotes any sub-derivative of σ, and κ > κ > 0. Moreover, for ρ > 0, let the following condition hold -ρ -2κW t,ii - κ n j=1,j̸ =i (|W t,ij | + |W t,ji |) > 0, i = 1, . . . , n for t ∈ [0, T ], where W t,ij is the ij-th element of W t . Then, the NODE ( 8) is contractive with a contraction rate ρ. Proof. From (6), the NODE ( 8) is contractive with a contraction rate ρ if -ρI - J t W t -W ⊤ t J t ≻ 0, ∀x ∈ R n , t ∈ [0, T ] , ) where J t is the Jacobian matrix of σ(W t x t + b t ) with respect to the input W t x t + b t . It follows that J t is a diagonal matrix with the i-th diagonal entry equal to σ ′ ([W t x t + b t ] i ), where [W t x t + b t ] i denotes the i-th element of W t x t + b t . According to the Gersgorin disk theorem (Horn & Johnson, 1985) , any matrix S ∈ R n×n that satisfies the following conditions S ii > n j=1,j̸ =i |S ij |, i = 1, . . . , n is positive definite (i.e. S ≻ 0). The diagonal elements of the matrix -ρI -JW -W ⊤ J (where the subscript t is dropped for simplicity) are -ρ -2J ii W ii , where J ii , W ii are the ii-th elements of the matrices J and W , respectively. Moreover, the ij-th (i ̸ = j) elements of the matrix -ρI -JW -W ⊤ J are -J ii W ij -J jj W ji . Therefore, in view of Gersgorin disk theorem, the matrix -ρI -JW - W ⊤ J is positive definite if -ρ -2J ii W ii > n j=1,j̸ =i |J ii W ij + J jj W ji |, i = 1, . . . , n . A sufficient condition for the feasibility of ( 11) is that the lower bound of the LHS is greater than the upper bound of the RHS. Consequently, it is necessary that W ii ≤ 0. Since κ ≥ σ ′ (•) ≥ κ, a lower bound of the LHS of (11) is -ρ -2κW ii , and an upper bound of the RHS of ( 11) is κ   n j=1,j̸ =i |W ij | + |W ji |   . Hence, if the condition (9) holds for all i and t ∈ [0, T ], the inequality (10) is verified, and the NODE ( 8) is contractive with the contraction rate ρ. Inspired by the above result, we can use the following regularization term in (3) during the training to promote contractivity of the NODE ( 8) reg {W k } T /h-1 k=0 = T /h-1 k=0 n i=1 ReLU   ρ + 2(κ + κ)W k,ii + κ n j=1 (|W k,ij | + |W k,ji |)   , ) where W k is the discretized counterpart of W t during the training. Remark 5. Similar to the Hamiltonian NODEs in Zakwan et al. (2022) ensuring contractivity by design, in view of Theorem 1, one can parameterize a subset of the weight matrices of NODE (8) that satisfy the condition (9) by design. The main idea is to modify the diagonal elements of W t such that the resulting weight matrices Wt satisfy (9) automatically. These matrices can be written as Wt = W t + H t , where H t = diag(H t,1 , . . . , H t,n ) with 2κH t,i = -ρ-2κW t,ii -κ n j=1,j̸ =i (|W t,ij |+|W t,ji |)-τ for any τ > 0.

3.2. EFFICIENT IMPLEMENTATION OF REGULARIZERS FOR CONVOLUTIONAL LAYERS

NNs are widely employed to perform image classification tasks, and convolutional layers have proved to be effective for image processing. However, convolution operations on inputs x t are usually not given in the form of W t x t + b t appearing in (8), which hampers the direct use of the regularizers described in Section 3.1. Although the convolution operation can be represented as W t x t + b t due to its linearity property, it might be burdensome to obtain W t . Hence, we propose a new regularizer directly defined on the convolution filters to avoid computing W t . By construction, the input x 0 and the output x T of the NODE (1) have the same size. Therefore, we consider convolution operations (Goodfellow et al., 2016) that preserve the dimension of the input. Suppose the size of the input X and the output Y of the convolution operation is D × P × H, where P and H are the width and height, respectively, of the image, and D is the number of channels. Let X i be the i-th channel of the input X, and Y j be the j-th channel of the output Y . Both channels have size P × H. Furthermore, let the filters of the convolution operations be C j i , i, j = 1, . . . , D, where C j i represents the filter map from the i-th input channel to the j-th output channel. Since inputs and outputs of NODEs have the same size, the convolution operations must satisfy additional conditions. For example, if the filter C j i is of size 3 × 3, the input size can be preserved by adding a zero-padding of 1 to the input and by applying a stride of 1 (Ciccone et al., 2018) . The convolution operation can be written as Y j = D i=1 C j i * X i , ∀j ∈ {1, . . . , D} , where * denotes the convolution operator. Let Vec(X) be the column vector concatenating the transpose of all the rows of X i for all i. Then, (13) can be written as Vec(Y ) = W × Vec(X) , for some weight matrix W ∈ R n×n , where n = D × P × H. From (13), we can see that every element of W is a linear function of C j i . However, computing W from C j i can be time-consuming. The following lemma reveals important connections between the matrix W and the filters C j i , that can be leveraged to directly regularize the filters C j i for imposing contractivity. Lemma 1 (Ciccone et al. (2018) ). Suppose the size of C j i is 3 × 3, and the convolution operation is applied with a zero-padding of 1 and a stride of 1. Then the following results hold. (1) Let {C d d } center denote the center element of C d d , d = 1, . . . , D. Then W ii = {C d d } center , i = P × H × (d -1) + 1, . . . , P × H × d , (2) Let {C d j } kl denote the kl-th elements of C d j . Then, n j=1 |W ij | ≤ D j=1 k,l |{C d j } kl |, i = P × H × (d -1) + 1, . . . , P × H × d , n j=1 |W ji | ≤ D j=1 k,l |{C j d } kl |, i = P × H × (d -1) + 1, . . . , P × H × d . Remark 6. Although Lemma 1 only considers convolution operations with a filter size 3 × 3, a zero-padding of 1 and a stride of 1, the result can also be extended to other convolution operations that preserve the size of the input, for example, the convolution operation with a filter size 5 × 5, a zero-padding of 2 and a stride of 1. For more details, please refer to Ciccone et al. (2018) . In view of Lemma 1, for i = P × H × (d -1) + 1, . . . , P × H × d, we have n j=1,j̸ =i (|W ij | + |W ji |) = n j=1 (|W ij | + |W ji |) -2|W ii | ≤ D j=1   k,l |{C d j } kl | + k,l |{C j d } kl |   -2|{C d d } center | . ( ) Therefore, if the NN in ( 8) contains a convolutional layer with filters C j i , one can use the following regularization term reg {C j i } D i,j=1 = D d=1 P × H × ReLU   ρ + 2(κ + κ){C d d } center + κ D j=1   k,l |{C d j } kl | + k,l |{C j d } kl |     , which is based on the expression of W ii in Lemma 1, the upper bound ( 14), the contractivity requirement ( 9), and the constraint W ii < 0. Remark 7. The regularizer (15) includes the coefficient P × H, which usually is very large for image classification tasks. In experiments of Section 4, we omit the term P × H in ( 15), and embed it into the regularization parameter γ.

4. EXPERIMENTS

In this section, first, we compare the pros and cons of the proposed regularizers, and second, we empirically validate the improvement in the robustness of convolutional NODEs against different forms of input noise and adversarial attacks by using the contractivity-promoting regularizers on MNIST and FashionMNIST classification tasks.

4.1. COMPARISONS OF DIFFERENT REGULARIZATION TERMS

In Section 3, we proposed two different approaches for promoting contraction through regularization. The first one exploits the regularization term ( 7), whereas, the second one focuses on the NODEs (8), and utilizes the regularization term ( 12). Both methods have pros and cons. ( 12) is more computationally efficient, but might affect the representation power of the NODEs. Indeed (12) aims to make (9) and further (6) hold for all x ∈ R n . In contrast, the regularizer (7) only promotes contractivity constraints on the sampled states x i k . As a result, the trained NODE is expected to be contractive only in the neighborhood of the points x i k . This property may be beneficial for some learning problems as illustrated in the following example. Consider the learning task shown in Figure 1a , where the goal is to train a NODE to learn a map associating x 1 0 to x 1 T and x 2 0 to x 2 T . Since the distance between x 1 0 and x 2 0 is smaller than the distance between x 1 T and x 2 T , we cannot obtain a satisfactory globally contractive NODE, that is, a NODE satisfying the contractivity condition (6). Instead, good performance can be achieved by a NODE, which fulfills the contractivity condition (7) involving only the sampled states. The learned trajectory and the flow of the NODE trained with the regularization term (7) are shown in Fig. 1b . To demonstrate the contractivity properties of the trained NODE, we sample the blue circles around x 1 0 and x 2 0 (see Fig. 1b ), and plot the sets corresponding to the NODE outputs in red. We can observe that the area of the red regions is smaller than the area of the input circles, which is expected from local contractivity.

4.2. MNIST AND FASHIONMNIST CLASSIFICATION TASKS

We evaluate the performance of the proposed regularization schemes on image classification tasks for the MNIST and FashionMNIST datasets, which are based on images of size 28 × 28. In both cases, we use the NODE (8) with convolutional layers. We train both a vanilla NODE (i.e., using γ = 0 in (3)) and the NODE with the regularization term (15), which we refer to as contractive NODE (CNODE), for ten different seeds so obtaining 10 versions of each model. The NODE structure is described as follows, where unless otherwise specified, the same parameters are used for both the MNIST and the FashionMNIST datasets. First, the image is processed by h α (•), which is a convolution operation with a filter size 3 × 3, a stride of 1, and a channel number of 8 and 16 for MNIST and FashionMNIST dataset, respectively. Second, it is processed by the NODE (8) for T = 0.1, where the NN is also a convolution operation with a filter size 3 × 3, a zero-padding of 1, and a stride of 1. We use FE discretization with step size h = 0.01 for training the NODEs. Finally, the output of the NODE is followed by a fully connected layer g β (•) with output dimension 10. Due to the smoothness requirement of f in (1) and the slope restrictions, we select the activation function in (8) to be the smooth leaky ReLU function, given by σ(x) = 0.1x + 0.9 log(1 + e x ), which satisfies 0.1 ≤ σ ′ (•) ≤ 1. We use the Adam optimizer to minimize the cross-entropy loss. The initial learning rate for the Adam optimizer is 0.05, and the learning rate is reduced by a factor of 0.7 after every training epoch. The maximal number of training epochs is 20. For the regularizer (15), we use ρ = 2. The weight γ for the regularization term ( 15) is set to 1. The contraction rate ρ and the regularization parameter γ are selected using grid search. We show in Appendix A.2 and Appendix A.3 that the average test accuracy is quite insensitive to the choice of ρ and γ. Moreover, we change the convolution parameters in the NODE and repeat the experiment. The results are shown in Appendix A.4 , which implies that with different convolution parameters, we can still achieve improved robustness performance with contractivity regularization. We test the performance of the vanilla NODE and CNODE against noisy test datasets, where the images are perturbed by zero mean Gaussian noise, and salt&pepper noise (Schott et al., 2019) . For each kind of noise, we generate several noisy test datasets with different noise strengths. Moreover, we test the adversarial robustness of the NODEs with respect to fast-gradient-sign-method (FGSM) (Goodfellow et al., 2014) and projected gradient descent (PGD) attacks (Madry et al., 2017) . Tables 3 and 4 summarize the mean and standard deviations of the classification accuracy over all test sets. In Table 3 , σ is the standard deviation of the Gaussian noise and ϵ denotes the proportion of image pixels corrupted by the impulse noise. The results of robustness against adversarial attacks are reported in Table 4 , where δ represents the l ∞ amplitude of perturbations in FGSM and PGD attacks. The best performance in each column appears in bold. To give an idea of the intensity of perturbations, we provide samples of test images in Appendix A.1.

No Noise Gaussian

Salt&Pepper MNIST σ = ϵ = 0 σ = 0.1 σ = 0.2 σ = 0.3 ϵ = 0.1 ϵ = 0.2 ϵ = 0.3 Vanilla NODE 98±0.3 65±23 45±21 37±16 76±9 54±11 42±8 CNODE 98±0.1 94±4 79±8 62±12 88±4 68±8 48±8 FashionMNIST σ = ϵ = 0 σ = 0.1 σ = 0.2 σ = 0.3 ϵ = 0.1 ϵ = 0.2 ϵ = 0.3 Vanilla NODE 88±0.1 75±4 47±4 35±4 69±2 51±4 38±5 CNODE 88±0.2 85±1 72±2 55±4 75±2 57±5 42±5 Table 1 : Classification accuracy over noisy test images (mean ± standard deviation). From the tables, we can observe that the CNODEs achieve higher mean classification accuracy than the vanilla NODEs in the presence of image perturbations. In some cases, the performance improvements are very significant (up to 34% for the case of Gaussian noises). Moreover, the standard deviations with CNODEs are either the same or less than those with vanilla NODEs in In this appendix, we analyze how the contraction rate affects the classification accuracy. For this purpose, we use the MNIST dataset and images perturbed by Gaussian noises or FGSM attacks. We use contraction rates ρ in the set {0.1, 2, 5, 7, 10, 12, 15}, train the CNODEs, and obtain 10 models for each ρ by using different seeds. Then, we calculate the classification accuracy of these models on the clean test dataset, the test dataset perturbed by Gaussian noises, and the test dataset attacked by FGSM. The mean and the standard deviations of the classification accuracy are plotted in Figure 3 , Figure 4 and Figure 5 , where the solid line represents the mean and the shaded region spans one standard deviation on each side of the mean. We can observe that the average classification accuracy does not vary significantly for different contraction rates. This suggests that the choice of the contraction rate is not critical for the MNIST experiments discussed in Section 4. 

A.3 REGULARIZER WEIGHT VS CLASSIFICATION ACCURACY

In this appendix, we analyze how the regularizer weight affects the classification accuracy. For this purpose, we use the MNIST dataset and images perturbed by Gaussian noises or FGSM attacks. We use regularization parameter γ in the set {0.1, 1, 5, 10, 20, 30, 40, 50}, train the CNODEs, and obtain 10 models for each γ by using different seeds. Then, we calculate the classification accuracy of these models on the clean test dataset, the test dataset perturbed by Gaussian noises, and the test dataset attacked by FGSM. The mean and the standard deviations of the classification accuracy are plotted in Figure 6 , Figure 7 , and Figure 8 , where the solid line represents the mean and the shaded region spans one standard deviation on each side of the mean. We can observe that the average classification accuracy does not vary significantly with different values of γ. Therefore, the average robustness performance does not rely heavily on the selection of the regularization parameter γ in the MNIST experiment in Section 4. 



For simplicity, we assume that T h is an integer.



Learned NODE trajectory and its flow.

Figure 1: Example of a simple NODE trained with the contractivity-promoting regularizer (7).

01 δ = 0.02 δ = 0.03 δ = 0.01 δ = 0.02 δ = 0δ = 0.01 δ = 0.02 δ = 0.03 δ = 0.01 δ = 0.02 δ = 0

FashionMNIST samples perturbed by PGD attacks.

Figure 2: Examples of perturbed images in MNIST and FashionMNIST classification tasks

Figure 3: Classification accuracy on the clean test dataset with respect to different contraction rates ρ.

Figure 4: Classification accuracy on test dataset perturbed by Gaussian noise with respect to different contraction rates ρ.

Figure 5: Classification accuracy on test dataset perturbed by FGSM attacks with respect to different contraction rates ρ.

Figure 6: Classification accuracy on the clean test dataset with respect to different regularization parameters γ.

01 δ = 0.02 δ = 0.03 δ = 0.01 δ = 0.02 δ = 0δ = 0.01 δ = 0.02 δ = 0.03 δ = 0.01 δ = 0.02 δ = 0

Classification accuracy over adversarial attacks (mean ± standard deviation). almost all the experiments, which means, CNODEs are less sensitive than vanilla NODEs to the selection of initialization seeds.

Classification accuracy over adversarial attacks (mean ± standard deviation).

5. CONCLUSIONS

In this paper, we use contraction from dynamical system theory to improve the robustness of NODEs. We propose regularizers with different degrees of flexibility and different computational requirements to promote contractivity. The good performance of the resulting NNs is illustrated on image classification tasks. Future work will focus on the development of easy-to-compute regularizers for classes of NODEs stemming from specific choices of f in (1).

A.4 CONVOLUTION PARAMETERS VS CLASSIFICATION ACCURACY

In this Appendix, we perform an ablation study by selecting different convolution parameters and demonstrate that CNODEs can still achieve improved robustness performance. We set parameters of the convolution operation in the NODE to the following two groups:• Group 1: a filter size 5 × 5, a zero-padding of 2, and a stride of 1.• Group 2: a filter size 7 × 7, a zero-padding of 3, and a stride of 1.Then we re-conduct the experiment. The maximal number of training epochs is 30. The other training parameters are set to be the same with those in Section 4.2. The average test accuracy and the standard deviation data are shown in the following table, where CNODE(5) and CNODE(7) represent the CNODE with convolution parameter Group 1 and Group 2, respectively. We can observe that with different convolution parameters, the CNODE can still achieve improved performance. 

