FINDE: NEURAL DIFFERENTIAL EQUATIONS FOR FINDING AND PRESERVING INVARIANT QUANTITIES

Abstract

Many real-world dynamical systems are associated with first integrals (a.k.a. invariant quantities), which are quantities that remain unchanged over time. The discovery and understanding of first integrals are fundamental and important topics both in the natural sciences and in industrial applications. First integrals arise from the conservation laws of system energy, momentum, and mass, and from constraints on states; these are typically related to specific geometric structures of the governing equations. Existing neural networks designed to ensure such first integrals have shown excellent accuracy in modeling from data. However, these models incorporate the underlying structures, and in most situations where neural networks learn unknown systems, these structures are also unknown. This limitation needs to be overcome for scientific discovery and modeling of unknown systems. To this end, we propose first integral-preserving neural differential equation (FINDE). By leveraging the projection method and the discrete gradient method, FINDE finds and preserves first integrals from data, even in the absence of prior knowledge about underlying structures. Experimental results demonstrate that FINDE can predict future states of target systems much longer and find various quantities consistent with well-known first integrals in a unified manner.

1. INTRODUCTION

Modeling and predicting real-world systems are fundamental aspects of understanding the world in natural science and improving computer simulations in industry. Target systems include chemical dynamics for discovering new drugs (Raff et al., 2012) , climate dynamics for climate change prediction and weather forecasting (Rasp et al., 2020; Trigo & Palutikof, 1999) , and physical dynamics of vehicles and robots for optimal control (Nelles, 2001) . In addition to image processing and natural language processing (Devlin et al., 2018; He et al., 2016) , neural networks have been actively studied for modeling dynamical systems (Nelles, 2001) . Their history dates back to at least the 1990s (see Chen et al. (1990) ; Clouse et al. (1997) ; Levin & Narendra (1995) ; Narendra & Parthasarathy (1990) ; Sjöberg et al. (1994) ; Wang & Lin (1998) for examples). Recently, two notable but distinct families have been proposed. Physics-informed neural networks (PINNs) directly solve partial differential equations (PDEs) given as symbolic equations (Raissi et al., 2019) . Neural ordinary differential equations (NODEs) learn ordinary differential equations (ODEs) from observed data and solve them using numerical integrators (Chen et al., 2018) . Our focus this time is on NODEs. Most real-world systems are associated with first integrals (a.k.a. invariant quantities), which are quantities that remain unchanged over time (Hairer et al., 2006) . First integrals arise from intrinsic geometric structures of systems and are sometimes more important than superficial dynamics in understanding systems (see Appendix A for details). Many previous studies have extended NODEs by incorporating prior knowledge about first integrals and attempted to accurately learn a target system. Greydanus et al. (2019) proposed the Hamiltonian neural network (HNN), which employs a neural network to approximate Hamilton's equation, thereby conserving the system energy called the Hamiltonian. Finzi et al. (2020a) proposed neural network architectures that conserve linear and angular momenta by utilizing the graph structure. Finzi et al. (2020b) also extended an HNN to a system with holonomic constraints, which led to first integrals such as a pendulum length. (Chen et al., 2018) HNN (Greydanus et al., 2019 ) ✓ LieConv (Finzi et al., 2020a) ✓ ✓ DGNet (Matsubara et al., 2020) ✓ ✓ ✓ CHNN (Finzi et al., 2020b) ✓ ✓ NPM (Yang et al., 2020) ✓ ✓ ✓ Matsubara et al. (2020) proposed a model that preserves the total mass of a discretized PDE. These studies have demonstrated that the more prior knowledge a neural network has about first integrals, the more accurate their dynamics prediction. See Table 1 for comparisons. Continuous FINDE (proposed) ✓ ✓ ✓ ✓ ✓ Discrete FINDE (proposed) ✓ ✓ ✓ ✓ ✓ ✓ Previous studies have mainly attempted to preserve known first integrals for better computer simulations. However, in situations where a neural network learns a target system, it is naturally expected that first integrals associated with the target system are unknown, and it is not clear which of the above methods are available. Therefore, this study proposes first integral-preserving neural differential equation (FINDE) to find and preserve unknown first integrals from data in a unified manner. FINDE has two versions for continuous and discrete time; these have the following advantages. Finding First Integrals Many studies have designed architectures or operations of neural networks to model continuous-time dynamics with known types of first integrals. However, the underlying geometric structures of a target system are generally unknown in practice. In contrast, FINDE finds various types of first integrals from data in a unified manner and preserves them in predictions. For example, from an energy-dissipating system, FINDE can find first integrals other than energy. FINDE can find not only known first integrals, but also unknown ones. Hence, FINDE can lead to scientific discoveries. Combination with Known First Integrals FINDE can be combined with previously proposed neural networks designed to preserve known first integrals, such as HNNs. In addition, when some first integrals are known in advance, they can also be incorporated into FINDE to avoid rediscovery. Therefore, FINDE is available in various situations.

Exact Preservation of First Integrals

The first integral associated with a continuous-time system is destroyed after the dynamics is temporally discretized for computer simulations. By leveraging the discrete gradient, the discrete-time version of FINDE preserves first integrals exactly (up to rounding errors) in discrete time and further improves the prediction performance.

2. BACKGROUND AND RELATED WORK

First Integrals Let us consider a time-invariant differential system d dt u = f (u) on an Ndimensional manifold M, where u denotes the system state and f : M → T u M represents a vector field on M. For simplicity, we suppose the manifold M to be a Euclidean space R N . Definition 1 (first integral). A quantity V : M → R is referred to as a first integral of a system d dt u = f (u) if it remains constant along with any solution u(t), i.e., d dt V (u) = 0. If a differential system d dt u = f (u) has K functionally independent first integrals V 1 , . . . , V K , the solution u(t) given an initial value u 0 stays at the (N -K)-dimensional submanifold M ′ = {u ∈ M : V 1 (u) = V 1 (u 0 ), . . . , V K (u) = V K (u 0 )}. (1) The tangent space T u M ′ ⊂ T u M of the submanifold M ′ ⊂ M at a point u is the orthogonal complement to the space spanned by the gradients ∇V k (u) of the first integrals V k for k = 1, . . . , K; T u M ′ = {w ∈ T u M : ∇V k (u) ⊤ w = 0 for k = 1, . . . , K}. (2) Conversely, if the time-derivative f at point u is on the tangent space T u M ′ for certain functions V k 's, the quantities V k 's are first integrals of the system d dt u = f (u); it holds that d dt V k (u) = ∇V k (u) ⊤ d dt u = ∇V k (u) ⊤ f (u) = 0. One of the most well-known first integrals is the Hamiltonian H, which represents the system energy of a Hamiltonian system. Noether's theorem states that a continuous symmetry of a system leads to a conservation law (and hence a first integral) (Hairer et al., 2006) . A Hamiltonian system is symmetric to translation in time, and the corresponding first integral is the Hamiltonian. Symmetries to translation and rotation in space lead to the conservation of linear and angular momenta. However, not all first integrals are related to symmetries. A pendulum can be expressed in Cartesian coordinates, and then the rod length constrains the mass position. This type of constraint is called a holonomic constraint and leads to first integrals. Models of disease spreads and chemical reactions have the total mass (population) as the first integral. Also for a system described by a PDE, the total mass is sometimes a first integral (Furihata & Matsuo, 2010) . See Appendix A for the classes of dynamics, their geometric structures, and related studies to find or preserve first integrals. First Integrals in Numerical Analysis For computer simulations, differential systems are discretized in time and solved by numerical integration, causing numerical errors (which is composed of temporal discretization errors and rounding errors). Moreover, the geometric structures of the system are often destroyed, and the corresponding first integrals are no longer preserved. A common remedy is a symplectic integrator, which preserves the symplectic structure and accurately integrates Hamiltonian systems (Hairer et al., 2006) . However, the Ge-Marsden theorem states that a symplectic integrator only approximately conserves the Hamiltonian (Zhong & Marsden, 1988 ). Hence, many numerical schemes have also been investigated to preserve first integrals exactly, while these schemes cannot preserve the symplectic structure. Some examples are shown below. Let the superscript s ∈ {0, 1, . . . , S} denote the state u s or time t s at the s-th time step, and ∆t s = t s+1 -t s denote a time-step size. A projection method uses a numerical integrator to predict the next state ũs+1 from the current state u s and then projects the state ũs+1 onto the submanifold M ′ (Gear, 1986; Hairer et al., 2006, Section IV.4 ). The projected state u s+1 preserves the first integrals V k . In particular, the projected state u s+1 is obtained by solving the optimization problem arg min u s+1 ∥u s+1 -ũs+1 ∥ subject to V k (u s+1 ) -V k (u s ) = 0 for k = 1, . . . , K. The local coordinate method defines a coordinate system on the neighborhood of the current state u s and integrates a differential equation on it (Potra & Yen, 1991; Hairer et al., 2006, Section IV.5 ). The discrete gradient method defines a discrete analogue to a differential system and integrates it in discrete time, thereby preserving the Hamiltonian exactly (up to rounding errors) in discrete time (Furihata & Matsuo, 2010; Gonzalez, 1996; Hong et al., 2011) . Neural Networks to Preserve First Integrals NODE defines the right-hand side f of a differential system d dt u = f (u) using a neural network in the most general way with no associated first integrals (Chen et al., 2018) . NODE is a universal approximator to ODEs and can approximate any ODE with arbitrary accuracy if there is an infinite amount of training data (Teshima et al., 2020) . In practice, the amount of training data is limited, and prior knowledge about the target system is helpful for learning (see Sannai et al. (2021) for the case with convolutional neural networks (CNNs)). HNN (Greydanus et al., 2019 ) assumes the target system to be a Hamiltonian system in the canonical form, thereby guaranteeing various properties of Hamiltonian systems by definition, including the conservation of energy and preservation of the symplectic structure in continuous time (Hairer et al., 2006) . Some studies have employed a symplectic integrator for HNN to preserve the energy and symplectic structure with smaller numerical errors (Chen et al., 2020) . LieConv and EMLP-HNN employ neural network architectures with translational and rotational symmetries to preserve momenta (Finzi et al., 2020a; 2021) . CHNN incorporates a known holonomic constraint in the dynamics (Finzi et al., 2020b) . Deep conservation extracts latent dynamics of a PDE system and preserves a quantity of interest by forcing its flux to be zero (Lee & Carlberg, 2021) . HNN++ also guarantees the conservation of mass in PDE systems by using a coefficient matrix derived from differential operators (Matsubara et al., 2020) . These methods preserve known types of first integrals and suffer from temporal discretization errors. In contrast, FINDE learns any types of first integrals from data and preserves them even after temporal discretization. The neural projection method (NPM) learns fixed holonomic constraints using the projection (and inequality constraints) (Yang et al., 2020) . DGNet employed discrete gradient methods to guarantee the energy conservation in Hamiltonian systems (and the energy dissipation in friction systems) (Matsubara et al., 2020) . While these methods preserve the aforementioned first integrals exactly in discrete time, their formulations are not available for other first integrals. Several studies have proposed neural networks to learn Lyapunov functions, which are expected to be non-increasing over time, in contrast to first integrals (Manek & Kolter, 2019; Takeishi & Kawahara, 2020) . If the state moves in the direction of increasing the function, it is projected onto or moved inside the contour line of the Lyapunov function. This concept is similar to that of the continuous-time version of FINDE but focuses on a single non-increasing quantity in continuous time; FINDE preserves multiple quantities in both continuous and discrete time.

3. FIRST INTEGRAL-PRESERVING NEURAL DIFFERENTIAL EQUATION

We suppose that a target system has at least K unknown functionally independent first integrals. When a neural network learns the dynamics of the target system, it is not guaranteed to learn these first integrals. We suppose that a certain neural network f for modeling the target dynamics is given, and in addition to this model f , we introduce a neural network that outputs a K-dimensional vector V (u) = (V 1 (u) V 2 (u) . . . V K (u)) ⊤ . Each element is expected to learn one of the first integrals as V k : R N → R for k = 1, . . . , K. Then, the submanifold M ′ is defined as in Eq. (1).

3.1. CONTINUOUS FINDE: TIME-DERIVATIVE PROJECTION METHOD

We propose a time-derivative projection method called continuous FINDE (cFINDE). The cFINDE projects the time-derivative onto the tangent space T u M ′ . Roughly speaking, the cFINDE projects the dynamics on the space of the directions in which the first integrals do not change. In this way, the method can learn dynamics while preserving first integrals V , thereby finding unknown first integrals from data. We refer to the neural network that defines the time-derivative f : R N → R N as the base model. Applying the method of Lagrange multipliers to the projection method in Eq. (3), and taking the limit as the time-step size approaches zero, we have d dt u = f (u), f (u) = f (u) -M (u) ⊤ λ(u), d dt V (u) = 0, where M = ∂V ∂u and λ ∈ R N is the Lagrange multiplier (see Appendix B.1 for detailed derivation). We transform the second equation to obtain 0 = d dt V (u(t)) = ∂V ∂u d dt u = M (u)f (u) = M (u)( f (u) -M (u) ⊤ λ(u)), from which we obtain the Lagrange multiplier λ(u) = (M (u)M (u) ⊤ ) -1 M (u) f (u). By eliminating λ(u), we define the cFINDE as d dt u = f (u) = (I -Y (u)) f (u) for Y (u) = M (u) ⊤ (M (u)M (u) ⊤ ) -1 M (u). (6) Theorem 1 (continuous-time first integral preservation). The cFINDE d dt u = f (u) preserves all first integrals V k for k = 1, . . . , K in continuous time, that is, d dt V k = 0. See Appendix B.1 for proof. The base model f can be a NODE, an HNN, or any other model depending on the available prior knowledge. Additionally, if a first integral is already known, it can be directly used as one of the first integrals V k instead of being found by the neural network. Note that even though the base model f is an HNN, due to the projection, the cFINDE f is no longer a Hamiltonian system in the strict sense. Compared to the base model f , the cFINDE requires the additional computation of the neural network V , several matrix multiplications, and an inverse operation. The inverse operation has a computational cost of O(K 3 ), which is not costly if the number K of first integrals is small. Many previous models also need the inverse operation to satisfy the constraints and geometric structures, such as Lagrangian neural network (LNN) (Cranmer et al., 2020) , neural symplectic form (Chen et al., 2021) , and CHNN (Finzi et al., 2020b) .

3.2. DISCRETE FINDE: DISCRETE-TIME DERIVATIVE PROJECTION METHOD

The cFINDE is still an ODE and hence needs to be solved using a numerical integrator, which causes the temporal discretization errors in the first integrals. In order to eliminate these errors, it is necessary to constrain the destination (i.e., finite difference) rather than the direction (i.e., time-derivative). For this purpose, we propose discrete FINDE (dFINDE) by employing discrete gradients to define discrete tangent spaces, which are needed to constraint the state variables on the submanifold M ′ . A discrete gradient ∇V is a discrete analogue to a gradient ∇V (Furihata & Matsuo, 2010; Gonzalez, 1996; Hong et al., 2011) . Recall that a gradient ∇V of a function V : R N → R can be regarded as a function R N → R N that satisfies the chain rule d dt V (u) = ∇V (u) ⊤ d dt u. Analogously, a discrete gradient ∇ is defined as follows: Definition 2 (discrete gradient). A discrete gradient ∇V of a function V : R N → R is a function R N × R N → R N that satisfies V (v) -V (u) = ∇V (v, u) ⊤ (v -u) and ∇V (u, u) = ∇V (u). The first condition is a discrete analogue to the chain rule when replacing the time-derivatives d dt V and d dt u with finite differences (V (v) -V (u)) and (v -u), respectively, and the second condition ensures consistency with the ordinary gradient ∇V . A discrete gradient ∇V is not uniquely determined and has been obtained manually. Recently, the automatic discrete differentiation algorithm (ADDA) has been proposed by Matsubara et al. (2020) , which obtains a discrete gradient of a neural network in a manner similar to the automatic differentiation algorithm (Abadi et al., 2016; Paszke et al., 2017) . The discrete gradient is defined in discrete time; hence, the prediction using the discrete gradient is free from temporal discretization errors. See Appendix B.2 and the references Furihata & Matsuo (2010); Matsubara et al. (2020) for more details. Following Christiansen et al. (2011) ; Dahlby et al. (2011) , we introduce a discrete analogue to the tangent space T u M ′ called the discrete tangent space T (v,u) M ′ . In particular, for a pair of points (v, u) ∈ M ′ , the discrete tangent space is defined as T (v,u) M ′ = {w ∈ R N : ∇V k (v, u) ⊤ w = 0 for k = 1, . . . , K}. If the finite difference (u s+1 -u s ) between the predicted and current states is on the discrete tangent space T (u s+1 ,u s ) M ′ , the first integrals V k are preserved because V k (u s+1 ) -V k (u s ) = ∇V k (u s+1 , u s ) ⊤ (u s+1 -u s ) = 0. Note that similar concepts defined in different ways are also referred to as discrete tangent spaces (Cuell & Patrick, 2009; Dehmamy et al., 2021) . We suppose that a neural network (e.g., NODE) f defines an ODE and a numerical integrator predicts the next state ũs+1 from a given state u s . We call this process a discrete-time base model ψ, which satisfies ũs+1 -u s ∆t s = ψ(u s ; ∆t s ). Subsequently, we consider the model u s+1 -u s ∆t s = ψ(u s+1 , u s ; ∆t s ), ψ(u s+1 , u s ; ∆t s ) = ψ(u s ; ∆t s ) -M (u s+1 , u s ) ⊤ λ(u s+1 , u s ), V (u s+1 ) -V (u s ) = 0, where M (u s+1 , u s ) = (∇V 1 (u s+1 , u s ) . . . ∇V K (u s+1 , u s )) ⊤ . As shown in Appendix B.1, this formulation is also derived from the projection method in Eq. (3). Using the chain rule of the discrete gradient, 0 = V (u s+1 )-V (u s ) ∆t s = M (u s+1 , u s ) u s+1 -u s ∆t s = M (u s+1 , u s )ψ(u s+1 , u s ; ∆t s ), Substituting this into Eq. ( 8) and eliminating the Lagrange multiplier λ, we define the dFINDE as u s+1 -u s ∆t s = ψ(u s+1 , u s ; ∆t s ) = (I -Y (u s+1 , u s )) ψ(u s ; ∆t s ) for Y = M ⊤ (M M ⊤ ) -1 M , ( ) where we have abbreviated M (u s+1 , u s ) and Y (u s+1 , u s ) to M and Y , respectively. Theorem 2 (discrete-time first integral preservation). The dFINDE u s+1 -u s ∆t s = ψ(u s+1 , u s ; ∆t s ) preserves all first integrals V k for k = 1, . . . , K in discrete time, that is, V k (u s+1 ) -V k (u s ) = 0. See Appendix B.1 for proof. Intuitively, dFINDE projects the finite difference (discrete-time derivative) ψ onto the discrete tangent space T (u s+1 ,u s ) M ′ after the numerical integration for each step, whereas cFINDE projects the time-derivative f onto the tangent space T u M ′ at every substep inside a numerical integrator. In the discrete-time base model ψ, the ODE f can be defined by any model, such as NODE or HNN, and the numerical integrator can be implemented by any method, such as the Runge-Kutta method or the leapfrog integrator. The projection method in Eq. ( 3), the method in Eq. ( 8), and the dFINDE in Eq (10) are implicit methods and hence relatively computationally expensive. However, only the dFINDE can be trained non-iteratively by standard backpropagation algorithms. As explained in Appendix B.3, this is because the next state u s+1 is given during training and the ADDA can explicitly obtain the discrete gradient and its computational graph. 

4.1. EXPERIMENTAL SETTINGS

Target Systems We evaluated FINDE and base models using datasets associated with first integrals; these are summarized in Table 2 . A gravitational two-body problem (2-body) on a 2dimensional configuration space is a typical Hamiltonian system in the canonical form. In addition to the total energy, the system has first integrals related to symmetries in space, namely, the linear and angular momenta. The Korteweg-De Vries (KdV) equation is a PDE model of shallow water waves. This equation is a Hamiltonian system in a non-canonical form and has the Hamiltonian, total mass, and many other quantities as first integrals. We discretized the KdV equation in space, obtaining a fifty-dimensional state u. A double pendulum (2-pend) is a Hamiltonian system in polar coordinates. However, we transformed it to Cartesian coordinates; hence, it became a Poisson system. The lengths of the two rods work as holonomic constraints and lead to four first integrals in addition to the Hamiltonian. The FitzHugh-Nagumo model is a biological neuron model as an electric circuit, which exhibits a rapid and transient change of voltage called a spike. As an electric circuit, the currents through and voltages applied to the inductor and capacitor can be regarded as system states, which are constrained by the circuit topology and Kirchhoff's current and voltage laws. Then, this system has a state of four elements and two first integrals. Because the resistor dissipates the energy, the system is not a Poisson system, but a Dirac structure can be found (van der Schaft & Jeltsema, 2014). We generated a time-series set of each dataset with different initial conditions (hence, different values of first integrals). See Appendix C for more details. Implementation We implemented the proposed FINDE and evaluated it under the following settings. We implemented all codes by modifying the officially released codes of HNN (Greydanus et al., 2019) foot_0 and DGNet (Matsubara et al., 2020) foot_1 . We used Python v. 3.8.12 with packages scipy v. 1.7.3, pytorch v. 1.10.2, torchdiffeq v. 0.1.1, functorch v. 1.10 preview, and gplearn v. 0.4.2. We used the Dormand-Prince method (dopri5) (Dormand & Prince, 1986) as the numerical integrator, except in Section 4.2. All experiments were performed on a single NVIDIA A100. Following HNN (Greydanus et al., 2019) and DGNet (Matsubara et al., 2020) , we used fullyconnected neural networks with two hidden layers. The input was the state u, and the output represented the first integrals V for FINDE, time-derivative f for NODE, or the Hamiltonian H for HNN. Each hidden layer had 200 units and preceded a hyperbolic tangent activation function. Each weight matrix was initialized as an orthogonal matrix. For the KdV dataset, we used a 1-dimensional CNN, wherein the kernel size of each layer was 3. The double pendulum is a second-order system, implying that the time-derivative d dt q of the position q is known to be the velocity v. Hence, we treated only the acceleration d dt v as the output to learn in the 2-pend dataset. This assumption slightly improved the absolute performances but did not change the relative trends. As the loss function for the cFINDE, we used the mean squared error (MSE) between the ground truth future state u s+1 GT and the future state u s+1 pred. predicted from the current step u s GT normalized by the time-step size ∆t s ; we named this the 1-step error. For the dFINDE, we used the MSE between the left-and right-hand sides of Eq. ( 10) because the ground truth states u s GT and u s+1 GT are available during the training phase. The base model and FINDE were jointly trained using the Adam optimizer (Kingma & Ba, 2015) with the parameters (β 1 , β 2 ) = (0.9, 0.999) and a batch size of 200. The learning rate was initialized to 10 -3 and decayed to zero with cosine annealing (Loshchilov & Hutter, 2017) . See Appendix B.3 and the enclosed source code for details about implementations. Evaluation Metric We used the 1-step error as an evaluation metric, which is identical to the loss function for the cFINDE, and displayed it in the scale ×10 -9 . The lower this indicator, the better, as indicated by ↓. The MSEs of the state or system energy over a long period are misleading indicators, as suggested in prior studies (Botev et al., 2021; Jin et al., 2020b; Vlachas et al., 2020) . For example, a periodic orbit that is correctly learned except for a slight difference in angular velocity would have the same MSE as an orbit that never moves from its initial position. Instead, we used the valid prediction time (VPT) (Botev et al., 2021; Jin et al., 2020b; Vlachas et al., 2020) . VPT denotes the time point s divided by the length S of the time-series at which the MSE of the predicted state u s pred. first exceeds a given threshold θ in an initial value problem, that is, V P T (u pred. ; u GT ) = 1 S max{s f |MSE(u s pred. , u s GT ) < θ for all s ≤ s f , 0 ≤ s f ≤ S}. ( ) The higher this indicator, the better, as indicated by ↑. To obtain VPTs, we normalized each element of state to have the zero mean and unit variance in the training data and set θ to 0.01. For systems with "spiking" behaviors, a small error in phase may be regarded as a significant error in the state; for the FitzHugh-Nagumo model, we obtained the VPTs by allowing for a delay and advance of up to 5 steps. Before learning first integrals from data, we demonstrate that dFINDE can preserve first integrals without temporal discretization errors. We used a mass-spring system, which had the state u = (q v) ⊤ , dynamics d dt q = v and d dt v = -q, and system energy E(q, v) = 1 2 (q 2 + v 2 ). Using an initial value of (1.0 0.0) ⊤ and a time-step size of ∆t = 0.2, we solved the initial value problem of the true ODE using the leapfrog integrator with or without FINDE, with the true system energy E as the first integral V . Notably, no neural networks nor training were involved. Figure 1 shows the results, along with the analytical solution. The states predicted by comparison methods overlap and are apparently identical. However, the energy obtained by the leapfrog integrator fluctuates and the same is true for cFINDE. This is because the leapfrog integrator and cFINDE suffer from temporal discretization errors in first integrals. In contrast, dFINDE preserves the energy accurately, the same as the analytical solution. This is because dFINDE projects the state (q v) ⊤ onto the discrete tangent space T (v,u) M ′ at every step. Although a smaller time-step size reduces temporal discretization errors, this result demonstrates the advantage of dFINDE. See Appendix D.1 for the case with the Dormand-Prince integrator.

4.3. FINDING NON-HAMILTONIAN FIRST INTEGRALS OF HAMILTONIAN SYSTEMS

We evaluated cFINDE and dFINDE on learning from the 2-body dataset. We used HNN as the base model f . We found that cFINDE and dFINDE obtained better performances if it did not treat the Hamiltonian H of the HNN as one of the first integrals V k . The medians and standard deviations of five trials are summarized in the leftmost column of Table 3 . The cFINDE achieved better VPTs than the original HNN with K = 1 to 2, and its performance was suddenly degraded with K = 3. The dFINDE showed a similar trend with slightly better performances; there is a trade-off between performance and computational cost. The HNN with either cFINDE or dFINDE found two first integrals in addition to the Hamiltonian H of the HNN. Even though a two-body problem is a Hamiltonian system that an HNN can learn, the prior knowledge that there exist first integrals other than the Hamiltonian H can be a clue that enables better learning. Despite their better long-term prediction performance, the HNN with either cFINDE or dFINDE yielded 1-step errors worse than the HNN, indicating that the 1-step error is misleading as an evaluation criterion. These example results are depicted in Fig. 2 . In the absence of FINDE, the mass positions (x 1 , y 1 ) and (x 2 , y 2 ) became inaccurate in a short time and the center-of-gravity position (x c , y c ) = ( 2 , 2 ) deviated rapidly. The HNN with cFINDE accurately predicted the state for a longer period. Even after errors in the mass positions became non-negligible, errors in the center-of-gravity position were still small. Figure 3 shows the absolute errors averaged over all trials, which demonstrate how the trend changes with cFINDE. In both the xand y-directions, the HNN without FINDE produced errors in the center-of-gravity position x c (or y c ), and those in the mass positions x 1 , x 2 (or y 1 , y 2 ) at a similar level. In contrast, with the cFINDE, errors in the center-of-gravity position were much smaller than those in the mass positions, implying that errors in one mass position canceled out errors in the other. We performed a symbolic regression of first integrals V found by the neural network. For K = 2, the found first integrals V were identical to the linear momenta in the xand y-directions up to affine transformation in most cases. See Appendix D.2 for detailed results. Therefore, we conclude that FINDE not only had better prediction accuracy but also found and preserved linear momenta (which are related to symmetries in space) more accurately despite not having prior knowledge about symmetries.

4.4. FINDING FIRST INTEGRALS OF UNKNOWN SYSTEMS

It is often unclear whether a target system is a Hamiltonian system or not, but one can expect that it has several first integrals. We evaluated cFINDE and dFINDE using NODE as the base model and display the results in Table 3 . For the KdV dataset, the NODE with either cFINDE or dFINDE obtained improved VPTs for a wide range of K. Figure 4 shows an example result. The prediction states were apparently similar. In the absence of FINDE, the NODE increased all of its errors in proportion to time. With cFINDE, the error in total mass increased at the point where the two solitons collided, but then returned to the original level. Although the calculation is slightly inaccurate, the cFINDE learned to preserve the total mass. The error in energy continued to increase for K = 2, but remained within a small range for K = 3. These results suggest that the first or second quantity learned by the cFINDE was total mass, the third quantity was system energy, and the remaining quantity may correspond to one of the many first integrals of the KdV equation. For the 2-pend dataset, the NODE with either cFINDE or dFINDE obtained improved VPTs with K = 1 to 5. In addition to the system energy, the double pendulum has two holonomic constraints on the position, which lead to two additional constraints involving the velocity (see Appendix C for details). Thus, it is reasonable that the NODE with either cFINDE or dFINDE obtained the best VPT for K = 5 first integrals and completely failed for K > 5 first integrals. As exemplified in Fig. 5 , the NODE without FINDE did not preserve the lengths of rods, making the states deviate gradually. See Appendix D.3 for the case when actual constraints are known. For the FitzHugh-Nagumo dataset, the NODE with either cFINDE or dFINDE obtained improved VPTs for K = 2. As exemplified in Fig. 6 , the ground truth state converged to a periodic orbit, and only the NODE with cFINDE for K = 2 reproduced similar dynamics. Without FINDE, the state did not remain in a limited region. For K = 1, the state converged to a wrong equilibrium; the sole quantity V 1 may have attempted and failed to learn both first integrals. We conclude that both cFINDE and dFINDE found all first integrals of the 2-pend and FitzHugh-Nagumo datasets; K = 5 and K = 2, respectively.

5. CONCLUSION

This study proposed first integral-preserving neural differential equation (FINDE), which can find and preserve any type of first integrals from data in a unified manner. FINDE projects the time evolution onto the submanifold defined using the (discrete) gradients of first integrals represented by a neural network. We experimentally demonstrated that FINDE found and preserved first integrals that come from the energy and mass conservation laws, symmetries in space, and constraints, thereby predicting the dynamics for far longer. FINDE is available even for an energy-dissipating system. When FINDE obtains the best prediction accuracy with K = K ′ , it suggests that the target system has at least K ′ first integrals. Hence, FINDE has the potential to make scientific discoveries by revealing geometric structures of dynamical systems. See Appendix D.4 for more discussions on K. The numerical error tolerance 10 -9 was negligible compared to the 1-step errors (which were 10 -5 to 10 -4 in absolute error). However, the dFINDE tended to obtain much better VPTs than the cFINDE. This result suggests that a method leading to smaller numerical errors produces a model with smaller modeling errors, as observed in previous works (Chen et al., 2020; Matsubara et al., 2020) . These results may form a new frontier for integrating numerical and modeling errors.

REPRODUCIBILITY STATEMENT

See Section 4.1 for experimental settings. More detailed descriptions can be found in Appendix B.3 for training procedure and Appendix C for datasets. The authors have enclosed the source code for generating the datasets and running the experiments as supplementary material. Constrained Hamiltonian System A constraint C(q) = 0 on the position q is called a holonomic constraint. Holonomic constraint appear, for example, when the arm's length restricts the position of a robot's hand. Differentiating a holonomic constraint C(q) = 0 yields a constraint involving the velocity G(q, v) = ∂C ∂q v = 0, which is simply called a velocity constraint. Hence, each holonomic constraint leads to two first integrals C and G. A Hamiltonian system with holonomic constraints is also a Poisson system; in particular, it is a constrained Hamiltonian system. A CHNN incorporates the known holonomic constraints C(q) and corresponding velocity constraints G(q, v) of a Hamiltonian system in the canonical form (Finzi et al., 2020b) . The original study suggested that CHNN may learn holonomic constraints from data, but this has not been tested. For modeling a constrained Hamiltonian system, it is sufficient to incorporate only velocity constraints G(q, v) because a holonomic constraint C(q) is implicitly satisfied if the corresponding velocity constraint G(q, v) is satisfied. Celledoni et al. (2022) used such formulation, and extended HNN and CHNN to systems on non-Euclidean spaces. A neural projection method learns fixed holonomic constraints, as well as inequality constraints, which are outside the scope of this study (Yang et al., 2020) . This method updates the state by solving an optimization problem similar to Eq. ( 3) iteratively using the gradient descent method at every training step. Subsequently, it applies the backpropagation algorithm to all the optimization iterations. Thus, it has high computational and memory costs. These studies mainly focused on physically-induced holonomic constraints and may not work for other first integrals, as shown in Appendices D.3 and D.5. However, the purpose of FINDE is to find and preserve general first integrals, including energy and mass not limited by constraints. Dirac Structure A Dirac structure is named after a Dirac bracket, a generalization of the Poisson bracket (van der Schaft & Jeltsema, 2014), and can be found in various systems. For a rolling disk, the direction in which the disk can move forward without slipping is limited by the disk's orientation. This constraint is called a non-holonomic constraint. In an electric circuit, when elements are connected in series, the current flow through each element is always the same. This constraint is called Kirchhoff's current law. One can find Dirac structures in these systems. The dissipative SymODEN was proposed to model a port-Hamiltonian system in the canonical form (Zhong et al., 2020b) , which is a special case of the Dirac structure. To the best of our knowledge, a neural network model for a general Dirac structure has not yet been proposed. FINDE is the first neural network method to learn Dirac structures better than NODE can, even though it is not specialized for Dirac structures.

PDE with Mass Conservation

The total mass of a PDE system is sometimes preserved (Furihata & Matsuo, 2010) . The KdV equation is a Hamiltonian system that describes shallow water waves, in which the energy and total mass are preserved. The Cahn-Hilliard equation is a model of phase separation of copolymer melts, in which the total mass is preserved, but the energy is dissipated. In general, a quantity in an area is preserved if its flux entering minus its flux leaving is zero. Deep conservation extracts latent dynamics of a PDE system and preserves a quantity of interest by forcing its flux to be zero (Lee & Carlberg, 2021) . HNN++ also ensures mass conservation by designing a coefficient matrix that determines local interaction (Matsubara et al., 2020) . General First Intergals A concurrent study, "Constants-of-motion network," introduced the penalty loss function so that NODEs learn to preserve first integrals (Kasim & Lim, 2022); however, unlike other related methods, this method does not guarantee preservation. A Noether network was proposed to model videos that do not always capture physical phenomena (Alet et al., 2021) . A subset of the latent variable is assumed to represent image features that do not change during a video, such as the appearance of objects. For prediction, these features are forced not to change. The Noether network is potentially useful for learning physical phenomena from videos, but is more similar to semantic manipulation of latent variables (Shen et al., 2020) . Some studies have investigated methods that do not predict dynamics but specialize in finding first integrals (Fukunaga & Olsen, 1971; Liu & Tegmark, 2021) . These methods can be used to help FINDE determine the hyperparameter K. They commonly estimate the number (N -K) of dimensions of the tangent space T u M ′ of the submanifold M ′ at point u using its neighbors. For example, AI Poincaré proposed by Liu & Tegmark (2021) assumes that all data points share the submanifold M ′ and uses an autoencoder to reconstruct the tangent space T u M ′ . Hence, it can only process a an external current source I. Let I R denote the current through the resistor R, and V R denote the applied voltage. Ohm's law and other properties of the elements give Consider a situation where the current through and the voltage applied to stateful elements (capacitors and inductors) are measurable, but the connections between the elements are unknown. We treated I C , I L , V C , V L as the system state u. Because the state is in 4dimensional space and the dynamics is intrinsically 2-dimensional, there exist two first integrals; for example, but not limited to, V R = I R R, C d dt V C = I C , L d dt I L = V L I = I C + D(V C ) + I L and E = V C -I L R -V L . This type of electric circuit is an example of a Dirac structure because the state variables are constrained by circuit topology and Kirchhoff's current and voltage laws (van der Schaft & Jeltsema, 2014) . From the viewpoint of generalized Hamiltonian systems, (I L , V C ) corresponds to the position, and (V L , I C ) corresponds to the momentum. The electric circuit can be described as a port-Hamiltonian system in a non-canonical form. Because of the non-canonical form, the FitzHugh-Nagumo model is outside the scope of CHNN and dissipative SymODEN (Finzi et al., 2020b; Zhong et al., 2020b) . We set the external current source I to follow U(0.7, 1.1), set the initial values of V and W to follow U(-1.5, 1.5) and U(0.0, 2.0), and transformed them to the state. We set the time-step size ∆t to 0.1 and generated 1,000 time-series of S = 500 steps for training and 10 time-series of S = 2, 000 steps for evaluation. We trained each model for 30,000 iterations. In Fig. 1 , we examined a mass-spring system and FINDE using the leapfrog integrator. We also examined the case with the Dormand-Prince integrator (dopri5), as shown in Fig. A3 . We increased the number of steps to 10 5 , and displayed the MSEs of the state instead of the state itself. First, we focus on the energy. Even using the Dormand-Prince integrator, a fourth-order method, the energy is slightly decreased. The cFINDE with the Dormand-Prince integrator shows the same tendency. This phenomenon is due to temporal discretization errors and is called energy drift. The dFINDE with the Dormand-Prince integrator significantly suppresses the error in energy. The remaining error is caused by rounding errors.

D ADDITIONAL RESULTS

When the focus is on the MSEs of the state, the trend is different: the dFINDE with the Dormand-Prince integrator suffers from the most significant errors in state. Although the dFINDE is designed to eliminate temporal discretization errors in energy, it does not necessarily reduce those in state. In contrast, the Dormand-Prince integrator is designed to suppress temporal discretization errors in state. Therefore, there is no guarantee that the dFINDE improves the prediction performance when defined using errors in state. Conversely, the experimental results in Table 3 demonstrate that the dFINDE is superior to the base model and cFINDE in VPT. This is because dFINDE reduces the modeling errors rather. For the mass-spring system, the governing equation is already known as an ODE and is discretized by the dFINDE, leading to temporal discretization errors. However, when dFINDE learns dynamics from data, the training data points are already sampled in discrete time, and the dFINDE predicts future states in discrete time. Therefore, no temporal discretization error occurs, and we obtain only the advantages of exactly preserving the first integral. This type of paradox has been repeatedly discovered in previous studies. For example, the leapfrog integrator and discrete gradient method are second-order methods. However, they are superior to the Dormand-Prince integrator when combined with neural networks and learning dynamics from data (Matsubara et al., 2020) . For better learning (i.e., smaller modeling errors), the preservation of specific properties of target systems is more important than the order of accuracy.

D.2 SYMBOLIC REGRESSION OF FOUND FIRST INTEGRALS

Using gplearn (based on genetic programming), we performed a symbolic regression of the first integrals V found by the neural network. We prepared addition, subtraction, multiplication, and division as candidate operations, used Pearson's correlation coefficient as the evaluation criterion, set the early stopping threshold to 0.9, and set the population size to 10,000. We set the other hyperparameters to their default values, e.g., the maximum number of generations was 20. We summarize the regression results of the HNN with cFINDE for K = 2 trained using the twobody dataset in Table A1 . Note that Pearson's correlation coefficient is invariant to biases and scale factors. FINDE is also invariant because it only uses the directions of the gradients of first integrals. Hence, we removed biases and scale factors from the regression results. When the focus is on the symbolic regression of the training data, V 1 , V 1 , V 2 , and V 2 for trials 0, 1, 2, and 3 are identical to the linear momentum in the x-direction up to scale factors; recall that we set m 1 = m 2 = 1.0 and see Eq. (A17). V 2 , V 2 , V 1 , and V 1 for trials 0, 1, 2, and 3 are also identical to the linear momentum in the y-direction. V 1 and V 2 for trial 4 are weighted sums of the linear momenta in the xand y-directions; in particular, they can be regarded as the linear momenta in the (1, -1)and (1, 1)directions, respectively. When the quantities V 1 (u) and V 2 (u) are first integrals, any function of only V 1 (u), V 2 (u), and arbitrary constants is a first integral functionally dependent on V 1 (u) and V 2 (u). Thus, it is in principle impossible to re-discover a first integral as a well-known symbolic expression, and a failure in symbolic regression is not a problem in any way. Previous studies introduced certain constraints (such as "gauge fixing") for symbolic regression (Liu & Tegmark, 2021) ; a combination of such method may improve the results. However, recent studies on neural networks have revealed that typical initialization and training procedures tend to learn simple functions (Barrett & Dherin, 2021; Cao et al., 2021) . Additionally, the symbolic regression limited the depth of the computation graph, biasing the results toward simple functions; hence, the found first integrals were identical to the well-known forms and were separated in the xand y-directions in most cases. The same is true for the symbolic regression of the test data, except for V 1 for trial 0, which had a small perturbation α. Because of the limited extrapolation ability, neural networks cannot always accurately represent functions outside the training data range. Once first integrals are found by FINDE and identified as equations by symbolic regression, one can use the equations instead of neural networks, ensuring the preservation of first integrals in the entire domain. From these results, we can conclude that cFINDE identified the linear momenta. The state of the KdV dataset has 50 elements, which is too large to apply a symbolic regression. For the 2-pend and FitzHugh-Nagumo datasets, we did not find consistent equations of first integrals. For example, the symbolic regression identified a quantity x 2 1 -y 1 as a first integral in the 2-pend dataset, which is not directly related to well-known first integrals. When the angle θ 1 of the upper rod is small, y 1 takes a value close to -1, and the quantity x 2 1 -y 1 is close to x 2 1 + y 2 1 , which is a well-known first integral, namely the square l 2 1 of the upper rod length l 1 . It is difficult to determine whether this inaccuracy is because of the training of FINDE or symbolic regression. There may still be room for improvement in the training of FINDE or symbolic regression. We removed biases and scale factors. α = 0.003(y 1 + y 2 )(v x2 + x 1 + y 1 (v x2 + y 1 + y 2 ) + 1.402). The double pendulum (2-pend) is classified as a constrained Hamiltonian system. CHNN was proposed for cases when holonomic constraints are known (Finzi et al., 2020b) . We evaluated comparison methods under the assumption that the holonomic constraints were known. We summarized the results in Table A2 . The HNN, without constraints, completely failed to learn the dynamics. This is unsurprising because the dynamics of the double pendulum is outside the scope of the HNN. The two known holonomic constraints lead to two constraints involving the velocity; the CHNN took into account all four known constraints and worked remarkably. The HNN with cFINDE was given all four known constraints as the first integrals, but did not work properly. The original purpose of projection methods is to eliminate temporal discretization errors of first integrals but not to change the class to which the dynamics belong. Therefore, when a target system is not a subject of the base model, the base model with FINDE does not work. The NODE learns an ODE in a general way, and thus constrained Hamiltonian systems are included in its subjects. Given all four known constraints, the NODE with cFINDE worked better but never surpassed the CHNN. However, the CHNN works only for Hamiltonian systems in the canonical form with holonomic constraints. We also evaluated comparison methods using the 2-body dataset under the assumption that the linear momenta were known as first integrals. The CHNN attempted to obtain the inverse of a singular matrix and could not learn the dynamics. In contrast, the cFINDE improved the performances of both NODE and HNN. Existing methods (e.g., HNN and CHNN) assume geometric structures (e.g., Hamiltonian structure) described in Appendix A in order to guarantee conservation laws. When multiple structures are assumed at the same time, they must be integrated using appropriate prior knowledge. If it is possible, it would achieve extremely high performance. Otherwise, the geometric structures would conflict with each other and would not produce an appropriate model. This is the reason why CHNN failed to learn the 2-body dataset and HNN+FINDE failed to learn the 2-pend dataset. In contrast, NODE+FINDE does not assume any geometric structure and assumes first integrals in the most general way, being available to any situation. Hence, FINDE can assume one or more first integrals without changing anything. When the detailed properties of target systems are known, one can choose the best models. If the chosen model is inappropriate, the training procedure totally fails. FINDE provides a better alternative when prior knowledge is limited. Moreover, a constrained Hamiltonian system can have first integrals other than holonomic constraints and the Hamiltonian. In this case, the CHNN with FINDE is potentially the best choice.



https://github.com/greydanus/hamiltonian-nn https://github.com/tksmatsubara/discrete-autograd > 10 3 0.080 ±0.014 4.68 ±0.430 0.601 ±0.069 0.80 ±0.070 0.585 ±0.097 - > 10 3 0.070 ±0.019 7.79 ±0.510 0.425 ±0.067 12.53 ±0.000 0.005 ±0.000 * -1 7.01 ±1.060 0.379 ±0.040 11.61 ±6.600 0.288 ±0.083 0.75 ±0.100 0.152 ±0.017 47.07 ±8.030 0.117 ±0.122 2 7.03 ±1.000 0.475 ±0.022 2.70 ±0.260 0.598 ±0.059 0.74 ±0.050 0.271 ±0.111 33.24 ±3.400 0.455 ±0.032 + dFINDE 3 54.78 ±36.39 0.309 ±0.024 3.78 ±0.270 0.636 ±0.024 0.69 ±0.050 0.447 ±0.081 319.70 ±91.11 0.049 ±0.007 4 > 10 3 0.102 ±0.015 3.48 ±0.320 0.780 ±0.059 0.71 ±0.030 0.454 ±0.060 -5 > 10 3 0.086 ±0.011 * 5.26 ±0.150 0.718 ±0.038 0.86 ±0.090 0.591 ±0.087 -6 > 10 3 0.059 ±0.017 9.60 ±3.610 0.573 ±0.121 58.88 ±22.98 0.037 ±0.039 -



Figure 1: Integration of a known mass-spring system by the leapfrog integrator. (top) States predicted by comparison methods. (bottom) Energy calculated from the states predicted.

Figure 2: Example results of 2-body dataset. (left) Ground truth. (middle) HNN. (right) HNN with cFINDE.

Figure 4: Example results of KdV dataset. (top) Predicted states. Red belts denote moving solitons. (bottom) Mean absolute errors in states u, total mass N k=1 u k , and energy, from left to right.

Figure A2: Circuit diagram of FitzHugh-Nagumo model (Izhikevich & FitzHugh, 2006).

Figure A3: Integration of a known mass-spring system by Dormand-Prince integrator. (top) Mean squared errors in states predicted by comparison methods. (bottom) Energy calculated from the states predicted.

Comparison of Related Studies on Preservation of First Integrals.

Datasets, Dynamics, and First Integrals.

Results of cFINDE and dFINDE.

Symbolic Regression of First Integrals Found in Two-Body Problem x1 +v x2 -v y1 -v y2 v x1 +v x2 +v y1 +v y2 v x1 +v x2 -v y1 -v y2 v x1 +v x2 +v y1 +v y2

Results with Known Holonomic Constraints.

ACKNOWLEDGEMENT

This study was partially supported by JST CREST (JPMJCR1914), JST PRESTO (JPMJPR21C7), and JSPS KAKENHI (19K20344, 20K11693).

A HAMILTONIAN SYSTEM, ITS GENERALIZATION, AND FIRST INTEGRALS

Preliminary In this section, we briefly introduce potential target systems and related works. Methods proposed by related works use specific prior knowledge about target systems, such as constraints. In contrast, our proposed FINDE assumes a situation where neural networks learn systems with unknown properties. See, for example, Hairer et al. (2006) ; van der Schaft & Jeltsema (2014) for more details about geometric mechanics.On an N -dimensional manifold M, an ODE is defined using a vector field f : M → T u M, which maps a point u on the manifold M to a tangent vector f (u) on the tangent space T u M. The NODE defines an ODE in this way (Chen et al., 2018) . Given a scalar-valued function H : M → R on the manifold M, its differential dH : M → T * u M is a cotangent vector field (a.k.a. a differential 1-form), which maps a point u on the manifold M to a cotangent vector dH(u) on the cotangent space T * u M.Hamiltoanian System A Hamiltonian system is defined using a non-degenerate closed differential 2-form ω called symplectic form, which is a skew-symmetric bilinear map ω u : T u M × T u M → R at point u. A symplectic form assigned to a manifold is called the symplectic structure. The coordinate-free form of Hamilton's equation is d dt u = X H (u), ω u (X H (u), w) = ⟨dH(u), w⟩ for any w ∈ T u M, where X H is the Hamiltonian vector field. The symplectic form ω gives rise to a bundle map ω ♭ u : T u M → T * u M, with which Hamilton's equation is rewritten as d dt u = X H (u) = (ω ♭ u ) -1 (dH(u)). The right-hand side is locally equivalent to the product of a coefficient matrix S and the gradient ∇H of the Hamiltonian H. Then, Hamilton's equation is obtained as d dt u = S∇H(u). Hamiltonian systems are often expressed in the canonical form, in other words, they are defined on Darboux coordinates, on which the state u is the paired generalized position q and generalized momentum p. The corresponding coefficient matrix is S = 0 In -In 0 for 2n = N and the n-dimensional identity matrix I n . The HNN was developed to model Hamiltonian systems in the canonical forms (Greydanus et al., 2019 ).An Euler-Lagrange equation with a hyperregular Lagrangian and a Lotka-Volterra equation are also Hamiltonian systems; however, their coordinate systems are not Darboux coordinates. A neural symplectic form (NSF) handles this class of equations (Chen et al., 2021) . The KdV equation is also a Hamiltonian system not on Darboux coordinates. For Hamiltonian PDE systems, HNN++ was proposed (Matsubara et al., 2020) . According to Darboux's theorem, any Hamiltonian system on an even-dimensional manifold can be transformed into the canonical form.Noether's theorem states that a continuous symmetry of a system leads to a conservation law. A Hamiltonian system is symmetric (invariant) to translation in time and conserves the Hamiltonian H. A two-body problem is symmetric to translation and rotation in space and conserves linear and angular momenta. These quantities are first integrals. LieConv and EMLP-HNN had such symmetries implemented in their architectures (Finzi et al., 2020a; 2021) . A pendulum is not symmetric to translation and rotation in space and does not conserve linear and angular momenta, but does exchange them with the base to which it is fixed.Poisson System A Poisson system is named after a Poisson bracket {•, •}, but it is convenient to refer to it as a degenerate Hamiltonian system. A Poisson bracket is defined using a Poisson 2-vector B, which is a skew-symmetric bilinear map B u : A Poisson neural network (PNN) learns to transform a given Poisson system into a canonical form (Jin et al., 2020a) . single long time series with fixed first integrals. In contrast, our proposed FINDE can leverage a dataset of multiple time series with different values of the first integrals.

B.1 DERIVATION OF FINDE

Continuous FINDE (cFINDE) Let u s denote a current state and f denote a vector field. After a time interval ∆t, the state transitions to ûs+1 . A typical projection method projects the state ũs+1 onto a submanifold M ′ and obtains a state u s+1 , which preserves the first integrals V = (V 1 . . . V K ) ⊤ . This procedure is defined as an optimization problem in Eq. (3); arg minOne can solve the problem using the method of Lagrange multipliers. A Lagrangian function iswhere λ ′ is the Lagrange multiplier. The stationary point satisfiesSubsequently, a projection method can be redefined asWe transform Eq. (A4) intowhere λ = λ ′ /∆t. Taking the limit as ∆t → +0, we obtain Eq. (4);The second equation ensures that a state transition following the new vector field f preserves the first integrals V . By eliminating the Lagrange multiplier λ(u), we define the cFINDE as in Eq. ( 6), that is,where M = ∂V ∂u . Because of the above derivation, the cFINDE can be considered a continuous-time version of a projection method. The preservation of first integrals can be proved as follows.Proof of Theorem 1.Hence, it holds that d dt V k (u) = 0 for k = 1, . . . , K, indicating that the cFINDE d dt u = f (u) preserves all first integrals V k in continuous time.Discrete FINDE (dFINDE) For dFINDE, we take the discrete gradient of the Lagrangian equation in Eq. ( A2) and obtain the discrete version of the necessary conditions for the stationary point;M (u s+1 , u s ) corresponds to the Jacobian ∂V ∂u . By substituting the base model ũs+1 -u s ∆t s = ψ(u s ; ∆t s ) and the dFINDE u s+1 -u s ∆t s = ψ(u s+1 , u s ; ∆t s ) into the above equation and dividing the first equation by ∆t, we obtain Eq. ( 8);where λ = λ ′ /∆t s . By eliminating the Lagrange multiplier λ, we define the dFINDE as in Eq. ( 10), that is,(A10) The preservation of first integrals can be proved as follows.Proof of Theorem 2.Hence, it holds that V k (u s+1 ) = V k (u s ) for k = 1, . . . , K, indicating that the dFINDE u s+1 -u s ∆t s = ψ(u s+1 , u s ; ∆t s ) preserves all first integrals V k in discrete time.

B.2 DISCRETE GRADIENT

A discrete gradient is a discrete analogue to a gradient (Furihata & Matsuo, 2010; Gonzalez, 1996; Hong et al., 2011) . Discrete gradients that satisfy Definition 2 are not unique, and many variations have been proposed. For a neural network, Matsubara et al. (2020) proposed the automatic discrete differentiation algorithm (ADDA). We briefly introduce the algorithm in the case of finitedimensional Euclidean spaces. The differential dg of a function g : R N → R M is a linear operator dg u : R N → R M at point u and satisfiesThe differential dg acting on a vector w is equivalent to the product of a vector w with the Jacobian J g (u) of the function g at point u: dg u (w) = J g(u) w. Similarly, according to the chain rule, the differential d(h • g) of a composition h • g of functions g, h is equivalent to the multiplication with a series J h(g(u)) J g(u) of Jacobians. Therefore, the automatic differentiation algorithm obtains the differential of a neural network. The differential dg of a function g : R N → R is a horizontal vector, and the gradient ∇g of the function g is a vertical vector dual to the differential. Therefore, the gradient ∇g is obtained by transposing the differential dg. The ADDA replaces each Jacobian with its discrete analogue. For linear layers, such as fully-connected and convolution layers, the discrete Jacobian is identical to the ordinary Jacobian. For element-wise nonlinear layers, such as activation functions, a diagonal matrix composed of the slopes between two inputs can act as the discrete Jacobian. A discrete gradient obtained by the above steps satisfies Definition 2.

B.3 PREDICTION AND TRAINING PROCEDURES

For ODEs modeled by neural networks, various training and prediction strategies have been proposed to date (Chen et al., 2018; 2020; Course et al., 2020; Matsubara et al., 2020; Zhong et al., 2020a) ; FINDE can adopt any of these. In our experiments, we used the following simple strategies.In the case of the cFINDE and base models, taking a state u s GT from the dataset, a numerical integrator solves the ODE d dt u = f (u) and predicts the next state u s+1 pred. . This process can be informally expressed asWe solved this integration using torchdiffeq.odeint. The prediction accuracy can be evaluated using the difference between the predicted state u s+1 pred. and ground truth u s+1 GT taken from the dataset. We normalized the difference by the time-step size ∆t s and defined the 1-step error L 1-step asThe cFINDE and base models were trained to minimize the 1-step error L 1-step .In the case of the dFINDE, the next state u s+1 pred. is predicted by solving Eq. ( 10) as an implicit scheme; in particular, arg minTherefore, prediction by the dFINDE is implicit. For evaluation, we solved this scheme using scipy. optimize.fsolve and obtained the 1-step error in Eq. (A13). However, during the training phase, the ground truth u s+1 GT of the next state is known. Hence, we substituted this into Eq. ( 10), and then used the difference between the left-and right-hand sides of the dFINDE as the loss function:The discrete Jacobian M (and hence Y ) can be obtained explicitly, and an explicit numerical integrator can be used for the base model ψ. Hence, the process to obtain the value of the loss function is explicit, and the dFINDE can be trained in an explicit way, whereas the prediction is still implicit.Some previous studies have proposed alternative strategies. For example, a loss function can be defined as the sum of the errors at multiple time points during a long-term prediction. The cFINDE can naturally adopt such a training strategy, and the dFINDE can adopt it after a minor modification. While it is helpful to pursue absolute performance, it requires additional hyperparameters, such as the length of prediction time, and additional effort to adjust them. We used the 1-step error in the present study for simplicity and fair comparisons.The function V (u) learning a first integral may become a constant function during training; subsequently, its Jacobian matrix vanishes ( ∂V (u) ∂u ≡ 0). In this case, our algorithm returns a division-byzero error because it requires the inverse of the matrix ∂V (u) ∂u ∂V (u) ∂u ⊤ for the projection. We have not taken any special measures to prevent such errors, but no errors occurred in any experiments with proper settings. The division-by-zero errors have occurred only when FINDE assumes an unreasonable number of first integrals (e.g., K = 6 for the double pendulum, which has five first integrals). FINDE works correctly even when the functions f (u) and V (u) learn the same first integrals; we verified such a case in Section 4.2, where both functions are known. FINDE learns first integrals point-by-point, and the found first integral is not always consistent over the domain. The same can be said about the energy function of HNN, and this type of problem is an open problem for neural network models of dynamical systems.

C DETAILS OF DATASETS

To generate each dataset, we used scipy package and the Dormand-Prince method (dopri5) with the default relative tolerance of 10 -9 , unless otherwise stated. Experiments on the KdV dataset were performed with double precision, and all other experiments were performed with single precision.Hamiltonian System in Canonical Form: Two-Body Problem A gravitational two-body problem on a 2-dimensional configuration space has a state u composed of the 4-dimensional position q = (x 1 y 1 x 2 y 2 ) ⊤ and 4-dimensional velocity v = (v x1 v y1 v x2 v y2 ) ⊤ . This is a secondorder ODE, indicating that d dt q = v. The momentum p x1 of x 1 equals m 1 v x1 . The timederivative d dt v of the velocity v is called the acceleration. The acceleration of x 1 is given by, where G, m 1 , and m 2 denote the constant of gravity and masses of two bodies, respectively. The same process applies for the remaining positions.The total energy of the two-body problem is given byThe first and second terms denote the kinetic and potential energies, respectively. The two-body problem is a Hamiltonian system, and the dynamics mentioned above can be rewritten as Hamilton's equation. The Hamiltonian H is one of the first integrals; the two-body problem has other first integrals, such as the linear momenta in the xand y-directionsand angular momentum (Hairer et al., 2006) .We set G, m 1 , and m 2 to 1.0. The initial distance r 1 = x 2 1 + y 2 1 of a mass m 1 from the origin was set to r 1 ∼ U(0.5, 1.0), and the initial angle θ 1 = tan -1 ( y1 x1 ) was set to θ 1 ∼ U(0, 2π). The initial speed |v 1 | = v 2 x1 + v 2 y1 was set to 1 2r 2 ϵ v , where ϵ v ∼ N (1, 0.05). The initial angle of the velocity was set to θ ± 0.5π + ϵ θ π, where ϵ θ ∼ N (0, 0.05). The initial condition of the other mass m 2 was set to the opposite of the mass m 1 . Subsequently, the two masses trace elliptical orbits, and when ϵ v = ϵ θ = 0, they trace exactly circular orbits. In addition, we added a perturbation following N (0, 0.01) to the velocities of both masses, which corresponds to the center-of-gravity velocity.We set the time-step size ∆t to 0.01 and generated 1,000 time-series of S = 500 steps for training and 10 time-series of S = 10, 000 steps for evaluation. We trained each model for 100,000 iterations.Hamiltonian System in Non-Canonical Form: KdV equation The KdV equation is a model of shallow water waves and is known to have soliton solutions (Furihata, 2001) . The dynamics is given by u t = -αuu x + βu xxx , (A18) where x denotes the spatial position and the subscripts denote partial derivatives; for example, u t = ∂u ∂t . The Hamiltonian is given byAs Hamilton's equation d dt u = S∇H, the partial differential operator ∂ ∂x acts as the coefficient matrix S. This system is Liouville integrable and has infinitely many first integrals, including the Hamiltonian H, total mass I 1 = udx, and T 2 = u 2 dx (Miura et al., 1968) . Other first integrals are defined using higher-order partial derivatives.For PDEs, PINNs are known to provide solutions when symbolic equations and boundary conditions are given (Raissi et al., 2019) . We, in contrast, consider learning spatially discretized PDEs as ODEs from observed data and solving them using numerical integrators, in the same context as NODEs and HNNs; this topic has also been studied extensively (Long et al., 2018; Matsubara et al., 2020; Sun et al., 2020; Holl et al., 2020) . Following the experiments in a previous study (Matsubara et al., 2020) , we discretized the KdV equation in space; it no longer has infinitely many first integrals. We set α = -6, β = 1, spatial size to 10 space units, and space mesh size to 0.2; the system state u had 50 elements. We generated two solitons as the initial condition; each was expressed as, where the size κ followed U(0.5, 2) and the initial position d of one soliton was set to be at least 2.0 from that of the other.We set the time-step size ∆t to 0.001 and generated 1,000 time-series of S = 500 steps for training and 10 time-series of S = 10, 000 steps for evaluation, using the discrete gradient method to ensure energy conservation (Furihata, 2001) . We trained each model for 30,000 iterations.Due to the spatial discretization, the KdV dataset contains spatial truncation errors. When the neural network learns this dataset, no spatial truncation errors are additionally introduced. An evaluation using the analytical solution as a dataset or datasets created with different spatial resolutions is included in future work.Poisson System: Double Pendulum A double pendulum (2-pend) is depicted in Fig. A1 . In polar coordinates, this is a Hamiltonian system. The state is composed of the angles (θ 1 , θ 2 ) of the two rods and their angular velocities (ω 1 , ω 2 ). This is also a second-order ODE, indicating that d dt θ 1 = ω 1 and d dt θ 2 = ω 2 . Let l 1 , l 2 denote the lengths of the two rods, m 1 , m 2 denote the masses of the two weights, and g denote the gravitational acceleration. The acceleration is given bywhere ∆ = θ 1 -θ 2 . In 2-dimensional Cartesian coordinates, the state is composed of the positions (x 1 , y 1 , x 2 , y 2 ) of the two masses and the corresponding velocitiesThe position is transformed by x 1 = l 1 sin θ 1 , y 1 = l 1 cos θ 1 , x 2 = x 1 + l 2 sin θ 2 , and y 2 = y 1 + l 2 cos θ 2 , and the velocity is transformed accordingly. The total energy H is given by The first and second terms denote the kinetic and potential energies, respectively. The double pendulum is no longer a Hamiltonian system in Cartecoordinates. Because the lengths of the two rods are constant, the double pendulum has two constraints on the position: l 2 1 = x 2 1 + y 2 1 and l 2 2 = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 . These constraints are holonomic constraints, and they lead to constraints involving the velocity, namely 0 = x 1 v x1 + y 1 v y1 and 0 = (x 2 -x 1 )(v x2 -v x1 ) + (y 2 -y 1 )(v y2 -v y1 ). When the constraints involving the velocity are satisfied, the holonomic constraints are implicitly satisfied. Therefore, the number of first integrals is five; however, three first integrals are sufficient to determine the dynamics. The dynamics is degenerate and classified as a constrained Hamiltonian system, or a Poisson system in a more general case.We set the masses of the two weights to m 1 = m 2 = 1.0 and the gravitational acceleration g to 9.8. We set the lengths l 1 , l 2 of the two rods to follow U(0.9, 1.1), the initial angles θ 1 , θ 2 to follow U(-0.5, 0.5), and the initial angular velocities θ1 , θ2 to follow U(-0.1, 0.1).We set the time-step size ∆t to 0.1 and generated 1,000 time-series of S = 500 steps for training and 10 time-series of S = 5, 000 steps for evaluation. We trained each model for 100,000 iterations.Dirac Structure: FitzHugh-Nagumo Model R. FitzHugh proposed a model of the electrical dynamics of a biological neuron, and J. Nagumo created an equivalent electric circuit. This model is called the FitzHugh-Nagumo model (Izhikevich & FitzHugh, 2006) and is a modified version of the van der Pol oscillator; the state oscillates when the magnitude of the external current source I is within an appropriate range. The circuit comprises a resistor R, inductor L, capacitor C, tunnel diode D, and voltage source E connected as shown in Fig. A2 . The whole circuit is connected to Published as a conference paper at ICLR 2023 The theoretical explanation for the high performance of neural networks (e.g., HNN) that assume first integrals for physical phenomena is an open question. Sannai et al. (2021) has theoretically shown that neural networks (e.g., CNNs and GNNs) with symmetry have faster learning convergence, and we consider this approach can be applied to the above question. At least for cFINDE and dFinde, we have an intuitive but not rigorous explanation; assuming one more first integral (i.e., increasing K by 1) reduces the number of degrees of freedom in the dynamics by 1, narrows the hypothesis space, accelerates learning convergence, and suppresses generalization errors.As shown in Table 3 , the performance of cFINDE and dFINDE is sensitive to the assumed number K of first integrals. Because K is a hyperparameter, it is basically a subject to be adjusted through evaluations on a validation set. With inappropriately large K, both cFINDE and dFINDE dropped their performance significantly. See the results of the 2-pend and FitzHugh-Nagumo datasets for K = 6 and K = 3, respectively.However, the performance drop can be found even with the training set. Table A3 summarizes the prediction performance on the training set of the 2-pend dataset. As was the case with the test set, the performance significantly dropped at K = 6. This is because NODE with cFINDE for K = 6 assumes the submanifold M ′ to be 2-dimensional. The submanifold M ′ is in fact 3-dimensional, so NODE with cFINDE for K = 6 is incapable of learning the dynamics and performs poorly even on the training set. Hence, the training set is enough to avoid a fatally inappropriate K.Alternatively, K can be determined by using other methods (e.g., Fukunaga & Olsen (1971) ; Liu & Tegmark (2021)) . Although these methods have some drawbacks introduced in Appendix A, they may be complementary to FINDE.

D.5 COMPARISON WITH MODIFIED NEURAL PROJECTION METHOD

The neural projection method (NPM) also employs a projection method (Yang et al., 2020) . Using a manner similar to Newton's method, it enforces the constraint C(u) = 0 by the projection of the state u under the assumption that the quantity C(u) is always zero. This assumption holds for some cases (e.g., holonomic constraints in a fixed environment), but not for most first integrals, whose values depend on initial conditions.For example, the linear momentum in the x-direction of the two-body problem is the first integral expressed as. This quantity V is constant within a trial (i.e., V (u(t)) = V (u(0))) and varies between trials depending on the initial speed v x1 (0) and v x2 (0). The total energy, the total mass, and many other first integrals depend on the initial condition in the same manner; hence, they are outside the scope of the NPM. In contrast, by imposing the constraint on the gradient ∇V = 0 or discrete gradient ∇V = 0, our proposed FINDE keeps the quantity V constant and can handle any first integrals.For comparison, we replaced the constraint C(u) = 0 with C(u s+1 , u s ) = V (u s+1 ) -V (u s ) = 0 and adopted the NPM to first integrals varying from trial to trial. We evaluated the modified NPM using the 2-pend dataset. Because the modified NPM is a discrete-time projection method, Published as a conference paper at ICLR 2023 we compared it with the discrete-time version of the proposed FINDE (dFINDE). The results are summarized in Table A4 .The dFINDE successfully learned the dynamics in all trials, but the modified NPM failed to learn the dynamics in half the trials (see the rightmost column for the numbers of successful trials out of 5).The modified NPM often encountered of the underflow of the time-step size or a division by the zero gradient of the first integral. Even when the learning was successful, the performance of the NPM was inferior to that of the dFINDE. The modified NPM solved the optimization problem in Eq. ( 3) at every step, but it sometimes diverged or failed to converge, especially in the early phase of learning.The NPM was successful for fixed environments but might be unsuited for general first integrals varying from trial to trial. However, the dFINDE does not require solving an optimization problem during training, making the learning process robust against randomness such as initialization.

