CONTINUOUS-TIME IDENTIFICATION OF DYNAMIC STATE-SPACE MODELS BY DEEP SUBSPACE ENCODING

Abstract

Continuous-time (CT) modeling has proven to provide improved sample efficiency and interpretability in learning the dynamical behavior of physical systems compared to discrete-time (DT) models. However, even with numerous recent developments, the CT nonlinear state-space (NL-SS) model identification problem remains to be solved in full, considering common experimental aspects such as the presence of external inputs, measurement noise, latent states, and general robustness. This paper presents a novel estimation method that addresses all these aspects and that can obtain state-of-the-art results on multiple benchmarks with compact fully connected neural networks capturing the CT dynamics. The proposed estimation method called the subspace encoder approach (SUBNET) ascertains these results by efficiently approximating the complete simulation loss by evaluating short simulations on subsections of the data, by using an encoder function to estimate the initial state for each subsection and a novel state-derivative normalization to ensure stability and good numerical conditioning of the training process. We prove that the use of subsections increases cost function smoothness together with the necessary requirements for the existence of the encoder function and we show that the proposed state-derivative normalization is essential for reliable estimation of CT NL-SS models.

1. INTRODUCTION

Dynamical systems described by nonlinear state-space models with a state vector x(t) ∈ R nx are powerful tools of many modern sciences and engineering disciplines to understand potentially complex dynamical systems. One can distinguish between Discrete-Time (DT) x k+1 = f (x k , u k ) and Continuous-Time (CT) dx(t) dt = f (x(t), u(t)) state-space models. In general, obtaining DT dynamical models from data is easier than CT models since data in computers is represented as discrete elements (e.g. arrays). However, the additional implementation complexity and computational costs associated with identifying CT models can be justified in many cases. First and foremost, from the natural sciences, we know that many systems are compactly described by CT dynamics which makes the continuity prior of CT models a well-motivated regularization/prior (De Brouwer et al., 2019) . It has been observed that this regularization can be beneficial for sample efficiency (De Brouwer et al., 2019) which is a common observation when "including physics" in learning approaches (Karniadakis et al., 2021) . Furthermore, the analysis of ODE equations is a well-regarded field of study with many powerful results and methods which could further improve model interpretability (Fan et al., 2021) , such as applied in Bai et al. (2019) . Another inherent advantage is that these models can be used with irregularly sampled or missing data (Rudy et al., 2019) . Lastly, in the control community, CT models are generally regarded desirable for control synthesis tasks as shaping the behavior of the controller is much more intuitive in CT (Garcia et al., 1989) . Hence, developing robust and general CT models and estimation methods would be greatly beneficial. In the identification of physical CT systems, it is common to encounter challenges such as: external inputs (u(t)), noisy measurements, latent states, unknown measurement function/distribution (e.g. y(t) = h(x(t))), the need for accurate long-term predictions and a need for a sufficiently low computational cost. For instance, all these aspects need to be considered for the cascade tank benchmark problem (Schoukens & Noël, 2017) . These aspects and the considered CT state-space model is summarized in Figure 1 . Many of these aspects have been studied independently, for instance, Brajard et al. (2020) ; Rudy et al. (2019) explicitly addressed the presence of noise on the measurement data, Maulik et al. (2020) ; Chen et al. ( 2018) provided methods for modeling dynamics with latent states, Zhong et al. ( 2020) considers the presence of known external inputs, Zhou et al. (2021a) provides a computationally tractable method for accurate long-term sequence modeling. However, formulating models and estimation methods for the combination of multiple or all aspects is in comparison underdeveloped with only a few attempts such as Forgione & Piga (2021a) that have been made. In contrast to previous work, we present a CT encoder-based method which is a general, robust and well-performing estimation method for CT state-space model identification. That is, the formulation addresses noise assumptions, external inputs, latent states, an unknown output function, and provides state-of-the-art results on multiple benchmarks of real systems. The presented subspace encoder method is summarized in Figure 2 . The proposed method considers a cost function evaluations on only short subsections of the available dataset which reduces the computational complexity. Furthermore, we show theoretically that considering subsections enhances cost function smoothness and thus optimization stability. The initial states of these subsections are estimated using the encoder function for which we present necessary requirements for its existence. Lastly, we introduce a normalization of the state and state-derivative and we show that it is required for proper CT estimation. Moreover, we attain additional novelty as these results are obtained without needing to impose a specific structure on the state-space (such as in Greydanus et al. (2019) ; Cranmer et al. (2020) ) obtaining a practically widely applicable method. Our main contributions are the following; • We formally derive the problem of CT state-space model estimation with latent states, external inputs, and measurement noise. • We reduce the computational loads by proposing a subspace encoder-based identification algorithm that employs short subsections, an encoder function that estimates the initial latent states of these subsections, and a state-derivative normalization term for robustness. • We make multiple theoretical contributions; (i) we prove that the use of short subsections increases cost function smoothness by Lipschitz continuity analysis, (ii) we derive necessary conditions for the encoder function to exist and (iii) we show that a state-derivative normalization term is required for proper CT model estimation. • We demonstrate that the proposed estimation method obtains state-of-the-art results on multiple benchmarks.

2. RELATED WORK

One of the most influential papers in CT model estimation is the introduction of neural ODEs (Chen et al., 2018) , which showed that residual networks is presented as an Euler discretization of a continuous in-depth neural network. Moreover, they also show that one can employ numerical integrators to integrate through the depth in a computationally efficient manner. This depth can be interpreted as the time direction to be able to model dynamical systems. The ideas in the neural ODE contribution have been extended to/used in, for instance, normalizing flows to efficiently model arbitrary probability distributions (Papamakarios et al., 2021; Grathwohl et al., 2019) , and enhance the understanding and interpretability of neural networks (Fan et al., 2021) . However, the neural ODE does not scale well for long sequences, nor does it consider external inputs or noise, and the optimization process is often unstable. An adjacent research direction is the method/models which consider CT dynamics and directly use the state derivatives and even often the noiseless states to formulate structured and interpretable models such as Hamiltonian Neural Networks (HNN) (Greydanus et al., 2019) , Lagrangian Neural Networks (LNN) (Cranmer et al., 2020) and Sparse Identification of Nonlinear Dynamics (SINDy) (Brunton et al., 2016) . In contrast, the proposed method is formulated for an unstructured state-space and does not require the system state or the state derivatives to be known. Our method is in part related to (Ayed et al., 2019) which concerns the estimation of CT models with latent variables. They also employ an encoder function to estimate initial states, however, this encoder is only dependent on the past outputs, contains a partially known state and there is no theoretical support for the method. Furthermore, in that work, only a fixed output function is considered and the involved optimization problem is solved as an optimal control problem whereas our formulation alters the simulation loss function to obtain a computationally desirable form. Furthermore, (Forgione & Piga, 2021a) , to which we compare in this work, considers CT model with latent variables, subsections, and an additional loss term for the integration error. However, they include the initial states of these subsections as free optimization parameters. This increases the model complexity with the number of subsections. In contrast, our proposed method uses an encoder to estimate the initial states. This results in fixed model complexity. Furthermore, we only employ a single loss function and a novel state-derivative normalization term. Additionally, we provide theoretical insights into these existing elements and extend them to the considered setting in a robust and computationally efficient manner. 3 PROBLEM STATEMENT Consider a system represented by a continuous-time nonlinear state-space (CT NL-SS) description sampled at a fixed interval ∆t for simplicity: ẋs (t) = f (x s (t), u(t)), y k = h(x s,k , u k ) + w k , where the subscript notation denotes sampling as x s,k = x s (k∆t), x s (t) ∈ R nx s is the system state variable, u(t) ∈ R nu is the input, y k ∈ R ny is the output, f represents the system dynamics and h gives the output function while w k ∈ R ny is a i.i.d. zero-mean white noise process with finite variance Σ w . For this system, the CT model estimation problem can be expressed for a given dataset of measurements: D N = {(u 0 , y 0 ), (u 1 , y 1 ), ..., (u N -1 , y N -1 )}, with unknown w k , x s (t), ẋs (t), ẏk and initial state x s (0), as the following optimization problem (a.k.a. simulation loss minimization): min θ,x(0) 1 N N -1 k=0 ∥y(k∆t) -ŷ(k∆t)∥ 2 2 , s.t. ŷ(t) = h θ (x(t)), ẋ(t) = f θ (x(t), u(t)), where x(t) ∈ R nx is the model state, h θ and f θ are the output and state-derivative functions parameterized by θ and being Lipschitz continuous in their inputs and parameterization. These two functions are formulated as multi-layer feedforward neural networks during our experiments. To obtain the simulation output ŷ(k∆t), one can integrate ẋ(t) = f θ (x(t), u(t)) starting from the initial state x(0). This integration can be performed with any ODE solver that allows for backpropagation such as Euler (x(t + ∆t) = x(t) + ∆tf θ (x(t), u(t))), RK4, or numerous adaptive step methods (Chen et al., 2018; Ribeiro et al., 2020) . 1 To make this a well-posed optimization problem, additional information or an assumption on the inter-sample behavior of u(t) is required, since, for example, u(∆t/2) is not present in D N . This behavior is often chosen to be Zero-Order Hold (ZOH) (Ljung, 1999) as can be viewed in Figure 1 . Multiple major issues are encountered when solving the optimization Problem (2) with a gradientdescent-based method. The first issue is that computing the value of the loss function requires a forward pass on the whole length of the dataset (Ayed et al., 2019) . Hence, the computational complexity grows linearly with the length of the dataset. Furthermore, a common occurrence is that the values of x(t) or its gradient grows exponentially which results in non-smooth loss functions or gradients. This causes gradient-based optimization algorithms to become unreliable since the optimization process might be unstable or it converges to a local minima (Ribeiro et al., 2020) . All these issues are addressed in the proposed method.

4. PROPOSED METHOD

We propose to consider multiple overlapping short subsections of length T ∆t to form a truncated simulation loss instead of simulating over the entire length of the dataset. We express this in the following optimization problem (note that we express the optimization problem with discrete-time notation (u k := u(k∆t)) for brevity): minimize θ 1 N -T -max(n a , n b ) + 1 N -T n=max(na,n b ) 1 T T -1 k=0 ∥y n+k -ŷn+k|n ∥ 2 2 , s.t. ŷn+k|n = h θ (x n+k|n ), x n+k+1|n = ODEsolve[ 1 τ f θ , x n+k|n , u n+k , ∆t], x n|n = ψ θ (u n-1 , ..., u n-n b , y n-1 , ..., y n-na ). (3) Here, the pipe (|) notation indicates the current index and the starting index as (current index | start index) to differentiate between different subsections. This pipe notation is similar to the notation used in Kalman filtering and conditional probability distributions (Chui et al., 2017) . Furthermore, ODEsolve indicates a numerical scheme which integrates 1/τ f θ (x, u) from the initial state x n+k|n for a length of ∆t given the input u n+k . Lastly, we introduced an encoder function ψ θ with encoder lengths n a and n b , for the past output and input samples respectively, which estimates the initial states of the considered subsection. This encoder function will also be parameterized as a feedforward neural network during our experiments. A graphical summary of the proposed method called the CT subspace encoder approach (abbreviated as SUBNET) can be viewed in Figure 2 . The first observation is that optimization Problem (3) is a generalisation of (2) since if T = N and n a = n b = 0, the original optimization Problem (2) is recovered. However, as one might observe, this optimization problem is less computationally challenging to solve if T < N since the first sum can be computed in parallel. In other words, computational costs scale as O(T ) for (3) and O(N ) for (2). Moreover, the smoothness of the encoder cost function is also enhanced since the associated Lipschitz constant L V (enc) can scale exponentially with the subsection length (T ∆t) as shown in Theorem 1. The enhanced smoothness is reflected in the ease of optimization (Ribeiro et al., 2020) . Theorem 1. The Lipschitz constant L V (enc) of the cost function (3) ∥V (enc) (θ 1 ) -V (enc) (θ 2 )∥ 2 ≤ L V (enc) ∥θ 1 -θ 2 ∥ 2 (4) scales as L V (enc) = O(exp(2T ∆tL f /τ )) (5) where L f is the Lipschitz constant of f θ . Proof. See Appendix 8.1 The CT subspace encoder (SUBNET) method applied on a subsection of the data of length T ∆t starting from time t. The encoder ψ θ estimates the initial state, h θ provides the output predictions while 1 τ f θ governs the state dynamics. All three functions are parameterized by fully connected neural networks. Lastly, the 1 τ factor is the novel state-derivative normalization factor which significantly increases optimization stability. All three functions are optimized together by minimizing the mean squared difference of the model outputs of multiple subsections of the available training data as seen in Eq. ( 3) which both reduces the computational cost and enhances cost function smoothness as seen in Theorem 1. An error in the initial state of x n|n can significantly bias the estimate due to the short nature of the subsections. To counter this, we formulated an encoder function ψ θ which estimates the initial state of each subsection. We do not add any additional loss term since an improved initial state error estimate also minimizes the transient error (Forgione & Piga, 2021a) which is present in the encoder loss. A natural question to ask is under which conditions there exists an encoder function that can map from the past inputs and outputs to this initial state. In Appendix 8.2, we formally derive necessary conditions for the existence. These necessary conditions are that, state derivative f θ requires to be Lipschitz continuous in x, and if the number of considered past outputs n a and inputs n b are equal then n a ≥ n x /n y needs to be satisfied, among other conditions. It is widely known that input and output normalization is essential for obtaining competitive models throughout deep learning in terms of respecting the prior assumptions made in for instance Xavier Weight Initialization (Glorot & Bengio, 2010) . Input and output normalization can be seen to be insufficient when considering CT state-space model due to the presence of the hidden state x and the state-derivative f θ (x, u). However, as shown in Theorem 2, any CT system can be transformed to become normalized by the introduction of a state transform and a positive 1/τ normalization factor. Theorem 2. Given ẋ(t) = f (x(t), u(t)) and y(t) = h(x(t), u(t)) that defines the dynamics of a system. For any bounded non-zero state-trajectory x(t) ∈ R nx and input signal u(t) ∈ R nu that satisfies ẋ(t) = f (x(t), u(t)) for all t ∈ R, there exists a τ and a scalar state transformation γ x(t) = x(t) such that both the equivalent state trajectory x and state-derivative function f (x, u) of the transformed system ẋ = 1 τ f (x(t), u(t)) are normalized on the time interval [0, L] as RMS(x) = 1 L L 0 1 n x ∥x(t)∥ 2 2 dt = 1 & RMS( f (x, u)) = 1. (6) if RMS(f (x, u)) ̸ = 0. Proof. With γ = RMS(x) & 1 τ = RMS( ẋ) RMS(x) the normalization conditions are satisfied, as shown below Hence, to assure the existence of a properly normalized model where both the state-derivative f function and state x are normalized, it is sufficient to include a state and state-derivative normalization factor. Furthermore, this proof also guides the choice of τ since the amplitude of RMS(x)/RMS( ẋ) might be known from physical insight or by an approximate model. RMS(x) = RMS(x)/γ = 1 (8a) RMS( f (x, u)) = τ RMS( ẋ) = τ RMS( ẋ)/γ = τ RMS( ẋ)/RMS(x) = 1 (8b)

5.1. BENCHMARK DESCRIPTIONS

The Cascade Tank with overflow (CCT) benchmark (Schoukens & Noël, 2017; Schoukens et al., 2017) consists of measurements taken from a two-tank fluid system with a pump. The input signal controls a water pump that delivers water from the reservoir to the upper tank. Through a small opening in the upper tank, the water enters the lower tank where the water level is recorded. Lastly, through a small opening in the lower tank, the water re-enters the reservoir. This benchmark is nonlinear as the flow rates are governed by square root relations and the water can overflow either tank which is a hard saturation nonlinearity. The benchmark consists of two datasets with measurements of 1024 samples each at a sample rate of ∆t = 4s. The first dataset is used for training, the first 512 samples of the second set are used for validation (used only for early stopping) and the entire second set for testing. Most of the other methods to which we compare our approach use the entire second set as validation and test set, as no explicit test set is provided in this benchmark description. The Coupled Electric Drive (CED) benchmark (Wigren & Schoukens, 2017 ) consists of measurements from a belt and pulley with two motors where both clockwise and counter-clockwise movement is permitted. The motors are actuated by the given inputs and the measured output is a pulse transducer that only measures the absolute velocity (i.e. insensitive to the sign of the velocity) of the belt. The system approximately has three states; the velocity of the belt, the position of the pulley, and the velocity of the pulley. The benchmark consists of two datasets of measured 500 samples each at a sample rate of ∆t = 20ms. The first 300 samples are used for training and the other 200 samples are for testing, of those samples the first 100 samples are also used for validation with both datasets. Similar to the last benchmark, even with this overlap, it is still a fair comparison as most of the other methods to which we compare use the entire second set as validation and test. The Electro-Mechanical Positioning System (EMPS) benchmark (Janot et al., 2019 ) consists of measured signals from a one-dimensional drive system used to drive the prismatic joint of robots or machine tools. The provided measurements of the position are obtained in closed-loop actuated and no direct velocity measurements are available. The main source of nonlinearity are the nonlinear friction effects (e.g. static and dynamic friction). The benchmark consists of two sequences of samples 24841 with a sampling time of ∆t = 1ms. As prescribed by the benchmark, the first sequence is used for training and validation and the second sequence for testing (i.e. the validation set and test set are completely disjoint). Specifically for the CT subnet implementation, we utilize 17885 samples of the first set for training and the last 6956 samples are used for validation while the entire second set is used for testing. Table 1 : The test RMSE simulation on two benchmarks for the CT SUBNET method using an ensemble of models. The value given is the best RMSE simulation of all estimated models and the value between parentheses is the mean performance of the models. Note that we are unable to report the results for the neural ODE without state-derivative normalization 1/τ for CCT since the optimization was unstable. (a) CCT benchmark Method RMSE BLA (Relan et al., 2017) 0.75 Volterra model (Birpoutsoukis et al., 2018) 0.54 State-space with GP-inspired prior (Svensson & Schön, 2017) 0.45 SCI (Forgione & Piga, 2021a) 0.40 IO stable CT ANN (Weigand et al., 2021) 0.39 NL-SS + NLSS2 (Relan et al., 2017) 0.34 TSEM (Forgione & Piga, 2021a) 0.33 Tensor B-splines (Karagoz & Batselier, 2020) 0 

5.2. RESULTS

Using the SUBNET method, we estimate models where the three functions h θ , f θ and ψ θ are implemented as 2 hidden layer neural networks with 64 hidden nodes per layer, tanh activation and a linear bypass from the input to the output for CCT and CED and 1 hidden layer with 30 hidden nodes for EMPS. As an ODE solver, we use a single RK4 step between samples and assume that the input signal has been applied in a zero-order hold sense. As for the implementation of the CT subspace encoder-based method, the following hyperparameters are considered; 2021) for discrete-time. We observed similar effects of the hyperparameters for continuous-time. The training is done by using the Adam optimizer with default settings (Kingma & Ba, 2015) with a batch size of 32 for CED, 64 for CCT and 1024 for EMPS and using a simulation on the validation dataset for early stopping to reduce overfitting. We also directly compare our method with a reproduction of neural ODE on both benchmarks. We adapt the code and the example ("latent ODE.py") available online (Chen et al., 2018) to include ZOH inputs, leaving the neural network unaltered and an RK4 integrator. We observed that the initial model was unstable for CCT and underperforming for CED and, hence, the neural ODE method alone was unable to provide state-of-the-art results. To stabilize and improve the neural ODE method we also introduce a state-derivative normalization term 1/τ motivated by Theorem 2. The value of 1/τ for CCT and CED for neural ODE was initially chosen to be the optimal value found in the SUBNET approach, however, in the CED case, it was lowered due to optimization instabilities. 2We compared our obtained model to the literature in Table 1 for CCT and CED. We report both the mean and the minimum of an ensemble of models estimated only differing in parameter initialization. This ensemble consists of 17 SUBNET models for both CCT and CED and 24 and 8 neuralODE models for CCT and CED respectively. The table also includes the discrete-time (DT) subspace encoder which has the same network structure and loss function as the CT subspace encoder but where the ODE solver is replaced by f θ . The table shows that the obtained models with the CT subspace encoder method provide state-of-the-art results. The obtained models are the bestknown with a black-box modeling approach on both benchmarks. Furthermore, we use unrestricted . state-space and fully connected neural networks as model elements. Remarkably, the resulting performance is close to the performance of a grey-box model. Furthermore, Figure 4 illustrates that the resulting models have been able to model the nonlinear behavior present in both benchmarks. Table 1 also contains the results of the modified neural ODE with normalization. One observation is that the best and mean performance difference is significantly larger than for SUBNET. We think that this is due to the availability of only a single sequence in the training set for the CCT benchmark (and two sequences for CED) which results in extensive overfitting. In comparison, the subspace encoder method is less prone to overfitting since it uses many subsections of the available sequence(s). Moreover, the subspace encoder method only requires about 20 minutes to train a model to the lowest validation loss, whereas, neural ODE requires about 2 hours for CED and 5 hours for CCT. We compare the CT SUBNET method applied on the EMPS benchmark to the existing methods in Table 2 and show the simulated response on the test set in Figure 5 . The obtained results show a remarkable accuracy over the entire 24841 samples showing that the CT SUBNET method is able to make accurate long-term predictions while using relatively short sub-sequences of only T = 200 in length during training. In the table, dynoNET is significantly better than the proposed method, however, this method utilizes grey-box knowledge (i.e. physics-based) in the network structure but this is system specific and not easily generalizable. Lastly, IO stable CT ANN performs slighly better than the proposed method since it enforces stability which can be a problem during estimation since there is a position integrator present in system. Furthermore, since the validation and test set are disjoint we also show that overfitting does not play a role in the reported results. and the RMSE simulation on the test set(s). These two figures show that there exists a range of ∆t/τ where RMS(f ) ≈ RMS(x) ≈ 1 which numerically validates Theorem 2. Furthermore, ∆t/τ with this property also has a significantly lowered RMSE simulation as was argued in its introduction. When we introduced the state-derivative normalization factor 1/τ , we argued that it would normalize f θ and that it would increase optimization stability and, hence, the quality of the obtained models. Here, we provide some empirical insight for these two statements by providing a parameter sweep over ∆t/τ . To eliminate variations due to different initial parameters we trained an ensemble of models which creates box plots with mean state amplitude defined by RMS(x) ≜ 1 N nx k ∥x(k∆t)∥ 2 2 , mean state-derivative amplitude RMS(f ), and the RMSE simulation. These box-plots as shown in Figure 6 indeed illustrate that there exists an 1/τ such that both RMS(f ) ≈ RMS(x) ≈ 1 and that the best performing models are close to that value of 1/τ . Moreover, to illustrate that improper normalization (i.e. τ = 1) can diminish the performance for both CCT and CED, observe that the RMSE simulation on the test set(s) is 2.0 and 0.3 [ticks/s] for ∆t/τ = ∆t = 4 s and ∆t/τ = ∆t = 0.02 s respectively.

6. CONCLUSION

In this paper, we have introduced the CT subspace encoder approach to identify nonlinear dynamical systems in the presence of latent states, external inputs, and measurement noise. We have shown that the proposed method can obtain highly accurate CT models only consisting of fully connected neural networks. The approach has improved computational cost and stability by considering multiple subsections where the initial state is estimated with an encoder function and by a state-derivative normalization term to improve optimization stability. We provided multiple theoretical proofs which provide additional insight and motivation for the method. These proofs are, increased cost function smoothness, necessary conditions for the existence of the encoder function and that for proper normalization in CT modeling one requires the state-derivative normalization term. Furthermore, we obtain state-of-the-art results on all three considered benchmarks. • Datasets: (i) The CCT dataset is described in Schoukens & Noël (2017); Schoukens et al. (2017) and is available for download at https://data.4tu. nl/articles/dataset/Cascaded_Tanks_Benchmark_Combining_ Soft_and_Hard_Nonlinearities/12960104, (ii) the CED dataset is described in Wigren & Schoukens (2017) and is available for download at http://www.it.uu.se/research/publications/reports/2017-024/, (iii) the EMPS datset is described in (Janot et al., 2019) Recall that the subspace encoder loss function as in Eq. 3 can be expressed in the following form V enc (θ) = 1 N -T -max(n a , n b ) + 1 N -T n=max(na,n b ) 1 T T -1 k=0 ∥y n+k -ŷn+k|n ∥ 2 2 , with ŷn+k|n = h θ (x n+k|n ) = h θ (x((n + k)∆t|n∆t)), ẋ(t|n∆t) = 1 τ f θ (x(t|n∆t), u(t)) x(n∆t|n∆t) = ψ θ (u n-1 , ..., u n-n b , y n-1 , ..., y n-na ). Our aim is to derive the scaling in T of the Lipschitz constant L enc as defined in |V enc (θ 1 ) -V enc (θ 2 )| 2 ≤ L 2 enc ∥θ 1 -θ 2 ∥ 2 2 , ∀θ 1 , θ 2 ∈ Θ ⊂ R n θ . We aim to express L enc in terms of the following Lipschitz constants ∥h θ1 (x 1 ) -h θ2 (x 2 )∥ 2 2 ≤ L 2 h (∥x 1 -x 2 ∥ 2 2 + ∥θ 1 -θ 2 ∥ 2 2 ). (11a) ∥f θ1 (x 1 , u) -f θ2 (x 2 , u)∥ 2 2 ≤ L 2 f (∥x 1 -x 2 ∥ 2 2 + ∥θ 1 -θ 2 ∥ 2 2 ), ∥ψ θ1 (u past , y past ) -ψ θ2 (u past , y past )∥ 2 2 ≤ L 2 ψ ∥θ 1 -θ 2 ∥ 2 2 . (11c) for all x 1 , x 2 ∈ X ⊂ R nx and ∀θ 1 , θ 2 ∈ Θ ⊂ R n θ . A sufficient condition for these Lipschitz constants to be finite is that the derivatives are finite on a compact set of inputs which is often the case in feed-forward neural networks. Since the encoder loss function as in Eq. 3 can be written as a sum: V enc (θ) = 1 N -T -max(n a , n b ) + 1 N -T n=max(na,n b ) V sec (n, θ 1 ) (12a) V sec (n, θ 1 ) = 1 T T -1 k=0 ∥y n+k -ŷn+k|n ∥ 2 2 (12b) where |V sec (n, θ 1 ) -V sec (n, θ 2 )| 2 ≤ L 2 sec ∥θ 1 -θ 2 ∥ 2 2 , Then by the sum property this implies that L enc = L sec since L enc = 1 N -T -max(n a , n b ) + 1 N -T n=max(na,n b ) L sec . Hence, it is sufficient to consider only a single subsection. Take n = 0 and drop the bar notation for simplicity. By the sum and multiplication properties, we derive get |V sec (θ 1 ) -V sec (θ 2 )| ≤ 2/T T -1 k=0 (M y + M k )∥ŷ 1,k -ŷ2,k ∥ 2 , where M y is the bound on ∥y(t)∥ 2 assuming a stable system and M k , the bound on ∥ŷ k ∥ 2 . The M k bound scales the same as ∥ŷ 1,k -ŷ2,k ∥ 2 as shown in Ribeiro et al. (2020) . The ∥ŷ 1,k -ŷ2,k ∥ 2 expression can be expanded by using Eq. 11a as ∥ŷ 1,k -ŷ2,k ∥ 2 2 ≤ L 2 h (∥x 1 (k∆t) -x 2 (k∆t)∥ 2 2 + ∥θ 1 -θ 2 ∥ 2 2 ). Next, we aim to derive an expression for the Lipschitz constant L x (t) given in terms of ∥x 1 (t) -x 2 (t)∥ 2 2 ≤ L x (t) 2 ∥θ 1 -θ 2 ∥ 2 2 . ( ) whereby Eq. (11c) L x (0) = L ψ . By considering a small increment in time of length h, we can express L x (t + h) in terms of L x (t). Using the fact that h is small we can use an Eurler step and discard higher order terms of h as; ∥x 1 (t + h) -x 2 (t + h)∥ 2 2 = ∥(x 1 (t) -x 2 (t)) + h/τ (f θ1 (x 1 (t), u(t)) -f θ2 (x 2 (t), u(t)))∥ 2 2 ≤ ∥x 1 (t) -x 2 (t)∥ 2 2 + 2h/τ ∥x 1 (t) -x 2 (t)∥ 2 ∥f θ1 (x 1 (t), u(t)) -f θ2 (x 2 (t), u(t))∥ 2 by the triangle inequality. Next, we can replace all the f terms by Eq. (11b) and x terms by Eq. ( 16) to derive an expression for L x (t + h) as ≤ ∥x 1 (t) -x 2 (t)∥ 2 2 + 2h/τ ∥x 1 (t) -x 2 (t)∥ 2 L f ∥x 1 (t) -x 2 (t)∥ 2 2 + ∥θ 1 -θ 2 ∥ 2 2 ≤ L x (t) 2 + 2h/τ L x (t)L f L x (t) 2 + 1 ∥θ 1 (t) -θ 2 (t)∥ 2 2 L x (t + h) = L x (t) 2 + 2hL x (t) L x (t) 2 + 1L f /τ . This expression allows us to derive that the derivative of L x (t) is given by Lx (t) = lim h→0 L x (t + h) -L x (t) h , = 1 + L x (t) 2 L f /τ.



The adjoint methods for gradient computation is not within the scope of this research. The code used for both SUBNET and neural ODE experiments is available at https://github.com/ GerbenBeintema/CT-subnet



Figure 1: In this work, we consider the problem of estimating continuous-time (CT) state-space models from noisy observation (additive noise) with long-term prediction capabilities, hidden states and external signals in a computationally efficient and robust manner.

Figure2: The CT subspace encoder (SUBNET) method applied on a subsection of the data of length T ∆t starting from time t. The encoder ψ θ estimates the initial state, h θ provides the output predictions while 1 τ f θ governs the state dynamics. All three functions are parameterized by fully connected neural networks. Lastly, the 1 τ factor is the novel state-derivative normalization factor which significantly increases optimization stability. All three functions are optimized together by minimizing the mean squared difference of the model outputs of multiple subsections of the available training data as seen in Eq. (3) which both reduces the computational cost and enhances cost function smoothness as seen in Theorem 1.

Figure3: A photo of the Cascade Tank with overflow (CCT) system(Schoukens & Noël, 2017)   and a graphical depiction of the Coupled Electric Drive (CED) system(Wigren & Schoukens,  2017). These two systems and the EMPS benchmark are the basis of the benchmarks used in the analysis and comparison of the SUBNET method.

n x = 2, n a = n b = 5 and T = 30 for CCT, n x = 3, n a = n b = 4 and T = 60 for CED and n x = 3, n a = n b = 20 and T = 200 for EMPS. These hyperparameters are chosen based on hyperparameters analysis shown in Beintema et al. (

Figure 4: Time-domain simulation for both the (a) CCT and (b) CED benchmarks of the obtained models by the CT SUBNET method along the test set. Since the CED benchmark contains two separate test sequences, hence, they are shown in two separate figures.

Figure 6: The influence of the state-derivative normalization hyperparameter ∆t/τ as in ẋ = 1 τ f θ (x, u) on different model properties for both the (a) CCT and (b) CED benchmarks. The shown model properties are the mean state amplitude RMS(x), mean state-derivative amplitude RMS(f )and the RMSE simulation on the test set(s). These two figures show that there exists a range of ∆t/τ where RMS(f ) ≈ RMS(x) ≈ 1 which numerically validates Theorem 2. Furthermore, ∆t/τ with this property also has a significantly lowered RMSE simulation as was argued in its introduction.

Furthermore, to derive the scaling of L enc we use known properties of the Lipschitz constants;• The sum property: c(x) = a(x) + b(x) has a Lipschitz constant of L c = L a + L b . • The multiplication property: c(x) = a(x)b(x) has a Lipschitz L c = M a L b + M b L awhere M a is the maximal value of a on a closed set of inputs x ∈ X and M b being similarly defined.

The resulting test RMSE simulation on the test set for the EMPS benchmark compared with results in the literature.

and is available for download at https://www.nonlinearbenchmark.org/benchmarks/emps. • Code: Both the implementation and experiments of CT SUBNET and neural ODE are available at https://github.com/GerbenBeintema/CT-subnet. • Hardware: It takes about 15 minutes to estimate a single CT SUBNET model and 2 hours for a single neural ODE model on a consumer laptop. A notable exception is CT SUBNET for EMPS which took about 10 hours due to the increased size and difficulty of the dataset.

annex

which suggests that Lx (t) is continuous in t and has the closed form solution asNow by substituting Eq. ( 21) into, ( 16), ( 15), ( 14) and using ( 13) we arrive at the following expression for L enc aswhich scales in the limit of large T assince L f > 0 and ∆t > 0. Note that the 2 in the exponent is from the multiplication with M k which also scales as the y term as previously mentioned.Furthermore, this bound cannot be lowered since the linear system ẋ(t) = x(t)L f /τ already results in the scaling of L enc ∼ exp(2T ∆tL f /τ ).

8.2. RECONSTRUCTABILITY OF THE INITIAL STATE FROM PAST INPUT AND OUTPUTS

To derive the conditions on the existence of the encoder function suppose that we have a system given bywhere x n = x(n∆t). For this system, if a state is given x(t 0 ) along the input trajectory u(t) one would in principle be able to compute x(t) for all t > t 0 . However, since we aim to construct the state given past outputs we need x(t) for t < t 0 which requires backward in time integration. This backward integration on f is guaranteed to be unique if f is Lipschitz continuous in x for all u as by the Picard-Lindelöf theorem (Murray & Miller, 2013) . Hence, since f is assumed to be Lipschitz continuous we can construct an operator f d which can integrate backwards or forwards aswhere u is subject to ZOH.This operator allows us to construct past outputs asn are similarly defined. To construct the initial state x n , we need to invert Eq. ( 26). This inverse is also known as a reconstructability map (Katayama, 2005) . For the inverse to exist, several necessary requirements can be given. One such necessary requirement is that a small perturbation to a solution x n should change (H • F -z d ), otherwise these solutions are indistinguishable from the output. This is formalized stating by that the matrixhas a null space of rank zero which is also known as the local observability condition. This is the same as the condition that the column rank of this matrix ≥ n x . A necessary requirement for this column rank condition is that the number of columns is equal to or greater than the number of rows i.e. zn y ≥ n x .Hence, under the right conditions, it might be possible to solve Eq. ( 26) for a singular x n since this equation is a nonlinear fixed point problem if W -z n is known. For the case that W -z n is unknown one can estimate the state xn ≈ x n by solving the nonlinear regression problem;Hence, both f uniformly Lipschitz continuous and (∇ x L) T ∇ x L being full rank in xn are necessary conditions for the existence of a unique reconstructability map.Computing the reconstructability map for our model thus requires solving an optimization problem that becomes computationally infeasible during training. Hence, the encoder function aims to approximate the solution to Problem (27).

