IDENTIFYING NONLINEAR DYNAMICAL SYSTEMS WITH MULTIPLE TIME SCALES AND LONG-RANGE DEPENDENCIES

Abstract

A main theoretical interest in biology and physics is to identify the nonlinear dynamical system (DS) that generated observed time series. Recurrent Neural Networks (RNNs) are, in principle, powerful enough to approximate any underlying DS, but in their vanilla form suffer from the exploding vs. vanishing gradients problem. Previous attempts to alleviate this problem resulted either in more complicated, mathematically less tractable RNN architectures, or strongly limited the dynamical expressiveness of the RNN. Here we address this issue by suggesting a simple regularization scheme for vanilla RNNs with ReLU activation which enables them to solve long-range dependency problems and express slow time scales, while retaining a simple mathematical structure which makes their DS properties partly analytically accessible. We prove two theorems that establish a tight connection between the regularized RNN dynamics and its gradients, illustrate on DS benchmarks that our regularization approach strongly eases the reconstruction of DS which harbor widely differing time scales, and show that our method is also en par with other long-range architectures like LSTMs on several tasks.

1. INTRODUCTION

Theories in the natural sciences are often formulated in terms of sets of stochastic differential or difference equations, i.e. as stochastic dynamical systems (DS). Such systems exhibit a range of common phenomena, like (limit) cycles, chaotic attractors, or specific bifurcations, which are the subject of nonlinear dynamical systems theory (DST; Strogatz (2015) ; Ott (2002) ). A long-standing desire is to retrieve the generating dynamical equations directly from observed time series data (Kantz & Schreiber, 2004) , and thus to 'automatize' the laborious process of scientific theory building to some degree. A variety of machine and deep learning methodologies toward this goal have been introduced in recent years (Chen et al., 2017; Champion et al., 2019; Ayed et al., 2019; Koppe et al., 2019; Hamilton et al., 2017; Razaghi & Paninski, 2019; Hernandez et al., 2020) . Often these are based on sufficiently expressive series expansions for approximating the unknown system of generative equations, such as polynomial basis expansions (Brunton et al., 2016; Champion et al., 2019) or recurrent neural networks (RNNs) (Vlachas et al., 2018; Hernandez et al., 2020; Durstewitz, 2017; Koppe et al., 2019) . Formally, RNNs are (usually discrete-time) nonlinear DS that are dynamically universal in the sense that they can approximate to arbitrary precision the flow field of any other DS on compact sets of the real space (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Hanson & Raginsky, 2020) . Hence, RNNs seem like a good choice for reconstructing -in this sense of dynamically equivalent behavior -the set of governing equations underlying real time series data. However, RNNs in their vanilla form suffer from the 'vanishing or exploding gradients' problem (Hochreiter & Schmidhuber, 1997; Bengio et al., 1994) : During training, error gradients tend to either exponentially explode or decay away across successive time steps, and hence vanilla RNNs face severe problems in capturing long time scales or long-range dependencies in the data. Specially designed RNN architectures equipped with gating mechanisms and linear memory cells have been proposed for mitigating this issue (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) . However, from a DST perspective, simpler models that can be more easily analyzed and interpreted in DS terms (Monfared & Durstewitz, 2020a; b) , and for which more efficient inference algorithms exist that emphasize approximation of the true underlying DS (Koppe et al., 2019; Hernandez et al., 2020; Zhao & Park, 2020) , would be preferable. More recent solutions to the vanishing vs. exploding gradient problem attempt to retain the simplicity of vanilla RNNs by initializing or constraining the recurrent weight matrix to be the identity (Le et al., 2015) , orthogonal (Henaff et al., 2016; Helfrich et al., 2018) or unitary (Arjovsky et al., 2016) . While merely initialization-based solutions, however, may be unstable and quickly dissolve during training, orthogonal or unitary constraints, on the other hand, are too restrictive for reconstructing DS, and more generally from a computational perspective as well (Kerg et al., 2019) : For instance, neither chaotic behavior (that requires diverging directions) nor multi-stability, that is the coexistence of several distinct attractors, are possible. Here we therefore suggest a different solution to the problem which takes inspiration from computational neuroscience: Supported by experimental evidence (Daie et al., 2015; Brody et al., 2003) , line or plane attractors have been suggested as a dynamical mechanism for maintaining arbitrary information in working memory (Seung, 1996; Machens et al., 2005) , a goal-related active form of shortterm memory. A line or plane attractor is a continuous set of marginally stable fixed points to which the system's state converges from some neighborhood, while along the line itself there is neither connor divergence (Fig. 1A ). Hence, a line attractor will perform a perfect integration of inputs and retain updated states indefinitely, while a slightly detuned line attractor will equip the system with arbitrarily slow time constants (Fig. 1B ). This latter configuration has been suggested as a dynamical basis for neural interval timing (Durstewitz, 2003; 2004) . The present idea is to exploit this dynamical setup for long short-term memory and arbitrary slow time scales by forcing part of the RNN's subspace toward a plane (line) attractor configuration through specifically designed regularization terms. Specifically, our goal here is not so much to beat the state of the art on long short-term memory tasks, but rather to address the exploding vs. vanishing gradient problem within a simple, dynamically tractable RNN, optimized for DS reconstruction and interpretation. For this we build on piecewiselinear RNNs (PLRNNs) (Koppe et al., 2019; Monfared & Durstewitz, 2020b) which employ ReLU activation functions. PLRNNs have a simple mathematical structure (see eq. 1) which makes them dynamically interpretable in the sense that many geometric properties of the system's state space can in principle be computed analytically, including fixed points, cycles, and their stability (Suppl. 6.1.2; Koppe et al. (2019) ; Monfared & Durstewitz (2020a) ), i.e. do not require numerical techniques (Sussillo & Barak, 2013) . Moreover, PLRNNs constitute a type of piecewise linear (PWL) map for which many important bifurcations have been comparatively well characterized (Monfared & Durstewitz, 2020a; Avrutin et al., 2019) . PLRNNs can furthermore be translated into equivalent continuous time ordinary differential equation (ODE) systems (Monfared & Durstewitz, 2020b) which comes with further advantages for analysis, e.g. continuous flow fields (Fig. 1A, B ). We retain the PLRNN's structural simplicity and analytical tractability while mitigating the exploding vs. vanishing gradient problem by adding special regularization terms for a subset of PLRNN units to the loss function. These terms are designed to push the system toward line attractor configurations, without strictly enforcing them, along some -but not all -directions in state space. We further establish a tight mathematical relationship between the PLRNN dynamics and the behavior of its gradients during training. Finally, we demonstrate that our approach outperforms LSTM and other, initialization-based, methods on a number of 'classical' machine learning benchmarks (Hochreiter & Schmidhuber, 1997) . Much more importantly in the present DST context, we demonstrate that our new regularization-supported inference efficiently captures all relevant time scales when reconstructing challenging nonlinear DS with multiple short-and long-range phenomena.

2. RELATED WORK

Dynamical systems reconstruction. From a natural science perspective, the goal of reconstructing or identifying the underlying DS is substantially more ambitious than (and different from) building a system that 'merely' yields good ahead predictions: In DS identification we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties (see section 3.5, Fig. S2 ; Kantz & Schreiber (2004) ). inputs add s 1 ∈[0,1] idx s 2 ∈{0,1} A 1,1 =1 B 1,1 =1 W 1,2 =1 C 2,1 =1 C 2,2 =1 h 2 =-1 A B C z 2 =0 z 1 =0 z z z z Figure 1: A)-B): Illustration of the state space of a 2-unit RNN with flow field (grey) and nullclines (set of points at which the flow of one of the variables vanishes, in blue and red). Insets: Time graphs of z 1 for T = 30 000. A) Perfect line attractor. The flow converges to the line attractor, thus retaining states indefinitely in the absence of perturbations, as illustrated for 3 example trajectories (green). B) Slightly detuned line attractor. The system's state still converges toward the "attractor ghost", but then very slowly crawls up within the 'attractor tunnel' (green trajectory) until it hits the stable fixed point at the intersection of nullclines. Within the tunnel, flow velocity is smoothly regulated by the gap between nullclines, thus enabling arbitrary time constants. C) Simple 2-unit solution to the addition problem exploiting the line attractor properties of ReLUs. The output unit serves as a perfect integrator (see Suppl. 6.1.1 for complete parameters). the recent work on variational inference of DS (Duncker et al., 2019; Zhao & Park, 2020; Hernandez et al., 2020) . Although this enables insight into the dynamics along the empirically observed trajectories, both -posterior inference and good ahead predictions -do not per se guarantee that the inferred models can generate the underlying attractor geometries on their own (see Fig. S2 , Koppe et al. (2019) ). In contrast, if fully generative reconstruction of the underlying DS in this latter sense were achieved, formal analysis or simulation of the resulting RNN equations could provide a much deeper understanding of the dynamical mechanisms underlying empirical observations (Fig. 1 C ). Some approaches geared toward this latter goal of full DS reconstruction make specific structural assumptions about the form of the DS equations ('white box approach'; Meeds et al. (2019) ; Raissi (2018) ; Gorbach et al. (2017) ), e.g. based on physical or biological domain knowledge, and focus on estimating the system's latent states and parameters, rather than approximating an unknown DS based on the observed time series information alone ('black box approach'). Others (Trischler & D'Eleuterio, 2016; Brunton et al., 2016; Champion et al., 2019) attempt to approximate the flow field, obtained e.g. by numerical differentiation, directly through basis expansions or neural networks. However, numerical derivatives are problematic for their high variance and other numerical issues (Raissi, 2018; Baydin et al., 2018; Chen et al., 2017) . Another factor to consider is that in many biological systems like the brain the intrinsic dynamics are highly stochastic with many noise sources, like probabilistic synaptic release (Stevens, 2003) . Models that do not explicitly account for dynamical process noise (Ayed et al., 2019; Champion et al., 2019; Rudy et al., 2019) are therefore less suited and more vulnerable to model misspecification. Finally, some fully probabilistic models for DS reconstruction based on GRU (Fraccaro et al., 2016) , LSTM (Zheng et al., 2017; Vlachas et al., 2018) , or radial basis function (Zhao & Park, 2020) networks, are not easily interpretable and amenable to DS analysis in the sense defined in sect. 3.3. Most importantly, none of these previous approaches consider the long-range dependency problem within more easily tractable RNNs for DS. Long-range dependency problems in RNNs. Error gradients in vanilla RNNs tend to either explode or vanish due to the large product of derivative terms that results from recursive application of the chain rule over time steps (Hochreiter, 1991; Bengio et al., 1994; Hochreiter & Schmidhuber, 1997) . To address this issue, RNNs with gated memory cells (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) have been specifically designed, but their more complicated mathematical structure makes them less amenable to a systematic DS analysis. Even simple objects like fixed points of these systems have to be found by numerical techniques (Sussillo & Barak, 2013; Jordan et al., 2019) . Thus, approaches which retain the simplicity of vanilla RNNs while solving the exploding vs. vanishing gradients problem would be desirable. Recently, Le et al. (2015) observed that initialization of the recurrent weight matrix W to the identity in ReLU-based RNNs may yield performance en par with LSTMs on standard machine learning benchmarks. Talathi & Vartak (2016) expanded on this idea by initializing the recurrence matrix such that its largest absolute eigenvalue is 1. Later work en-forced orthogonal (Henaff et al., 2016; Helfrich et al., 2018; Jing et al., 2019) or unitary (Arjovsky et al., 2016) constraints on the recurrent weight matrix during training. While this appears to yield long-term memory performance sometimes superior to that of LSTMs (but see (Henaff et al., 2016) ), these networks are limited in their computational power (Kerg et al., 2019) . This may be a consequence of the fact that RNNs with orthogonal recurrence matrix are quite restricted in the range of dynamical phenomena they can produce, e.g. chaotic attractors are not possible since (locally) diverging eigen-directions are disabled. Our approach therefore is to establish line/plane attractors only along some but not all directions in state space, and to only push the RNN toward these configurations but not strictly enforce them, such that convergence or (local) divergence of RNN dynamics is still possible. We furthermore implement these concepts through regularization terms in the loss functions, rather than through mere initialization. This way plane attractors are encouraged throughout training without fading away.

3.1. BASIC MODEL FORMULATION

Assume we are given two multivariate time series S = {s t } and X = {x t }, one we will denote as 'inputs' (S) and the other as 'outputs' (X). In the 'classical' (supervised) machine learning setting, we usually wish to map S on X through a RNN with latent state equation z t = F θ (z t-1 , s t ) and outputs x t ∼ p λ (x t |z t ), as for instance in the 'addition problem' (Hochreiter & Schmidhuber, 1997) . In DS reconstruction, in contrast, we usually have a dense time series X from which we wish to infer (unsupervised) the underlying DS, where S may provide an additional forcing function or sparse experimental inputs or perturbations. While our focus in this paper is on this latter task, DS reconstruction, we will demonstrate that our approach brings benefits in both these settings. Here we consider for the latent model a PLRNN (Koppe et al., 2019) which takes the form z t = Az t-1 + W φ(z t-1 ) + Cs t + h + ε t , ε t ∼ N (0, Σ), where z t ∈ R M ×1 is the hidden state (column) vector of dimension M , A ∈ R M ×M a diagonal and W ∈ R M ×M an off-diagonal matrix, s t ∈ R K×1 the external input of dimension K, C ∈ R M ×K the input mapping, h ∈ R M ×1 a bias, and ε t a Gaussian noise term with diagonal covariance matrix diag(Σ) ∈ R M + . The nonlinearity φ(z) is a ReLU, φ(z) i = max(0, z i ), i ∈ {1, . . . , M }. This specific formulation represents a discrete-time version of firing rate (population) models as used in computational neuroscience (Song et al., 2016; Durstewitz, 2017; Engelken et al., 2020) . We will assume that the latent RNN states z t are coupled to the actual observations x t through a simple observation model of the form x t = Bg(z t ) + η t , η t ∼ N (0, Γ) in the case of observations x t ∈ R N ×1 , where B ∈ R N ×M is a factor loading matrix, g some (usually monotonic) nonlinear transfer function (e.g., ReLU), and diag(Γ) ∈ R N + the diagonal covariance matrix of the Gaussian observation noise, or through a softmax function in case of categorical observations x i,t ∈ {0, 1} (see Suppl. 6.1.7 for details).

3.2. REGULARIZATION APPROACH

First note that by letting A = I, W = 0, and h = 0 in eq. 1, every point in z space will be a marginally stable fixed point of the system, leading it to perform a perfect integration of external inputs as in parametric working memory (Machens et al., 2005; Brody et al., 2003) .foot_0 This is similar in spirit to Le et al. (2015) who initialized RNN parameters such that it performs an identity mapping for z i,t ≥ 0. However, here 1) we use a neuroscientifically motivated network architecture (eq. 1) that enables the identity mapping across the variables' entire support, z i,t ∈ [-∞, +∞], which we conjecture will be of advantage for establishing long short-term memory properties, 2) we encourage this mapping only for a subset M reg ≤ M of units (Fig. S1 ), leaving others free to perform arbitrary computations, and 3) we stabilize this configuration throughout training by introducing a specific L 2 regularization for parameters A, W , and h in eq. 1. When embedded into a larger, (locally) convergent system, we will call this configuration more generally a manifold attractor. That way, we divide the units into two types, where the regularized units serve as a memory that tends to decay very slowly (depending on the size of the regularization term), while the remaining units maintain the flexibility to approximate any underlying DS, yet retaining the simplicity of the original PLRNN (eq. 1). Specifically, the following penalty is added to the loss function (Fig. S1 ): L reg = τ A Mreg i=1 (A i,i -1) 2 + τ W Mreg i=1 M j=1 j =i W 2 i,j + τ h Mreg i=1 h 2 i ( ) (Recall from sect. 3.1 that A is a diagonal and W is an off-diagonal matrix.) While this formulation allows us to trade off, for instance, the tendency toward a manifold attractor (A → I, h → 0) vs. the sensitivity to other units' inputs (W → 0), for all experiments performed here a common value, τ A = τ W = τ h = τ , was assumed for the three regularization factors. We will refer to (z 1 . . . z Mreg ) as the regularized ('memory') subsystem, and to (z Mreg+1 . . . z M ) as the non-regularized ('computational') subsystem. Note that in the limit τ → ∞ exact manifold attractors would be enforced.

3.3. THEORETICAL ANALYSIS

We will now establish a tight connection between the PLRNN dynamics and its error gradients. Similar ideas appeared in Chang et al. (2019) , but these authors focused only on fixed point dynamics, while here we will consider the more general case including cycles of any order. First, note that by interpretability of model eq. 1 we mean that it is easily amenable to a rigorous DS analysis: As shown in Suppl. 6.1.2, we can explicitly determine all the system's fixed points and cycles and their stability. Moreover, as shown in Monfared & Durstewitz (2020b) , we can -under certain conditions -transform the PLRNN into an equivalent continuous-time (ODE) piecewise-linear system, which brings further advantages for DS analysis. Let us rewrite eq. 1 in the form z t = F (z t-1 ) = (A + W D Ω(t-1) )z t-1 + h := W Ω(t-1) z t-1 + h, where D Ω(t-1) is the diagonal matrix of outer derivatives of the ReLU function evaluated at z t-1 (see Suppl. 6.1.2), and we ignore external inputs and noise terms for now. Starting from some initial condition z 1 , we can recursively develop z T as (see Suppl. 6.1.2 for more details): z T = F T -1 (z 1 ) = T -1 i=1 W Ω(T -i) z 1 + T -1 j=2 j-1 i=1 W Ω(T -i) + I h. Likewise, for some common loss function L(A, W , h) = T t=2 L t , we can recursively develop the derivatives w.r.t. weights w mk (and similar for components of A and h) as ∂L ∂w mk = T t=2 ∂L t ∂z t ∂z t ∂w mk , with ∂z t ∂w mk = 1 (m,k) D Ω(t-1) z t-1 (6) + t-2 j=2 j-1 i=1 W Ω(t-i) 1 (m,k) D Ω(t-j) z t-j + t-2 i=1 W Ω(t-i) ∂z 2 ∂w mk , where 1 (m,k) is an M × M indicator matrix with a 1 for the (m, k)'th entry and 0 everywhere else. Observing that eqs. 5 and 6 contain similar product terms which determine the system's long-term behavior, our first theorem links the PLRNN dynamics to its total error gradients: Theorem 1. Consider a PLRNN given by eq. 4, and assume that it converges to a stable fixed point, say z t * 1 := z * 1 , or a k-cycle (k > 1) with the periodic points {z t * k , z t * k -1 , • • • , z t * k -(k-1) }, for T → ∞. Suppose that, for k ≥ 1 and i ∈ {0, 1, • • • , k -1}, σ max (W Ω(t * k -i) ) = W Ω(t * k -i) < 1, where W Ω(t * k -i) denotes the Jacobian of the system at z t * k -i and σ max indicates the largest singular value of a matrix. Then, the 2-norms of the tensors collecting all derivatives, ∂z T ∂W 2 , ∂z T ∂A 2 , ∂z T ∂h 2 , will be bounded from above, i.e. will not diverge for T → ∞. Proof. See Suppl. sect. 6.1 (subsection 6.1.3). While Theorem 1 is a general statement about PLRNN dynamics and total gradients, our next theorem more specifically provides conditions under which Jacobians linking temporally distant states z T and z t , T t, will neither vanish nor explode in the regularized PLRNN: Theorem 2. Assume a PLRNN with matrix A + W partitioned as in Fig. S1, i .e. with the first M reg rows corresponding to those of an M × M identity matrix. Suppose that the non-regularized subsystem (z Mreg+1 . . . z M ), if considered in isolation, satisfies Theorem 1, i.e. converges to a kcycle with k ≥ 1. Then, for the full system (z 1 . . . z M ), the 2-norm of the Jacobians connecting temporally distal states z T and z t will be bounded from above and below for all T > t, i.e. ∞ > ρ up ≥ ∂z T ∂zt 2 = t<k≤T W Ω(k) 2 ≥ ρ low > 0. In particular, for state variables z iT and z jt such that i ∈ {M reg + 1, • • • , M } and j ∈ {1, • • • , M reg }, i.e . that connect states from the 'memory' to those of the 'computational' subsystem, one also has ∞ > λ up ≥ ∂z iT ∂zjt ≥ λ low > 0 as T -t → ∞, i.e. these derivatives will never vanish nor explode. Proof. See Suppl. sect. 6.1 (subsection 6.1.4). The bounds ρ up , ρ low , λ up , λ low , are given in Suppl. sect. 6.1.4. We remark that when the regularization conditions are not exactly met, i.e. when parameters A and W slightly deviate from those in Fig. S1 , memory (and gradients) may ultimately dissipate, but only very slowly, as actually required for temporal processes with very slow yet not infinite time constants (Fig. 1B ).

3.4. TRAINING PROCEDURES

For the (supervised) machine learning problems, all networks were trained by stochastic gradient descent (SGD) to minimize the squared-error loss between estimated and actual outputs for the addition and multiplication problems, and the cross entropy loss for sequential MNIST (see Suppl. 6.1.7). Adam (Kingma & Ba, 2014) from PyTorch package (Paszke et al., 2017) was used as the optimizer, with a learning rate of 0.001, gradient clip parameter of 10, and batch size of 500. SGD was stopped after 100 epochs and the fit with the lowest loss across all epochs was taken, except for LSTM which was allowed to run for up to 200 epochs as it took longer to converge (Fig. S10 ). For comparability, the PLRNN latent state dynamics eq. 1 was assumed to be deterministic in this setting (i.e., Σ = 0), g(z t ) = z t and Γ = I N in eq. 2. For the regularized PLRNN (rPLRNN), penalty eq. 3 was added to the loss function. For the (unsupervised) DS reconstruction problems, the fully probabilistic, generative RNN eq. 1 was considered. Together with eq. 2 (where we take g(z t ) = φ(z t )) this gives the typical form of a nonlinear state space model (Durbin & Koopman, 2012) with observation and process noise, and an Expectation-Maximization (EM) algorithm that efficiently exploits the model's piecewise linear structure (Durstewitz, 2017; Koppe et al., 2019) was used to solve for the parameters by maximum likelihood. Details are given in Suppl. 6.1.5. All code used here will be made openly available at https://github.com/DurstewitzLab/reg-PLRNN.

3.5. PERFORMANCE MEASURES

For the machine learning benchmarks we employed the same criteria as used for optimization (MSE or cross-entropy, Suppl. 6.1.7) as performance metrics, evaluated across left-out test sets. In addition, we report the relative frequency P correct of correctly predicted trials across the test set (see Suppl. 6.1.7 for details). For DS reconstruction problems, it is not sufficient or even sensible to judge a method's ability to infer the underlying DS purely based on some form of (ahead-)prediction error like the MSE defined on the time series itself (Ch.12 in Kantz & Schreiber (2004) ). Rather, we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties. This is not automatically guaranteed for a model that yields agreeable ahead predictions on a time series (Fig. S2A ; cf. Koppe et al. (2019) ; Wood (2010)). We therefore followed Koppe et al. (2019) and used the Kullback-Leibler divergence between true and reproduced probability distributions across states in state space to quantify how well an inferred PLRNN captured the underlying dynamics, thus assessing the agreement in attractor geometries (cf. Takens (1981) ; Sauer et al. (1991) ) (see Suppl. 6.1.6 for more details). 

4.1. MACHINE LEARNING BENCHMARKS

Although not our prime interest here, we first examined how the rPLRNN would fare on supervised machine learning benchmarks where inputs (S) are to be mapped onto target outputs (X) across long time spans (i.e., requiring long short-term maintenance of information), namely the addition and multiplication problems (Talathi & Vartak, 2016; Hochreiter & Schmidhuber, 1997) , and sequential MNIST (LeCun et al., 2010) . Details of these experimental setups are in Suppl. 6.1.7. Performance of the rPLRNN (eq. 1, eq. 3) on all 3 benchmarks was compared to several other models summarized in Suppl. Table 1 . To achieve a meaningful comparison, all models have the same number M = 40 (based on Fig. S3 ) of hidden states (which gives LSTMs overall about 4 times as many trainable parameters). On all three problems the rPLRNN outperforms all other tested methods, including LSTM, iRNN (RNN initialized by the identity matrix as in Le et al. ( 2015)), and a version of the orthogonal RNN (oRNN; Vorontsov et al. ( 2017)) (similar results were obtained for other settings of M and batch size). LSTM performs even worse than iRNN and iPLRNN (PLRNN initialized with the identity as the iRNN), although it had 4 times as many parameters and was given twice as many epochs (and thus opportunities) for training, as it also took longer to converge (Fig. S10 ). In addition, the iPLRNN tends to perform slightly better than the iRNN on all three problems, suggesting that the specific structure eq. 1 of the PLRNN that allows for a manifold attractor across the variables' full range may be advantageous to begin with, while the regularization further improves performance.

4.2. NUMERICAL EXPERIMENTS ON DYNAMICAL SYSTEMS WITH DIFFERENT TIME SCALES

While it is encouraging that the rPLRNN may perform even better than several previous approaches to the vanishing vs. exploding gradients problem, our major goal here was to examine whether our regularization scheme would help with the (unsupervised) identification of DS that harbor widely different time scales. To test this, we used a biophysical, bursting cortical neuron model with one voltage (V ) and two conductance recovery variables (see Durstewitz (2009) ), one slow (h) and one fast (n; Suppl. 6.1.8). Reproduction of this DS is challenging since it produces very fast spikes on top of a slow nonlinear oscillation (Fig. 3D ). Only short time series (as in scientific data) of length T = 1500 from this model were provided for training. rPLRNNs with M = {8 . . . 18} states were trained, with the regularization factor varied within τ ∈ {0, 10 1 , 10 2 , 10 3 , 10 4 , 10 5 }/T . Note that for τ = 0 (no regularization), the approach reduces to the standard PLRNN (Koppe et al., 2019) . Fig. 3A confirms our intuition that stronger regularization leads to better DS reconstruction as assessed by the KL divergence between true and generated state distributions (similar results were obtained with ahead-prediction errors as a metric, Fig. S4A ), accompanied by a likewise decrease in the MSE between the power spectra of true (suppl. eq. 55) and generated (rPLRNN) voltage traces (Fig. 3B ). Fig. 3D gives an example of voltage traces (V ) and the slower of the two gating variables (h; see Fig. S5A for variable n) freely simulated (i.e., sampled) from the autonomously running rPLRNN. This illustrates that our model is in principle capable of capturing both the stiff spike dynamics and the slower oscillations in the second gating variable at the same time. Fig. 3C provides more insight into how the regularization worked: While the high frequency components (> 50 Hz) related to the repetitive spiking activity hardly benefited from increasing τ , there was a strong reduction in the MSE computed on the power spectrum for the lower frequency range (≤ 50 Hz), suggesting that increased regularization helps to map slowly evolving components of the dynamics. This result is more general as shown in Fig. S6 for another DS example. In contrast, an orthogonality (Vorontsov et al., 2017) or plain L2 constraint on weight matrices did not help at all on this problem (Fig. S4B ). Further insight into the dynamical mechanisms by which the rPLRNN solves the problem can be obtained by examining the latent dynamics: As shown in Fig. 3E (see also Fig. S5 ), regularized states indeed help to map the slow components of the dynamics, while non-regularized states focus on the fast spikes. These observations further corroborate the findings in Fig. 3C and Fig. S6C .

4.3. REGULARIZATION PROPERTIES AND MANIFOLD ATTRACTORS

In Figs. 2 and 3 we demonstrated that the rPLRNN is able to solve problems and reconstruct dynamics that involve long-range dependencies. Figs. 3A,B furthermore directly confirm that solutions improve with stronger regularization, while Figs. 3C, E give insight into the mechanism by which the regularization works. To further verify empirically that our specific form of regularization, eq. 3, is important, Fig. 2 also shows results for a PLRNN with standard L2 norm on a fraction of M reg /M = 0.5 states (L2pPLRNN). Fig. S7 provides additional results for PLRNNs with L2 norm on all weights and for vanilla L2-regularized RNNs. All these systems fell far behind the performance of the rPLRNN on all tasks tested. Moreover, Fig. 4 reveals that the specific regularization proposed indeed encourages manifold attractors, and that this is not achieved by a standard L2 regularization: In contrast to L2PLRNN, as the regularization factor τ is increased, more and more of the maximum absolute eigenvalues around the system's fixed points (computed according to eq. 8, sect. 6.1.2) cluster on or near 1, indicating directions of marginal stability in state space. Also, the deviations from 1 become smaller for strongly regularized PLRNNs (Fig. 4B, D ), indicating a higher precision in attractor tuning. Fig. S9 in addition confirms that rPLRNN parameters are increasingly driven toward values that would support manifold attractors with stronger regularization. Fig. 3E furthermore suggests that both regularized and non-regularized states are utilized to map the full dynamics. But how should the ratio M reg /M be chosen in practice? While for the problems here this meta-parameter was determined through 'classical' grid-search and cross-validation, Figs. S3 C -E suggest that the precise setting of M reg /M is actually not overly important: Nearly optimal performance is achieved for a broader range M reg /M ∈ [0.3, 0.6] on all problems tested. Hence, in practice, setting M reg /M = 0.5 should mostly work fine.

5. CONCLUSIONS

In this work we introduced a simple solution to the long short-term memory problem in RNNs that retains the simplicity and tractability of PLRNNs, yet does not curtail their universal computational capabilities (Koiran et al., 1994; Siegelmann & Sontag, 1995) and their ability to approximate arbitrary DS (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Trischler & D'Eleuterio, 2016) . We achieved this by adding regularization terms to the loss function that encourage the system to form a 'memory subspace' (Seung, 1996; Durstewitz, 2003) which would store arbitrary values for, if unperturbed, arbitrarily long periods. At the same time we did not rigorously enforce this constraint, which allowed the system to capture slow time scales by slightly departing from a perfect manifold attractor. In neuroscience, this has been discussed as a dynamical mechanism for regulating the speed of flow in DS and learning of arbitrary time constants not naturally included qua RNN design (Durstewitz, 2003; 2004 ) (Fig. 1B ). While other RNN architectures, including vanilla RNNs, can, in principle, also develop line attractors to solve specific tasks (Maheswaranathan et al., 2019) , they are generally much harder to train to achieve this and may exhibit less precise attractor tuning (cf. Fig. 4 ), which is needed to bridge long time scales (Durstewitz, 2003) . Moreover, part of the PLRNN's latent space was not regularized at all, leaving the system enough degrees of freedom for realizing arbitrary computations or dynamics (see also Fig. S11 for a chaotic example). We showed that the rPLRNN is en par with or outperforms initialization-based approaches, orthogonal RNNs, and LSTMs on a number of classical benchmarks. More importantly, however, the regularization strongly facilitates the identification of challenging DS with widely different time scales in PLRNN-based algorithms for DS reconstruction. Similar regularization schemes as proposed here (eq. 3) may, in principle, also be designed for other architectures, but the convenient mathematical form of the PLRNN makes their implementation particularly powerful and straightforward.

6. APPENDIX

6.1 SUPPLEMENTARY TEXT

6.1.1. Simple exact PLRNN solution for addition problem

The exact PLRNN parameter settings (cf. eq. 1, eq. 2) for solving the addition problem with 2 units (cf. Fig. 1C ) are as follows: ) as the diagonal matrix formed from this vector. Note that there are at most 2 M distinct matrices W Ω(t) as defined in eq. 4, depending on the sign of the components of z t . A = 1 0 0 0 , W = 0 1 0 0 , h = 0 -1 , C = 0 0 1 1 , B = (1 0) If h = 0 and W Ω(t) is the identity matrix, then the map F becomes the identity map and so every point z will be a fixed point of F . Otherwise, the fixed points of F can be found solving the equation F (z * 1 ) = z * 1 as z * 1 = (I -W Ω(t * 1 ) ) -1 h = H * 1 h, where z * 1 = z t * 1 = z t * 1 -1 , if det(I -W Ω(t * 1 ) ) = P W Ω(t * 1 ) (1) = 0, i.e. W Ω(t * 1 ) has no eigenvalue equal to 1. Stability and type of fixed points (node, saddle, spiral) can then be determined from the eigenvalues of the Jacobian A + W D Ω(t * 1 ) = W Ω(t * 1 ) (Strogatz ( 2015)). For k > 1, solving F k (z * k ) = z * k , one can obtain a k-cycle of the map F with the periodic points {z * k , F (z * k ), F 2 (z * k ), • • • , F k-1 (z * k )}. For this, we first compute F k as follows: z t = F (z t-1 ) = W Ω(t-1) z t-1 + h, z t+1 = F 2 (z t-1 ) = F (z t ) = W Ω(t) W Ω(t-1) z t-1 + W Ω(t) + I h, z t+2 = F 3 (z t-1 ) = F (z t+1 ) = W Ω(t+1) W Ω(t) W Ω(t-1) z t-1 + W Ω(t+1) W Ω(t) + W Ω(t+1) + I h, . . . z t+(k-1) = F k (z t-1 ) = k+1 i=2 W Ω(t+(k-i)) z t-1 + k j=2 k-j+2 i=2 W Ω(t+(k-i)) + I h, in which k+1 i=2 W Ω(t+(k-i)) = W Ω(t+(k-2)) W Ω(t+(k-3)) • • • W Ω(t-1 ) . Assuming t + (k -1) := t * k , then the k-cycle is given by the fixed point of the k-times iterated map F k as z * k = I - k i=1 W Ω(t * k -i) -1 k j=2 k-j+1 i=1 W Ω(t * k -i) + I h = H * k h, where z * k = z t * k = z t * k -k , provided that I - k i=1 W Ω(t * k -i) is invertible. That is det I - k i=1 W Ω(t * k -i) = P k i=1 W Ω(t * k -i) (1) = 0 and k i=1 W Ω(t * k -i) := W Ω * k has no eigenvalue equal to 1. As for the fixed points, we can determine stability of the k-cycle from the eigenvalues of the Jacobians k i=1 W Ω(t * k -i) . It may also be helpful to spell out the recursions in eq. 5 and eq. 6 in section 3.3 in a bit more detail. Analogously to the derivations above, for t = 1, 2, . . . , T we can recursively compute z 2 , z 3 , . . . , z T (T ∈ N) as z 2 = F (z 1 ) = W Ω(1) z 1 + h, z 3 = F 2 (z 1 ) = F (z 2 ) = W Ω(2) W Ω(1) z 1 + W Ω(2) + I h, . . . z T = F T -1 (z 1 ) = F (z T -1 ) = W Ω(T -1) W Ω(T -2) • • • W Ω(1) z 1 + W Ω(T -1) W Ω(T -2) • • • W Ω(2) + W Ω(T -1) W Ω(T -2) • • • W Ω(3) + • • • + W Ω(T -1) + I h = T -1 i=1 W Ω(T -i) z 1 + T -2 j=1 T -j-1 i=1 W Ω(T -i) + I h = T -1 i=1 W Ω(T -i) z 1 + T -1 j=2 j-1 i=1 W Ω(T -i) + I h. Likewise, we can write out the derivatives eq. 6 more explicitly as ∂z t ∂w mk = ∂F (z t-1 ) ∂w mk = 1 (m,k) D Ω(t-1) z t-1 + A + W D Ω(t-1) ∂z t-1 ∂w mk = 1 (m,k) D Ω(t-1) z t-1 + A + W D Ω(t-1) 1 (m,k) D Ω(t-2) z t-2 + A + W D Ω(t-1) A + W D Ω(t-2) ∂z t-2 ∂w mk = 1 (m,k) D Ω(t-1) z t-1 + A + W D Ω(t-1) 1 (m,k) D Ω(t-2) z t-2 + A + W D Ω(t-1) A + W D Ω(t-2) 1 (m,k) D Ω(t-3) z t-3 + A + W D Ω(t-1) A + W D Ω(t-2) A + W D Ω(t-3) ∂z t-3 ∂w mk = • • • = 1 (m,k) D Ω(t-1) z t-1 + t-2 j=2 j-1 i=1 W Ω(t-i) 1 (m,k) D Ω(t-j) z t-j + t-2 i=1 W Ω(t-i) ∂z 2 ∂w mk ( ) where ∂z2 ∂w mk = ( ∂z1,2 ∂w mk • • • ∂z M,2 ∂w mk ) with ∂z l,2 ∂w mk = 0 ∀ l = m and ∂zm,2 ∂w mk = d k z k,1 . The derivatives w.r.t. the elements of A and h can be expanded in a similar way, only that the terms D Ω(t) z t on the last line of eq. 12 need to be replaced by just z t for ∂zt ∂amm , and by just a vector of 1's for ∂zt ∂hm (also, in these cases, the indicator matrix will be the diagonal matrix 1 (m,m) ).

6.1.3. Proof of Theorem 1

To state the proof, let us rewrite the derivatives of the loss function L(W , A, h) = T t=1 L t in the following tensor form: ∂L ∂W = T t=1 ∂L t ∂W , where ∂L t ∂W = ∂L t ∂z t ∂z t ∂W , for which the 3D tensor ∂z t ∂W =         ∂z1,t ∂W ∂z2,t ∂W . . . ∂z M,t ∂W         (14) of dimension M × M × M , consists of all the gradient matrices ∂z i,t ∂W =         ∂zi,t ∂w11 ∂zi,t ∂w12 • • • ∂zi,t ∂w 1M ∂zi,t ∂w21 ∂zi,t ∂w22 • • • ∂zi,t ∂w 2M . . . ∂zi,t ∂w M 1 ∂zi,t ∂w M 2 • • • ∂zi,t ∂w M M         :=         ∂zi,t ∂w1 * ∂zi,t ∂w2 * . . . ∂zi,t ∂w M *         , i = 1, 2, • • • , M, where w i * ∈ R M is a row-vector. Now, suppose that {z 1 , z 2 , z 3 , . . .} is an orbit of the system which converges to a stable fixed point, i.e. lim T →∞ z T = z * k . Then lim T →∞ z T = lim T →∞ W Ω(T -1) z T -1 + h = z * 1 = W Ω(t * 1 ) z * 1 + h, and so lim T →∞ W Ω(T -1) z * 1 = W Ω(t * 1 ) z * 1 . ( ) Assume that lim T →∞ W Ω(T -1) = L. Since eq. 17 holds for every z * 1 , then substituting z * 1 = e T 1 = (1, 0, • • • , 0) T in eq. 17, we can prove that the first column of L equals the first column of W Ω(t * 1 ) . Performing the same procedure for z * 1 = e T i , i = 2, 3, • • • , M , yields lim T →∞ W Ω(T -1) = W Ω(t * 1 ) . ( ) Also, for every i ∈ N (1 < i < ∞) lim T →∞ W Ω(T -i) = W Ω(t * 1 ) , i.e. ∀ > 0 ∃N ∈ N s.t. T -i ≥ N =⇒ W Ω(T -i) -W Ω(t * 1 ) ≤ . (20) Thus, W Ω(T -i) -W Ω(t * 1 ) ≤ W Ω(T -i) -W Ω(t * 1 ) gives ∀ > 0 ∃N ∈ N s.t. T -i ≥ N =⇒ W Ω(T -i) ≤ W Ω(t * 1 ) + . ( ) Since T -1 > T -2 > • • • > T -i ≥ N , so ∀ > 0 W Ω(T -i) ≤ W Ω(t * 1 ) + , i = 1, 2, • • • , T -N. ( ) Hence ∀ > 0 T -N i=1 W Ω(T -i) ≤ T -N i=1 W Ω(T -i) ≤ W Ω(t * 1 ) + T -N . ( ) If W Ω(t * 1 ) < 1, then for any < 1, considering ¯ ≤ + W Ω(t * 1 ) 2 < 1, it is concluded that lim T →∞ T -N i=1 W Ω(T -i) = lim T →∞ T -N i=1 W Ω(T -i) ≤ lim T →∞ W Ω(t * 1 ) + ¯ T -N = 0. (24) Therefore lim T →∞ T -1 i=1 W Ω(T -i) = 0. ( ) If the orbit {z 1 , z 2 , z 3 , . . .} tends to a stable k-cycle (k > 1) with the periodic points {F k (z * k ), F k-1 (z * k ), F k-2 (z * k ), • • • , F (z * k )} = {z t * k , z t * k -1 , • • • , z t * k -(k-1) }, then, denoting the stable k-cycle by Γ k = {z t * k , z t * k -1 , • • • , z t * k -(k-1) , z t * k , z t * k -1 , • • • , z t * k -(k-1) , • • • }, (26) we have lim T →∞ d(z T , Γ k ) = 0. ( ) Hence, there exists a neighborhood U of Γ k and k sub-sequences {z T kn } ∞ n=1 , {z T kn+1 } ∞ n=1 , • • • , {z T kn+(k-1) } ∞ n=1 of the sequence {z T } ∞ T =1 such that these sub-sequences belong to U and (i) z T kn+s = F k (z T k(n-1)+s ), s = 0, 1, 2, • • • , k -1, (ii) lim T →∞ z T kn+s = z t * k -s , s = 0, 1, 2, • • • , k -1, (iii) for every z T ∈ U there is some s ∈ {0, 1, 2, • • • , k -1} such that z T ∈ {z T kn+s } ∞ n=1 . In this case, for every z T ∈ U with z T ∈ {z T kn+s } ∞ n=1 we have lim T →∞ z T = z t * k -s for some s = 0, 1, 2, • • • , k -1. Therefore, continuity of F implies that lim T →∞ F (z T ) = F (z t * k -s ) and so lim T →∞ W Ω(T ) z T + h = W Ω(t * k -s) z t * k -s + h. Thus, similarly, we can prove that ∃ s ∈ {0, 1, 2, • • • , k -1} s.t. lim T →∞ W Ω(T ) = W Ω(t * k -s) . Analogously, for every i ∈ N (1 < i < ∞) ∃ s i ∈ {0, 1, 2, • • • , k -1} s.t. lim T →∞ W Ω(T -i) = W Ω(t * k -si) , On the other hand, W Ω(t * k -si) < 1 for all s i ∈ {0, 1, 2, • • • , k -1}. So, without loss of generality, assuming max 0≤si≤k-1 W Ω(t * k -si) = W Ω(t * k ) < 1, we can again obtain some relations similar to eq. 23-eq. 25 for t * k , k ≥ 1. Since {z T -1 } ∞ T =1 is a convergent sequence, so it is bounded, i.e. there exists a real number q > 0 such that ||z T -1 || ≤ q for all T ∈ N. Furthermore, D Ω(T -1) ≤ 1 for all T . Therefore, by eq. 12 and eq. 23 (for t * k , k ≥ 1) ∂z T ∂w mk = 1 (m,k) D Ω(T -1) z T -1 + T -1 j=2 j-1 i=1 W Ω(T -i) 1 (m,k) D Ω(T -j) z T -j + T -1 i=1 W Ω(T -i) D Ω(1) z 1 (32) ≤ z T -1 + T -1 j=2 j-1 i=1 W Ω(T -i) z T -j + T -1 i=1 W Ω(T -i) z 1 ≤ q 1 + T -1 j=2 W Ω(t * k ) + ¯ j-1 + W Ω(t * k ) + ¯ T -1 z 1 . Thus, by W Ω(t * k ) + ¯ < 1, we have lim T →∞ ∂z T ∂w mk ≤ q 1 + W Ω(t * k ) + ¯ 1 -W Ω(t * k ) -¯ = M < ∞, i.e., by eq. 14 and eq. 15, the 2-norm of total gradient matrices and hence ∂zt ∂W 2 will not diverge (explode) under the assumptions of Theorem 1. Analogously, we can prove that ∂z T ∂A 2 and ∂z T ∂h 2 will not diverge either. Since, similar as in the derivations above, it can be shown that relation eq. 34 is true for ∂z T ∂amm with q = q, where q is the upper bound of z T , as {z T } ∞ T =1 is convergent. Furthermore, relation eq. 34 also holds for ∂z T ∂hm with q = 1. Remark 2.1. By eq. 24 the Jacobian parts ∂z T ∂zt 2 connecting any two states z T and z t , T > t, will not diverge either. Corollary 2.1. The results of Theorem 1 are also true if W Ω(t * k ) is a normal matrix with no eigenvalue equal to one. Proof. If W Ω(t * k ) is normal, then W Ω(t * k ) = ρ(W Ω(t * k ) ) < 1 which satisfies the conditions of Theorem 1.

6.1.4. Proof of Theorem 2

Let A, W and D Ω(k) , t < k ≤ T , be partitioned as follows A = I reg O T O A nreg , W = O reg O T S W nreg , D Ω(k) = D k reg O T O D k nreg , where -Mreg) is an off-diagonal sub-matrix (cf. Fig. S1 ). Moreover, -Mreg) are diagonal sub-matrices. Then, we have I Mreg×Mreg := I reg ∈ R Mreg×Mreg , O Mreg×Mreg := O reg ∈ R Mreg×Mreg , O, S ∈ R (M -Mreg)×Mreg , A {Mreg+1:M,Mreg+1:M } := A nreg ∈ R (M -Mreg)×(M -Mreg) is a diagonal sub- matrix, W {Mreg+1:M,Mreg+1:M } := W nreg ∈ R (M -Mreg)×(M D k Mreg×Mreg := D k reg ∈ R Mreg×Mreg and D k {Mreg+1:M,Mreg+1:M } := D k nreg ∈ R (M -Mreg)×(M t<k≤T W Ω(k) = t<k≤T I reg O T S D k reg A nreg + W nreg D k nreg := t<k≤T I reg O T S D k reg W k nreg = I reg O T S D t+1 reg + T j=2 t<k≤t+j-1 W k nreg S D t+j reg t<k≤T W k nreg . Therefore, considering the 2-norm, we obtain ∂z T ∂z t = t<k≤T W Ω(k) = I reg O T S D t+1 reg + T j=2 t<k≤t+j-1 W k nreg S D t+j reg t<k≤T W k nreg < ∞. Moreover 1 ≤ max{1, ρ(W T -t )} = ρ t<k≤T W Ω(k) ≤ t<k≤T W Ω(k) = ∂z T ∂z t where W T -t := t<k≤T W k nreg . Therefore, eq. 37 and eq. 38 yield 1 ≤ ρ low ≤ ∂z T ∂z t ≤ ρ up < ∞. Furthermore, we assumed that the non-regularized subsystem (z Mreg+1 . . . z M ), if considered in isolation, satisfies Theorem 1. Hence, similar to the proof of Theorem 1, it is concluded that lim T →∞ T k=t W k nreg = O nreg . On the other hand, by definition of D Ω(k) , for every t < k ≤ T , we have D k reg ≤ 1 and so S D k reg ≤ S D k reg ≤ S , which, in accordance with the the assumptions of Theorem 1, by convergence of T j=2 t+j-1 k=t+1 W k nreg implies lim T →∞ S D t+1 reg + T j=2 t+j-1 k=t+1 W k nreg S D t+j reg ≤ S 1 + lim T →∞ T j=2 t+j-1 k=t+1 W k nreg ≤ S M nreg . Thus, denoting Q := S D t+1 reg + T j=2 t<k≤t+j-1 W k nreg S D t+j reg , from eq. 41 we deduce that λ max lim T →∞ (Q T Q) = lim T →∞ ρ(Q T Q) ≤ lim T →∞ Q T Q = lim T →∞ Q 2 ≤ S M nreg 2 . Now, if T -t tends to ∞, then eq. 37, eq. 39 and eq. 42 result in 1 = ρ low ≤ ∂z T ∂z t = σ max I reg O T Q O nreg = λ max (I reg + lim T →∞ (Q T Q)) = ρ up < ∞. ( ) Remark 2.2. If S = 0, then ∂z T ∂zt → 1 as T -t → ∞.

6.1.5. Details on EM algorithm and DS reconstruction

For DS reconstruction we request that the latent RNN approximates the true generating system of equations, which is a taller order than learning the mapping S → X or predicting future values in a time series (cf. sect. 3.5).foot_2 This point has important implications for the design of models, inference algorithms and performance metrics if the primary goal is DS reconstruction rather than 'mere' time series forecasting. 3 In this context we consider the fully probabilistic, generative RNN eq. 1. Together with eq. 2 (where we take g(z t ) = φ(z t )) this gives the typical form of a nonlinear state space model (Durbin & Koopman, 2012) with observation and process noise. We solve for the parameters θ = {A, W , C, h, µ 0 , Σ, B, Γ} by maximum likelihood, for which an efficient Expectation-Maximization (EM) algorithm has recently been suggested (Durstewitz, 2017; Koppe et al., 2019) , which we will summarize here. Since the involved integrals are not tractable, we start off from the evidence-lower bound (ELBO) to the log-likelihood which can be rewritten in various useful ways: log p(X|θ) ≥ E Z∼q [log p θ (X, Z)] + H (q(Z|X)) = log p(X|θ) -D KL (q(Z|X) p θ (Z|X)) =: L (θ, q) In the E-step, given a current estimate θ * for the parameters, we seek to determine the posterior p θ (Z|X) which we approximate by a global Gaussian q(Z|X) instantiated by the maximizer (mode) Z * of p θ (Z|X) as an estimator of the mean, and the negative inverse Hessian around this maximizer as an estimator of the state covariance, i.e. E[Z|X] ≈ Z * = arg max Z log p θ (Z|X) = arg max Z [log p θ (X|Z) + log p θ (Z) -log p θ (X)] = arg max Z [log p θ (X|Z) + log p θ (Z)] , since Z integrates out in p θ (X) (equivalently, this result can be derived from a Laplace approximation to the log-likelihood, log p(X|θ) ≈ log p θ (X|Z * )+log p θ (Z * )-1 2 log |-L * |+const, where L * is the Hessian evaluated at the maximizer). We solve this optimization problem by a fixed-point iteration scheme that efficiently exploits the model's piecewise linear structure, as detailed below. Using this approximate posterior for p θ (Z|X), based on the model's piecewise-linear structure most of the expectation values E z∼q [φ(z)], E z∼q φ(z)z T , and E z∼q φ(z)φ(z) T , could be solved for (semi-)analytically (where z is the concatenated vector form of Z, see below). In the M-step, we seek θ * := arg max θ L(θ, q * ), assuming proposal density q * to be given from the E-step, which for a Gaussian observation model amounts to a simple linear regression problem (see Suppl. eq. 49). To force the PLRNN to really capture the underlying DS in its governing equations, we use a previously suggested (Koppe et al., 2019) stepwise annealing protocol that gradually shifts the burden of fitting the observations X from the observation model eq. 2 to the latent RNN model eq. 1 during training, the idea of which is to establish a mapping from latent states Z to observations X first, fixing this, and then enforcing the temporal consistency constraints implied by eq. 1 while accounting for the actual observations. Now we briefly outline the fixed-point-iteration algorithm for solving the maximization problem in eq. 45 (for more details see Durstewitz (2017) ; Koppe et al. (2019) ). Given a Gaussian latent PLRNN and a Gaussian observation model, the joint density p(X, Z) will be piecewise Gaussian, hence eq. 45 piecewise quadratic in Z. Let us concatenate all state variables across m and t into one long column vector z = (z 1,1 , . . . , z M,1 , . . . , z 1,T , . . . , z M,T ) T , arrange matrices A, W into large M T × M T block tri-diagonal matrices, define d Ω := 1 z1,1>0 , 1 z2,1>0 , . . . , 1 z M,T >0 T as an indicator vector with a 1 for all states z m,t > 0 and zeros otherwise, and D Ω := diag(d Ω ) as the diagonal matrix formed from this vector. Collecting all terms quadratic, linear, or constant in z, we can then write down the optimization criterion in the form Q * Ω (z) = - 1 2 [z T U 0 + D Ω U 1 + U T 1 D Ω + D Ω U 2 D Ω z -z T (v 0 + D Ω v 1 ) -(v 0 + D Ω v 1 ) T z] + const. In essence, the algorithm now iterates between the two steps: 1. Given fixed D Ω , solve z * = U 0 + D Ω U 1 + U T 1 D Ω + D Ω U 2 D Ω -1 • (v 0 + D Ω v 1 ) 2. Given fixed z * , recompute D Ω until either convergence or one of several stopping criteria (partly likelihood-based, partly to avoid loops) is reached. The solution may afterwards be refined by one quadratic programming step. Numerical experiments showed this algorithm to be very fast and efficient (Durstewitz, 2017; Koppe et al., 2019) . At z * , an estimate of the state covariance is then obtained as the inverse negative Hessian, V = U 0 + D Ω U 1 + U T 1 D Ω + D Ω U 2 D Ω -1 . (48) In the M-step, using the proposal density q * from the E-step, the solution to the maximization problem θ * := arg max θ L(θ, q * ), can generally be expressed in the form θ * = t E α t β T t t E β t β T t -1 , where, for the latent model, eq. 1, α t = z t and β t := z T t-1 , φ(z t-1 ) T , s T t , 1 T ∈ R 2M +K+1 , and for the observation model, eq. 2, α t = x t and β t = g (z t ). 6.1.6 More details on DS performance measure As argued before (Koppe et al., 2019; Wood, 2010) , in DS reconstruction we require that the RNN captures the underlying attractor geometries and state space properties. This does not necessarily entail that the reconstructed system could predict future time series observations more than a few time steps ahead, and vice versa. For instance, if the underlying attractor is chaotic, even if we had the exact true system available, with a tiny bit of noise trajectories starting from the same initial condition will quickly diverge and ahead-prediction errors become essentially meaningless as a DS performance metric (Fig. S2B ). To quantify how well an inferred PLRNN captured the underlying dynamics we therefore followed Koppe et al. (2019) and used the Kullback-Leibler divergence between the true and reproduced probability distributions across states in state space, thus assessing the agreement in attractor geometries (cf. Takens (1981) ; Sauer et al. (1991) ) rather than in precise matching of time series, D KL (p true (x) p gen (x|z)) ≈ K k=1 p(k) true (x) log p(k) true (x) p(k) gen (x|z) , where p true (x) is the true distribution of observations across state space (not time!), p gen (x|z) is the distribution of observations generated by running the inferred PLRNN, and the sum indicates a spatial discretization (binning) of the observed state space. We emphasize that p(k) gen (x|z) is obtained from freely simulated trajectories, i.e. drawn from the prior p(z) specified by eq. 1, not from the inferred posteriors p(z|x train ). In addition, to assess reproduction of time scales by the inferred PLRNN, the average MSE between the power spectra of the true and generated time series was computed, as displayed in Fig. 3B-C . The measure D KL introduced above only works for situations where the ground truth p true (X) is known. Following Koppe et al. (2019) , we next briefly indicate how a proxy for D KL may be obtained in empirical situations where no ground truth is available. Reasoning that for a well reconstructed DS the inferred posterior p inf (z|x) given the observations should be a good representative of the prior generative dynamics p gen (z), one may use the Kullback-Leibler divergence between the distribution over latent states, obtained by sampling from the prior density p gen (z), and the (dataconstrained) posterior distribution p inf (z|x) (where z ∈ R M ×1 and x ∈ R N ×1 ), taken across the system's state space: D KL (p inf (z|x) p gen (z)) = z∈R M ×1 p inf (z|x) log p inf (z|x) p gen (z) dz As evaluating this integral is difficult, one could further approximate p inf (z|x) and p gen (z) by Gaussian mixtures across trajectories, i.e. p inf (z|x) ≈ 1 T T t=1 p(z t |x 1:T ) and p gen (z) ≈ 1 L L l=1 p(z l |z l-1 ), where the mean and covariance of p(z t |x 1:T ) and p(z l |z l-1 ) are obtained by marginalizing over the multivariate distributions p(Z|X) and p gen (Z), respectively, yielding E[z t |x 1:T ], E[z l |z l-1 ], and covariance matrices Var(z t |x 1:T ) and Var(z l |z l-1 ). Supplementary eq. 51 may then be numerically approximated through Monte Carlo sampling (Hershey & Olsen, 2007) by D KL (p inf (z|x) p gen (z)) ≈ 1 n n i=1 log p inf (z (i) |x) p gen (z (i) ) , z (i) ∼ p inf (z|x) Alternatively, there is also a variational approximation of eq. 51 available (Hershey & Olsen, 2007) : D variational KL (p inf (z|x) p gen (z)) ≈ 1 T T t=1 log T j=1 e -DKL(p(zt|x 1:T ) p(zj |x 1:T )) T k=1 e -DKL(p(zt|x 1:T ) p(z k |z k-1 )) , where the KL divergences in the exponentials are among Gaussians for which we have an analytical expression. 6.1.7 More details on benchmark tasks and model comparisons We compared the performance of our rPLRNN to the other models summarized in Suppl.  = s 1,t1 + s 1,t2 2) The multiplication problem is the same as the addition problem, only that the product instead of the sum has to be produced by the RNN as an output at time T , x target T = s 1,t1 • s 1,t2 3) The MNIST dataset (LeCun et al., 2010) consists of 60 000 training and 10 000 28 × 28 test images of hand written digits. To make this a time series problem, in sequential MNIST the images are presented sequentially, pixel-by-pixel, scanning lines from upper left to bottom-right, resulting in time series of fixed length T = 784. For training on the addition and multiplication problems, the mean squared-error loss across R  samples, L = 1 R R n=1 x(n) T -x with x i,t ∈ {0, 1}, i x i,t = 1. We remark that as long as the observation model takes the form of a generalized linear model (Fahrmeir & Tutz, 2001) , as assumed here, meaning may be assigned to the latent states z m by virtue of their association with specific sets of observations x n through the factor loading matrix B. This adds another layer of model interpretability (besides its accessibility in DS terms). The large error bars in Fig. 2 at the transition from good to bad performance result from the fact that the networks mostly learn these tasks in an all-or-none fashion. While the rPLRNN in general outperformed the pure initialization-based models (iRNN, npRNN, iPLRNN), confirming that a manifold attractor subspace present at initialization may be lost throughout training, we conjecture that this difference in performance will become even more pronounced as noise levels or task complexity increase. The neuron model used in section 4.2 is described by -C m V = g L (V -E L ) + g N a m ∞ (V )(V -E N a ) + g K n(V -E K ) + g M h(V -E K ) + g N M DA σ(V )(V -E N M DA ) (55) ḣ = h ∞ (V ) -h τ h (56) ṅ = n ∞ (V ) -n τ n (57) σ(V ) = 1 + .33e -.0625V -1 where C m refers to the neuron's membrane capacitance, the g • to different membrane conductances, E • to the respective reversal potentials, and m, h, and n are gating variables with limiting values given by {m ∞ , n ∞ , h ∞ } = 1 + e ({V hN a ,V hK ,V hM }-V )/{k N a ,k K ,k M } -1 (59) Different parameter settings in this model lead to different dynamical phenomena, including regular spiking, slow bursting or chaos (see Durstewitz (2009) Figure S4 : A) 20-step-ahead prediction error between true and generated observations for rPLRNN as a function of regularization τ . B) KL divergence (D KL ) between true and generated state space distributions for orthogonal PLRNN (oPLRNN; i.e., the PLRNN with the 'manifold attractor regularization' replaced by an orthogonality regularization, (A + W )(A + W ) T → I), as well as for the partially (L2p) and fully (L2f) standard L2-regularized PLRNNs (i.e., with all weight parameters all (L2f) or only a fraction M reg /M of states (L2p) driven to 0). Note that the quality of the DS reconstruction does not significantly depend on the strength of regularization τ , or becomes even slightly worse, for the oPLRNN, L2pPLRNN and L2fPLRNN. Globally diverging estimates were removed. In the and the oPLRNN tended to produce many fixed point solutions. In those cases where this was not the case, the standard PLRNN tended to reproduce only the fast components of the dynamics as in the example in C (in agreement with the results in Figs. 3C & 3E ), while the oPLRNN tended to capture only the slow components as in the example in B (as expected from the fact that the orthogonality constraint tends to produce solutions similar to those obtained for the regularized states only, cf. Fig. 3E ). LSTM training was therefore allowed to proceed for 200 epochs, after which convergence was usually reached, while training for all other models was stopped after 100 epochs. Also note that although for the best test performance on seq. MNIST shown here LSTM slightly supersedes rPLRNN, on average rPLRNN performed better than LSTM (as shown in Fig. 2C ), despite having much fewer trainable parameters (when LSTM was given about the same number of parameters as rPLRNN, i.e. M/4, its performance fell behind even more). 



Note that this very property of marginal stability required for input integration also makes the system sensitive to noise perturbations directly on the manifold attractor. Interestingly, this property has indeed been observed experimentally for real neural integrator systems(Major et al., 2004;Mizumori & Williams, 1993). By reconstructing the governing equations we mean their approximation in the sense of the universal approximation theorems for DS(Funahashi & Nakamura, 1993;Kimura & Nakano, 1998), i.e. such that the behavior of the reconstructed system becomes dynamically equivalent to that of the true underlying system. In this context we also remark that models which include longer histories of hidden activations(Yu et al., 2019), as in many statistical time series models(Fan & Yao, 2003), are not formally valid DS models anymore since they violate the uniqueness of flow in state space(Strogatz, 2015).



Figure 2: Comparison of rPLRNN (τ = 5, Mreg M = 0.5, cf. Fig. S3) to other methods for A) addition problem, B) multiplication problem and C) sequential MNIST. Top row gives loss as a function of time series length T (error bars = SEM, n ≥ 5), bottom row shows relative frequency of correct trials. Note that better performance (lower values in top row, higher values in bottom row) is reflected in a more rightward shift of curves. Dashed lines indicate chance level, black dots in C indicate individual repetitions.

Figure3: Reconstruction of a 2-time scale DS in limit cycle regime. A) KL divergence (D KL ) between true and generated state space distributions. Globally diverging system estimates were removed. B) Average MSE between power spectra of true and reconstructed DS and C) split according to low (≤ 50 Hz) and high (> 50 Hz) frequency components. Error bars = SEM (n = 33). D) Example of (best) generated time series (red=reconstruction with τ = 2 3 ). See Fig.S5Afor variable n. E) Dynamics of regularized and non-regularized latent states for the example in D.

Figure 4: A) Distribution of maximum absolute eigenvalues λ of Jacobians around fixed points for rPLRNN for different τ and L2PLRNN trained on bursting neuron DS. B) Absolute deviations of max. |λ| from 1 (using for each system the one eigenvalue with smallest deviation). C) Same as A for addition problem for rPLRNN (τ = 5) vs. standard, fully L2-(L2f), and partially L2 (L2p)regularized PLRNN. D) Same as B for the models from C. Error bars = stdv. See also Fig. S8.

Computation of fixed points and cycles in PLRNN Consider the PLRNN in the form of eq. 4. For clarity, let us define d Ω(t) := (d 1 , d 2 , • • • , d M ) as an indicator vector with d m (z m,t ) := d m = 1 for all states z m,t > 0 and zeros otherwise, and D Ω(t) := diag(d Ω(t)

) was employed for sequential MNIST, where pi,t := pt (x i,t = 1|z t ) = eBi,:zt

for details). Parameter settings used here were: C m = 6 µF, g L = 8 mS, E L = -80 mV, g N a = 20 mS, E N a = 60 mV, V hN a = -20 mV, k N a = 15, g K = 10 mS, E K = -90 mV, V hK = -25 mV, k K = 5, τ n = 1 ms, g M = 25 mS, V hM = -15 mV, k M = 5, τ h = 200 ms, g N M DA = 10.2 mS.

Figure S3: Performance of the rPLRNN for different A) numbers of latent states M , B) values of τ , and C-E) proportions M reg /M of regularized states. A-C are for the addition problem, D for the multiplication problem, and E for sequential MNIST. Dashed lines denote the values used for the results reported in section 4.1.

Figure S5: A) Reconstruction of fast gating variable n (rightmost) not shown in Fig. 3D. For completeness and comparison, other variables have been re-plotted from Fig. 3D as well. B) Example of reconstruction of voltage (V , left) and slow gating (h, center) observations, and underlying latent state dynamics (right) for oPLRNN (with orthogonality regularization on A + W , see Fig. legend). C) Example of V (left) and h (center) observations standard PLRNN, and underlying latent state dynamics (right).In the and the oPLRNN tended to produce many fixed point solutions. In those cases where this was not the case, the standard PLRNN tended to reproduce only the fast components of the dynamics as in the example in C (in agreement with the results in Figs.3C & 3E), while the oPLRNN tended to capture only the slow components as in the example in B (as expected from the fact that the orthogonality constraint tends to produce solutions similar to those obtained for the regularized states only, cf. Fig.3E).

Figure S6: Reconstruction of a DS with multiple time scales like fast spikes and slow T-waves (simulated ECG signal, see McSharry et al. (2003)). A) KL divergence (D KL ) between true and generated state space distributions as a function of τ . Unstable (globally diverging) system estimates were removed. B) Average MSE between power spectra (slightly smoothed) of true and reconstructed DS. C) Average normalized MSE between power spectra of true and reconstructed DS split according to low (≤ 2.5 Hz) and high (> 2.5 Hz) frequency components. Error bars = SEM in all graphs. D) Example of (best) generated time series (standardized, red=reconstruction with τ = 1000/3600).

Figure S7: Same as Fig. 2, illustrating performance for L2RNN (vanilla RNN with L2 regularization on all weights) and L2fPLRNN (PLRNN with L2 regularization on all weights) on the three problems shown in Fig. 2. Note that the L2fPLRNN is essentially not able to learn any of the tasks, likely because a conventional L2 norm drives the PLRNN parameters away from a manifold attractor configuration (as supported by Fig. 4 and Fig. S8). Results for rPLRNN, vanilla RNN, L2pPLRNN, and LSTM have been re-plotted from Fig. 2 for comparison.

FigureS10: Cross-entropy loss as a function of training epochs for the best model fits on the sequential MNIST task. Note that LSTM takes longer to converge than the other models. LSTM training was therefore allowed to proceed for 200 epochs, after which convergence was usually reached, while training for all other models was stopped after 100 epochs. Also note that although for the best test performance on seq. MNIST shown here LSTM slightly supersedes rPLRNN, on average rPLRNN performed better than LSTM (as shown in Fig.2C), despite having much fewer trainable parameters (when LSTM was given about the same number of parameters as rPLRNN, i.e. M/4, its performance fell behind even more).

Figure S11: Example reconstruction of a chaotic system, the famous 3d Lorenz equations, by the rPLRNN. Left: True state space trajectory of Lorenz system; right: trajectory simulated by rPLRNN (τ = 100/T , M = 14) after training on time series of length T = 1000 from the Lorenz system.

Table1on the following three benchmarks requiring long short-term maintenance of information(Talathi & Vartak (2016);Hochreiter & Schmidhuber (1997)): 1) The addition problem of time length T consists of 100 000 training and 10 000 test samples of 2 × T input series S = {s 1 , . . . , s T }, where entries s 1,: ∈ [0, 1] are drawn from a uniform random distribution and s 2,: ∈ {0, 1} contains zeros except for two indicator bits placed randomly at times t 1 < 10 and t 2 < T /2. Constraints on t 1 and t 2 are chosen such that every trial requires a long memory of at least T /2 time steps. At the last time step T , the target output of the network is the sum of the two inputs in s 1,: indicated by the 1-entries in s 2,: , x target T

Overview over the different models used for comparison

ACKNOWLEDGEMENTS

This work was funded by grants from the German Research Foundation (DFG) to DD (Du 354/10-1, Du 354/8-2 within SPP 1665) and to GK (TRR265: A06 & B08), and under Germany's Excellence Strategy -EXC-2181 -390900948 ('Structures').

A W h

Figure S1 : Illustration of the 'manifold-attractor-regularization' for the PLRNN's auto-regression matrix A, coupling matrix W , and bias terms h. Regularized values are indicated in red, crosses mark arbitrary values (all other values set to 0 as indicated).Figure S2 : MSE evaluated between time series is not a good measure for DS reconstruction. A) Time graph (top) and state space (bottom) for the single neuron model (see section 4.2 and Suppl. 6.1.8) with parameters in the chaotic regime (blue curves) and with simple fixed point dynamics in the limit (red line). Although the system has vastly different limiting behaviors (attractor geometries) in these two cases, as visualized in the state space, the agreement in time series initially seems to indicate a perfect fit. B) Same as in A) for two trajectories drawn from exactly the same DS (i.e., same parameters) with slightly different initial conditions. Despite identical dynamics, the trajectories immediately diverge, resulting in a high MSE. Dash-dotted grey lines in top graphs indicate the point from which onward the state space trajectories were depicted. Figure S9 : Effect of regularization strength τ on rPLRNN network parameters (cf. eq. 1) (regularized parameters for states m ≤ M reg , eq. 1, in red). Note that some of the non-regularized network parameters (in blue) appear to systematically change as well as τ is varied.

