IDENTIFYING NONLINEAR DYNAMICAL SYSTEMS WITH MULTIPLE TIME SCALES AND LONG-RANGE DEPENDENCIES

Abstract

A main theoretical interest in biology and physics is to identify the nonlinear dynamical system (DS) that generated observed time series. Recurrent Neural Networks (RNNs) are, in principle, powerful enough to approximate any underlying DS, but in their vanilla form suffer from the exploding vs. vanishing gradients problem. Previous attempts to alleviate this problem resulted either in more complicated, mathematically less tractable RNN architectures, or strongly limited the dynamical expressiveness of the RNN. Here we address this issue by suggesting a simple regularization scheme for vanilla RNNs with ReLU activation which enables them to solve long-range dependency problems and express slow time scales, while retaining a simple mathematical structure which makes their DS properties partly analytically accessible. We prove two theorems that establish a tight connection between the regularized RNN dynamics and its gradients, illustrate on DS benchmarks that our regularization approach strongly eases the reconstruction of DS which harbor widely differing time scales, and show that our method is also en par with other long-range architectures like LSTMs on several tasks.

1. INTRODUCTION

Theories in the natural sciences are often formulated in terms of sets of stochastic differential or difference equations, i.e. as stochastic dynamical systems (DS). Such systems exhibit a range of common phenomena, like (limit) cycles, chaotic attractors, or specific bifurcations, which are the subject of nonlinear dynamical systems theory (DST; Strogatz (2015) ; Ott (2002) ). A long-standing desire is to retrieve the generating dynamical equations directly from observed time series data (Kantz & Schreiber, 2004) , and thus to 'automatize' the laborious process of scientific theory building to some degree. A variety of machine and deep learning methodologies toward this goal have been introduced in recent years (Chen et al., 2017; Champion et al., 2019; Ayed et al., 2019; Koppe et al., 2019; Hamilton et al., 2017; Razaghi & Paninski, 2019; Hernandez et al., 2020) . Often these are based on sufficiently expressive series expansions for approximating the unknown system of generative equations, such as polynomial basis expansions (Brunton et al., 2016; Champion et al., 2019) or recurrent neural networks (RNNs) (Vlachas et al., 2018; Hernandez et al., 2020; Durstewitz, 2017; Koppe et al., 2019) . Formally, RNNs are (usually discrete-time) nonlinear DS that are dynamically universal in the sense that they can approximate to arbitrary precision the flow field of any other DS on compact sets of the real space (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Hanson & Raginsky, 2020) . Hence, RNNs seem like a good choice for reconstructing -in this sense of dynamically equivalent behavior -the set of governing equations underlying real time series data. However, RNNs in their vanilla form suffer from the 'vanishing or exploding gradients' problem (Hochreiter & Schmidhuber, 1997; Bengio et al., 1994) : During training, error gradients tend to either exponentially explode or decay away across successive time steps, and hence vanilla RNNs face severe problems in capturing long time scales or long-range dependencies in the data. Specially designed RNN architectures equipped with gating mechanisms and linear memory cells have been proposed for mitigating this issue (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) . However, from a DST perspective, simpler models that can be more easily analyzed and interpreted in DS terms (Monfared & Durstewitz, 2020a; b) , and for which more efficient inference algorithms exist that emphasize approximation of the true underlying DS (Koppe et al., 2019; Hernandez et al., 2020; Zhao & Park, 2020) , would be preferable. More recent solutions to the vanishing vs. exploding gradient problem attempt to retain the simplicity of vanilla RNNs by initializing or constraining the recurrent weight matrix to be the identity (Le et al., 2015) , orthogonal (Henaff et al., 2016; Helfrich et al., 2018) or unitary (Arjovsky et al., 2016) . While merely initialization-based solutions, however, may be unstable and quickly dissolve during training, orthogonal or unitary constraints, on the other hand, are too restrictive for reconstructing DS, and more generally from a computational perspective as well (Kerg et al., 2019) : For instance, neither chaotic behavior (that requires diverging directions) nor multi-stability, that is the coexistence of several distinct attractors, are possible. Here we therefore suggest a different solution to the problem which takes inspiration from computational neuroscience: Supported by experimental evidence (Daie et al., 2015; Brody et al., 2003) , line or plane attractors have been suggested as a dynamical mechanism for maintaining arbitrary information in working memory (Seung, 1996; Machens et al., 2005) , a goal-related active form of shortterm memory. A line or plane attractor is a continuous set of marginally stable fixed points to which the system's state converges from some neighborhood, while along the line itself there is neither connor divergence (Fig. 1A ). Hence, a line attractor will perform a perfect integration of inputs and retain updated states indefinitely, while a slightly detuned line attractor will equip the system with arbitrarily slow time constants (Fig. 1B ). This latter configuration has been suggested as a dynamical basis for neural interval timing (Durstewitz, 2003; 2004) . The present idea is to exploit this dynamical setup for long short-term memory and arbitrary slow time scales by forcing part of the RNN's subspace toward a plane (line) attractor configuration through specifically designed regularization terms. Specifically, our goal here is not so much to beat the state of the art on long short-term memory tasks, but rather to address the exploding vs. vanishing gradient problem within a simple, dynamically tractable RNN, optimized for DS reconstruction and interpretation. For this we build on piecewiselinear RNNs (PLRNNs) (Koppe et al., 2019; Monfared & Durstewitz, 2020b) which employ ReLU activation functions. PLRNNs have a simple mathematical structure (see eq. 1) which makes them dynamically interpretable in the sense that many geometric properties of the system's state space can in principle be computed analytically, including fixed points, cycles, and their stability (Suppl. 6.1.2; Koppe et al. ( 2019); Monfared & Durstewitz (2020a)), i.e. do not require numerical techniques (Sussillo & Barak, 2013) . Moreover, PLRNNs constitute a type of piecewise linear (PWL) map for which many important bifurcations have been comparatively well characterized (Monfared & Durstewitz, 2020a; Avrutin et al., 2019) . PLRNNs can furthermore be translated into equivalent continuous time ordinary differential equation (ODE) systems (Monfared & Durstewitz, 2020b) which comes with further advantages for analysis, e.g. continuous flow fields (Fig. 1A, B ). We retain the PLRNN's structural simplicity and analytical tractability while mitigating the exploding vs. vanishing gradient problem by adding special regularization terms for a subset of PLRNN units to the loss function. These terms are designed to push the system toward line attractor configurations, without strictly enforcing them, along some -but not all -directions in state space. We further establish a tight mathematical relationship between the PLRNN dynamics and the behavior of its gradients during training. Finally, we demonstrate that our approach outperforms LSTM and other, initialization-based, methods on a number of 'classical' machine learning benchmarks (Hochreiter & Schmidhuber, 1997) . Much more importantly in the present DST context, we demonstrate that our new regularization-supported inference efficiently captures all relevant time scales when reconstructing challenging nonlinear DS with multiple short-and long-range phenomena.

2. RELATED WORK

Dynamical systems reconstruction. From a natural science perspective, the goal of reconstructing or identifying the underlying DS is substantially more ambitious than (and different from) building a system that 'merely' yields good ahead predictions: In DS identification we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties (see section 3.5, Fig. S2 ; Kantz & Schreiber (2004) ). Earlier work using RNNs for DS reconstruction (Roweis & Ghahramani, 2002; Yu et al., 2005) mainly focused on inferring the posterior over latent trajectories Z = {z 1 , . . . , z T } given time series data X = {x 1 , . . . , x T }, p(Z|X), and on ahead predictions (Lu et al., 2017) , as does much of

