COUPLED OSCILLATORY RECURRENT NEURAL NET-WORK (CORNN): AN ACCURATE AND (GRADIENT) STABLE ARCHITECTURE FOR LEARNING LONG TIME DEPENDENCIES

Abstract

Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.

1. INTRODUCTION

Recurrent neural networks (RNNs) have achieved tremendous success in a variety of tasks involving sequential (time series) inputs and outputs, ranging from speech recognition to computer vision and natural language processing, among others. However, it is well known that training RNNs to process inputs over long time scales (input sequences) is notoriously hard on account of the so-called exploding and vanishing gradient problem (EVGP) (Pascanu et al., 2013) , which stems from the fact that the well-established BPTT algorithm for training RNNs requires computing products of gradients (Jacobians) of the underlying hidden states over very long time scales. Consequently, the overall gradient can grow (to infinity) or decay (to zero) exponentially fast with respect to the number of recurrent interactions. A variety of approaches have been suggested to mitigate the exploding and vanishing gradient problem. These include adding gating mechanisms to the RNN in order to control the flow of information in the network, leading to architectures such as long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurring units (GRU) (Cho et al., 2014) , that can overcome the vanishing gradient problem on account of the underlying additive structure. However, the gradients might still explode and learning very long term dependencies remains a challenge (Li et al., 2018) . Another popular approach for handling the EVGP is to constrain the structure of underlying recurrent weight matrices by requiring them to be orthogonal (unitary), leading to the so-called orthogonal RNNs (Henaff et al., 2016; Arjovsky et al., 2016; Wisdom et al., 2016; Kerg et al., 2019) and references therein. By construction, the resulting Jacobians have eigen-and singular-spectra with unit norm, alleviating the EVGP. However as pointed out by Kerg et al. (2019) , imposing such constraints on the recurrent matrices may lead to a significant loss of expressivity of the RNN resulting in inadequate performance on realistic tasks. In this article, we adopt a different approach, based on observation that coupled networks of controlled non-linear forced and damped oscillators, that arise in many physical, engineering and biological systems, such as networks of biological neurons, do seem to ensure expressive representations while constraining the dynamics of state variables and their gradients. This motivates us to propose a novel architecture for RNNs, based on time-discretizations of second-order systems of non-linear ordinary differential equations (ODEs) (1) that model coupled oscillators. Under verifiable hypotheses, we are able to rigorously prove precise bounds on the hidden states of these RNNs and their gradients, enabling a possible solution of the exploding and vanishing gradient problem, while demonstrating through benchmark numerical experiments, that the resulting system still retains sufficient expressivity, i.e. ability to process complex inputs, with a competitive performance, with respect to the state of the art, on a variety of sequential learning tasks.

2. THE PROPOSED RNN

Our proposed RNN is based on the following second-order system of ODEs, y = σ (Wy + Wy + Vu + b)γyy . (1) Here, t ∈ [0, 1] is the (continuous) time variable, u = u(t) ∈ R d is the time-dependent input signal, y = y(t) ∈ R m is the hidden state of the RNN with W, W ∈ R m×m , V ∈ R m×d are weight matrices, b ∈ R m is the bias vector and 0 < γ, are parameters, representing oscillation frequency and the amount of damping (friction) in the system, respectively. σ : R → R is the activation function, set to σ(u) = tanh(u) here. By introducing the so-called velocity variable z = y (t) ∈ R m , we rewrite (1) as the first-order system: y = z, z = σ (Wy + Wz + Vu + b) -γy -z. We fix a timestep 0 < ∆t < 1 and define our proposed RNN hidden states at time t n = n∆t ∈ [0, 1] (while omitting the affine output state) as the following IMEX (implicit-explicit) discretization of the first order system (2): y n = y n-1 + ∆tz n , z n = z n-1 + ∆tσ (Wy n-1 + Wz n-1 + Vu n + b) -∆tγy n-1 -∆t z n, with either n = n or n = n -1. Note that the only difference in the two versions of the RNN (3) lies in the implicit (n = n) or explicit (n = n -1) treatment of the damping termz in (2), whereas both versions retain the implicit treatment of the first equation in (2). Motivation and background. To see that the underlying ODE (2) models a coupled network of controlled forced and damped nonlinear oscillators, we start with the single neuron (scalar) case by setting d = m = 1 in (1) and assume an identity activation function σ(x) = x. Setting W = W = V = b = = 0 leads to the simple ODE, y + γy = 0, which exactly models simple harmonic motion with frequency γ, for instance that of a mass attached to a spring (Guckenheimer & Holmes, 1990) . Letting > 0 in (1) adds damping or friction to the system (Guckenheimer & Holmes, 1990) . Then, by introducing non-zero V in (1), we drive the system with a driving force proportional to the input signal u(t). The parameters V, b modulate the effect of the driving force, W controls the frequency of oscillations and W the amount of damping in the system. Finally, the tanh activation mediates a non-linear response in the oscillator. In the coupled network (2) with m > 1, each neuron updates its hidden state based on the input signal as well as information from other neurons. The diagonal entries of W (and the scalar hyperparameter γ) control the frequency whereas the diagonal entries of W (and the hyperparameter ) determine the amount of damping for each neuron, respectively, whereas the non-diagonal entries of these matrices modulate interactions between neurons. Hence, given this behavior of the underlying ODE (2), we term the RNN (3) as a coupled oscillatory Recurrent Neural Network (coRNN). The dynamics of the ODE (2) (and the RNN (3)) for a single neuron are relatively straightforward. As we illustrate in Fig. 6 of supplementary material SM §C, input signals drive the generation of (superpositions of) oscillatory wave-forms, whose amplitude and (multiple) frequencies are controlled by the tunable parameters W, W, V, b. Adding a tanh activation does not change these dynamics much. This is in contrast to truncating tanh to leading non-linear order by setting σ(x) = xx 3 /3, which yields a Duffing type oscillator that is characterized by chaotic behavior (Guckenheimer & Holmes, 1990) . Adding interactions between neurons leads to further accentuation of this generation of superposed wave forms (see Fig. 6 in SM §C) and even with very simple network topologies, one sees the emergence of non-trivial non-oscillatory hidden states from oscillatory inputs. In practice, a network of a large number of neurons is used and can lead to extremely rich global dynamics. Hence, we argue that the ability of a network of (forced, driven) oscillators to access a very rich set of output states may lead to high expressivity of the system, allowing it to approximate outputs from complicated sequential inputs. Oscillator networks are ubiquitous in nature and in engineering systems (Guckenheimer & Holmes, 1990; Strogatz, 2015) with canonical examples being pendulums (classical mechanics), business cycles (economics), heartbeat (biology) for single oscillators and electrical circuits for networks of oscillators. Our motivating examples arise in neurobiology, where individual biological neurons can be viewed as oscillators with periodic spiking and firing of the action potential. Moreover, functional circuits of the brain, such as cortical columns and prefrontal-striatal-hippocampal circuits, are being increasingly interpreted by networks of oscillatory neurons, see Stiefel & Ermentrout (2016) for an overview. Following well-established paths in machine learning, such as for convolutional neural networks (LeCun et al., 2015) , our focus here is to abstract the essence of functional brain circuits being networks of oscillators and design an RNN based on much simpler mechanistic systems, such as those modeled by (2), while ignoring the complicated biological details of neural function.

Related work.

There is an increasing trend of basing RNN architectures on ODEs and dynamical systems. These approaches can roughly be classified into two branches, namely RNNs based on discretized ODEs and continuous-time RNNs. Examples of continuous-time approaches include neural ODEs (Chen et al., 2018) with ODE-RNNs (Rubanova et al., 2019) as its recurrent extension as well as E ( 2017) and references therein, to name just a few. We focus, however, in this article on an ODE-inspired discrete-time RNN, as the proposed coRNN is derived from a discretization of the ODE (1). A good example for a discrete-time ODE-based RNNs is the so-called anti-symmetric RNN of Chang et al. (2019) , where the RNN architecture is based on a stable ODE resulting from a skew-symmetric hidden weight matrix, thus constraining the stable (gradient) dynamics of the network. This approach has much in common with previously mentioned unitary/orthogonal/nonnormal RNNs in constraining the structure of the hidden-to-hidden layer weight matrices. However, adding such strong constraints might reduce expressivity of the resulting RNN and might lead to inadequate performance on complex tasks. In contrast to these approaches, our proposed coRNN does not explicitly constrain the weight matrices but relies on the dynamics of the underlying ODE (and the IMEX discretization (3)), to provide gradient stability. Moreover, no gating mechanisms as in LSTMs/GRUs are used in the current version of coRNN. There is also an increasing interest in designing hybrid methods, which use a discretization of an ODE (in particular a Hamiltonian system) in order to learn the continuous representation of the data, see for instance Greydanus et al. (2019) ; Chen et al. (2020) . Overall, our approach here differs from these papers in our use of networks of oscillators to build the RNN.

3. RIGOROUS ANALYSIS OF THE PROPOSED RNN

An attractive feature of the underlying ODE system (2) lies in the fact that the resulting hidden states (and their gradients) are bounded (see SM §D for precise statements and proofs). Hence, one can expect that a suitable discretization of the ODE (2) that preserves these bounds will not have exploding gradients. We claim that one such structure preserving discretization is given by the IMEX discretization that results in the RNN (3) and proceed to derive bounds on this RNN below. Following standard practice we set y(0) = z(0) = 0 and purely for the simplicity of exposition, we set the control parameters, = γ = 1 and n = n in (3) leading to, y n = y n-1 + ∆tz n , z n = zn-1 1+∆t + ∆t 1+∆t σ(A n-1 ) -∆t 1+∆t y n-1 , A n-1 := Wy n-1 + Wz n-1 + Vu n + b. Analogous results and proofs for the case where n = n -1 and for general values of , γ are provided in SM §F. Bounds on the hidden states. As with the underlying ODE (2), the hidden states of the RNN (3) are bounded, i.e. Proposition 3.1 Let y n , z n be the hidden states of the RNN (4) for 1 ≤ n ≤ N , then the hidden states satisfy the following (energy) bounds: y n y n + z n z n ≤ nm∆t = mt n ≤ m. The proof of the energy bound ( 5) is provided in SM §E.1 and a straightforward variant of the proof (see SM §E.2) yields an estimate on the sensitivity of the hidden states to changing inputs. As with the underlying ODE (see SM §D) , this bound rules out chaotic behavior of hidden states. Bounds on hidden state gradients. We train the RNN (3) to minimize the loss function, E := 1 N N n=1 E n , E n = 1 2 y n -ȳn 2 2 , with ȳ being the underlying ground truth (training data). During training, we compute gradients of the loss function ( 6) with respect to the weights and biases Θ = [W, W, V, b], i.e. ∂E ∂θ = 1 N N n=1 ∂E n ∂θ , ∀ θ ∈ Θ. ( ) Proposition 3.2 Let y n , z n be the hidden states generated by the RNN (4). We assume that the time step ∆t << 1 can be chosen such that, max ∆t(1 + W ∞ ) 1 + ∆t , ∆t W ∞ 1 + ∆t = η ≤ ∆t r , 1 2 ≤ r ≤ 1. Denoting δ = 1 1+∆t , the gradient of the loss function E (6) with respect to any parameter θ ∈ Θ is bounded as, ∂E ∂θ ≤ 3 2 m + Ȳ √ m , with Ȳ = max 1≤n≤N ȳn ∞ be a bound on the underlying training data. Sketch of the proof. Denoting X n = [y n , z n ], we can apply the chain rule repeatedly (for instance as in Pascanu et al. (2013) ) to obtain, ∂E n ∂θ = 1≤k≤n ∂E n ∂X n ∂X n ∂X k ∂ + X k ∂θ ∂E (k) n ∂θ . Here, the notation ∂ + X k ∂θ refers to taking the partial derivative of X k with respect to the parameter θ, while keeping the other arguments constant. This quantity can be readily calculated from the structure of the RNN (4) and is presented in the detailed proof provided in SM §E.3. From (6), we can directly compute that ∂En ∂Xn = [y nȳn , 0] . Repeated application of the chain rule and a direct calculation with (4) yields, ∂X n ∂X k = k<i≤n ∂X i ∂X i-1 , ∂X i ∂X i-1 = I + ∆tB i-1 ∆tC i-1 B i-1 C i-1 , ( ) where I is the identity matrix and B i-1 = δ∆t (diag(σ (A i-1 ))W -I) , C i-1 = δ (I + ∆t diag(σ (A i-1 ))W) . ( ) It is straightforward to calculate using the assumption (8 ) that B i-1 ∞ < η and C i-1 ∞ ≤ η + δ. Using the definitions of matrix norms and (8), we obtain: ∂X i ∂X i-1 ∞ ≤ max (1 + ∆t( B i-1 ∞ + C i-1 ∞ ), B i-1 ∞ + C i-1 ∞ ) ≤ max (1 + ∆t(δ + 2η), δ + 2η) ≤ 1 + 3∆t r . Therefore, using (11), we have ∂X n ∂X k ∞ ≤ k<i≤n ∂X i ∂X i-1 ∞ ≤ (1 + 3∆t r ) n-k ≈ 1 + 3(n -k)∆t r . ( ) Note that we have used an expansion around 1 and neglected terms of O(∆t 2r ) as ∆t << 1. We remark that the bound ( 13) is the crux of our argument about gradient control as we see from the structure of the RNN that the recurrent matrices have close to unit norm. The detailed proof is presented in SM §E.3. As the entire gradient of the loss function ( 6), with respect to the weights and biases of the network, is bounded above in (9), the exploding gradient problem is mitigated for this RNN. On the vanishing gradient problem. The vanishing gradient problem (Pascanu et al., 2013) arises if ∂E (k) n ∂θ , defined in (10), → 0 exponentially fast in k, for k << n (long-term dependencies). In that case, the RNN does not have long-term memory, as the contribution of the k-th hidden state to error at time step t n is infinitesimally small. We already see from ( 14) that ∂Xn ∂X k ∞ ≈ 1 (independently of k). Thus, we should not expect the products in (10) to decay fast. In fact, we will provide a much more precise characterization of this gradient. To this end, we introduce the following order-notation, β = O(α), for α, β ∈ R + if there exists constants C, C such that Cα ≤ β ≤ Cα. M = O(α), for M ∈ R d1×d2 , α ∈ R + if there exists constant C such that M ≤ Cα. ( ) For simplicity of notation, we will also set ȳn = u n ≡ 0, for all n, b = 0 and r = 1 in ( 8) and we will only consider θ = W i,j for some 1 ≤ i, j ≤ m in the following proposition. Proposition 3.3 Let y n be the hidden states generated by the RNN (4). Under the assumption that y i n = O( √ t n ), for all 1 ≤ i ≤ m and (8), the gradient for long-term dependencies satisfies, ∂E (k) n ∂θ = O ĉδ∆t 3 2 + O ĉδ(1 + δ)∆t 5 2 + O(∆t 3 ), ĉ = sech 2 √ k∆t(1 + ∆t) , k << n. ( ) This precise bound ( 16) on the gradient shows that although the gradient can be small, i.e O(∆t 2 ), it is in fact independent of k, ensuring that long-term dependencies contribute to gradients at much later steps and mitigating the vanishing gradient problem. The detailed proof is presented in SM §E.5. Summarizing, we see that the RNN (3) indeed satisfied similar bounds to the underlying ODE (2) that resulted in upper bounds on the hidden states and its gradients. However, the lower bound on the gradient ( 16) is due to the specific choice of this discretization and does not appear to have a continuous analogue, making the specific choice of discretization of (2) crucial for mitigating the vanishing gradient problem.

4. EXPERIMENTS

We present results on a variety of learning tasks with coRNN (3) with n = n -1, as this version resulted in marginally better performance than the version with n = n. Details of the training procedure for each experiment can be found in SM §B. We wish to clarify here that we use a straightforward hyperparameter tuning protocol based on a validation set and do not use additional performance enhancing tools, such as dropout (Srivastava et al., 2014) , gradient clipping (Pascanu et al., 2013) or batch normalization (Ioffe & Szegedy, 2015) , which might further improve the performance of coRNNs. Adding problem. We start with the well-known adding problem (Hochreiter & Schmidhuber, 1997) , proposed to test the ability of an RNN to learn (very) long-term dependencies. The input is a two-dimensional sequence of length T , with the first dimension consisting of random numbers drawn from U([0, 1]) and with two non-zero entries (both set to 1) in the second dimension, chosen at random locations, but one each in both halves of the sequence. The output is the sum of two numbers of the first dimension at positions, corresponding to the two 1 entries in the second dimension. We compare the proposed coRNN to three recently proposed RNNs, which were explicitly designed to learn LTDs, namely the FastRNN (Kusupati et al., 2018) , the antisymmetric (anti.sym.) RNN (Chang et al., 2019) and the expRNN (Lezcano-Casado & Martínez-Rubio, 2019) , and to a plain vanilla tanh RNN, with the goal of beating the baseline mean square error (MSE) of 0.167 (which stems from the variance of the baseline output 1). All methods have 128 hidden units (dimensionality of the hidden state y) and the same training protocol is used in all cases. Fig. 1 shows the results for different lengths T of the input sequences. We can see that while the tanh RNN is not able to beat the baseline for any sequence length, the other methods successfully learn the adding task for T = 500. However, in this case, coRNN converges significantly faster and reaches a lower test MSE than other tested methods. When setting the length to the much more challenging case of T = 2000, we see that only coRNN and the expRNN beat the baseline. However, the expRNN fails to reach a desired test MSE of 0.01 within training time. In order to further demonstrate the superiority of coRNN over recently proposed RNN architectures for learning LTDs, we consider the adding problem for T = 5000 and observe that coRNN converges very quickly even in this case, while expRNN fails to consistently beat the baseline. We thus conclude that the coRNN mitigates the vanishing/exploding gradient problem even for very long sequences. Table 1 : Test accuracies on sMNIST and psMNIST (we provide our own psMNIST result for the FastGRNN, as no official result for this task has been published so far). Model sMNIST psMNIST # units # params uRNN (Arjovsky et al., 2016) 95.1% 91.4% 512 9k LSTM (Helfrich et al., 2018) 98.9% 92.9% 256 270k GRU (Chang et al., 2017) 99.1% 94.1% 256 200k anti.sym. RNN (Chang et al., 2019) , 1998) digit one pixel at a time leading to a classification task with a sequence length of T = 784. In permuted sequential MNIST (psMNIST), a fixed random permutation is applied in order to increase the time-delay between interdependent pixels and to make the problem harder. In Table 1 , we compare the test accuracy for coRNN on sMNIST and psMNIST with recently published best case results for other recurrent models, which were explicitly designed to solve long-term dependencies together with baselines corresponding to gated and unitary RNNs. To the best of our knowledge the proposed coRNN outperforms all single-layer recurrent architectures, published in the literature, for both the sMNIST and psMNIST. Moreover in Fig. 2 , we present the performance (with respect to number of epochs) of different RNN architectures for psMNIST with the same fixed random permutation and the same number of hidden units, i.e. 128. As seen from this figure, coRNN clearly outperforms the other architectures, some of which were explicitly designed to learn LTDs, handily for this permutation. Noise padded CIFAR-10. Another challenging test problem for learning LTDs is the recently proposed noise padded CIFAR-10 experiment by Chang et al. (2019) , in which CIFAR-10 data points (Krizhevsky et al., 2009) are fed to the RNN row-wise and flattened along the channels resulting in sequences of length 32. To test the long term memory, entries of uniform random numbers are added such that the resulting sequences have a length of 1000, i.e. the last 968 entries of each sequence are only noise to distract the network. Table 2 shows the result for coRNN together with other recently published best case results. We observe that coRNN readily outperforms other RNN architectures on this benchmark, while requiring only 128 hidden units. (Kag et al., 2020) 54.5% 128 12k FastRNN (Kag et al., 2020) 45.8% 128 16k anti.sym. RNN (Chang et al., 2019) 48.3% 256 36k Gated anti.sym. RNN (Chang et al., 2019) Human activity recognition. This experiment is based on the human activity recognition data set provided by Anguita et al. (2012) . The data set is a collection of tracked human activities, which were measured by an accelerometer and gyroscope on a Samsung Galaxy S3 smartphone. Six activities were binarized to obtain two merged classes {Sitting, Laying, Walking_Upstairs} and {Standing, Walking, Walking_Downstairs}, leading to the HAR-2 data set, which was first proposed in Kusupati et al. (2018) . Table 3 shows the result for coRNN together with other very recently published best case results on the same data set. We can see that coRNN readily outperforms all other methods. We also ran this experiment on a tiny coRNN with very few parameters, i.e. only 1k. We can see that even in this case, the tiny coRNN beats all baselines. We thus conclude that coRNN can efficiently be used on resource-constrained IoT micro-controllers. IMDB sentiment analysis. The IMDB data set (Maas et al., 2011 ) is a collection of 50k movie reviews, where 25k reviews are used for training (with 7.5k of these reviews used for validating) and 25k reviews are used for testing. The aim of this binary sentiment classification task is to decide whether a movie review is positive or negative. We follow the standard procedure by initializing the word embedding with pretrained 100d GloVe (Pennington et al., 2014) vectors and restrict the (Kag et al., 2020) 93.7% 64 16k FastRNN (Kusupati et al., 2018) 94.5% 80 7k FastGRNN (Kusupati et al., 2018) 95.6% 80 7k anti.sym. RNN (Kag et al., 2020) 93.2% 120 8k incremental RNN (Kag et al., 2020) (Campos et al., 2018) 86.6% 128 220k GRU (Campos et al., 2018) 86.2% 128 164k Skip GRU (Campos et al., 2018) 86.6% 128 164k ReLU GRU (Dey & Salemt, 2017) Further experimental results. To shed further light on the performance of coRNN, we consider the following issues. First, the theory suggested that coRNN mitigates the exploding/vanishing gradient problem as long as the assumptions (8) on the time step ∆t and weight matrices W, W hold. Clearly one can choose a suitable ∆t to enforce (8) before training, but do these assumptions remain valid during training? In SM §E.4, we argue, based on worst-case estimates, that the assumptions will remain valid for possibly a large number of training steps. More pertinently, we can verify experimentally that (8) holds during training. This is demonstrated in Fig. 3 , where we show that (8) holds for all LTD tasks during training. Thus, the presented theory applies and one can expect control over hidden state gradients with coRNN. Next, we recall that the frequency parameter γ and damping parameter play a role for coRNNs (see SM §F for the theoretical dependence and Table 8 for best performing values of , γ for each numerical experiment within the range considered in Table 7 ). How sensitive is the performance of coRNN to the choice of these 2 parameters? To investigate this dependence, we focus on the noise padded CIFAR-10 experiment and show the results of an ablation study in Fig. 4 , where the test accuracy for different coRNNs based on a two dimensional hyperparameter grid ( , γ) ∈ [0.8, 1.8] × [5.7, 17, 7] (i.e., sufficiently large intervals around the best performing values of , γ from Table 8 ) is plotted. We observe from the figure that although there are reductions in test accuracy for non-optimal values of ( , γ), there is no large variation and the performance is rather robust with respect to these hyperparameters. Finally, note that we follow standard practice and present best reported results with coRNN as well as other competing RNNs in order to compare the relative performance. However, it is natural to investigate the dependence of these best results on the random initial (before training) values of the weight matrices. To this end, in Table 5 of SM, we report the mean and standard deviation (over 10 retrainings) of the test accuracy with coRNN on various learning tasks and find that the mean value is comparable to the best reported value, with low standard deviations. This indicates further robustness of the performance of coRNNs. 

5. DISCUSSION

Inspired by many models in physics, biology and engineering, we proposed a novel RNN architecture (3) based on a model (1) of a network of controlled forced and damped oscillators. For this RNN, we rigorously showed that under verifiable hypotheses on the time step and weight matrices, the hidden states are bounded ( 5) and obtained precise bounds on the gradients (Jacobians) of the hidden states, ( 9) and ( 16). Thus by design, this architecture can mitigate the exploding and vanishing gradient problem (EVGP) for RNNs. We present a series of numerical experiments that include sequential image classification, activity recognition and sentiment analysis, to demonstrate that the proposed coRNN keeps hidden states and their gradients under control, while retaining sufficient expressivity to perform complex tasks. Thus, we provide a novel and promising strategy for designing RNN architectures that are motivated by the functioning of natural systems, have rigorous bounds on hidden state gradients and are robust, accurate, straightforward to train and cheap to evaluate. This work can be extended in different directions. For instance in this article, we have mainly focused on the learning of tasks with long-term dependencies and observed that coRNNs are comparable in performance to the best published results in the literature. Given that coRNNs are built with networks of oscillators, it is natural to expect that they will perform very well on tasks with oscillatory inputs/outputs, such as the time series analysis of high-resolution biomedical data, for instance EEG (electroencephalography) and EMG (electromyography) data and seismic activity data from geoscience. This will be pursued in a follow-up article. Similarly, applications of coRNN to language modeling will be covered in future work. However, it is essential to point out that coRNNs might not be suitable for every learning task involving sequential inputs/outputs. As a concrete example, we consider the problem of predicting time series corresponding to a chaotic dynamical system. We recall that by construction, the underlying ODE (2) (and the discretization (3)) do not allow for super-linear (in time) separation of trajectories for nearby inputs. Thus, we cannot expect that coRNNs will be effective at predicting chaotic time series and it is indeed investigated and demonstrated for a Lorenz-96 ODE in SM §A, where we observe that the coRNN is outperformed by LSTMs in the chaotic regime. Our main theoretical focus in this paper was to demonstrate the possible mitigation of the exploding and vanishing gradient problem. On the other hand, we only provided some heuristics and numerical evidence on why the proposed RNN still has sufficient expressivity. A priori, it is natural to think that the proposed RNN architecture might introduce a strong bias towards oscillatory functions. However, as we argue in SM §C, the proposed coRNN can be significantly more expressive, as the damping, forcing and coupling of several oscillators modulates nonlinear response to yield a very rich and diverse set of output states. This is also evidenced by the ability of coRNNs to deal with many tasks in our numerical experiments, which do not have an explicit oscillatory structure. This sets the stage for a rigorous investigation of universality of the proposed coRNN architecture, as in the case of echo state networks in Grigoryeva & Ortega (2018) . A possible approach would be to leverage the ability of the proposed RNN to convert general inputs into a rich set of superpositions of harmonics (oscillatory wave forms). Moreover, the proposed RNN was based on the simplest model of coupled oscillators (1). Much more detailed models of oscillators are available, particularly those that arise in the modeling of biological neurons, Stiefel & Ermentrout (2016) and references therein. An interesting variant of our proposed RNN would be to base the RNN architecture on these more elaborate models, resulting in analogues of the spiking neurons model of Maass (2001) for RNNs. Supplementary Material for: Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies A CHAOTIC TIME-SERIES PREDICTION. According to proposition E.1, coRNN does not exhibit chaotic behavior by design. While this property is highly desirable for learning long-term dependencies (a slight perturbation of the input should not result in an unbounded perturbation of the prediction), it impairs the performance on tasks, where the network has to learn actual chaotic dynamics. To test this numerically, we consider the following version of the Lorenz 96 system: (Lorenz, 1996): x j = (x i+1 -x i-2 )x i-1 -x i + F, where x j ∈ R for all j = 1, . . . , 5 and F is an external force controlling the level of chaos in the system. Fig. 5 shows a trajectory of the system (17) plotted on the x 1 x 2 -plane for a small external force of F = 0.9 as well as a trajectory for a large external force of F = 8. We can see that while for F = 0.9 the system does not exhibit chaotic behavior, the dynamics for F = 8 is already highly chaotic. Our task consists of predicting the 25-th next state of a trajectory of the system (17). We provide 128 trajectories of length 2000 for each of the training, validation and test sets. The trajectories are generated by numerically solving the system (17) and evaluating it at 2000 equidistantly distributed discrete time points with distance 0.01. The initial value for each trajectory is chosen uniform at random on [F -1/2, F + 1/2] 5 around the equilibrium point (F, . . . , F ) of the system (17). Since LSTMs are known to be able to produce chaotic dynamics, even in the autonomous (zero-entry) case (Laurent & von Brecht, 2017) , we expect them to perform significantly better than coRNN if the underlying system exhibits strong chaotic behavior. Table 6 shows the normalized root mean square error (NRMSE) (RMSE divided by the root mean square of the target trajectory) on the test set for coRNN and LSTM. We can see that indeed for the non-chaotic case of using an external force of F = 0.9 LSTM and coRNN perform similarly. However, when the dynamics get chaotic (in this case using an external force of F = 8), the LSTM clearly outperforms coRNN. 

B TRAINING DETAILS

The IMDB task was conducted on an NVIDIA GeForce GTX 1080 Ti GPU, while all other experiments were run on a Intel Xeon E3-1585Lv5 CPU. The weights and biases of coRNN are randomly initialized according to U(- 1 √ nin , 1 √ nin ), where n in denotes the input dimension of each affine transformation. Instead of treating the parameters ∆t, γ and as fixed hyperparameters, we can also treat them as trainable network parameters by constraining ∆t to [0, 1] by using a sigmoidal activation function and , γ > 0 by the use of ReLU for instance. However, in this case no major difference in performance is obtained. The hyperparameters are optimized with a random search algorithm, where the results of the best performing coRNN (based on the validation set) are reported. The ranges of the hyperparameters for the random search algorithm are provided in Table 7 . Table 8 shows the rounded hyperparameters of the best performing coRNN architecture resulting from the random search algorithm for each learning task. We used 100 training epochs for sMNIST, psMNIST and noise padded CIFAR-10 with additional 20 epochs in which the learning rate was reduced by a factor of 10. Additionally, we used 100 epochs for the IMDB task and 250 epochs for the HAR-2 task. Table 7 : Setting for the hyperparameter optimization of coRNN. Intervals denote ranges of the corresponding hyperparameter for the grid search algorithm, while fixed numbers mean that no hyperparameter optimization was done in this case. task learning rate batch size ∆t γ 5.4 × 10 -3 120 7.6 × 10 -2 4 × 10 -1 8.0 Noise padded CIFAR-10 7.5 × 10 -3 100 3.4 × 10 -2 1.3 12.7 HAR-2 1.7 × 10 -2 64 10 -1 2 × 10 -1 6.4 IMDB 6.0 × 10 -4 64 5.4 × 10 -2 4.9 4.8 Adding 2 × 10 -2 50 [10 -2 , 10 -1 ] [1, 100] [1, 100] sMNIST (n hid = 128) [10 -4 , 10 -1 ] 120 [10 -2 , 10 -1 ] [10 -1 , 10] [10 -1 , 10] sMNIST (n hid = 256) [10 -4 , 10 -1 ] 120 [10 -2 , 10 -1 ] [10 -1 , 10] [10 -1 , 10] psMNIST (n hid = 128) [10 -4 , 10 -1 ] 120 [10 -2 , 10 -1 ] [10 -1 , 10] [10 -1 , 10] psMNIST (n hid = 256) [10 -4 , 10 -1 ] 120 [10 -2 , 10 -1 ] [10 -1 , 10] [10 -1 , 10] Noise padded CIFAR-10 [10 -4 , 10 -1 ] 100 [10 -2 , 10 -1 ] [1, 100] [1, 100] HAR-2 [10 -4 , 10 -1 ] 64 [10 -2 , 10 -1 ] [10 -1 , 10] [10 -1 , 10] IMDB [10 -4 , 10 -1 ] 64 [10 -2 , 10 -1 ] [10 -1 , 10] [10 -1 , 10]

C HEURISTICS OF NETWORK FUNCTION

At the level of a single neuron, the dynamics of the RNN is relatively straightforward. We start with the scalar case, i.e. m = d = 1 and illustrate different hidden states y as a function of time, for different input signals, in Fig. 6 . In this figure, we consider two different input signals, one oscillatory signal given by u(t) = cos(4t) and another is a combination of step functions. First, we plot the solution y(t) of ( 1), with the parameters V, b, W, W, = 0 and γ = 1. This simply corresponds to the case of a simple harmonic oscillator (SHO) and the solution is described by a sine wave with the natural frequency of the oscillator. Next, we introduce forcing by the input signal by setting V = 1 and the activation function is the identity σ(x) = x, leading to a forced damped oscillator (FDO). As seen from Fig. 6 , in the case of an oscillatory signal, this leads to a very minor change over the SHO, whereas for the step function, the change is only in the amplitude of the wave. Next, we add damping by setting = 0.25 and see that the resulting forced damped oscillator (FDO), merely damps the amplitude of the waves, without changing their frequency. Then, we consider the case of controlled oscillator (CFDO) by setting W = -2, V = 2, b = 0.25, W = 0.75. As seen from Fig. 6 , this leads to a significant change in the wave form in both cases. For the oscillatory input, the output is now a superposition of many different forms, with different amplitudes and frequencies (phases) whereas for the step function input, the phase is shifted. Already, we can see that for a linear controlled oscillator, the output can be very complicated with the superposition of different waves. This holds true when the activation function is set to σ(x) = tanh(x) (which is our proposed coRNN). For both inputs, the output is a modulated version of the one generated by CFDO, expressed as a superposition of waves. On the other hand, we also plot the solution with a Duffing type oscillator (DUFF) by setting the activation function as, σ(x) = x - x 3 3 . ( ) In this case, the solution is very different from the CFDO and coRNN solutions and is heavily damped (either in the output or its derivative). On the other hand, given the chaotic nature of the dynamical system in this case, a slight change in the parameters led to the output blowing up. Thus, a bounded nonlinearity seems essential in this context. Coupling neurons together further accentuates this generation of superpositions of different waveforms, as seen even with the simplest case of a network with two neurons, shown in Fig. 6 (Bottom row). For this figure, we consider two neurons, i.e m = 2 and two different network topologies. For the first, we only allow the first neuron to influence the second one and not vice versa. This is enforced with the weight matrices, W = -2 0 3 -2 , W = 0.75 0 -1 0.75 . We also set V = [2, 2] , b = [0.25, 0.25] . Note that in this case (we name as ORD (for ordered connections)), the output of the first neuron should be exactly the same as in the uncoupled (UC) case, whereas there is a distinct change in the output of the second neuron and we see that the first neuron has modulated a sharp change in the resulting output wave form. It is well illustrated by the emergence of an approximation to the step function (Bottom Right of Fig. 6 ), even though the input signal is oscillatory. Next, we consider the case of fully connected (FC) neurons by setting the weight matrices as, W = -2 1 3 -2 , W = 0.75 0.3 -1 0.75 . The resulting outputs for the first neuron are now slightly different from the uncoupled case. On the the other hand, the approximation of step function output for the second neuron is further accentuated. Even these simple examples illustrate the functioning of a network of controlled oscillators well. The input signal is converted into a superposition of waves with different frequencies and amplitudes, with these quantities being controlled by the weights and biases in (1). Thus, very complicated outputs can be generated by modulating the number, frequencies and amplitudes of the waves. In practice, a network of a large number of neurons is used and can lead to extremely rich global dynamics, along the lines of emergence of synchronization or bistable heterogeneous behavior seen in systems of idealized oscillators and explained by their mean field limit, see H. Sakaguchi & Kuramoto (1987) ; Winfree (1967) ; Strogatz (2001) . Thus, we argue that the ability of the network of (forced, driven) oscillators to access a very rich set of output states can lead to high expressivity of the system. The training process selects the weights that modulate frequencies, phases and amplitudes of individual neurons and their interaction to guide the system to its target output.

D BOUNDS ON THE DYNAMICS OF THE ORDINARY DIFFERENTIAL EQUATION (1)

In this section, we present bounds that show how the continuous time dynamics of the ordinary differential equation ( 2), modeling non-linear damped and forced networks of oscillators, is constrained. We start with the following estimate on the energy of the solutions of the system (2). Proposition D.1 Let y(t), z(t) be the solutions of the ODE system (2) at any time t ∈ [0, T ] and assume that the damping parameter ≥ 1 2 and the initial data for (2) is given by, y(0) = z(0) ≡ 0. Then, the solutions are bounded as, y(t) y(t) ≤ mt γ , z(t) z(t) ≤ mt, ∀t ∈ (0, T ]. To prove this proposition, we multiply the first equation in (2) with y(t) and the second equation in (2) with 1 γ z(t) to obtain, d dt y(t) y(t) 2 + z(t) z(t) 2γ = z(t) σ(A(t)) γ - γ z(t) z(t), with A(t) = Wy(t) + Wz(t) + Vu(t) + b. Using the elementary Cauchy's inequality repeatedly in (20) results in, d dt y(t) y(t) 2 + z(t) z(t) 2γ ≤ σ(A) σ(A) 2γ + 1 γ 1 2 - z z ≤ m 2γ (as |σ| ≤ 1 and ≥ 1 2 ). Integrating the above inequality over the time interval [0, t] and using the fact that the initial data are y(0) = z(0) ≡ 0, we obtain the bounds (19). The above proposition and estimate ( 19) clearly demonstrate that the dynamics of the network of coupled non-linear oscillators (1) is bounded. The fact that the nonlinear activation function σ = tanh is uniformly bounded in its arguments played a crucial role in deriving the energy bound (19). A straightforward adaptation of this argument leads to the following proposition about the sensitivity of the system to inputs, Proposition D.2 Let y(t), z(t) be the solutions of the ODE system (2) with respect to the input signal u(t). Let ȳ(t), z(t) be the solutions of the ODE system (2), but with respect to the input signal ū(t). Assume that the damping parameter ≥ 1 2 and the initial data are given by, y(0) = z(0) = ȳ(0) = z(0) ≡ 0. Then we have the following bound, (y(t) -ȳ(t)) (y(t) -ȳ(t)) ≤ 4mt γ , (z(t) -z(t)) (z(t) -z(t)) ≤ 4mt, ∀t ∈ (0, T ]. Thus from the bound ( 21), there can be atmost linear separation (in time) with respect to the trajectories of the ODE (2) for different input signals. Hence, chaotic behavior, which is characterized by the (super-)exponential separation of trajectories is ruled out by the structure of the ODE system Note that this property of the ODE system was primarily a result of the uniform boundedness of the activation function σ. Using a different activation function such as ReLU might enable to obtain an exponential separation of trajectories that is a prerequisite for a chaotic dynamical system.

D.1 GRADIENT DYNAMICS FOR THE ODE SYSTEM (2)

Let θ denote the i, j-th entry of the Weight matrices W, W, V or the i-th entry of the bias vector b. We are interested in finding out how the gradients of the hidden state y (and the auxiliary hidden state z) with respect to parameter θ, vary with time. Note that these gradients are precisely the objects of interest in the training of an RNN, based on a discretization of the ODE system (2). To this end, we differentiate (2) with respect to the parameter θ and denote y θ (t) = ∂y ∂θ (t), z θ (t) = ∂z ∂θ (t), to obtain, y θ = z θ , z θ = diag(σ (A)) [Wy θ + Wz θ ] + Z i,j m, m(A)ρ -γy θ -z θ . As introduced before, Z i,j m, m(A) ∈ R m× m is a matrix with all elements are zero except for the (i, j)-th entry which is set to σ (A(t)) i , i.e. the i-th entry of σ (A), and we have, ρ = y, m = m, if θ = W i,j , ρ = z, m = m, if θ = W i,j , ρ = u, m = d, if θ = V i,j , ρ = 1, m = 1, if θ = b i . We see from ( 22) that the ODEs governing the gradients with respect to the parameter θ also represent a system of oscillators but with additional coupling and forcing terms, proportional to the hidden states y, z or input signal u. As we have already proved with estimate (19) that the hidden states are always bounded and the input signal is assumed to be bounded, it is natural to expect that the gradients of the states with respect to θ are also bounded. We make this statement explicit in the following proposition, which for simplicity of exposition, we consider the case of θ = W i,j , as the other values of θ are very similar in their behavior. Proposition D.3 Let θ = W i,j and y, z be the solutions of the ODE system (2). Assume that the weights and the damping parameter satisfy, W ∞ + W ∞ ≤ , then we have the following bounds on the gradients, y θ (t) y θ (t) + 1 γ z θ (t) z θ (t) ≤ y θ (0) y θ (0) + 1 γ z θ (0) z θ (0) e Ct + mt 2 2γ 2 , t ∈ (0, T ], C = max W 1 γ , 1 + W 1 . E.3 PROOF OF PROPOSITION 3.2 From (6), we readily calculate that, ∂E n ∂X n = [y n -ȳn , 0] . Similarly from (3), we calculate, ∂ + X k ∂θ =                          ∆t 2 1+∆t Z i,j m,m (A k-1 )y k-1 , ∆t 1+∆t Z i,j m,m (A k-1 )y k-1 if θ = (i, j)-th entry of W, ∆t 2 1+∆t Z i,j m,m (A k-1 )z k-1 , ∆t 1+∆t Z i,j m,m (A k-1 )z k-1 if θ = (i, j)-th entry of W, ∆t 2 1+∆t Z i,j m,d (A k-1 )u k , ∆t 1+∆t Z i,j m,d (A k-1 )u k if θ = (i, j)-th entry of V, ∆t 2 1+∆t Z i,1 m,1 (A k-1 ) , ∆t 1+∆t Z i,1 m,1 (A k-1 ) if θ = i-th entry of b, ) where Z i,j m, m(A k-1 ) ∈ R m× m is a matrix with all elements are zero except for the (i, j)-th entry which is set to σ (A k-1 ) i , i.e. the i-th entry of σ (A k-1 ). We easily see that Z i,j m, m(A k-1 ) ∞ ≤ 1 for all i, j, m, m and all choices of A k-1 . Now, using definitions of matrix and vector norms and applying ( 14) in ( 10), together with ( 27) and ( 28), we obtain the following estimate on the norm: ∂E (k) n ∂θ ≤        ( y n ∞ + ȳn ∞ )(1 + 3(n -k)∆t r )δ∆t y k-1 ∞ , if θ is entry of W, ( y n ∞ + ȳn ∞ )(1 + 3(n -k)∆t r )δ∆t z k-1 ∞ , if θ is entry of W, ( y n ∞ + ȳn ∞ )(1 + 3(n -k)∆t r )δ∆t u k ∞ , if θ is entry of V, ( y n ∞ + ȳn ∞ )(1 + 3(n -k)∆t r )δ∆t, if θ is entry of b. We will estimate the above term, just for the case of θ is an entry of W, the rest of the terms are very similar to estimate. For simplicity of notation, we let k -1 ≈ k and aim to estimate the term, ∂E (k) n ∂θ ≤ y n ∞ y k ∞ (1 + 3(n -k)∆t r )δ∆t + ȳn ∞ y k ∞ (1 + 3(n -k)∆t r )δ∆t ≤ m √ nk∆t(1 + 3(n -k)∆t r )δ∆t + ȳn ∞ √ mk √ ∆t(1 + 3(n -k)∆t r )δ∆t (by (5)) ≤ m √ nkδ∆t 2 + 3m √ nk(n -k)δ∆t r+2 + ȳn ∞ √ mk √ ∆t(1 + 3(n -k)∆t r )δ∆t. (30) To further analyze the above estimate, we recall that n∆t = t n ≤ 1 and consider two different regimes. Let us start by considering short-term dependencies by letting k ≈ n, i.e nk = c with constant c ∼ O(1), independent of n, k. In this case, a straightforward application of the above assumptions in the bound (30) yields, ∂E (k) n ∂θ ≤ m √ nkδ∆t 2 + 3m √ nk(n -k)δ∆t r+2 + ȳn ∞ √ m √ t n δ∆t + ȳn ∞ √ m √ t n cδ∆t r+1 ≤ mt n δ∆t + mct n δ∆t r+1 + ȳn ∞ √ m √ t n δ∆t + ȳn ∞ √ m √ t n cδ∆t r+1 ≤ t n mδ∆t + ȳn ∞ √ m √ t n δ∆t (for ∆t << 1 as r ≥ 1/2) ≤ mδ∆t + ȳn ∞ √ mδ∆t. Next, we consider long-term dependencies by setting k << n and estimating, ∂E (k) n ∂θ ≤ m √ nkδ∆t 2 + 3m √ nk(n -k)δ∆t r+2 + ȳn ∞ √ mδ∆t 3 2 + 3 ȳn ∞ √ mnδ∆t r+ 3 2 ≤ m √ t n δ∆t 3 2 + 3mt 3 2 n δ∆t r+ 1 2 + ȳn ∞ √ mδ∆t 3 2 + 3 ȳn ∞ √ mt n δ∆t r+ 1 2 ≤ mδ∆t 3 2 + 3mδ∆t r+ 1 2 + ȳn ∞ √ mδ∆t 3 2 + 3 ȳn ∞ √ mδ∆t r+ 1 2 (as t n < 1) ≤ 3mδ∆t r+ 1 2 + 3 ȳn ∞ √ mδ∆t r+ 1 2 (as r ≤ 1 and ∆t << 1). (32) Thus, in all cases, we have that, ∂E (k) n ∂θ ≤ 3δ∆t m + √ m ȳn ∞ (as r ≥ 1/2). Applying the above estimate in (10) allows us to bound the gradient by, ∂E n ∂θ ≤ 1≤k≤n ∂E (k) n ∂θ ≤ 3δt n m + √ m ȳn ∞ . Therefore, the gradient of the loss function ( 6) can be bounded as, ∂E ∂θ ≤ 1 N N n=1 ∂E n ∂θ ≤ 3δ m∆t N N n=1 n + √ m∆t N N n=1 ȳn ∞ n ≤ 3δ m∆t N N n=1 n + √ m Ȳ ∆t N N n=1 n ≤ 3 2 δ(N + 1)∆t m + Ȳ √ m ≤ 3 2 δ(t N + ∆t) m + Ȳ √ m ≤ 3 2 δ(1 + ∆t) m + Ȳ √ m (as t N = 1) ≤ 3 2 m + Ȳ √ m , which is the desired estimate (9).

E.4 ON THE ASSUMPTION (8) AND TRAINING

Note that all the estimates were based on the fact that we were able to choose a time step ∆t in (3) that enforces the condition (8). For any fixed weights W, W, we can indeed choose such a value of to satisfy (8). However, we train the RNN to find the weights that minimize the loss function (6). Can we find a hyperparameter ∆t such that ( 8) is satisfied at every step of the stochastic gradient descent method for training? To investigate this issue, we consider a simple gradient descent method of the form: θ +1 = θ -ζ ∂E ∂θ (θ ). ( ) Note that ζ is the constant (non-adapted) learning rate. We assume for simplicity that θ 0 = 0 (other choices lead to the addition of a constant). Then, a straightforward estimate on the weight is given by, (37) In order to calculate the minimum number of steps L in the gradient descent method (36) such that the condition ( 8) is satisfied, we set = L in (37) and applying it to the condition (8) leads to the straightforward estimate, L ≥ 1 ζ 3 2 m + Ȳ √ m m∆t 1-r δ . ( ) Note that the parameter δ < 1, while in general, the learning rate ζ << 1. Thus, as long as r ≤ 1, we see that the assumption (8) holds for a large number of steps of the gradient descent method. We remark that the above estimate ( 38) is a large underestimate on L. In the experiments presented in this article, we are able to take a very large number of training steps, while the gradients remain within a range (see Fig. 3 ).

E.5 PROOF OF PROPOSITION 3.3

We start with the following decomposition of the recurrent matrices: with B, C defined in (12) . By the assumption (8), one can readily check that Mi-1 ∞ ≤ ∆t, for all k ≤ i ≤ n -1. ∂X i ∂X i-1 = M i-1 + ∆t Mi-1 , M i-1 := I ∆tC i-1 B i-1 C i-1 , Mi-1 := B i-1 0 0 0 , We will use an induction argument to show the following representation formula for the product of Jacobians, ∂X n ∂X k = k<i≤n ∂X i ∂X i-1 =      I ∆t n-1 j=k k i=j C i B n-1 + k j=n-2 j+1 i=n-1 C i B j k i=n-1 C i      + O(∆t). ( ) We start by the outermost product and calculate, ∂X n ∂X n-1 ∂X n-1 ∂X n-2 = M n-1 + ∆t Mn-1 M n-2 + ∆t Mn-2 = M n-1 M n-2 + ∆t( Mn-1 M n-2 + M n-1 Mn-2 ) + O(∆t 2 ). By direct multiplication, we obtain, M n-1 M n-2 = I ∆t (C n-2 + C n-1 C n-2 ) B n-1 + C n-1 B n-2 C n-1 C n-2 + ∆t C n-1 B n-2 0 0 B n-1 C n-2 . Using the definitions in ( 12) and (8), we can easily see that C n-1 B n-2 0 0 B n-1 C n-2 = O(∆t). Similarly, it is easy to show that Mn-1 M n-2 , M n-1 Mn-2 ∼ O(∆t). Plugging all the above estimates yields, ∂X n ∂X n-1 ∂X n-1 ∂X n-2 = I ∆t (C n-2 + C n-1 C n-2 ) B n-1 + C n-1 B n-2 C n-1 C n-2 + O(∆t 2 ), which is exactly the form of the leading term (39). Iterating the above calculations (nk) times and realizing that (nk)∆t 2 ≈ n∆t 2 = t n ∆t yields the formula (39). Recall that we have set θ = W i,j , for some 1 ≤ i, j ≤ m in proposition 3.3. Directly calculating with ( 27), ( 28) and the representation formula (39) yields the formula,

∂E

(k) n ∂θ = y n ∆t 2 δZ i,j m,m (A k-1 )y k-1 + y n ∆t 2 δC * Z i,j m,m (A k-1 )y k-1 + O(∆t 3 ), with matrix C * defined as, C * := n-1 j=k k i=j C i , and Z i,j m,m (A k-1 ) ∈ R m×m is a matrix with all elements are zero except for the (i, j)-th entry which is set to σ (a i k-1 ), i.e. the i-th entry of σ (A k-1 ). Note that the formula (40) can be explicitly written as, ∂E (k) n ∂θ = δ∆t 2 σ (a i k-1 )y i n y j k-1 + δ∆t 2 σ (a i k-1 ) m =1 C * i y n y j k-1 + O(∆t 3 ), with y j n denoting the j-th element of vector y n , and a i k-1 := m =1 W i y k-1 + m =1 W i z k-1 . ( ) By the assumption (8), we can readily see that W ∞ , W ∞ ≤ 1 + ∆t. Therefore by the fact that σ = sech 2 , the assumption y i k = O( √ t k ) and (42), we obtain, ĉ = sech 2 ( √ k∆t(1 + ∆t) ≤ σ (a k-1 i ) ≤ 1. Using ( 43) in (41), we obtain, δ∆t 2 σ (a i k-1 )y i n y j k-1 = O ĉδ∆t Summing over j and using the fact that k << n, we obtain that C * = (O(n) + O(δ∆t 0 ))I. Plugging ( 45) and ( 43) into (41) leads to, δ∆t 2 σ (a i k-1 ) m =1 C * i y n y j k-1 = O ĉδ∆t 3 2 + O ĉδ 2 ∆t 5 2 . ( ) Combining ( 44) and ( 46) yields the desired estimate ( 16).



Figure 1: Results of the adding problem for coRNN, expRNN, FastRNN, anti.sym. RNN and tanh RNN based on three different sequence lengths T , i.e. T = 500, T = 2000 and T = 5000.

Figure 2: Performance on psM-NIST for different models, all with 128 hidden units and the same fixed random permutation.

Figure 4: Ablation study on the hyperparameters , γ in (3) using the noise padded CIFAR-10 experiment.

Figure 5: Exemplary (x 1 , 2 )-trajectories of the Lorenz 96 system (17) for different forces F .

Figure 6: Illustration of the hidden state y of coRNN (3) with a scalar input signal u (Top, Middle, Left) with one neuron with state y (Top and Middle, Right) and two neurons with states y 1 (Bottom left), and y 2 (Bottom right), corresponding to scalar input signal, shown in Top Left. Legend is SHO (simple harmonic oscillator), FHO (forced oscillator), FDO (forced and damped oscillator), CFDO (controlled forced and damped oscillator), DUFF (Duffing type) UC (Uncoupled), Ord (ordered coupling) and FC (fully coupled). Legend explained in the text.

of C i , we can expand the product in C * and neglect terms of order O(∆t 4 ), to obtaink i=j C i = (O(1) + O((jk + 1)δ∆t 2 ))I.

Test accuracies on noise padded CIFAR-10.

Test accuracies on HAR-2.

Table4shows the results for coRNN and other recently published models, which are trained similarly and have the same number of hidden units, i.e. 128. We can see that coRNN compares favorable with gated baselines (which are known to perform very well on this task), while at the same time requiring significantly less parameters. Test accuracies on IMDB.

Distributional information (mean and standard deviation) on the results for each classification experiment presented in the paper based on 10 re-trainings of the best performing coRNN using random initialization of the trainable parameters.

Test NRMSE on the Lorenz 96 system (17) for coRNN and LSTM.

Rounded hyperparameters of the best performing coRNN architecture.

annex

The proof of this proposition follows exactly along the same lines as the proof of proposition D.1 and we skip the details, while noting the crucial role played by the energy bound (19) .We remark that the bound (23) indicates that as long as the initial gradients with respect to θ are bounded and the weights are controlled by the damping parameter, the hidden state gradients remain bounded in time.

E SUPPLEMENT TO THE RIGOROUS ANALYSIS OF CORNN

In this section, we supplement the section on the rigorous analysis of the proposed RNN (4). We start with E.1 PROOF OF PROPOSITION 3.1We multiply (y n-1 , z n ) to (3) and use the elementary identities,to obtain the following,Iterating the above inequality n times leads to the energy bound,as y 0 = z 0 = 0.

E.2 SENSITIVITY TO INPUTS

Next, we examine how changes in the input signal u affect the dynamics. We have the following proposition:Proposition E.1 Let y n , z n be the hidden states of the trained RNN (4) with respect to the input u = {u n } N n=1 and let y n , z n be the hidden states of the same RNN (4), but with respect to the input u = {u n } N n=1 , then the differences in the hidden states are bounded by,The proof of this proposition is completely analogous to the proof of proposition 3.1, we subtract(26) from (4) and multiply (y ny n ) , (z nz n )to the difference. The estimate (25) follows identically to the proof of (5) (presented above) by realizing that σ(A n-1 )σ(A n-1 ) ≤ 2.Note that the bound (25) ensures that the hidden states can only separate linearly in time for changes in the input. Thus, chaotic behavior, such as for Duffing type oscillators, characterized by at least exponential separation of trajectories, is ruled out for this proposed RNN, showing that it is stable with respect to changes in the input. This is largely on account of the fact that the activation function σ in (3) is globally bounded.Remark. A careful examination of the above proof reveals that the constants hidden in the prefactors of the leading term O ĉδ∆t 3 2 of ( 16) stem from the formula (46). Here, we have used the assumption thatNote that this assumption implicitly assumes that the energy bound ( 5) is equidistributed among all the elements of the vector y k and results in the obfuscation of the constants in the leading term of ( 16). Given that the energy bound ( 5) is too coarse to allow for precise upper and lower bounds on each individual element of the hidden state vector y k , we do not see any other way of, in general, determining the distribution of energy among individual entries of the hidden state vector. Thus, assuming equidistribution seems reasonable. On the other hand, in practice, one has access to all the terms in formula (46) for each numerical experiment and if one is interested, then one can directly evaluate the precise bound on the leading term of the formula (16).

F RIGOROUS ESTIMATES FOR THE RNN (3) WITH n = n -1 AND GENERAL VALUES OF , γ

In this section, we will provide rigorous estimates, similar to that of propositions 3.1, E.1 and 3.2 for the version of coRNN (3) that results by setting n = n -1 in (3) leading to,Note that (47) can be equivalently written as,We will also consider the case of non-unit values of the control parameters γ and below.Bounds on Hidden states. We start the following bound on the hidden states of (47), Proposition F.1 Let the damping parameter > 1 2 and the time step ∆t in the RNN (47) satisfy the following condition,Let y n , z n be the hidden states of the RNN (47) for 1 ≤ n ≤ N , then the hidden states satisfy the following (energy) bounds:We set A n-1 = Wy n-1 + Wz n-1 + Vu n-1 + b and as in the proof of proposition 3.1, we multiply (y n-1 , 1 γ z n ) to (47) and use elementary identities and rearrange terms to obtain,We use a rescaled version of the well-known Cauchy's inequalityfor a constant c > 0 to be determined, to rewrite the above identity as,Using the first equation in (47), the above inequality reduces to,As long as,we can easily check that,Iterating the above bound till n = 0 and using the zero initial data yields the desired (50) as long as we find a c such that the condition ( 51) is satisfied. To do so, we equalize the two terms on the right hand side of ( 51) to obtain,From the assumption (49) and the fact that > 1 2 , we see that such a c > 0 always exists for any value of γ > 0 and ( 51) is satisfied, which completes the proof.We remark that the same bound on the hidden states is obtained for both versions of coRNN, i.e. (3) with n = n and (47). However, the difference lies in the constraint on the time step ∆t. In contrast to (49), a careful examination of the proof of proposition 3.1 reveals that the condition on the time step for the stability of (3) with n = n is given by,and is clearly less stringent than the condition (51) for the stability of (47). For instance, in the prototypical case of γ = = 1, the stability of (3) with n = n is ensured for any ∆t < 1. On the other hand, the stability of ( 47) is ensured as long as ∆t < 1 2 . However, it is essential to recall that these conditions are only sufficient to ensure stability and are by no means necessary. Thus in practice, the coRNN version (47) is found to be stable in the same range of time steps as the version (3) with n = n.On the exploding and vanishing gradient problems for coRNN (47) Next, we have the following upper bound on the hidden state gradients for the version (47) of coRNN, Proposition F.2 Let y n , z n be the hidden states generated by the RNN (47). We assume that the damping parameter > 1 2 and the time step ∆t can be chosen such that in addition to (51) it also satisfies,and with the constant C independent of the other parameters of the RNN (47). Then the gradient of the loss function E (6) with respect to any parameter θ ∈ Θ is bounded as,with the constant C, defined in (53) and Ȳ = max 1≤n≤N ȳn ∞ be a bound on the underlying training data Published as a conference paper at ICLR 2021 The proof of this proposition is completely analogous to the proof of proposition 3.2 and we omit the details here.Note that the bound (54) enforces that hidden state gradients cannot explode for version (47) of coRNN. A similar statement for the vanishing gradient problem is inferred from the proposition below.Proposition F.3 Let y n be the hidden states generated by the RNN (47). Under the assumption that y i n = O( tn γ ), for all 1 ≤ i ≤ m and (53), the gradient for long-term dependencies satisfies,+O(∆t 3 ), ĉ = sech 2 √ k∆t(1 + ∆t) k << n.(The proof is a repetition of the steps of the proof of proposition 3.3, with suitable modifications for the structure of the RNN and non-unit , γ and we omit the tedious calculations here. Note that (55) rules out the vanishing gradient problem for the coRNN version (47).

