COUPLED OSCILLATORY RECURRENT NEURAL NET-WORK (CORNN): AN ACCURATE AND (GRADIENT) STABLE ARCHITECTURE FOR LEARNING LONG TIME DEPENDENCIES

Abstract

Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.

1. INTRODUCTION

Recurrent neural networks (RNNs) have achieved tremendous success in a variety of tasks involving sequential (time series) inputs and outputs, ranging from speech recognition to computer vision and natural language processing, among others. However, it is well known that training RNNs to process inputs over long time scales (input sequences) is notoriously hard on account of the so-called exploding and vanishing gradient problem (EVGP) (Pascanu et al., 2013) , which stems from the fact that the well-established BPTT algorithm for training RNNs requires computing products of gradients (Jacobians) of the underlying hidden states over very long time scales. Consequently, the overall gradient can grow (to infinity) or decay (to zero) exponentially fast with respect to the number of recurrent interactions. A variety of approaches have been suggested to mitigate the exploding and vanishing gradient problem. These include adding gating mechanisms to the RNN in order to control the flow of information in the network, leading to architectures such as long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurring units (GRU) (Cho et al., 2014) , that can overcome the vanishing gradient problem on account of the underlying additive structure. However, the gradients might still explode and learning very long term dependencies remains a challenge (Li et al., 2018) . Another popular approach for handling the EVGP is to constrain the structure of underlying recurrent weight matrices by requiring them to be orthogonal (unitary), leading to the so-called orthogonal RNNs (Henaff et al., 2016; Arjovsky et al., 2016; Wisdom et al., 2016; Kerg et al., 2019) and references therein. By construction, the resulting Jacobians have eigen-and singular-spectra with unit norm, alleviating the EVGP. However as pointed out by Kerg et al. (2019) , imposing such constraints on the recurrent matrices may lead to a significant loss of expressivity of the RNN resulting in inadequate performance on realistic tasks. In this article, we adopt a different approach, based on observation that coupled networks of controlled non-linear forced and damped oscillators, that arise in many physical, engineering and biological systems, such as networks of biological neurons, do seem to ensure expressive representations while constraining the dynamics of state variables and their gradients. This motivates us to propose a novel architecture for RNNs, based on time-discretizations of second-order systems of non-linear ordinary differential equations (ODEs) (1) that model coupled oscillators. Under verifiable hypotheses, we are able to rigorously prove precise bounds on the hidden states of these RNNs and their gradients, enabling a possible solution of the exploding and vanishing gradient problem, while demonstrating through benchmark numerical experiments, that the resulting system still retains sufficient expressivity, i.e. ability to process complex inputs, with a competitive performance, with respect to the state of the art, on a variety of sequential learning tasks.

2. THE PROPOSED RNN

Our proposed RNN is based on the following second-order system of ODEs, y = σ (Wy + Wy + Vu + b)γyy .(1)Here, t ∈ [0, 1] is the (continuous) time variable, u = u(t) ∈ R d is the time-dependent input signal, y = y(t) ∈ R m is the hidden state of the RNN with W, W ∈ R m×m , V ∈ R m×d are weight matrices, b ∈ R m is the bias vector and 0 < γ, are parameters, representing oscillation frequency and the amount of damping (friction) in the system, respectively. σ : R → R is the activation function, set to σ(u) = tanh(u) here. By introducing the so-called velocity variable z = y (t) ∈ R m , we rewrite (1) as the first-order system:We fix a timestep 0 < ∆t < 1 and define our proposed RNN hidden states at time t n = n∆t ∈ [0, 1] (while omitting the affine output state) as the following IMEX (implicit-explicit) discretization of the first order system (2):with either n = n or n = n -1. Note that the only difference in the two versions of the RNN (3) lies in the implicit (n = n) or explicit (n = n -1) treatment of the damping termz in (2), whereas both versions retain the implicit treatment of the first equation in (2).Motivation and background. To see that the underlying ODE (2) models a coupled network of controlled forced and damped nonlinear oscillators, we start with the single neuron (scalar) case by setting d = m = 1 in (1) and assume an identity activation function σ(x) = x. Setting W = W = V = b = = 0 leads to the simple ODE, y + γy = 0, which exactly models simple harmonic motion with frequency γ, for instance that of a mass attached to a spring (Guckenheimer & Holmes, 1990) . Letting > 0 in (1) adds damping or friction to the system (Guckenheimer & Holmes, 1990). Then, by introducing non-zero V in (1), we drive the system with a driving force proportional to the input signal u(t). The parameters V, b modulate the effect of the driving force, W controls the frequency of oscillations and W the amount of damping in the system. Finally, the tanh activation mediates a non-linear response in the oscillator. In the coupled network (2) with m > 1, each neuron updates its hidden state based on the input signal as well as information from other neurons. The diagonal entries of W (and the scalar hyperparameter γ) control the frequency whereas the diagonal entries of W (and the hyperparameter ) determine the amount of damping for each neuron, respectively, whereas the non-diagonal entries of these matrices modulate interactions between neurons. Hence, given this behavior of the underlying ODE (2), we term the RNN (3) as a coupled oscillatory Recurrent Neural Network (coRNN).The dynamics of the ODE (2) (and the RNN (3)) for a single neuron are relatively straightforward.As we illustrate in Fig. 6 of supplementary material SM §C, input signals drive the generation of (superpositions of) oscillatory wave-forms, whose amplitude and (multiple) frequencies are controlled by the tunable parameters W, W, V, b. Adding a tanh activation does not change these dynamics much. This is in contrast to truncating tanh to leading non-linear order by setting σ(x) = xx 3 /3, which yields a Duffing type oscillator that is characterized by chaotic behavior (Guckenheimer & Holmes, 1990) . Adding interactions between neurons leads to further accentuation of this generation of superposed wave forms (see Fig. 6 in SM §C) and even with very simple network topologies, one

