COUPLED OSCILLATORY RECURRENT NEURAL NET-WORK (CORNN): AN ACCURATE AND (GRADIENT) STABLE ARCHITECTURE FOR LEARNING LONG TIME DEPENDENCIES

Abstract

Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.

1. INTRODUCTION

Recurrent neural networks (RNNs) have achieved tremendous success in a variety of tasks involving sequential (time series) inputs and outputs, ranging from speech recognition to computer vision and natural language processing, among others. However, it is well known that training RNNs to process inputs over long time scales (input sequences) is notoriously hard on account of the so-called exploding and vanishing gradient problem (EVGP) (Pascanu et al., 2013) , which stems from the fact that the well-established BPTT algorithm for training RNNs requires computing products of gradients (Jacobians) of the underlying hidden states over very long time scales. Consequently, the overall gradient can grow (to infinity) or decay (to zero) exponentially fast with respect to the number of recurrent interactions. A variety of approaches have been suggested to mitigate the exploding and vanishing gradient problem. These include adding gating mechanisms to the RNN in order to control the flow of information in the network, leading to architectures such as long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurring units (GRU) (Cho et al., 2014) , that can overcome the vanishing gradient problem on account of the underlying additive structure. However, the gradients might still explode and learning very long term dependencies remains a challenge (Li et al., 2018) . Another popular approach for handling the EVGP is to constrain the structure of underlying recurrent weight matrices by requiring them to be orthogonal (unitary), leading to the so-called orthogonal RNNs (Henaff et al., 2016; Arjovsky et al., 2016; Wisdom et al., 2016; Kerg et al., 2019) and references therein. By construction, the resulting Jacobians have eigen-and singular-spectra with unit norm, alleviating the EVGP. However as pointed out by Kerg et al. (2019) , imposing such constraints on the recurrent matrices may lead to a significant loss of expressivity of the RNN resulting in inadequate performance on realistic tasks. In this article, we adopt a different approach, based on observation that coupled networks of controlled non-linear forced and damped oscillators, that arise in many physical, engineering and biological

