ON THE CURSE OF MEMORY IN RECURRENT NEU-RAL NETWORKS: APPROXIMATION AND OPTIMIZA-TION ANALYSIS

Abstract

We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuoustime linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals and characterize the approximation rate. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs by gradient methods. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on both approximation and optimization: when there is long-term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with increasing memory -a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenons that may arise in learning temporal relationships using recurrent architectures.

1. INTRODUCTION

Recurrent neural networks (RNNs) (Rumelhart et al., 1986) are among the most frequently employed methods to build machine learning models on temporal data. Despite its ubiquitous application (Baldi et al., 1999; Graves & Schmidhuber, 2009; Graves, 2013; Graves et al., 2013; Graves & Jaitly, 2014; Gregor et al., 2015) , some fundamental theoretical questions remain to be answered. These come in several flavors. First, one may pose the approximation problem, which asks what kind of temporal input-output relationships can RNNs model to an arbitrary precision. Second, one may also consider the optimization problem, which concerns the dynamics of training (say, by gradient descent) the RNN. While such questions can be posed for any machine learning model, the crux of the problem for RNNs is how the recurrent structure of the model and the dynamical nature of the data shape the answers to these problems. For example, it is often observed that when there are long-term dependencies in the data (Bengio et al., 1994; Hochreiter et al., 2001) , RNNs may encounter problems in learning, but such statements have rarely been put on precise footing. In this paper, we make a step in this direction by studying the approximation and optimization properties of RNNs. Compared with the static feed-forward setting, the key distinguishing feature † Equal contribution ‡ Corresponding author 1

