ON THE CURSE OF MEMORY IN RECURRENT NEU-RAL NETWORKS: APPROXIMATION AND OPTIMIZA-TION ANALYSIS

Abstract

We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuoustime linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals and characterize the approximation rate. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs by gradient methods. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on both approximation and optimization: when there is long-term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with increasing memory -a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenons that may arise in learning temporal relationships using recurrent architectures.

1. INTRODUCTION

Recurrent neural networks (RNNs) (Rumelhart et al., 1986) are among the most frequently employed methods to build machine learning models on temporal data. Despite its ubiquitous application (Baldi et al., 1999; Graves & Schmidhuber, 2009; Graves, 2013; Graves et al., 2013; Graves & Jaitly, 2014; Gregor et al., 2015) , some fundamental theoretical questions remain to be answered. These come in several flavors. First, one may pose the approximation problem, which asks what kind of temporal input-output relationships can RNNs model to an arbitrary precision. Second, one may also consider the optimization problem, which concerns the dynamics of training (say, by gradient descent) the RNN. While such questions can be posed for any machine learning model, the crux of the problem for RNNs is how the recurrent structure of the model and the dynamical nature of the data shape the answers to these problems. For example, it is often observed that when there are long-term dependencies in the data (Bengio et al., 1994; Hochreiter et al., 2001) , RNNs may encounter problems in learning, but such statements have rarely been put on precise footing. In this paper, we make a step in this direction by studying the approximation and optimization properties of RNNs. Compared with the static feed-forward setting, the key distinguishing feature here is the presence of temporal dynamics in terms of both the recurrent architectures in the model and the dynamical structures in the data. Hence, to understand the influence of dynamics on learning is of fundamental importance. As is often the case, the key effects of dynamics can already be revealed in the simplest linear setting. For this reason, we will focus our analysis on linear RNNs, i.e. those with linear activations. Further, we will employ a continuous-time analysis initially studied in the context of feed-forward architectures (E, 2017; Haber & Ruthotto, 2017; Li et al., 2017) and recently in recurrent settings (Ceni et al., 2019; Chang et al., 2019; Lim, 2020; Sherstinsky, 2018; Niu et al., 2019; Herrera et al., 2020; Rubanova et al., 2019) and idealize the RNN as a continuous-time dynamical system. This allows us to phrase the problems under investigation in convenient analytical settings that accentuates the effect of dynamics. In this case, the RNNs serve to approximate relationships represented by sequences of linear functionals. On first look the setting appears to be simple, but we show that it yields representative results that underlie key differences in the dynamical setting as opposed to static supervised learning problems. In fact, we show that memory, which can be made precise by the decay rates of the target linear functionals, can affect both approximation rates and optimization dynamics in a non-trivial way. Our main results are: 1. We give a systematic analysis of the approximation of linear functionals by continuoustime linear RNNs, including a precise characterization of the approximation rates in terms of regularity and memory of the target functional. 2. We give a fine-grained analysis of the optimization dynamics when training linear RNNs, and show that the training efficiency is adversely affected by the presence of long-term memory. These results together paint a comprehensive picture of the interaction of learning and dynamics, and makes concrete the heuristic observations that the presence of long-term memory affects RNN learning in a negative manner (Bengio et al., 1994; Hochreiter et al., 2001) . In particular, mirroring the classical curse of dimensionality (Bellman, 1957) , we introduce the concept of the curse of memory that captures the new phenomena that arises from learning temporal relationships: when there is long-term memory in the data, one requires an exponentially large number of neurons for approximation, and the learning dynamics suffers from exponential slow downs. These results form a basic step towards a mathematical understanding of the recurrent structure and its effects on learning from temporal data.

2. RELATED WORK

We will discuss related work on RNNs on three fronts concerning the central results in this paper, namely approximation theory, optimization analysis and the role of memory in learning. A number of universal approximation results for RNNs have been obtained in discrete (Matthews, 1993; Doya, 1993; Schäfer & Zimmermann, 2006; 2007) and continuous time (Funahashi & Nakamura, 1993; Chow & Xiao-Dong Li, 2000; Li et al., 2005; Maass et al., 2007; Nakamura & Nakagawa, 2009) . Most of these focus on the case where the target relationship is generated from a hidden dynamical system in the form of difference or differential equations. The formulation of functional approximation here is more general, albeit our results are currently limited to the linear setting. Nevertheless, this is already sufficient to reveal new phenomena involving the interaction of learning and dynamics. This will be especially apparent when we discuss approximation rates and optimization dynamics. We also note that the functional/operator approximation using neural networks has been explored in Chen & Chen (1993); Tianping Chen & Hong Chen (1995); Lu et al. ( 2019) for nonrecurrent structures and reservoir systems for which approximation results similar to random feature models are derived (Gonon et al., 2020) . The main difference here is that we explicitly study the effect of memory in target functionals on learning using recurrent structures. On the optimization side, there are a number of recent results concerning the training of RNNs using gradient methods, and they are mostly positive in the sense that trainability is proved under specific settings. These include recovering linear dynamics (Hardt et al., 2018) or training in overparameterized settings (Allen-Zhu et al., 2019) . Here, our result concerns the general setting of learning linear functionals that need not come from some underlying differential/difference equations, and is also away from the over-parameterized regime. In our case, we discover on the contrary

