ON THE CURSE OF MEMORY IN RECURRENT NEU-RAL NETWORKS: APPROXIMATION AND OPTIMIZA-TION ANALYSIS

Abstract

We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuoustime linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals and characterize the approximation rate. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs by gradient methods. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on both approximation and optimization: when there is long-term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with increasing memory -a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenons that may arise in learning temporal relationships using recurrent architectures.

1. INTRODUCTION

Recurrent neural networks (RNNs) (Rumelhart et al., 1986) are among the most frequently employed methods to build machine learning models on temporal data. Despite its ubiquitous application (Baldi et al., 1999; Graves & Schmidhuber, 2009; Graves, 2013; Graves et al., 2013; Graves & Jaitly, 2014; Gregor et al., 2015) , some fundamental theoretical questions remain to be answered. These come in several flavors. First, one may pose the approximation problem, which asks what kind of temporal input-output relationships can RNNs model to an arbitrary precision. Second, one may also consider the optimization problem, which concerns the dynamics of training (say, by gradient descent) the RNN. While such questions can be posed for any machine learning model, the crux of the problem for RNNs is how the recurrent structure of the model and the dynamical nature of the data shape the answers to these problems. For example, it is often observed that when there are long-term dependencies in the data (Bengio et al., 1994; Hochreiter et al., 2001) , RNNs may encounter problems in learning, but such statements have rarely been put on precise footing. In this paper, we make a step in this direction by studying the approximation and optimization properties of RNNs. Compared with the static feed-forward setting, the key distinguishing feature here is the presence of temporal dynamics in terms of both the recurrent architectures in the model and the dynamical structures in the data. Hence, to understand the influence of dynamics on learning is of fundamental importance. As is often the case, the key effects of dynamics can already be revealed in the simplest linear setting. For this reason, we will focus our analysis on linear RNNs, i.e. those with linear activations. Further, we will employ a continuous-time analysis initially studied in the context of feed-forward architectures (E, 2017; Haber & Ruthotto, 2017; Li et al., 2017) and recently in recurrent settings (Ceni et al., 2019; Chang et al., 2019; Lim, 2020; Sherstinsky, 2018; Niu et al., 2019; Herrera et al., 2020; Rubanova et al., 2019) and idealize the RNN as a continuous-time dynamical system. This allows us to phrase the problems under investigation in convenient analytical settings that accentuates the effect of dynamics. In this case, the RNNs serve to approximate relationships represented by sequences of linear functionals. On first look the setting appears to be simple, but we show that it yields representative results that underlie key differences in the dynamical setting as opposed to static supervised learning problems. In fact, we show that memory, which can be made precise by the decay rates of the target linear functionals, can affect both approximation rates and optimization dynamics in a non-trivial way. Our main results are: 1. We give a systematic analysis of the approximation of linear functionals by continuoustime linear RNNs, including a precise characterization of the approximation rates in terms of regularity and memory of the target functional. 2. We give a fine-grained analysis of the optimization dynamics when training linear RNNs, and show that the training efficiency is adversely affected by the presence of long-term memory. These results together paint a comprehensive picture of the interaction of learning and dynamics, and makes concrete the heuristic observations that the presence of long-term memory affects RNN learning in a negative manner (Bengio et al., 1994; Hochreiter et al., 2001) . In particular, mirroring the classical curse of dimensionality (Bellman, 1957) , we introduce the concept of the curse of memory that captures the new phenomena that arises from learning temporal relationships: when there is long-term memory in the data, one requires an exponentially large number of neurons for approximation, and the learning dynamics suffers from exponential slow downs. These results form a basic step towards a mathematical understanding of the recurrent structure and its effects on learning from temporal data.

2. RELATED WORK

We will discuss related work on RNNs on three fronts concerning the central results in this paper, namely approximation theory, optimization analysis and the role of memory in learning. A number of universal approximation results for RNNs have been obtained in discrete (Matthews, 1993; Doya, 1993; Schäfer & Zimmermann, 2006; 2007) and continuous time (Funahashi & Nakamura, 1993; Chow & Xiao-Dong Li, 2000; Li et al., 2005; Maass et al., 2007; Nakamura & Nakagawa, 2009) . Most of these focus on the case where the target relationship is generated from a hidden dynamical system in the form of difference or differential equations. The formulation of functional approximation here is more general, albeit our results are currently limited to the linear setting. Nevertheless, this is already sufficient to reveal new phenomena involving the interaction of learning and dynamics. This will be especially apparent when we discuss approximation rates and optimization dynamics. We also note that the functional/operator approximation using neural networks has been explored in Chen & Chen (1993) ; Tianping Chen & Hong Chen (1995) ; Lu et al. (2019) for nonrecurrent structures and reservoir systems for which approximation results similar to random feature models are derived (Gonon et al., 2020) . The main difference here is that we explicitly study the effect of memory in target functionals on learning using recurrent structures. On the optimization side, there are a number of recent results concerning the training of RNNs using gradient methods, and they are mostly positive in the sense that trainability is proved under specific settings. These include recovering linear dynamics (Hardt et al., 2018) or training in overparameterized settings (Allen-Zhu et al., 2019) . Here, our result concerns the general setting of learning linear functionals that need not come from some underlying differential/difference equations, and is also away from the over-parameterized regime. In our case, we discover on the contrary that training can become very difficult even in the linear case, and this can be understood in a quantitative way, in relation to long-term memory in the target functionals. This points to the practical literature in relation to memory and learning. The dynamical analysis here puts the ubiquitous but heuristic observations -that long-term memory negatively impacts training efficiency (Bengio et al., 1994; Hochreiter et al., 2001) -on concrete theoretical footing, at least in idealized settings. This may serve to justify or improve current heuristic methods (Tseng et al., 2016; Dieng et al., 2017; Trinh et al., 2018) developed in applications to deal with the difficulty in training with long-term memory. At the same time, we also complement general results on "vanishing and explosion of gradients" (Pascanu et al., 2013; Hanin & Rolnick, 2018; Hanin, 2018) that are typically restricted to initialization settings with more precise characterizations in the dynamical regime during training. The long range dependency within temporal data has been studied for a long time in the time series literature, although its effect on learning input-output relationships is rarely covered. For example, the Hurst exponent (Hurst, 1951) is often used as a measure of long-term memory in time series, e.g. fractional Brownian motion (Mandelbrot & Ness, 1968 ). In contrast with the setting in this paper where memory involves the dependence of the output time series on the input, the Hurst exponent measures temporal variations and dependence within the input time series itself. Much of the time series literature investigates statistical properties and estimation methods of data with long range dependence (Samorodnitsky, 2006; Taqqu et al., 1995; Beran, 1992; Doukhan et al., 2003) . One can also combine these classic statistical methodologies with the RNN-like architectures to design hybrid models with various applications (Loukas & Oke, 2007; Diaconescu, 2008; Mohan & Gaitonde, 2018; Bukhari et al., 2020) .

3. PROBLEM FORMULATION

The basic problem of supervised learning on time series data is to learn a mapping from an input temporal sequence to an output sequence. Formally, one can think of the output at each time as being produced from the input via an unknown function that depends on the entire input sequence, or at least up to the time at which the prediction is made. In the discrete-time case, one can write the data generation process y k = H k (x 0 , . . . , x k-1 ), k = 0, 1, . . . where x k , y k denote respectively the input data and output response, and {H k : k = 0, 1, . . . } is a sequence of ground truth functions of increasing input dimension accounting for temporal evolution. The goal of supervised learning is to learn an approximation of the sequence of functions {H k } given observation data. Recurrent neural networks (RNN) (Rumelhart et al., 1986) gives a natural way to parameterize such a sequence of functions. In the simplest case, the one-layer RNN is given by h k+1 = σ(W h k + U x k ), ŷk = c h k . Here, {h k } are the hidden/latent states and its evolution is governed by a recursive application of a feed-forward layer with activation σ, and ŷk is called the observation or readout. We omit the bias term here and only consider a linear readout or output layer. For each time step k, the mapping {x 0 , . . . , x k-1 } → ŷk parameterizes a function Ĥk (•) through adjustable parameters (c, W, U ). Hence, for a particular choice of these parameters, a sequence of functions { Ĥk } is constructed at the same time. To understand the working principles of RNNs, we need to characterize how { Ĥk } approximates {H k }. which is the linear space of continuous functions from R (time) to R d that vanishes at infinity. Here d is the dimension of each point in the time series. We denote an element in X by x := {x t ∈ R d : t ∈ R} and equip X with the supremum norm x X := sup t∈R x t ∞ . For the space of outputs we will take a scalar time series, i.e. the space of bounded continuous functions from R to R: Y = C b (R, R). (4) This is due to the fact that vector-valued outputs can be handled by considering each output separately. In continuous time, the target relationship (ground truth) to be learned is y t = H t (x), t ∈ R (5) where for each t ∈ R, H t is a functional H t : X → R. Correspondingly, we define a continuous version of (2) as a hypothesis space to model continuous-time functionals d dt h t = σ(W h t + U x t ), ŷt = c h t , whose Euler discretization corresponds to a discrete-time residual RNN. The dynamics then naturally defines a sequences of functionals { Ĥt (x) = ŷt : t ∈ R}, which can be used to approximate the target functionals {H t } via adjusting (c, W, U ). Linear RNNs in continuous time. In this paper we mainly investigate the approximation and optimization property of linear RNNs, which already reveals the essential effect of dynamics. The linear RNN obeys (6) with σ being the identity map. Notice that in the theoretical setup, the initial time of the system goes back to -∞ with lim t→-∞ x t = 0, ∀x ∈ X , thus by linearity (H t (0) = 0) we specify the initial condition of the hidden state h -∞ = 0 for consistency. In this case, (6) has the following solution ŷt = ∞ 0 c e W s U x t-s ds. Since we will investigate uniform approximations over large time intervals, we will consider stable RNNs, where W ∈ W m with W m = {W ∈ R m×m : eigenvalues of W have negative real parts}. (8) Owing to the representation of solutions in (7), the linear RNN defines a family of functionals Ĥ := ∪ m≥1 Ĥm , Ĥm := { Ĥt (x), t ∈ R} : Ĥt (x) = ∞ 0 c e W s U x t-s ds, W ∈ W m , U ∈ R m×d , c ∈ R m . (9) Here, m is the width of the network and controls the complexity of the hypothesis space. Clearly, the family of functionals the RNN can represent is not arbitrary, and must possess some structure. Let us now introduce some definitions of functionals that makes these structures precise. Definition 3.1. Let {H t : t ∈ R} be a sequence of functionals. 1. H t is causal if it does not depend on future values of the input: for every pair of x, x ∈ X such that x s = x s for all s ≤ t, we have H t (x) = H t (x ). 2. H t is linear and continuous if H t (λx + λ x ) = λH t (x) + λ H t (x ) for any x, x ∈ X and λ, λ ∈ R, and sup x∈X , x X ≤1 |H t (x)| < ∞, in which case the induced norm can be defined as H t := sup x∈X , x X ≤1 |H t (x)|. 3. H t is regular if for any sequence {x(n) ∈ X : n ∈ N} such that x(n) t → 0 for Lebesgue almost every t ∈ R, lim n→∞ H t (x(n)) = 0. 4. {H t : t ∈ R} is time-homogeneous if H t (x) = H t+τ (x(τ )) for any t, τ ∈ R, where x(τ ) s = x s-τ for all s ∈ R, i.e. x(τ ) is x whose time index is shifted to the right by τ . Linear, continuous and causal functionals are common definitions. One can think of regular functionals as those that are not determined by values of the inputs on an arbitrarily small time interval, e.g. an infinitely thin spike input. Time-homogeneous functionals, on the other hand, are those where there is no special reference point in time: if the time index of both the input sequence and the functional are shifted in coordination, the output value remains the same. Given these definitions, the following observation can be verified directly and its proof is immediate and hence omitted. Proposition 3.1. Let { Ĥt : t ∈ R} be a sequence of functionals in the RNN hypothesis space Ĥ (see ( 9)). Then for each t ∈ R, Ĥt is a causal, continuous, linear and regular functional. Moreover, the sequence of functionals { Ĥt : t ∈ R} is time-homogeneous.

4. APPROXIMATION THEORY

The most basic approximation problem for RNN is as follows: given some sequence of target functionals {H t : t ∈ R} satisfying appropriate conditions, does there always exist a sequence of RNN functionals { Ĥt : t ∈ R} in Ĥ such that H t ≈ Ĥt for all t ∈ R? We now make an important remark with respect to the current problem formulation that differs from previous investigations in the RNN approximation: we are not assuming that the target functionals {H t : t ∈ R} are themselves generated from an underlying dynamical system of the form H t (x) = y t where ḣt = f (h t , x t ), y t = g(h t ) for any linear or nonlinear functions f, g. This differs from previous work where it is assumed that the sequence of target functionals are indeed generated from such a system. In that case, the approximation problem reduces to that of the functions f, g, and the obtained results often resemble those in feed-forward networks. In our case, however, we consider general input-output relationships related by temporal sequences of functionals, with no necessary recourse to the mechanism from which these relationships are generated. This is more flexible and natural, since in applications it is often not clear how one can describe the data generation process. Moreover, notice that in the linear case, if the target functionals {H t } are generated from a linear ODE system, then the approximation question is trivial: as long as the dimension of h t in the approximating RNN is greater than or equal to that which generates the data, we must have perfect approximation. However, we will see that in the more general case here, this question becomes much more interesting, even in the linear regime. In fact, we now prove precise approximation theories and characterize approximation rates that reveal intricate connections with memory effects, which may be otherwise obscured if one considers more limited settings. Our first main result is a converse of Proposition 3.1 in the form of an universal approximation theorem for certain classes of linear functionals. The proof is found in Appendix A. Theorem 4.1 (Universal approximation of linear functionals). Let {H t : t ∈ R} be a family of continuous, linear, causal, regular and time-homogeneous functionals on X . Then, for any > 0 there exists { Ĥt : t ∈ R} ∈ Ĥ such that sup t∈R H t -Ĥt ≡ sup t∈R sup x X ≤1 |H t (x) -Ĥt (x)| ≤ . The proof relies on the classical Riesz-Markov-Kakutani representation theorem, which says that each linear functional H t can be uniquely associated with a signed measure µ t such that H t (x) = R x s dµ t (s). Owing to the assumptions of Theorem 4.1, we can further show that the sequence of representations {µ t } are related to an integrable function ρ : [0, ∞) → R d such that {H t } admits the common representation H t (x) = ∞ 0 x t-s ρ(s)ds, t ∈ R, x ∈ X . ( ) Comparing this representation with the solution (7) of the continuous RNN, we find that the approximation property of the linear RNNs is closely related to how well ρ(t) can be approximated by the exponential sums of the form (c e W t U ) . Intuitively, (12) says that each output y t = H t (x) is simply a convolution between the input signal and the kernel ρ. Thus, the smoothness and decay of the input-output relationship is characterized by the convolution kernel ρ. Due to this observation, we will hereafter refer to {H t } and ρ interchangeably. Approximation rates. While the previous result establishes the universal approximation property of linear RNNs for suitable classes of linear functionals, it does not reveal to us which functionals can be efficiently approximated. In the practical literature, it is often observed that when there is some long-term memory in the inputs and outputs, the RNN becomes quite ill-behaved (Bengio et al., 1994; Hochreiter et al., 2001) . It is the purpose of this section to establish results which make these heuristics statements precise. In particular, we will show that the rate at which linear functionals can be approximated by linear RNNs depends on the former's smoothness and memory properties. We note that this is a much less explored area in the approximation theory of RNNs. To characterize smoothness and memory of linear functionals, we may pass to investigating the properties of their actions on constant input signals.  i (t)| ≤ γ, k = 1, . . . , α + 1, ( ) where y (k) i (t) denotes the k th derivative of y i (t). Then, there exists a universal constant C(α) only depending on α, such that for any m ∈ N + , there exists a sequence of width-m RNN functionals { Ĥt : t ∈ R} ∈ Ĥm such that sup t∈R H t -Ĥt ≡ sup t∈R sup x X ≤1 |H t (x) -Ĥt (x)| ≤ C(α)γd βm α . ( ) The curse of memory in approximation. For approximation of non-linear functions using linear combinations of basis functions, one often suffers from the "curse of dimensionality" (Bellman, 1957) , in that the number of basis functions required to achieve a certain approximation accuracy increases exponentially when the dimension of input space d increases. In the case of Theorem 4.2, the bound scales linearly with d. This is because the target functional possesses a linear structure, and hence each dimension can be approximated independently of others, resulting in an additive error estimate. Nevertheless, due to the presence of the temporal dimension, there enters another type of challenge, which we coin the curse of memory. Let us now discuss this point in detail. The key observation is that the rate result requires exponential decay of derivatives of H t (e i ), but the density result (Theorem 4.1) makes no such assumption. The natural question is thus, what happens when no exponential decay is present? We assume d = 1 and consider an example in which the target functional's representation satisfies ρ(t) ∈ C (1) (R) and ρ(t) ∼ t -(1+ω) as t → +∞. Here ω > 0 indicates the decay rate of the memory effects in our target functional family. The smaller its value, the slower the decay and the longer the system memory. For any ω > 0, the system's memory vanishes more slowly than any exponential decay. Notice that y (1) (t) = ρ(t) and in this case there exists no β > 0 making (13) true, and no rate estimate can be deduced from it. A natural way to circumvent this obstacle is to introduce a truncation in time. With T ( 1) we can define ρ(t) ∈ C (1) (R) such that ρ(t) ≡ ρ(t) for t ≤ T , ρ(t) ≡ 0 for t ≥ T + 1, and ρ(t) is monotonically decreasing for T ≤ t ≤ T + 1. With the auxiliary linear functional Ht (x) := t 0 x t-s ρ(s)ds, we can have an error estimate (with technical details provided in Appendix B) sup t∈R H t -Ĥt ≤ sup t∈R H t -Ht + sup t∈R Ht -Ĥt ≤ C T -ω + ω m T 1-ω . ( ) In order to achieve an error tolerance , according to the first term above we require T ∼ -1 ω , and then according to the second term we have m = O ωT 1-ω / = O ω -1/ω . ( ) This estimate gives us a quantitative relationship between the degree of freedom needed and the decay speed. With ρ(t) ∼ t -(1+ω) , the system has long memory when ω is small. Denote the minimum number of terms needed to achieve an L 1 error as m(ω, ). The above estimate shows an upper-bound of m(ω, ) goes to infinity exponentially fast as ω → 0 + with fixed . This is akin to the curse of dimensionality, but this time on memory, which manifests itself even in the simplest linear settings. A stronger result would be that the lower bound of m(ω, ) → ∞ exponentially fast as ω → 0 + with fixed , and this is a point of future work. Note that this kind of estimates differ from the previous results in the literature (Kammler, 1979b; Braess & Hackbusch, 2005) regarding the order of m(ω, ) ∼ log(1/ ) as → 0 with fixed ω = 1 or 2 in the L ∞ or L 1 sense. Note that the L 1 result has not been proved.

5. FINE-GRAINED ANALYSIS OF OPTIMIZATION DYNAMICS

According to Section 4, memory plays an important role in determining the approximation rates. The result therein only depends on the model architecture, and does not concern the actual training dynamics. In this section, we perform a fine-grained analysis on the latter, which again reveals an interesting interaction between memory and learning. The loss function (for training) is defined as E x J T (x; c, W, U ) := E x | ĤT (x) -H T (x)| 2 = E x T 0 [c e W t U -ρ(t) ]x T -t dt 2 . ( ) Without loss of generality, here the input time series x is assumed to be finitely cut off at zero, i.e. x t = 0 for any t ≤ 0 almost surely. Training the RNN amounts to optimizing E x J T with respect to the parameters (c, W, U ). The most commonly applied method is gradient descent (GD) or its stochastic variants (say SGD), which updates the parameters in the steepest descent direction. We first show that the training dynamics of E x J T exhibits very different behaviors depending on the form of target functionals. Take d = 1 and consider learning different target functionals with white noise x. We first investigate two choices for ρ: a simple decaying exponential sum and a scaled Airy function. The Airy function target is defined as ρ(t) = Ai(s 0 [t -t 0 ]), where Ai(t) is the Airy function of the first kind, given by the improper integral Ai(t) = 1 π lim ξ→∞ ξ 0 cos u 3 3 + tu du. Note that the effective rate of decay is controlled by the parameter t 0 : for t ≤ t 0 , the Airy function is oscillatory. Hence for large t 0 , a large amount of memory is present in the target. Observe from Figure 1 that the training proceeds efficiently for the exponential sum case. However, in the second Airy function case, there are interesting "plateauing" behaviors in the training loss, where the loss decrement slows down significantly after some time in training. The plateau is sustained for a long time before an eventual reduction is observed. As a further demonstration of that this behavior may be generic, we also consider a nonlinear forced dynamical system, the Lorenz 96 system (Lorenz, 1996) , where the similar plateauing behavior is observed even for a non-linear RNN model trained with the Adam optimizer (Kingma & Ba, 2015) . All experimental details are found in Appendix C.3.1. The results in Figure 1 hint at the fact that there are certainly some functionals that are much harder to learn than others, and it is the purpose of the remaining analyses to understand precisely when and why such difficulties occur. In particular, we will again relate this to the memory effects in the target functional, which shows yet another facet of the curse of memory when it comes to optimization. Dynamical analysis. To make analysis amenable, we will make a series of simplifications to the loss function (17), by assuming that x is white noise, d = 1, T → ∞, and the recurrent kernel W is diagonal. This allows us to write (see Appendix C.1 for details) the optimization problem as min a∈R m ,w∈R m + J(a, w) := ∞ 0 m i=1 a i e -wit -ρ(t) 2 dt. ( ) We will subsequently see that these simplifications do not lose the key features of the training dynamics, such as the plateauing behavior. We start with some informal discussion on a probable reason behind the plateauing. A straightforward computation shows that, for k = 1, 2, . . . , m, The shaded region depicts the mean ± the standard deviation in 10 independent runs using randomized initialization. Observe that learning complex functionals (Airy, Lorenz) suffers from slow-downs in the form of long plateaus. ∂J ∂w k (a, w) = 2a k ∞ 0 (-t)e -w k t m i=1 a i e -wit -ρ(t) dt. A similar expression holds for ∂J ∂a k . Write the (simplified) linear functional representation for linear RNNs as ρ(t; a, w) := m i=1 a i e -wit , which serves to learn the target ρ. Observe that plateauing under the GD dynamics occurs if the gradient ∇J is small but the loss J is large. A sufficient condition is that the residual ρ(t; a, w) -ρ(t) is large only for large t (meaning the exponential multiplier to the residual is small). That is, the learned functional differs from the target functional only at large times. This again relates to the long-term memory in the target. Based on this observation, we build this memory effect explicitly into the target functional by considering ρ of the parametrized form ρ ω (t) := ρ(t) + ρ 0,ω (t), where ρ is the function which can be well-approximated by the model ρ, e.g. the exponential sum ρ(t) = m * j=1 a * j e -w * j t (with w * j > 0, j = 1, • • • , m * ). On the other hand, ρ 0,ω (t) := ρ 0 (t -1/ω) controls the target memory, with ρ 0 as any bounded template function in L 2 (R) ∩ C 2 (R) with sub-Gaussian tails. As ω → 0 + , the support of ρ 0,ω shifts towards large times, modelling the dominance of long-term memories. In this case, if the initialization satisfies ρ ≈ ρ, the sufficient condition informally discussed above is satisfied as ω → 0 + , which heuristically leads to the plateauing. A simple example of (20) can be ρ ω (t) = a * e -w * t + c 0 e -(t-1/ω) 2 2σ 2 , where a * , c 0 , σ = 0 and w * > 0 are fixed constants. This corresponds to the simple case that m * = 1 and ρ 0 is the Gaussian density. Observe that as ω → 0 + , the memory of this sequence of functionals represented by ρ ω increases. It can be numerically verified that this simple target functional gives rise to the plateauing behavior, which gets worse significantly as ω → 0 + (see Figure 2  in Appendix C.3.2). Our main result on training dynamics quantifies the plateauing behavior theoretically for general functionals possessing the decomposition (20). For rigorous statements and detailed proofs, see Theorem C.1 in Appendix C.2. Theorem 5.1. Define the loss function J ω as in ( 18) with the target ρ = ρ ω as defined in (20). Consider the gradient flow training dynamics d dτ θ ω (τ ) = -∇J ω (θ ω (τ )), θ ω (0) = θ 0 , where θ ω (τ ) := (a ω (τ ), w ω (τ )) ∈ R 2m for any τ ≥ 0, and θ 0 := (a 0 , w 0 ). For any ω > 0, m ∈ N + and θ 0 ∈ R m × R m + , 0 < δ 1, define the hitting time τ 0 = τ 0 (δ; ω, m, θ 0 ) := inf {τ ≥ 0 : |J ω (θ ω (τ )) -J ω (θ 0 )| > δ} . ( ) Assume that m > m * , and the initialization is bounded and satisfies ρ(t; θ 0 ) ≈ ρ(t). Then τ 0 (δ; ω, m, θ 0 ) ω 2 e c0/ω min δ √ m , ln(1 + δ) for any ω > 0 sufficiently small, where c 0 and hide universal positive constants independent of ω, m and θ 0 . Let us sketch the intuition behind Theorem 5.1. Suppose that we currently have a good approximation ρ of the short-term memory part ρ, then we can show that the loss is large (J = O(1)) since the long-term memory part ρ 0,ω is not accounted for. However, the gradient now is small (∇J = o(1)), since the gradient corresponding to the long-term memory part is concentrated at large t, and thus modulated by exponentially decayed multipliers (see ( 19)). This implies slowdowns in the training dynamics in the region ρ ≈ ρ. It remains to estimate a lower bound on the timescale of escaping from this region, which depends on the curvature of the loss function. In particular, we show that ∇ 2 J is positive semi-definite when ω = 0, but has O(1) positive eigenvalues and multiple o(1) (can be exponentially small) eigenvalues for any 0 < ω 1. Hence, a local linearization analysis implies an exponentially increasing escape timescale, as indicated in (23). While the target form (20) may appear restrictive, we emphasize that some restrictions on the type of functionals is necessary, since plateauing does not always occur (see Figure 1 ). In fact, a goal of the preceding analysis is to establish a family of functionals for which exponential slowdowns in training provably occurs, and this can be related to memories of target functionals in a precise way. The curse of memory in optimization. The timescale proved in Theorem 5.1 is verified numerically in Figure 3 in Appendix C.3.3, where we also show that the analytical setting here is representative of general cases, where plateauing occurs even for non-linear RNNs trained with accelerated optimizers, as long as the target functional has the memory structure imposed in (20). Theorem 5.1 reveals another aspect of the curse of memory, this time in optimization. When ω → 0 + , the influence of target functional H t does not decay, much like the case considered in the curse of memory in approximation. However, different from the approximation case where an exponentially large number of hidden states is required to achieve approximation tolerance, here in optimization the adverse effect of memory comes with the exponentially pronounced slowdowns of the gradient flow training dynamics. While this is theoretically proved under sensible but restrictive settings, we show numerically in Appendix C.3.3 (Figure 4 ) that this is representative of general cases. In the literature, a number of results have been obtained pertaining to the analysis of training dynamics of RNNs. A positive result for training by GD is established in Hardt et al. (2018) , but this is in the setting of identifying hidden systems, i.e. the target functional comes from a linear dynamical system, hence it must possess good decay properties provided stablity. On the other hand, convergence can also be ensured if the RNN is sufficiently over-parameterized (large m; Allen-Zhu et al. ( 2019)). However, both of these settings may not be sufficient in reality. Here we provide an alternative analysis of a setting that is representative of the difficulties one may encounter in practice. In particular, the curse of memory that we established here is consistent with the difficulty in RNN training often observed in applications, where heuristic attributions to memory are often alluded to Hu et al. (2018) ; Campos et al. (2017); Talathi & Vartak (2015) ; Li et al. (2018) . The analysis here makes the connection between memories and optimization difficulties precise, and may form a basis for future developments to overcome such difficulties in applications.

6. CONCLUSION

In this paper, we analyzed the basic approximation and optimization aspects of using RNNs to learn input-output relationships involving temporal sequences in the linear, continuous-time setting. In particular, we coined the concept curse of memory, and revealed two of its facets. That is, when the target relationship has the long-term memory, both approximation and optimization become exceedingly difficult. These analyses make concrete heuristic observations of the adverse effect of memory on learning RNNs. Moreover, it quantifies the interaction between the structure of the model (RNN functionals) and the structure of the data (target functionals). The latter is a much less studied topic. Here, we adopt a continuous-time approach in order gain access to more quantitative tools, including classical results in approximation theory and stochastic analysis, which help us derive precise results in approximation rates and optimization dynamics. The extension of these results to discrete time may be performed via numerical analysis in subsequent work. More broadly, this approach may be a basic starting point for understanding learning from partially observed time series data in general, including gated RNN variants (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) and other methods such as transformers and convolution-based approaches (Vaswani et al., 2017; Oord et al., 2016) . These are certainly worthy of future exploration.

A UNIVERSAL APPROXIMATION THEOREM OF LINEAR FUNCTIONALS BY LINEAR RNNS

A key simplification of considering linear functionals is due to the classical representation result below, which allows us to pass from the approximation of functionals to the approximation of functions. In short, this theorem says that while for each measure µ, x → R x s dµ(s) defines a linear functional, this is in fact the only way to define a linear functional: for any linear functional H there exists a unique measure µ such that H(x) = R x s dµ(s). Theorem A.1 (Riesz-Markov-Kakutani representation theorem). Let H : X → R be a continuous linear functional. Then, there exists a unique, vector-valued, regular, countably additive signed measure µ on R such that H(x) = R x s dµ(s) = d i=1 R x s,i dµ i (s). ( ) Moreover, we have H := sup x X ≤1 |H(x)| = µ 1 (R) := i |µ i |(R). Proof. Well-known, see e.g. Bogachev ( 2007), CH 7.10.4. We will use the representation theorem to prove Theorem 4.1. First, we prove some lemmas. The first shows that there is in fact a common representation of a sequence of linear functionals satisfying the assumptions of Theorem 4.1. Lemma A.1. Let {H t } be a family of continuous, linear, regular, causal and time homogeneous functionals on X . Then, there exists a measurable function ρ : [0, ∞) → R d that is integrable, i.e. ρ L 1 ([0,∞)) := d i=1 ∞ 0 |ρ i (s)|ds < ∞ (26) and H t (x) = ∞ 0 x t-s ρ(s)ds, t ∈ R. In particular, {H t } is uniformly bounded with sup t H t = ρ L 1 ([0,∞)) and t → H t (x) is continuous for all x ∈ X . Proof. By the Riesz-Markov-Kakutani representation theorem (Theorem A.1), for each t there is a unique regular signed Borel measure µ t such that H t (x) = R x s dµ t (s), and i |µ t,i |(R) = H t . Since {H t } is causal, we must have ∞ t x s dµ t (s) = 0 for any x and thus H t (x) = t -∞ x s dµ t (s). (29) Now, by time homogeneity we have t -∞ x s dµ t (s) = H t (x) = H t+τ (x (τ ) ) = t+τ -∞ x s-τ dµ t+τ (s). Take τ = -t and set µ = -µ 0 to get H t (x) = ∞ 0 x t-s dµ(s). Note that we have µ 1 ([0, ∞)) = µ 0 1 ([0, ∞)) = H 0 = H t , and continuity follows from the fact that |H t+δ (x) -H t (x)| = ∞ 0 (x t+δ-s -x t-s ) dµ(s) ≤ i ∞ 0 x t+δ-s -x t-s ∞ d|µ i |(s), which converges to 0 as δ → 0 by dominated convergence theorem. Finally, we will show that each µ i is absolutely continuous with respect to λ (Lebesgue measure). Take a measurable E ⊂ [0, ∞) such that λ(E) = 0 and set E = [0, ∞) \ E. For each n ≥ 0 set K n ⊂ E, K n ⊂ E where K n , K n are closed and µ i (E \ K n ) ≤ 1/n, µ i (E \ K n ) ≤ 1/n. For a fixed i ∈ {1, . . . , d}, define x (n) to be such that x (n) t-s,j = 0 for all j = i and all s. For j = i, we set x (n) t-s,i = 1 if s ∈ K n and 0 if s ∈ K n , which can then be continuously extended to [0, ∞). Observe that by construction, x (n) t-s → 0 for λ-a.e. s, thus by dominated convergence theorem 0 = lim n→∞ H t (x (n) ) = µ i (E). ( ) This shows that µ i is absolutely continuous with respect to λ, and by the Radon-Nikodym theorem there exists a measurable function ρ i : [0, ∞) → R such that for any measurable A ⊂ R we have A dµ i (s) = A ρ i (s)ds, for i = 1, . . . , d. Hence, we have H t (x) = ∞ 0 x t-s ρ(s)ds (35) with ρ L 1 ([0,∞)) = i ∞ 0 |ρ i (s)|ds = µ 1 ([0, ∞)) < ∞. Lemma A.2. Let ρ : [0, ∞) → R a Lebesgue integrable function, i.e. ρ L 1 ([0,∞)) < ∞. Then, for any > 0, there exists a polynomial p with p(0) = 0 such that ρ -p(e -• ) L 1 ([0,∞)) = ∞ 0 |ρ(t) -p(e -t )|dt ≤ . ( ) Proof. The approach here is similar to that of the approximation of functions using exponential sums (Kammler, 1976; Braess, 1986) . Alternatively, one may also appeal to the density of phase type distributions (He & Zhang, 2007; O'Cinneide, 1990) in the space of positive distributions, and generalizing them to signed measures. Fix > 0. Define R(u) = 1 u ρ(-log u), u ∈ (0, 1], 0, u = 0. Then, we can check that R L 1 ([0,1]) = ρ L 1 ([0,∞)) < ∞. ( ) By density of continuous functions in L 1 there exists a continuous function R on [0, 1] with R(0) = 0 such that R -R L 1 ([0,1]) ≤ /2. ( ) By Müntz-Szász theorem (Müntz, 1914; Szász, 1916) , there exists a polynomial p with p(0 ) = 0 such that q -R L 1 ([0,1]) ≤ /2, ( ) and q(u) := p(u)/u is also a polynomial. Therefore, we have ρ -p(e -• ) L 1 ([0,∞)) = 1 0 |R(u) -p(u)/u|du ≤ 1 0 |R(u) -R(u)|du + 1 0 | R(u) -p(u)/u|du ≤ . We are now ready to present the proof of Theorem 4.1. Proof of Theorem 4.1. By (7), for each { Ĥt } ∈ Ĥ we can write Ĥt (x) = ∞ 0 x t-s (U [e W s ] c)ds. By Lemma A.1, we can write H t (x) = ∞ 0 x t-s ρ(s)ds, ( ) where ρ is integrable. Thus, we can apply Lemma A.2 to conclude that there exists polynomials p i , i = 1, . . . , d with p i (0) = 0 such that i ρ i -p i (e -• ) L 1 ([0,∞)) ≤ . ( ) Notice that we can write each p i (u) = m j=1 α ij u j for some m equaling the maximal order of {p i }. Taking W = diag(-1, . . . , -m), c = (1, . . . , 1) and U ij = α ji , we have (U [e W s ] c) i = p i (e -s ), i = 1, . . . , d. Consequently, we have for any x with x ∞ ≤ 1, |H t (x) -Ĥt (x)| = ∞ 0 x t-s ρ(s)ds - ∞ 0 x t-s p(e -s )ds ≤ i ∞ 0 |x t-s,i | ρ i (s) -p i (e -s ) ds ≤ i ρ i -p i (e -• ) L 1 ([0,∞)) ≤ .

B APPROXIMATION RATES AND THE CURSE OF MEMORY IN APPROXIMATION

We first give the proof of Theorem 4.2. Proof of Theorem 4.2. We fix i ∈ {1, . . . , d} below until the last part of the proof. By Lemma A.1, there exists ρ i (t) ∈ C α [0, ∞) such that y i (t) = H t (e i ) = t 0 ρ i (r)dr, t ≥ 0. ( ) By the assumption, ρ (k) i (t) = o(e -βt ) as t → ∞, k = 0, . . . , α. Consider the transform q i (s) = 0 s = 0, ρi( -(α+1) log s β ) s s ∈ (0, 1]. For k = 0, . . . , α, one can prove by induction that q (k) i (s) = (-1) k k j=0 c(j, k) α + 1 β j ρ (j) i -(α+1) log s β s k+1 , ( ) where c(j, k) are some integer constants. Together with the assumption, we have q (k) i (e -β α+1 t ) = k j=0 c(j, k) α + 1 β j ρ (j) i (t) e -(k+1)β α+1 t ≤ k j=0 c(j, k)|(α + 1) j γ ≤ C(α)γ, (51) where C(α) is a universal constant only depending on α. Note that for j = 0, . . . , α, lim s→0 + ρ (j) i -(α+1) log s β s k+1 = lim t→∞ ρ (j) i (t) e -(k+1)β α+1 t = lim t→∞ ρ (j) i (t) e -βt e -(α-k)β α+1 t = 0, hence q i (s) ∈ C α [0, 1] with q i (0) = q (1) i (0) = • • • = q (α) i (0) = 0. By Jackson's theorem (Jackson, 1930) , for m = 1, 2, . . . , there exists a polynomial Q i,m of degree m -1 such that q i -Q i,m L ∞ ([0,1]) ≤ C(α)γ m α . ( ) Denote the polynomial Q i,m as Q i,m (s) = m-1 j=0 α i,j s j , and define φ i,m (t) = e -β α+1 t Q i,m (e -β α+1 t ). (55) Then we have φ i,m (t) = c e W t u i , c = (1, 1, . . . , 1), W =       -β α+1 -2β α+1 . . . -mβ α+1       , ( ) u i = (α i,0 , α i,1 , . . . , α i,m-1 ). ( ) With a change of variable s = e -β α+1 t , we have the estimate ρ i -φ i,m L 1 ([0,∞)) = ∞ 0 |ρ i (t) -φ i,m (t)|dt = 1 0 ρ i -(α + 1) log s β -sQ i,m (s) α + 1 βs ds = α + 1 β 1 0 |q i (s) -Q i,m (s)|ds ≤ C(α)γ βm α . (60) Finally we define U = [u 1 , . . . , u d ] ∈ R m×d and have c e W t U = (φ 1,m (t), . . . , φ d,m (t)). (61) Recall the dynamical system (6) d dt h t = σ(W h t + U x t ), ŷt = c h t , which is together determined by the parameters c, W, U . Similar to the argument in the proof of Theorem 4.1, for any x with x ∞ ≤ 1 and t, we have |H t (x) -Ĥt (x)| ≤ i ρ i -φ i,m L 1 ([0,∞)) ≤ C(α)γd βm α . ( ) The curse of memory in approximation. Now we explain more technical details of why Theorem 4.2 implies the curse of memory in approximation, as pointed out in the main text. We assume d = 1 and consider an example H t (x) := t 0 x t-s ρ(s)ds, t ≥ 0, in which the density satisfies ρ(t) ∈ C (1) (R) and ρ(t) ∼ t -(1+ω) as t → +∞. ( ) Here ω > 0 indicates the decay rate of the memory effects in our target functional family {H t }. The smaller its value, the slower the decay and the longer the system memory. Notice that y (1) (t) = ρ(t), and in this case there exists no β > 0 making the following condition ((13) in the main text) true: e βt y (k) i (t) = o(1) as t → +∞, sup t≥0 β -k |e βt y (k) i (t)| ≤ γ, and no rate estimate can be deduced from it. A natural way to circumvent this obstacle is to introduce a truncation in time. With T ( 1) we can define ρ(t) ∈ C (1) (R) such that ρ(t) ≡ ρ(t) for t ≤ T , ρ(t) ≡ 0 for t ≥ T + 1, and ρ(t) is monotonically decreasing for T ≤ t ≤ T + 1. Considering the linear functional Ht (x) := t 0 x t-s ρ(s)ds, we have the truncation error estimate |H t (x) -Ht (x)| ≤ x X ∞ T |ρ(s)|ds ∼ x X T -ω . Now the conclusion of Theorem 4.2 (i.e. ( 63)) is applicable to the truncated { Ht } (with α = 1), and we have for ∀β > 0, there is a linear RNN (U, W, c) such that the associated functionals { Ĥt } ∈ Ĥm satisfy sup t∈R Ht -Ĥt ≤ Cγ βm := C βm sup t≥0 |e βt y (1) (t)| β = Cω m e βT β 2 T ω+1 . ( ) It is straightforward to verify that when β = 2/T , the right-hand side of (68) achieves the minimum, which gives us sup t∈R Ht -Ĥt ≤ Cω m T 1-ω . ( ) Combining ( 67) and ( 69) gives us sup t∈R H t -Ĥt ≤ C T -ω + ω m T 1-ω . In order to achieve an error tolerance , according to the first term above we require T ∼ -1 ω , and then according to second term we have m = O ωT 1-ω = O ω -1 ω . ( ) This estimate gives us a quantitative relationship between the degree of freedom needed and the decay speed. When ω is small, i.e., the system has long memory, the size of the RNN required grows exponentially.

C FINE-GRAINED ANALYSIS OF OPTIMIZATION DYNAMICS C.1 SIMPLIFICATIONS ON THE LOSS FUNCTION

Recall the loss function E x J T (x; c, W, U ) = E x T 0 [c e W t U -ρ(t) ]x T -t dt 2 . ( ) The simplifications on E x J T are listed as follows. 1. Take the input data x to be the white noise, so that x T -t dt in distribution = dB t , where B t is the standard d-dimensional Wiener process. As a consequence, simplifying (72) via Itô's isometry gives J T (c, W, U ) := E x J T (x; c, W, U ) = T 0 c e W t U -ρ(t) 2 2 dt. 2. We focus on the temporal dimension and take the spatial dimension d = 1 in (74). 1 Moreover, to investigate the effect of long-term memory, it is necessary to consider the training on large time horizons. Hence, we take T → ∞ to get J ∞ (c, W, b) := ∞ 0 c e W t b -ρ(t) 2 dt, ( ) where b is the sole column of U in (74) and ρ(t) becomes a scalar-valued target. This corresponds to the so-called single-input-single-output (SISO) system. 3. Due to the difficulty of directly analyzing ∇ W e W t and ∇ 2 W e W t , we consider a further simplified ansatz. Assume that W is a diagonal matrix with negative entries (to guarantee the stability of the model). That is, W = -diag(w) with w ∈ R m + . Then we can combine a = b • c (where • denotes the Hadamard product) and rewrite the model as ρ(t; c, W, b) := c e W t b = m i=1 a i e -wit a e -wt ρ(t; a, w). (76) The optimization problem (75) becomes min a∈R m ,w∈R m + J(a, w) := ∞ 0 m i=1 a i e -wit -ρ(t) 2 dt. ( ) Here we omit the subscript ∞. 2 4. We apply a continuous-time idealization of the gradient descent dynamics by considering the gradient flow with respect to J(a, w). That is, a (τ ) = -∇ a J(a(τ ), w(τ )), w (τ ) = -∇ w J(a(τ ), w(τ )), with some initial value a(0 ) = a 0 ∈ R m , w(0) = w 0 ∈ R m + . As we will show later, applying the training dynamics (78) to optimize ( 77) is able to serve as a starting point in the fine-grained dynamical analysis, since it still preserves the plateauing behavior observed in the optimization process (see Figure 1 ), provided additional structures related to memories as discussed next.

C.2 CONCRETE DYNAMICAL ANALYSIS AND THE CURSE OF MEMORY IN OPTIMIZATION

We prove Theorem 5.1 in this section. The basic insight is, by adding long-term memories in targets, one can increase the loss with little effect on the gradient and Hessian, which leads to a significant slow down of the training dynamics near the short-term memory parts of the targets. Therefore, Theorem 5.1 is proved subsequently in the following procedure. 1. We prove that J ω has a large value but small gradient when ρ(t; a, w) ≡ ρ(t). 1 One can see that the spatial dimension plays little role in the previous approximation analysis (see the proof of Theorem 4.2), since each spatial dimension can be handled separately. 2 The time horizon is always taken as ∞ in the whole analysis. Note that here we also omit an index m (width of the network, relating to the model capacity), since it remains unchanged in the following content if not specified. 2. We prove that when ρ(t; a, w) ≡ ρ(t), the Hessian ∇ 2 J ω is positive semi-definite for ω = 0, but for finite, small ω > 0, ∇ 2 J ω has O(1) positive eigenvalues and multiple o(1) eigenvalues. 3. Based on these results, we perform a local linearization analysis on the gradient flow (21) initialized by ρ(t; a 0 , w 0 ) ≡ ρ(t), from which and by continuity the timescale of plateauing is derived. (1) Preliminary Results As discussed around (20), we consider the target functional with a parametrized representation ρ ω (t) = ρ(t) + ρ 0,ω (t) = m * j=1 a * j e -w * j t + ρ 0 (t -1/ω). Here, a * j := a * (w * j ) = 0, w * j > 0 and w * i = w * j for any i = j, i, j ∈ [m * ],foot_0 and m * < m. The former requirements are just non-degenerate conditions, and the last requirement ensures that the model can perfectly represent the well-approximated part of the target, ρ(t). The memory in the target is controlled by ρ 0,ω (t) = ρ 0 (t -1/ω), with ρ 0 as a fixed template function which satisfies the following assumptions. Assumptions on ρ 0 . (i) ρ 0 (t) ≡ 0; (ii) ρ 0 ∈ L 2 (R) ∩ C 2 (R); (iii) ρ 0 is bounded on R, i.e. ρ 0 L ∞ (R) < ∞; (iv) lim t→-∞ ρ 0 (t) = 0. Remark C.1. The above assumptions (i)(ii)(iii) are rather natural, and (iv) only restricts the single side tail of ρ 0 to be zero. In the following analysis, we further focus on ρ 0 with light tails, e.g. the sub-Gaussian tail |ρ 0 (t)| ≤ c 0 e -c1t 2 , ∀t : |t| ≥ t 0 for some fixed positive constants c 0 , c 1 , t 0 . Obviously, Gaussian densities and continuous functions with compact supports are sub-Gaussian functions. We begin by the following preliminary estimate that is used throughout the subsequent analysis. Lemma C.1. For any n ∈ N, ω > 0 and w > 0, let ∆ n,ω (w) := ∞ 0 t n e -wt |ρ 0,ω (t)|dt. (80)

Then

• ∆ n,ω (w) is monotonically decreasing on (0, ∞); • lim ω→0 + ∆ n,ω (w) = 0; • In particular, if ρ 0 is sub-Gaussian, we further have ∆ n,ω (w) ω -n e -w/ω c w 2 2 + c w 3 , ω ∈ (0, min{1/2, 1/t 0 , 2c 1 /w}). ( ) Here c 2 = e 1 4c 1 > 1, c 3 = e t0 > 1, and hides universal constants only depending on n and ρ 0 , t 0 , c 0 , c 1 . Proof. (i) Obviously ∆ n,ω (w 1 ) ≤ ∆ n,ω (w 2 ) for any w 1 > w 2 > 0. (ii) By the assumptions on ρ 0 , we get lim ω→0 + |ρ 0,ω (t)| = lim ω→0 + |ρ 0 (t -1/ω)| = lim s→-∞ |ρ 0 (s)| = 0, ∀t ≥ 0, and M 0 := ρ 0 L ∞ (R) < ∞, which gives t n e -wt |ρ 0,ω (t)| ≤ M 0 t n e -wt ∈ L 1 ([0, ∞)) for any n ∈ N, ω > 0 and w > 0. By Lebesgue's dominant convergence theorem, we have lim ω→0 + ∆ n,ω (w) = ∞ 0 t n e -wt • lim ω→0 + |ρ 0,ω (t)|dt = 0, ∀n ∈ N, ∀w > 0. (iii) Now we estimate ∆ n,ω (w) under the sub-Gaussian condition (79). Suppose 0 < ω < 1/t 0 , we have |∆ n,ω (w)| = ∞ 0 t n e -wt |ρ 0 (t -1/ω)|dt = 1/ω+t0 1/ω-t0 t n e -wt |ρ 0 (t -1/ω)|dt + 1/ω-t0 0 t n e -wt |ρ 0 (t -1/ω)|dt + ∞ 1/ω+t0 t n e -wt |ρ 0 (t -1/ω)|dt I 1 + I 2 + I 3 . Then we bound I 1 , I 2 and I 3 respectively: I 1 ≤ M 0 1/ω+t0 1/ω-t0 t n e -wt dt ≤ M 0 e -w(1/ω-t0) 1/ω+t0 1/ω-t0 t n dt = M 0 e wt0 • e -w/ω • (1 + ωt 0 ) n+1 -(1 -ωt 0 ) n+1 (n + 1)ω n+1 M 0 e wt0 ω -n e -w/ω (t 0 + ω), where ω ∈ (0, 1/2), and hides universal constants only related to n and t 0 . Let 1/c 1 := 2σ 2 , we have I 2 = e -w/ω -t0 -1/ω (s + 1/ω) n e -ws |ρ 0 (s)|ds ≤ e -w/ω -t0 -1/ω (s + 1/ω) n e -ws • c 0 e -c1s 2 ds, where -t0 -1/ω (s + 1/ω) n e -ws e -c1s 2 ds = e w 2 4c 1 -t0 -1/ω (s + 1/ω) n e -c1(s+ w 2c 1 ) 2 ds ≤ e σ 2 w 2 /2 R (|t| + |1/ω -σ 2 w|) n • e -t 2 2σ 2 dt = e σ 2 w 2 /2 n k=0 C k n (1/ω -σ 2 w) n-k • 2 ∞ 0 t k e -t 2 2σ 2 dt ≤ e σ 2 w 2 /2 n k=0 C k n (1/ω) n-k √ 2σ k+1 Γ k + 1 2 holds for any ω ∈ (0, 2c 1 /w). Here the last inequality is due to the Mellin Transform of absolute moments of the Gaussian density (see Proposition 1 in Li et al. (2005) ). The argument is similar for I 3 , which gives the same bound as I 2 . Combining all the estimates gives the desired conclusion. The proof is completed. The main idea to analyze plateauing behaviors is to investigate the local dynamics of the gradient flow ( 21) when ρ = ρ, then extend the results to the setting ρ ≈ ρ by continuity. Recall that both of them are exponential sums, we can obtain the relation of parameters between ρ and ρ, according to the following lemma. Lemma C.2. For any m ∈ N + , let λ = (λ 1 , • • • , λ m ) with λ i = λ j for any i = j, i, j ∈ [m]. Then the series of functions e λit m i=1 is linear independent on any interval I ⊂ R. Proof. The aim is to show m i=1 c i e λit = 0, t ∈ I ⇒ c i = 0, ∀i ∈ [m]. (83) holds trivially for m = 1. Assume that (83) holds for m -1, then m i=1 c i e λit = 0, t ∈ I ⇒ m-1 i=1 c i e (λi-λm)t + c m = 0, t ∈ I (84) ⇒ m-1 i=1 c i (λ i -λ m )e (λi-λm)t = 0, t ∈ I. By induction, we get c i (λ i -λ m ) = 0 for any i = 1, • • • , m -1. Since λ 1 , • • • , λ m are distinct, we have c i = 0 for any i = 1, • • • , m -1. Together with (84), we get c m = 0, which completes the proof. Definition C.1. Let m ≥ m * . For any partition P: [m] = ∪ m * j=0 I j with I j1 ∩ I j2 = ∅ for any j 1 = j 2 , j 1 , j 2 ∈ {0} ∪ [m * ], and I 0 = ∪ i0 r=1 I 0,r with I 0,r1 ∩ I 0,r2 = ∅ for any r 1 = r 2 , r 1 , r 2 ∈ [i 0 ], where I j = ∅ for any j ∈ [m * ] and I 0,r = ∅ for any r ∈ [i 0 ] (if I 0 = ∅), define the affine space (with respect to P): M * P := (a, w) ∈ R m × R m + : i∈Ij a i = a * j , w i = w * j for any i ∈ I j , j ∈ [m * ]; i∈I0,r a i 0, w i = v r = w * j for any i ∈ I 0,r , r ∈ [i 0 ] and j ∈ [m * ] . Denote the collection of all such affine spaces by M * := P M * P . The following lemma characterizes the relation of parameters, by showing that M * is exactly the set of equivalent points to (a * , w * ) for the purpose of representation via exponential sums. (2) Loss and Gradient Lemma C.3. For any (a, w) ∈ R m × R m + , ρ(t; a, w) ≡ ρ(t) ⇔ (a, w) ∈ M * . Proof. (i) (⇐) Since (a, w) ∈ M * , Now we show that as ω → 0 + , the loss J ω remains lower bounded, while ∇J ω 2 converges to 0. This implies that the loss will saturate at a high value when learning long-term memories. Proposition C.1. For any (a, w) ∈ R m × R m + satisfying ρ(t; a, w) ≡ ρ(t), we have J ω (a, w) ≥ C(ρ 0 ) > 0, ∀ω ∈ (0, C (ρ 0 )), where C(ρ 0 ), C (ρ 0 ) > 0 are constants only depending on ρ 0 . That is, the loss is lower bounded uniformly in ω. Proof. Recall the assumptions on ρ 0 , we have ρ 0 (t) ≡ 0 and ρ 0 ∈ C(R). Let t 1 ∈ R such that ρ 0 (t 1 ) = 0. By continuity, there exists δ 0 > 0 such that |ρ 0 (t)| ≥ |ρ 0 (t 1 )|/2 for any t ∈ [t 1 -δ 0 , t 1 + δ 0 ]. Hence, for any ω > 0 sufficiently small such that -1/ω < t 1 -δ 0 , we have ρ 0,ω 2 L 2 [0,∞) = ∞ 0 ρ 2 0 (t -1/ω)dt = ∞ -1 ω ρ 2 0 (t)dt ≥ t1+δ0 t1-δ0 ρ 2 0 (t)dt ≥ 1 2 δ 0 |ρ 0 (t 1 )| 2 > 0. Fix any (â, ŵ) ∈ R m × R m + such that ρ(t; â, ŵ) ≡ ρ(t). Then J ω (â, ŵ) = ρ(t; â, ŵ) -ρ(t) -ρ 0,ω (t) 2 L 2 [0,∞) = ρ 0,ω 2 L 2 [0,∞) ≥ 1 2 δ 0 |ρ 0 (t 1 )| 2 > 0, which completes the proof. Proposition C.2. For any (a, w) ∈ R m × R m + satisfying ρ(t; a, w) ≡ ρ(t), we have lim ω→0 + ∇J ω (a, w) 2 = 0. ( ) In particular, if ρ 0 has the sub-Gaussian tail (79), the estimate ∇J ω (a, w) 2 √ mω -1 e -wmin/ω c w 2 min 2 + c wmin 3 (1 + a ∞ ) holds for any ω ∈ (0, min{1/2, 1/t 0 , 2c 1 /w min }). Here w min := min i∈[m] w i > 0, c 2 , c 3 > 1 are constants only related to c 1 , t 0 , and hides universal constants only depending on ρ 0 , t 0 , c 0 , c 1 . Proof. A straightforward computation shows, for k = 1, 2, • • • , m, ∂J ω ∂a k (a, w) = 2   m i=1 a i w k + w i - m * j=1 a * j w k + w * j   -2∆ 0,ω (w k ), ∂J ω ∂w k (a, w) = -2a k   m i=1 a i (w k + w i ) 2 - m * j=1 a * j (w k + w * j ) 2   + 2a k ∆ 1,ω (w k ). Fix any (â, ŵ) ∈ R m × R m + satisfying ρ(t; â, ŵ) ≡ ρ(t). By Lemma C.3, we have (â, ŵ) ∈ M * . Recall Definition C.1, there exists a partition P: [m] = ∪ m * j=0 I j with I 0 = ∪ i0 r=1 I 0,r , where I j = ∅ for any j ∈ [m * ] and I 0,r = ∅ for any r ∈ [i 0 ] (if I 0 = ∅), such that (â, ŵ) ∈ M * P , which gives that i∈Ij âi = a * j , ŵi = w * j for any i ∈ I j and j ∈ [m * ]; i∈I0,r âi = 0, ŵi = v r = w * j for any i ∈ I 0,r , r ∈ [i 0 ] and j ∈ [m * ]. Therefore, for any n ∈ N + , we have m i=1 âi ( ŵk + ŵi ) n - m * j=1 a * j ( ŵk + w * j ) n = i0 r=1 i∈I0,r âi ( ŵk + ŵi ) n + m * j=1 i∈Ij âi ( ŵk + ŵi ) n - m * j=1 a * j ( ŵk + w * j ) n = i0 r=1 i∈I0,r âi ( ŵk + v r ) n + m * j=1 i∈Ij âi ( ŵk + w * j ) n - m * j=1 a * j ( ŵk + w * j ) n = 0. This yields ∂J ω ∂a k (â, ŵ) = -2∆ 0,ω ( ŵk ), ∂J ω ∂w k (â, ŵ) = 2â k ∆ 1,ω ( ŵk ), and hence ∇J ω (â, ŵ) 2 2 = 4 m k=1 ∆ 2 0,ω ( ŵk ) + â2 k ∆ 2 1,ω ( ŵk ) . By Lemma C.1, we get lim ω→0 + ∇J ω (â, ŵ) 2 = 0. If ρ 0 has the sub-Gaussian tail (79), again by Lemma C.1, the estimate Here ŵmin := min i∈[m] ŵi > 0, c 2 , c 3 > 1 are constants only related to c 1 , t 0 , and hides universal constants only depending on n and ρ 0 , t 0 , c 0 , c 1 . Therefore ∇J ω (â, ŵ) 2 √ mω -1 e -ŵmin/ω c ŵ2 min 2 + c ŵmin 3 (1 + â ∞ ) , ω ∈ (0, 1]. The proof is completed. (3) Eigenvalues of Hessian Now we show that minimal eigenvalues of ∇ 2 J ω also converges to 0 as ω → 0 + . Proposition C.3. For any (a, w) ∈ R m × R m + satisfying ρ(t; a, w) ≡ ρ(t), denote the eigenvalues of ∇ 2 J ω (a, w) by λ 1 (ω) ≥ λ 2 (ω) ≥ • • • ≥ λ 2m (ω). If m > m * , we have λ k (ω) > 0, k = 1, 2, • • • , m , (92) lim ω→0 + λ k (ω) = 0, k = m + 1, m + 2, • • • , 2m for ω > 0 sufficiently small, where m ≤ 2m * + |I 0 | ≤ m + m * . In particular, if ρ 0 has the sub-Gaussian tail (79), the estimate |λ k (ω)| ω -2 e -wmin/ω c w 2 min 2 + c wmin 3 (1 + a ∞ ) k = m + 1, m + 2, • • • , 2m holds for any ω ∈ (0, min{1/2, 1/t 0 , 2c 1 /w min }). Here w min := min i∈[m] w i > 0, c 2 , c 3 > 1 are constants only related to c 1 , t 0 , and hides universal constants only depending on ρ 0 , t 0 , c 0 , c 1 . Proof. A straightforward computation shows, for k , j = 1, 2, • • • , m, ∂ 2 J ω ∂a k ∂a j (a, w) = 2 w k + w j , ∂ 2 J ω ∂a k ∂w j (a, w) = -2a j (w k + w j ) 2 , k = j, ∂ 2 J ω ∂a k ∂w k (a, w) = -2   m i=1 a i (w k + w i ) 2 - m * j =1 a * j (w k + w * j ) 2   - a k 2w 2 k + 2∆ 1,ω (w k ), ∂ 2 J ω ∂w k ∂w j (a, w) = 4a k a j (w k + w j ) 3 , k = j, ∂ 2 J ω ∂w k ∂w k (a, w) = 4a k   m i=1 a i (w k + w i ) 3 - m * j =1 a * j (w k + w * j ) 3   + a 2 k 2w 3 k -2a k ∆ 2,ω (w k ). (99) Fix any (â, ŵ) ∈ R m × R m + satisfying ρ(t; â, ŵ) ≡ ρ(t) . By (90), we have ∂ 2 J ω ∂a k ∂a j (â, ŵ) = 2 ŵk + ŵj , ∂ 2 J ω ∂a k ∂w j (â, ŵ) = -2â j ( ŵk + ŵj ) 2 (k = j), ∂ 2 J ω ∂a k ∂w k (â, ŵ) = - âk 2 ŵ2 k + 2∆ 1,ω ( ŵk ), ∂ 2 J ω ∂w k ∂w j (â, ŵ) = 4â k âj ( ŵk + ŵj ) 3 (k = j), ∂ 2 J ω ∂w k ∂w k (â, ŵ) = â2 k 2 ŵ3 k -2â k ∆ 2,ω ( ŵk ). Let J(a, w) := ρ(t; a, w) -ρ(t) 2 L 2 [0,∞) (100) = m i=1 a i e -wit - m * j=1 a * j e -w * j t 2 L 2 [0,∞) , and E ω (a, w) := O m×m Diag(∆ 1,ω (w)) Diag(∆ 1,ω (w)) -Diag(a • ∆ 2,ω (w)) , where ∆ n,ω (w) (n = 1, 2) is performed element-wisely. One can verify that ∇ 2 J ω (a, w) = ∇ 2 J(a, w) + 2E ω (a, w). Then we analyze ∇ 2 J(â, ŵ) and E ω (â, ŵ) respectively. (i) ∇ 2 J(â, ŵ). Obviously (â, ŵ) is a global minimizer of J(a, w) due to J(â, ŵ) = 0. Hence ∇ J(â, ŵ) = 0 and ∇ 2 J(â, ŵ) is positive semi-definite. We further show that ∇ 2 J(â, ŵ) has multiple zero eigenvalues when m > m * . In fact, since ∂ 2 J ∂a k ∂a j (â, ŵ) = 2 ŵk + ŵj , ∂ 2 J ∂a k ∂w j (â, ŵ) = -2â j ( ŵk + ŵj ) 2 , ∂ 2 J ∂w k ∂w j (â, ŵ) = 4â k âj ( ŵk + ŵj ) 3 , it is straightforward to verify that for any i, j ∈ I p , p ∈ [m * ] and any i, j ∈ I 0,r , r ∈ [i 0 ], ∇ 2 J(â, ŵ) i,: = ∇ 2 J(â, ŵ) j,: , âj • ∇ 2 J(â, ŵ) m+i,: = âi • ∇ 2 J(â, ŵ) m+j,: , where A i,: denotes the i-th row of matrix A. Notice that i∈I0,r ∇ 2 J(â, ŵ) m+i,: = 0 for any r ∈ [i 0 ], we conclude that the Hessian ∇ 2 J(â, ŵ) has at most 2m * + i 0 + i 2 ≤ 2m * + |I 0 | ≤ m + m * different rows, 6 which yields rank(∇ 2 J(â, ŵ)) ≤ 2m * + |I 0 | ≤ m + m * . Therefore, the number of zero eigenvalues of ∇ 2 J(â, ŵ) ≥ dim x ∈ R 2m : ∇ 2 J(â, ŵ) • x = 0 = 2m - rank(∇ 2 J(â, ŵ)) ≥ 2(m -m * ) -|I 0 | ≥ m -m * . Since ∇ 2 J(â, ŵ ) is positive semi-definite, all the non-zero eigenvalues must be positive. (ii) E ω (â, ŵ). Let G (1) k := {y ∈ R : |y| ≤ |∆ 1,ω ( ŵk )|} , G (2) k := {y ∈ R : |y + âk ∆ 2,ω ( ŵk )| ≤ |∆ 1,ω ( ŵk )|} . By Gershgorin's circle theorem, for any eigenvalue of E ω (â, ŵ), say λ(ω), we have λ(ω) ∈ m k=1 (G (1) k ∪ G (2) k ). Combining with Lemma C.1, we get |λ(ω)| ≤ max k∈[m] |â k ||∆ 2,ω ( ŵk )| + |∆ 1,ω ( ŵk )| → 0, ω → 0 + . ( ) If ρ 0 has the sub-Gaussian tail (79), by ( 91), we further have |λ(ω)| ≤ max k∈[m] |â k ||∆ 2,ω ( ŵk )| + |∆ 1,ω ( ŵk )| ω -2 e -ŵmin/ω c ŵ2 min 2 + c ŵmin 3 (1 + â ∞ ), ω ∈ (0, 1], where ω ∈ (0, min{1/2, 1/t 0 , 2c 1 / ŵmin }), ŵmin := min i∈[m] ŵi > 0, c 2 , c 3 > 1 are constants only related to c 1 , t 0 , and hides universal constants only depending on ρ 0 , t 0 , c 0 , c 1 . Combining (i), (ii) and applying Weyl's theorem gives the desired result. (

4) Local Linearization Analysis

The previous analysis can now be tied directly to a quantitative dynamics via linearization arguments. It is shown that under mild assumptions, the gradient flow ( 21) can become trapped in plateaus with an exponentially large timescale. That is, the curse of memory occurs, this time in optimization dynamics instead of approximation rates. Theorem C.1 (Restatement of Theorem 5.1). For any ω > 0, m ∈ N + and θ 0 = (a 0 , w 0 ) ∈ R m × R m + , 0 < δ 1, define the hitting time τ 0 = τ 0 (δ; ω, m, θ 0 ) := inf {τ ≥ 0 : θ ω (τ ) -θ 0 2 > δ} , ( ) τ 0 = τ 0 (δ; ω, m, θ 0 ) := inf {τ ≥ 0 : |J ω (θ ω (τ )) -J ω (θ 0 )| > δ} . ( ) Assume that m > m * , and the initialization satisfies ρ(t; θ 0 ) ≈ ρ(t). Then we have lim ω→0 + τ 0 (δ; ω, m, θ 0 ) = lim ω→0 + τ 0 (δ; ω, m, θ 0 ) = +∞. ( ) In particular, if ρ 0 has the sub-Gaussian tail (79), and the initialization is bounded as (a 0 , w 0 ) ∈ [a 0 l , a 0 r ] m × [w 0 l , w 0 r ] m with constants a 0 l < a 0 r , 0 < w 0 l < w 0 r , we further have τ 0 (δ; ω, m, θ 0 ) ≥ τ 0 (δ; ω, m, θ 0 ) ω 2 e w 0 l /ω min δ √ m , ln(1 + δ) ( ) for any ω ∈ (0, min{1/2, 1/t 0 , 2c 1 /w 0 r }) sufficiently small, where hides universal constants only depending on ρ 0 , t 0 , c 0 , c 1 , w 0 r , a 0 l and a 0 r . Proof. Consider the asymptotic expansion with the form θ ω (τ ) = θ 0 ω (τ ) + ∞ i=1 δ i θ i ω (τ ) = θ 0 ω (τ ) + δθ 1 ω (τ ) + δ 2 θ 2 ω (τ ) + o(δ 2 ), for some δ ∈ (0, 1) (with δ 1) and θ i ω (τ ) = O(1) (τ ≥ 0, i = 0, 1, • • • ). 7 For consistency, we have θ 0 ω (0) = θ 0 and θ i ω (0) = 0 for i = 1, 2, • • • . By continuity, τ 0 > 0 and θ ω (τ ) -θ 0 2 ≤ δ for any τ ∈ [0, τ 0 ]. The aim is to quantify the scale of τ 0 . Let g 0 := ∇J ω (θ 0 ) and H 0 := ∇ 2 J ω (θ 0 ). The local linearization on (21) shows d dτ θ ω (τ ) = -g 0 -H 0 (θ ω (τ ) -θ 0 ) + O(δ 2 ), τ ∈ [0, τ 0 ]. Combining with (108), we have d dτ θ 0 ω (τ ) = -g 0 -H 0 (θ 0 ω (τ ) -θ 0 ), θ 0 ω (0) = θ 0 , at O(1) scale, d dτ θ 1 ω (τ ) = -H 0 θ 1 ω (τ ), θ 1 ω (0) = 0, at O(δ) scale, d dτ θ 2 ω (τ ) = -H 0 θ 2 ω (τ ) + O(1), θ 2 ω (0) = 0, at O(δ 2 ) scale. Therefore θ 0 ω (τ ) = θ 0 - τ 0 e -H0s ds g 0 , θ 1 ω (τ ) = e -H0τ θ 1 ω (0) = 0, which gives θ ω (τ ) = θ 0 - τ 0 e -H0s ds g 0 + O(δ 2 ), τ ∈ [0, τ 0 ]. ( ) To achieve a parameter separation gap δ 0 , i.e. θ ω (τ ) -θ 0 2 = δ 0 with δ 0 = cδ, c ∈ (0, 1], we need to take τ such that τ 0 e -H0s ds g 0 2 ≥ δ 0 2 . ( ) Let H 0 = P ΛP be the eigenvalue decomposition with P orthogonal and  Λ = diag(λ 1 , • • • λ 2m ) consisting of the eigenvalues of H 0 with λ 1 ≥ • • • ≥ λ 2m . Then ≤ g 0 2 • max max i∈[2m],λi =0 1 |λ i | |e -λiτ -1|, τ . It is straightforward to verify that h(τ ; λ) := 1 |λ| |e -λτ -1|, τ ≥ 0 monotonically decreases on λ ∈ R for any τ ≥ 0.foot_5 Hence τ 0 e -H0s ds g 0 2 ≤ g 0 2 • 1 -λ2m (e -λ2mτ -1), λ 2m < 0, τ, λ 2m ≥ 0, and the right hand side monotonically increases on τ ≥ 0. Combining ( 110), ( 111) gives δ 0 2 ≤ g 0 2 • 1 -λ2m (e -λ2mτ -1), λ 2m < 0, τ, λ 2m ≥ 0. ( ) We discuss for different cases: (i) g 0 2 = 0. Obviously the inequality (112) fails since (110) fails for any τ ≥ 0, which gives τ 0 = +∞; (ii) g 0 2 = 0 and λ 2m ≥ 0. By (112), we get τ ≥ δ0 2 g0 2 ; (iii) g 0 2 = 0 and λ 2m < 0. By (112), we get τ ≥ 1 -λ 2m ln 1 + δ 0 -λ 2m 2 g 0 2 . If -λ 2m ≤ 2 g 0 2 , we have τ ≥ 1 -λ 2m • δ 0 -λ 2m 2 g 0 2 • ln 1 + δ 0 -λ2m 2 g0 2 δ 0 -λ2m 2 g0 2 = δ 0 2 g 0 2 1 + O δ 0 -λ 2m 2 g 0 2 = δ 0 2 g 0 2 (1 + O(δ 0 )); if -λ 2m > 2 g 0 2 , we have τ ≥ ln(1+δ0) -λ2m . Combining (i), (ii), (iii) gives τ 0 = τ 0 (δ; ω, m, θ 0 ) min δ g 0 2 , ln(1 + δ) |λ 2m | . ( ) Let the initialization satisfy ρ(t; θ 0 ) ≡ ρ(t), and assume m > m * . According to Propositions C.2 C.3, we have lim ω→0 + g 0 2 = 0, lim ω→0 + λ 2m = 0 ⇒ lim ω→0 + τ 0 (δ; ω, m, θ 0 ) = +∞. ( ) If ρ 0 has the sub-Gaussian tail (79), again by Propositions C.2 and C.3, we further have τ 0 (δ; ω, m, θ 0 ) ω 2 e w0,min/ω min δ √ m , ln(1 + δ) 1 c w 2 0,min 2 + c w0,min 3 (1 + a 0 ∞ ) for any ω ∈ (0, min{1/2, 1/t 0 , 2c 1 /w 0,min }), where w 0,min := min i∈[m] w 0,i > 0, c 2 , c 3 > 1 are constants only related to c 1 , t 0 , and hides universal constants only depending on ρ 0 , t 0 , c 0 , c 1 . Since the initialization is bounded as (a 0 , w 0 ) ∈ [a 0 l , a 0 r ] m ×[w 0 l , w 0 r ] m with a 0 l < a 0 r , 0 < w 0 l < w 0 r , let c 0 a = max{|a 0 l |, |a 0 r |}, we get τ 0 (δ; ω, m, θ 0 ) ω 2 e w 0 l /ω min δ √ m , ln(1 + δ) 1 c (w 0 r ) 2 2 + c w 0 r 3 (1 + c 0 a ) ω 2 e w 0 l /ω min δ √ m , ln(1 + δ) , where hides universal constants only related to w 0 r , a 0 l and a 0 r . The last task is to show the dynamics of the loss is much slower than the parameter separation when there is plateauing. The argument is trivial since for any τ ∈ [0, τ 0 ], J ω (θ ω (τ )) -J ω (θ 0 ) = g 0 (θ ω (τ ) -θ 0 ) + (θ ω (τ ) -θ 0 ) H 0 (θ ω (τ ) -θ 0 ) + o(δ 2 ) ≥ -g 0 2 θ ω (τ ) -θ 0 2 + λ 2m θ ω (τ ) -θ 0 2 2 + o(δ 2 ) = o(1)O(δ) + o(1)O(δ 2 ) + o(δ 2 ) = o(δ 2 ), ω → 0 + . By continuity, the proof is completed. Remark C.3. The estimate in Theorem C.1 shows a lower bound on the escape time, hence it does not appear to preclude the situation that the plateauing lasts forever. However, in the proof above, if one supposes τ 0 = +∞ in (104), i.e. the hypothetical situation where the parameters are trapped forever, and write g0 : = P g 0 = (g 0,1 , • • • , g0,2m ), we have τ 0 e -H0s ds g 0 2 2 = g 0 τ 0 e -Λs ds 2 g0 = 2m i=1 (g 0,i ) 2 (h(τ ; λ i )) 2 ≥ (g 0,j ) 2 (h(τ ; λ j )) 2 for any j such that λ j < 0. If g0,j = 0, (109) gives θ ω (τ ) -θ 0 2 ≥ τ 0 e -H0s ds g 0 2 + O(δ 2 ) ≥ |g 0,j | -λ j (e -λj τ -1) + O(δ 2 ) → +∞, τ → ∞, which is a contradiction. That is to say, the parameter separation has to achieve the gap δ within a finite time, even if it is exponentially large. Remark C.4. Recall Lemma C.3, ρ(t; a 0 , w 0 ) ≡ ρ(t) if and only if (a 0 , w 0 ) ∈ M * = P M * P , where P is a partition over [m] as defined in Definition C.1. That is, as a union of affine spaces, M * is in fact an equivalent set for qualified initializations. As discussed in Remark C.2, when there is no degeneracy, the cardinality of M * is m * ! m m * (i.e. the number of P), with each M * P an (m -m * )-dimensional affine space; when there is degeneracy in some M * P , it then becoms an uncountable set. Certainly, initializations sufficiently near M * are also qualified by continuity. Remark C.5. Motivated by the idea of weights degeneracy (see Definition C.1), we can further apply similar methods to a (global) landscape analysis on the loss function J ω . The results there show that the plateaus are all over the landscape, even provided general targets (without memory structures). See details in Appendix D.

C.3 NUMERICAL EXPERIMENTS C.3.1 MOTIVATING TESTS

In this section we give details of Figure 1 . (1) Learning linear functionals using linear RNNs (with GD optimizer) The target functional is H T (x) = T 0 ρ(t) x T -t dt with white noise x, while the representation ρ is selected as the exponential sum or the scaled Airy function: 1. Exponential sum: ρ(t) = [c * ] e W * t b * , where c * , b * are standard normal random vectors of m * dimensions, and W = -I -Z Z with Z ∈ R m * ×m * is a Gaussian random matrix with i.i.d. entries having variance 1/m * . 2. Airy function: ρ(t) = Ai(s 0 [t -t 0 ] ), where Ai(t) is the Airy function of the first kind, given by the improper integral Ai(t) = 1 π lim ξ→∞ ξ 0 cos u 3 3 + tu du. Note that in the first example, the memory of target functional decays quickly. However, for the second example, the effective rate of decay is controlled by the parameter t 0 . For t ≤ t 0 , the Airy function is oscillatory. Hence for large t 0 , a large amount of memory is present in the target. In Figure 1 we set m * = 8 for exponential sums, and t 0 = 3, s 0 = 2.25 for Airy functions. In Figure 1 (a) and (b), we plot the gradient descent dynamics on training the linear RNNs (discretized using Euler method, hence equivalent to residual RNNs). We observe an efficient training process for the exponential sum case, while "plateauing" behaviors in the Airy function case. This causes a severe slow down of training. In addition, we also find that the plateauing effect gets worse as t 0 (or s 0 ) is increased, which corresponds to more complex Airy functions in the sense of more memory effects. That is, the long-term memory adversely affects the optimization process via gradient descent. (2) Learning nonlinear functionals using nonlinear RNNs (with Adam optimizer) To show that the plateauing behavior may be generic, we also consider a nonlinear forced dynamical system target, the Lorenz 96 system (Lorenz, 1996) : ẏ = -y + x + K k=1 z k /K, żk = 2[z k+1 (z k-1 -z k+2 ) -z k + y], k = 1, • • • , K, with cyclic indices z k+K = z k , and x is an external stochastic noise. When the unresolved variables z k are unknown, the dynamics of the resolved variable y driven by x is a nonlinear dynamical system with memory effects (but not a linear functional). We use a standard nonlinear RNN (with the tanh activation) to learn the sequence-to-sequence mapping x 0:T → y 0:T with the Adam optimizer. Figure 1 (c) shows that the training of the Lorenz 96 system with the presence of memory also exhibits the plateauing phenomenon. Here the model and target are both selected the same as Figure 2 , but with a larger width m = 10. We see the logarithm of time of plateauing and parameter separation is almost linear to the memory 1/ω. entries, using non-linear (tanh) activations and using the Adam optimizer. Furthermore, we do not take the Itô isometry simplification and instead use actual input sample paths of finite time horizons, just as one would do in training RNNs in practice. We observe that the plateauing behavior is present in all cases. Moreover, in the last case of Adam (which can be thought of as a momentumbased optimization method), the plateauing behavior is somewhat alleviated, although the separation of timescales is still present. This is consistent with our supplemental analysis in Appendix E, where we show that the momentum-based methods will speed up training based on our dynamical (Appendix C.2) and landscape analysis (Appendix D) of plateauing. The learning rate is 1.0 for GD and 0.001 for Adam. 10 initializations are sampled and trained for each experiment. We consider two possible input distributions: a) white noise inputs; b) inputs of the form x t = J j=1 α j cos(λ j t), where λ j ∼ U [0, 10] and α j ∼ N (0, 1). We observe that plateaus occur in all cases and the momentum generally improves the situation but still not resolve the difficulty.

D LANDSCAPE ANALYSIS

As mentioned in Remark C.5, we can perform a global landscape analysis on the loss function based on the idea of weights degeneracy, which arises from Definition C.1. Recall that the loss function reads min a∈R m ,w∈R m + J m (a, w) := ∞ 0 m i=1 a i e -wit -ρ(t) 2 dt. The main results of the appendix are summarized as follows. • In Theorem D.1, we prove that the loss function has infinitely many critical points, which form a factorial number of affine spaces; • In Theorem D.2, we prove that such (critical) affine spaces are much more than global minimizers provided the target being an exponential sum;foot_6 • In Theorem D.3, we prove that on such (critical) affine spaces, the Hessian is singular in the sense of processing multiple zero eigenvalues; • In Proposition D.1, we prove that the (critical) affine spaces contain both saddles and degenerate stable points which are not global optimal. Instead of a local dynamical analysis in Appendix C.2, we generalize similar methods to a global landscape analysis here, and the results hold for the loss function associated with general targets. More specifically, these results complement our main results (see Theorem 5.1 or Theorem C.1) in the following aspects. • It is shown that the weights degeneracy is quite common in the whole landscape of the loss function. Unfortunately, the weights degeneracy often worsens the landscape to a large extent; • It is shown that the weights degeneracy leads to a large number of stable areas (i.e. critical affine spaces), but most of them contribute to non-global minimizers; • It is shown that these stable areas can also be quite flat, which often connect with local plateaus; • For the structure of these stable areas, there are both saddles and degenerate critical points (not global optimal). In certain regimes, even saddles can be rather difficult to escape from (see Theorem 5.1 or Theorem C.1). As a consequence, the optimization problem of linear RNNs is globally and essentially difficult to solve.

D.1 SYMMETRY ANALYSIS ON THE LANDSCAPE

This subsection consists of two parts: in Appendix D.1.1, we give main results provided the existence of weights degeneracy; in Appendix D.1.2, we give sufficient conditions to guarantee the existence. Since the key observation to utilize weights degeneracy is to notice the permutation symmetry of coordinates of gradients, we also called it "symmetry analysis".

D.1.1 GENERIC THEORY

We begin with the following definition, which describes the concept of weights degeneracy in a natural and rigorous way. Definition D.1. (coincided critical solutions and affine spaces) Let d ∈ N + and 1 ≤ d ≤ m. We say (a, w) is a d-coincided critical solution of J m , if ∇J m (a, w) = 0, and w = (w i ) ∈ R m + has d different components. The coincided critical affine spaces are defined as coincided critical solutions that form affine spaces. To guarantee the existence of such solutions, it is necessary to have the following definition. Definition D.2. (â, ŵ) ∈ R m ⊗ R m + is called the non-degenerate global minimizer of J m , if and only if J m (â, ŵ) = inf a∈R m ,w∈R m + J m (a, w), and (â, ŵ) takes a non-degenerate form âi = 0, ŵi = ŵj for i = j, i, j = 1, 2, • • • , m. For convenience, we also define an index set N := {n ∈ N + : J m has non -degenerate global minimizers for any m ≤ n} , which is used frequently in the following analysis. For any f ∈ L 2 [0, ∞), let L[f ] be the Laplace transform of f , i.e. L[f ](s) = ∞ 0 e -st f (t)dt, s > 0. We begin with the following lemma. Lemma D.1. Assume that ρ is smooth and √ w |L[ρ](w)| → 0 as w → 0 + and w → ∞. Then we have 1 ∈ N and thus N = ∅. Proof. We aim to show that there exists â = 0 and ŵ > 0, such that J 1 (â, ŵ) = inf a∈R,w∈R+ J 1 (a, w). ( ) The basic idea is to limit the unbounded domain a ∈ R, w ∈ R + to a compact set without effecting the minimization of J 1 (a, w). We have min a,w>0 J 1 (a, w) = min w>0 min a 1 2w • a 2 -2L[ρ](w) • a + ρ 2 L 2 [0,∞) = min w>0 min a 1 2w a -2wL[ρ](w) 2 + ρ 2 L 2 [0,∞) -2w(L[ρ](w)) 2 = min w>0 ρ 2 L 2 [0,∞) -2w(L[ρ](w)) 2 = J 1 (a(w), w), where a(w) := 2wL[ρ](w). Write h(w) := J 1 (a(w), w), then h(0 ) for any w > 0, hence min J 1 (a(w), w). + ) = h(∞) = ρ 2 L 2 [0,∞) . Obviously h(w) < ρ 2 L 2 [0,∞ That is to say, the minimization of J 1 (a, w) can be equivalently performed on a 2-dimensional smooth curve (w, a(w)) w∈[w lb ,w ub ] , which is certainly a compact set. By continuity, J 1 (a, w) has global minimizers, say (â, ŵ). Obviously ŵ > 0 and â = a( ŵ) = 0 (since â = 0 implies J 1 (â, w) = ρ 2 L 2 [0,∞) , certainly not a minimum), which completes the proof. Proof. (i) Existence. The key observation is the permutation symmetry of ∇J m : by ( 19), if a i = a j and w i = w j for some i = j, then ∂Jm ∂ai = ∂Jm ∂aj and ∂Jm ∂wi = ∂Jm ∂wj . For any m, d ∈ N + , 1 ≤ d ≤ m, suppose that w = (w i ) ∈ R m + has d different components. Then for any partition P: {1, • • • , m} = ∪ d j=1 I j with I j1 ∩ I j2 = ∅ for any j 1 = j 2 , j 1 , j 2 = 1, • • • , d, define the affine space M P,(b,v),(m,d) :=    (a, w) ∈ R m ⊗ R m + : w i = v j for any i ∈ I j , i∈Ij a i = b j , j = 1, • • • , d    for some (b, v) ∈ R d ⊗R d + , where v has exactly d different components. Therefore, for any (a, w) ∈ M P,(b,v),(m,d) , we have J m (a, w) = d j=1 i∈Ij a i e -wit -ρ(t) 2 L 2 [0,∞) = d j=1 b j e -vj t -ρ(t) 2 L 2 [0,∞) = J d (b, v), and similarly ∂J m ∂a k (a, w) = 2 ∞ 0 e -vst   d j=1 b j e -vj t -ρ(t)   dt, k ∈ I s , s = 1, 2, • • • , d, ∂J m ∂w k (a, w) = 2a k ∞ 0 (-t)e -vst   d j=1 b j e -vj t -ρ(t)   dt, k ∈ I s , s = 1, 2, • • • , d. Notice that ∂J d ∂b s (b, v) = 2 ∞ 0 e -vst   d j=1 b j e -vj t -ρ(t)   dt, s = 1, 2, • • • , d, ∂J d ∂v s (b, v) = 2b s ∞ 0 (-t)e -vst   d j=1 b j e -vj t -ρ(t)   dt, s = 1, 2, • • • , d, we have ∂J m ∂a k (a, w) = ∂J d ∂b s (b, v), b s ∂J m ∂w k (a, w) = a k ∂J d ∂v s (b, v), k ∈ I s , s = 1, 2, • • • , d. ( ) foot_8 Since d ≤ min{m, M }, d ∈ N . In fact, for any k ∈ N + , if k / ∈ N , there exists i ≤ k such that J (i) has no non-degenerate global minimizers, we have j / ∈ N for any j ≥ i, hence M ≤ i -1 ≤ k -1. Hence M = ∞ implies N = N + and M < ∞ implies M ∈ N , and both of them lead to d ∈ N . Therefore, J d has non-degenerate global minimizers, i.e. there exists Proof. (i) Global minimizers. Since the target is an exponential sum, we have J m (a, w) ≥ 0 and J m (ā * , w * ) = 0, where ā * = P a * and w * = P w * with P ∈ R m×m to be some permutation matrix. ( b, v) ∈ R d ⊗ R d + such that J d ( b, v) = inf b∈R d ,v∈R d + J d (b, v), Next we show J m has no other global minimizers. Suppose J m (a, w) = 0, we have m i=1 a i e -wit - m j=1 a * j e -w * j t = 0, t ≥ 0. It is easy to see that for any j = 1, • • • , m, there exists i(j) such that w i(j) = w * j . Otherwise, if w i = w * j , i = 1, • • • , m, by (83) or Lemma C.2, we have a * j = 0, which is a contradiction. Notice that w * i = w * j for any i = j, different w * j 's will correspond to different w i 's, hence the correspondence is one-to-one. Therefore, let w i = w * j(i) , ( 129) can be rewritten as 0 = m i=1 a i e -w * j(i) t - m i=1 a * j(i) e -w * j(i) t = m i=1 (a i -a * j(i) )e -w * j(i) t , t ≥ 0. Again by Lemma C.2, we have a i = a * j(i) . That is to say, J m (a, w) = 0 implies a = P a * and w = P w * with P ∈ R m×m to be some permutation matrix. This gives m! global minimizers.  d = d m -1 d + m -1 d -1 . • For d = m -1, let p m := m m -1 , then p m = (m -1) m -1 m -1 + m -1 m -2 = (m -1) + p m-1 = • • • = m(m -1) 2 . • For d = m -2, let q m := m m -2 , then q m = (m -2) m -1 m -2 + m -1 m -3 = (m -2)p m-1 + q m-1 = • • • = 1 24 [2(m -2)(m -1)(2m -3) + 3(m -2) 2 (m -1) 2 ]. Combining above gives 1 m! m-1 d=1 d! m d > 1 m! [(m -1)!p m + (m -2)!q m ] = (m + 1)(3m -2) 24 , which is a quadratic polynominal on m. The proof is completed. Remark D.2. We only take the last two terms of ( 130) for a lower bound, which is obviously rather loose. In principle, a Poly(m) bound with higher degrees can be similarly obtained.  , l = 1, 2, • • • , m, ∂ 2 J m ∂a k ∂a l (a, w) = 2 w k + w l , ∂ 2 J m ∂a k ∂w l (a, w) = -2a l (w k + w l ) 2 , k = l, ∂ 2 J m ∂a k ∂w k (a, w) = -a k 2w 2 k + 2 ∞ 0 (-t)e -w k t m i=1 a i e -wit -ρ(t) dt. Let the induced d-coincided critical affine space be M P,( b,v),(m,d) , as is derived in the proof of Theorem D.1. Since ( b, v) is the non-degenerate global minimizer of J d , we have ∞ 0 (-t)e -ŵk t m i=1 âi e -ŵit -ρ(t) dt = ∞ 0 (-t)e -vst   d j=1 bj e -vj t -ρ(t)   dt = 1 2 bs ∂J d ∂v s ( b, v) = 0, k ∈ I s , s = 1, 2, • • • , d for any (â, ŵ) ∈ M P,( b,v), (m,d) . This gives ∂ 2 J m ∂a k ∂w k (â, ŵ) = -â k 2 ŵ2 k . Now we show that, for any i, j ∈ I s , i = j, s = 1, 2, • • • , d, ∇ 2 J m (â, ŵ) i,: = ∇ 2 J m (â, ŵ) j,: . In fact, for any k = 1, • • • , m, let k ∈ I s , then by (131), ∂ 2 J m ∂a i ∂a k (â, ŵ) = 2 ŵi + ŵk = 2 vs + vs , ∂ 2 J m ∂a j ∂a k (â, ŵ) = 2 ŵj + ŵk = 2 vs + vs . For k = i and k = j, (132) gives ∂ 2 J m ∂a i ∂w k (â, ŵ) = -2â k ( ŵi + ŵk ) 2 = -2â k (v s + vs ) 2 , ∂ 2 J m ∂a j ∂w k (â, ŵ) = -2â k ( ŵj + ŵk ) 2 = -2â k (v s + vs ) 2 . By (134), for k = i = j, ∂ 2 J m ∂a i ∂w k (â, ŵ) = -â i 2 ŵ2 i = -â i 2v 2 s , ∂ 2 J m ∂a j ∂w k (â, ŵ) = -2â i ( ŵj + ŵi ) 2 = -â i 2v 2 s , and similarly for k = j = i, ∂ 2 J m ∂a i ∂w k (â, ŵ) = -2â j ( ŵi + ŵj ) 2 = -â j 2v 2 s , ∂ 2 J m ∂a j ∂w k (â, ŵ) = -â j 2 ŵ2 j = -â j 2v 2 s . That is to say, there are at most m + d different rows in the symmetric matrix ∇ 2 J m (â, ŵ) ∈ R 2m×2m , hence rank ∇ 2 J m (â, ŵ) ≤ m + d. Therefore, the number of zero eigenvalues of ∇ 2 J m (â, ŵ) ≥ dim x ∈ R 2m : ∇ 2 J m (â, ŵ) • x = 0 = 2m -rank(∇ 2 J m (â, ŵ)) ≥ m -d. The proof is completed. Remark D.4. The bound in Theorem D.3 is not sharp, since the estimate on rank ∇ 2 J m here is loose as only rows with the same elements are considered. In practice (numerical tests), it is often observed that there are more zero eigenvalues of rank ∇ 2 J m on the coincided critical affine space M P,( b,v),(m,d) . Remark D.5. Theorem D.3 shows that, there are local plateaus around the d-coincided critical affine spaces M P,( b,v),(m,d) for d ≤ m -1. In addition, the 0-eigenspace of J m is higherdimensional for smaller d, which may suggest that one can stuck on plateaus more easily.

D.1.2 SUFFICIENT CONDITIONS

There is still a gap when connecting Theorem D.1 and Theorem D.2. That is, it is necessary to guarantee sup N relatively large, i.e. J 1 , J 2 , • • • , J d all have non-degenerate global minimizers for d as large as possible. Motivated by Kammler (1979a) , we can give some sufficient conditions by limiting the target ρ within a smaller function space, the so-called completely monotonic functions. Definition D.3. F ∈ C[0, ∞] ∩ C ∞ (0, ∞) is called completely monotonic, if and only if (-1) n F (n) (t) ≥ 0, 0 < t < ∞, n = 0, 1, • • • , and F (∞) = 0. Remark D.6. Several examples of completely monotonic functions: • ρ(t) = 1/(1 + t) α for any α > 0; • The non-degenerate exponential sum with positive coefficients ρ(t) = m * j=1 a * j e -w * j t , 0 ≤ w * 1 < • • • < w * m * , a * j > 0, j = 1, 2, • • • , m * . Since the space of exponential sums is not close, we turn to consider the problem of finding a best approximation to a given ρ ∈ L 2 [0, ∞) from the set ) , where D denotes the common differential operator. Obviously Kammler (1979a) proves the following theorem. Theorem D.4. Assume ρ ∈ L 2 [0, ∞) to be completely monotonic. Then there exists a best approximation ρ0 to ρ in V d (R + ) := ρ ∈ C d [0, ∞) : [(D + w 1 ) • • • (D + w d )]ρ = 0 for some w 1 , • • • w d ∈ R + (135) with respect to the common L 2 -norm, i.e. inf ρ∈V d (R+) ρ-ρ L 2 [0,∞ V d (R + ) ⊂ L 2 [0, ∞) and V d (R + ) V d+1 (R + ) for any d ∈ N + . V d (R + ), i.e. ρ0 -ρ L 2 [0,∞) = inf ρ∈V d (R+) ρ -ρ L 2 [0,∞) . Published as a conference paper at ICLR 2021 When ρ / ∈ V d (R + ), any such best approximation admits a non-degenerate form ρ0 (t) = d j=1 bj e -vj t , 0 < v1 < • • • < vd , bj > 0, j = 1, 2, • • • , d, and satisfies the generalized Aigrain-Williams equations L[ρ 0 ](v j ) = L[ρ](v j ), j = 1, 2, • • • , d, d ds L[ρ 0 ](s) s=vj = d ds L[ρ](s) s=vj , j = 1, 2, • • • , d. Note that ( 136) and ( 137) are pretty similar to Definition D.2, except for a different choice of hypothesis function space. Now we show a connection between these two problems. Theorem D.5. Assume ρ ∈ L 2 [0, ∞) to be completely monotonic, and ρ / ∈ V d (R + ) for some d ∈ N + . Then J d has non-degenerate global minimizers ( b, v) ∈ R d ⊗ R d + . Proof. According to Theorem D.4, there exists a non-degenerate best approximation ρ0 to ρ from V d (R + ), i.e. ρ0 -ρ L 2 [0,∞) = inf ρ∈V d (R+) ρ -ρ L 2 [0,∞) , ρ0 (t) = d j=1 bj e -vj t , 0 < v1 < • • • < vd , bj > 0, j = 1, 2, • • • , d. We aim to prove J d ( b, v) = inf b∈R d ,v∈R d + J d (b, v). Define the following subsets of exponential sums V d (R + ) := ρ : ρ(t) = d i=1 a i e -wit , a i ∈ R, w i > 0 , V d,k (R + ) := ρ ∈ V d (R + ) : w = (w i ) has k different components , 1 ≤ k ≤ d, then we have inf b∈R d ,v∈R d + J d (b, v) = inf ρ∈V d (R+) ρ -ρ 2 L 2 [0,∞) . It is straightforward to verify that V d (R + ) = d k=1 V d,k (R + ), and V d,k (R + ) = V k,k (R + ) V k (R + ) for k = 1, • • • , d. By (140), we get ρ0 -ρ 2 L 2 [0,∞) = inf ρ∈V d (R+) ρ -ρ 2 L 2 [0,∞) ≤ inf ρ∈V d,d (R+) ρ -ρ 2 L 2 [0,∞) . Since ρ0 ∈ V d,d (R + ), we have J d ( b, v) = ρ0 -ρ 2 L 2 [0,∞) = inf ρ∈V d,d (R+) ρ -ρ 2 L 2 [0,∞) . The last task is to show inf ρ∈V d (R+) ρ -ρ L 2 [0,∞) = inf ρ∈V d,d (R+) ρ -ρ L 2 [0,∞) . In fact, for any ρ ∈ V k,k , ρ(t) = k i=1 a i e -wit , let ã := (a 1 , • • • , a k , 0), w := (w 1 , • • • , w k , 1 + max 1≤i≤k w i ), we get ρ(t) := k+1 i=1 ãi e -wit ∈ V k+1,k+1 , which implies V k,k ⊂ V k+1,k+1 . Therefore, inf ρ∈V d (R+) ρ -ρ L 2 [0,∞) = inf ρ∈ d k=1 V d,k (R+) ρ -ρ L 2 [0,∞) = min 1≤k≤d inf ρ∈V d,k (R+) ρ -ρ L 2 [0,∞) = min 1≤k≤d inf ρ∈V k,k (R+) ρ -ρ L 2 [0,∞) ≥ inf ρ∈V d,d (R+) ρ -ρ L 2 [0,∞) , which completes the proof. Combining Theorem D.1 and Theorem D.5 immediately gives the following result. Theorem D.6. Assume ρ ∈ L 2 [0, ∞) to be completely monotonic, and ρ / ∈ Proof. We have 1 ∈ D and thus • Suppose the target is an non-degenerate exponential sum with positive coefficients: ρ(t) = m j=1 a * j e -w * j t , where a * j > 0 and w V 1 (R + ). Let D := {d ∈ N + : ρ / ∈ V d (R + )} D = ∅, D 0 ≥ 1. Since V d (R + ) V d+1 (R + ) for any d ∈ N + , we have D = {1, 2, • • • , D 0 } if D 0 < ∞, and D = N + if D 0 = ∞. 14 Both of them gives {1, 2, • • • , m } ⊂ D, i.e. ρ / ∈ V k (R + ) * i = w * j for any i = j, i, j = 1, • • • , m. Then D = {1, 2, • • • , m -1} and D 0 = m -1. The total number of coincided critical affine spaces of the corresponding J m is at least m-1 d=1 d! m d , which is exactly ( 130). An complement for Theorem D.2 is as follows. Theorem D.7. Fix any m ∈ N + relatively large. Consider the loss J m associated with the target being a non-degenerate exponential sum with positive coefficients, i.e. ρ(t) = m j=1 a * j e -w * j t , where a * j > 0 and w * i = w * j for any i = j, i, j = 1, • • • , m. Then in the landscape of J m , the number of coincided critical affine spaces is at least P oly(m) times larger than the number of global minimizers. Proof. By Theorem D.2, we only need to show is m ∈ N . Since ρ ∈ L 2 [0, ∞) is completely monotonic, and ρ / ∈ V k (R + ) for any k ≤ m -1, then by Theorem D.5, J (k) has non-degenerate global minimizers for any k ≤ m -1, i.e. m -1 ∈ N . The proof is completed by noticing that J m obviously has non-degenerate global minimizers, e.g. (a * , w * ).

D.1.3 A LOW-DIMENSIONAL EXAMPLE

To further understand the structure of coincided critical affine spaces, we focus on a specific lowdimensional example here. That is min a∈R 2 ,w∈R 2 + J 2 (a, w) = 2 i=1 a i e -wit -ρ(t) 2 L 2 [0,∞) , with the target to be a non-degenerate exponential sum ρ(t) = m * j=1 a * j e -w * j t , where a * j = 0 and w * i = w * j for any i = j, i, j = 1, • • • , m * . As we will show later, the coincided critical affine spaces of J 2 contain both saddles and degenerate stable points which are not global optimal. By Lemma D.1, Remark D.1 and Theorem D.1, we know the 1-coincided critical affine space of J 2 exists, and it can be constructed by taking the non-degenerate global minimizer of J 1 , say (â, ŵ) with â = 0 and ŵ > 0. Then M (â, ŵ),(2,1) := {(a 1 , â -a 1 , ŵ, ŵ) : a 1 ∈ R} ∈ R 4 is a linefoot_11 , and ∇J 2 (a 1 , â -a 1 , ŵ, ŵ) = 0 for any a 1 ∈ R. Denote the Hessian of J 2 on the line M (â, ŵ),(2,1) by A (â, ŵ) (a 1 ), i.e. A (â, ŵ) (a 1 ) := ∇ 2 J 2 (a 1 , â -a 1 , ŵ, ŵ). We investigate the landscape of J 2 on the line M (â, ŵ),(2,1) by analyzing the eigenvalue distribution of A (â, ŵ) (a 1 ). Proposition D.1. Suppose m = m * = 2, and 0 < w * 1 < w * 2 . Let I 1 := [0, â] and I 2 := (-∞, 0) ∪ (â, +∞)foot_12 . Then 1. If a * 1 a * 2 < 0, the minimal eigenvalue of A (â, ŵ) (a 1 ) is 0 for any a 1 ∈ I 1 , and negative for any a 1 ∈ I 2 ; 2. If a * 1 a * 2 > 0 and w * 2 /w * 1 < 2 + √ 3, the minimal eigenvalue of A (â, ŵ) (a 1 ) is negative for any a 1 ∈ I 1 , and 0 for any a 1 ∈ I 2 . Considering the congruent transformation of A (â, ŵ) (a 1 ), which does not affect the index of inertia:  A (â, ŵ) (a 1 ) =      1 ŵ 1 ŵ -a1 2 ŵ2 -a2 2 ŵ2 1 ŵ 1 ŵ -a1 2 ŵ2 -a2 2 ŵ2 -a1 2 ŵ2 -a1 2 ŵ2 a 2 1 2 ŵ3 + 4c( ŵ)a 1 a1a2 2 ŵ3 -a2 2 ŵ2 -a2 2 ŵ2 (1 + v j ) 3 = -16a * 1 a * 2 (v 1 -v 2 ) 2 (1 + v 1 ) 3 (1 + v 2 ) 3 > 0, which gives âc( ŵ) > 0 and â2 + 16 ŵ3 âc( ŵ) > 8 ŵ3 âc( ŵ) > 0. (ii) a * 1 a * 2 > 0, w * 2 /w * 1 < 2 + Since 4s 2 (s 2 -3s + 1) 2 -4s 4 = 4s 2 (s -1) 2 [(s -2) 2 -3], and 1 < s = u 2 /u 1 = ( ŵ + w * 2 )/( ŵ + w * 1 ) < w * 2 /w * 1 < 2 + √ 3, we get ∆ c < 0. This implies c 2 -2s(s 2 -3s + 1)c + s 4 > 0 and â2 + 16 ŵ3 âc( ŵ) > 0. In both (i) and (ii), â2 + 16 ŵ3 âc( ŵ) > 0, which implies that there is at least one positive diagonal element of A (â, ŵ) (a 1 ) in a sufficiently small neighborhood of a 1 = 0 and a 1 = â. By the Rayleigh-Ritz Theorem and Weyl's Theorem, A (â, ŵ) (a 1 ) has at least one positive eigenvalue in this neighborhood. However, by (142), det(A (â, ŵ) (a 1 )) only changes the sign at a 1 = 0 and a 1 = â. This implies another eigenvalue of A (â, ŵ) (a 1 ) changes the sign at a 1 = 0 and a 1 = â accordingly. By different signs of âc( ŵ) derived in (i) and (ii), and (142), the proof is completed. Remark D.8. From Proposition D.1, we see that there are both saddles and degenerate stable points of J 2 on the critical affine spaces (line) M (â, ŵ),(2,1) , and each of them in fact forms affine spaces (lines) respectively, but they are certainly not global minimizers. Therefore, the gradientbased algorithms can get stuck around this affine space, except that it meets saddles with negative eigenvalues of large magnitude.

E MOMENTUM HELPS TRAINING: QUADRATIC EXAMPLES

In practice, it is often the case that training is trapped in some very flat regions (plateaus), where the loss function has rather small gradients and negative eigenvalues of Hessian. Now we illustrate the escape dynamics (from plateaus) via a simple quadratic example. Consider the loss function f (x) = (x 2 1 -x 2 2 )/2 with 0 < 1. We check the escaping performance for continuous-in-time analogs of two optimization algorithms: gradient decent (GD) and momentum (heavy ball) method. (1) Gradient decent Consider the gradient flow of f (x) with an initial value x 0 = (δ, 1) , where 0 < δ 1 and δ = O( ). Thus ∇f (x 0 ) 2 = O( ), and x 1 (τ ) = -x 1 (τ ), x 1 (0) = δ x 2 (τ ) = x 2 (τ ), x 2 (0) = 1 ⇒ x 1 (τ ) = δe -τ x 2 (τ ) = e τ ⇒ f (x(τ )) = (δ 2 e -2τ -e 2 τ )/2 =: 1 (τ ). It is easy to show that there are different timescales of 1 (τ ). In fact, when τ = O(1/ ), 1 (τ ) = O( 2 )e -|O(1/ )| -e |O(1)| = O( ). However, when τ continuous to increase, say τ ≥ 1 2 ln δ 0 =: τ 1 , where δ 0 > 0 denotes the gap satisfying = o(δ 0 ), we get 1 (τ ) ≤ 1 (τ 1 ) = O( 2)e 2 • 1 2 ln δ 0 /2 = O( 2 ) -δ 0 /2 < -δ 0 /4 for any τ ≥ τ 1 . (2) Momentum The momentum algorithm has the update rule x k+1 = x k -η∇f (x k ) + ρ(x k -x k-1 ), where ρ ∈ R, η > 0 is the learning rate. The continuous-in-time analog can be derived as (similar to the arguments in Su et al. ( 2014)) 0 = ρ x k+1 -2x k + x k-1 η + (1 -ρ) √ η x k+1 -x k √ η + ∇f (x k ) ≈ ρx (t) + (1 -ρ) √ η x (t) + ∇f (x(t)), with x k := x(k √ η) and the step size √ η of the simple finite differencesfoot_13 . Let x 1 = x 0 -η∇f (x 0 ), we also get x (0) = -√ η∇f (x(0)). To facilitate a comparison to GD, we take η = 1foot_14 and ρ = 1foot_15 . Plugging the expression of f , we can solve the ODE x (t) + ∇f (x(t)) = 0 ⇔ x 1 (τ ) + x 1 (τ ) = 0, x 1 (0) = δ, x 1 (0) = -δ x 2 (τ ) -x 2 (τ ) = 0, x 2 (0) = 1, x 2 (0) = Combining (1) and (2), we have the following conclusions. ⇒ x 1 (τ ) = δ(cos τ -sin τ ) x 2 (τ ) = 1+ √ 2 e √ τ + 1- √ 2 e - √ τ ⇒ f (x(τ )) = 1 2 δ 2 (cos τ -sin τ ) 2 - 1 + √ 2 e √ τ + 1 - √ 2 e - √ τ • For both training dynamics, there are different timescales in the loss function. That is to say, relatively long time is needed to escape from the plateaus; • Comparing ( 146) and ( 148), we get different timescales of escaping: O (1/ • ln(1/ )) for GD and O (1/ √ • ln(1/ )) for momentum. Just like the convex case, where momentum improves the convergence rate by weakening the dependence on condition number, we see momentum can also help to escape rather flat saddles.



For any n ∈ N+, [n] := {1, 2, • • • , n}. That is, the non-degenerate case. Obviously, I0 = ∅ implies an uncountable M * P , but they are all degenerate. The result follows from basic knowledge of combinatorics. See details in the proof of Theorem D.1. Here i2 := |{r ∈ [i0] : |I0,r| ≥ 2}|. When I0 = ∅, the upper bound is 2m * ; when I0 = ∅, since I0,r = ∅ for any r ∈ [i0], let i1 := |{r ∈ [i0] : |I0,r| = 1}| and i2 defined as before. Then i0 = i1 + i2, |I0| = i 0 r=1 |I0,r| ≥ i1 + 2i2 = i0 + i2. The last inequality follows from |I0| = m -m * j=1 |Ij| ≤ m -m * . Here θ i ω (τ ) denotes the i-th term in the asymptotic expansion of θω(τ ), not the i-th power. With the convention that h(τ ; 0) = τ for any τ > 0, and h(0; λ) ≡ 0 for any λ ∈ R. The global minimizers are distinct when the target is an exponential sum. Here we compare the number of (critical) affine spaces with the number of global minimizers (both of them are finite). When the target is not an exponential sum, the same conclusion holds if there are still finite number of global minimizers. See Remark D.3 in Appendix D.1.1 for details. The affine spaces degenerate to distinct points when d = m. For sufficient conditions to guarantee M > 1 (to avoid vacuous results), see Theorem D.6 and Remark D.7 in Appendix D.1.2. By considering the gradient flow dynamic of J d instead of Jm, a model reduction (from m-dimensional to d-dimensional) is almost completed on M P,(b,v),(m,d) , except for some trivial degenerate cases (e.g. a k = 0 or bs = 0). Although the assumption m ∈ N seems strong, we will provide sufficient conditions to guarantee its validity in Appendix D.1.2. See an complement in Theorem D.7. That is, the affine space M P,( b,v),(m,d) . See details in the proof of Theorem D.1 Here we omit the corresponding partition P since it is unique. Suppose â > 0 here without loss of generality. If â < 0, we let I1 := [â, 0] and I2 := (-∞, â) ∪ (0, +∞) and the same conclusions hold. It is easy to check that the error of discretization is of order O( √ η). In the continuous-in-time analog of GD, i.e. the gradient flow, the step size is taken as 1. As is seen later, ρ = 1 not only simplifies the analysis, but also helps to obtain the best acceleration.



Figure 1: Comparison of training dynamics on different types of functionals. (a) and (b): using the linear RNN model with the GD optimizer; (c): using the nonlinear RNN model (with tanh activation) with the Adam optimizer.The shaded region depicts the mean ± the standard deviation in 10 independent runs using randomized initialization. Observe that learning complex functionals (Airy, Lorenz) suffers from slow-downs in the form of long plateaus.

there exists P such that (a, w) ∈ M * P . Then for any t ≥ 0Let I j = i ∈ [m] : w i = w * j for any j ∈ [m * ], andI 0 = i ∈ [m] : w i = w * j for any j ∈ [m * ] . Recall that ρ(t) = m * j=1 a * j e -w * j t is non-degenerate: a * j = 0, w * j > 0 and w * i = w * j for any i = j, i, j ∈ [m * ], we get [m] = ∪ m * j=0 I j , I j1 ∩ I j2 = ∅ for any j 1 = j 2 , j 1 , j 2 ∈ {0} ∪ [m * ]. Combining Lemma C.2 and the non-degeneracy of ρ , I j = ∅ for any j ∈ [m * ]. Assume that there are i 0 different components in (w i ) i∈I0 , say v 1 , • • • , v i0 , then v r = w * j for any r ∈ [i 0 ] and j ∈ [m * ]. Let I 0,r = i ∈ I 0 : w i = v r for any r ∈ [i 0 ], we get I 0,r = ∅ for any r ∈ [i 0 ] (if I 0 = ∅), and I 0 = ∪ i0 r=1 I 0,r , and I 0,r1 ∩ I 0,r2 = ∅ for anyr 1 = r 2 , r 1 , r 2 ∈ [i 0 ]. Hence [m] = ∪ m * j=0 I j with I 0 = ∪ i0 r=1 I 0,r forms a P defined in Definition C.1, and 0 ≡ ρ(t; a, w) -ρ(ta i e -vrt + m * j=1 i∈Ij a i -a * j e -w * j t .Again by Lemma C.2, we have i∈Ij a i = a * j for any j ∈ [m * ] and i∈I0,r a i = 0 for any r ∈ [i 0 ], which gives (a, w) ∈ M * P . The proof is completed. Remark C.2. Let I 0 = ∅. 4 It is straightforward to check that for any partition P, the dimension of M * P is m * j=1 (|I j | -1) = m -m * . In addition, it can be verified that the cardinality of M * is m * ! m m * , where m m * is the Stirling number of the second kind. 5

n,ω ( ŵk ) ≤ ∆ n,ω ( ŵmin ) ω -n e -ŵmin/ω c n ∈ N, ω ∈ (0, min{1/2, 1/t 0 , 2c 1 / ŵmin }) and k ∈ [m].

Figure3: The timescale of plateauing and parameter separation. Here the model and target are both selected the same as Figure2, but with a larger width m = 10. We see the logarithm of time of plateauing and parameter separation is almost linear to the memory 1/ω.

Figure4: Numerical verifications of the plateauing behavior under general settings, with nondiagonal recurrent kernels, the non-linear activation (tanh), and the Adam (momentum-based) optimizer. Here we use the target functional the same as Figure2with the memory 1/ω = 20. The time horizon is chosen as T = 32, and 128 input samples are generated from a standard white noise. The learning rate is 1.0 for GD and 0.001 for Adam. 10 initializations are sampled and trained for each experiment. We consider two possible input distributions: a) white noise inputs; b) inputs of the form x t =

lb ,w ub ] h(w), 0 < w lb < w ub < ∞, which implies min a,w>0 J 1 (a, w) = min w>0 J 1 (a(w), w) = min w∈[w lb ,w ub ]

If the target is an exponential sum, i.e. ρ(t) = m * j=1 a * j e -w * j t , we know ρ is smooth and √ w |L[ρ](w)| → 0 as w → 0 + and w → +∞; hence 1 ∈ N by Lemma D.1, and thus N = ∅. In fact, L[ρ](w) L[ρ](w) = O(1) when w → 0 + , and L[ρ](w) = O(1/w) when w → +∞. Theorem D.1. Assume that N = ∅ with N defined as (124). Let M := sup N . Then for any m, d ∈ N + , 1 ≤ d ≤ min{m, M }, there exists at least d! m d d-coincided critical affine spaces of J m , 10 where m d ∈ N + is called the Stirling number of the second kind.

and ( b, v) takes a non-degenerate form bi = 0, vi = vj for any i = j, i, j = 1, 2, • • • , d. (128) By (127), we get ∇J d ( b, v) = 0. Combining with (126) and (128), we obtain ∇J m (â, ŵ) = 0 for any (â, ŵ) ∈ M P,( b,v),(m,d) , i.e. (â, ŵ) belongs to a d-coincided critical affine space. Note that the affine space is with the dimension d j=1 (|I j | -1) = m -d, since there are d linear equality constrains on the m-dimensional vector a.

ii) Counting. By the structure of affine spaces discussed above, we can identify different affine spaces with respect to the partition P. For counting the number of different partitions P: {1, • • • , m} = ∪ d j=1 I j , it can be decomposed into the following two steps. First, partitioning a set of m labelled objects into d nonempty unlabelled subsets. By definition, the answer is the Stirling number of the second kind m d . Second, assign each partition to I 1 , • • • , I d accordingly. There are d! ways in total. Therefore, the number of d-coincided critical affine spaces is at least d! m d . The proof is completed. Combining Lemma D.1, Remark D.1 and Theorem D.1 gives the following theorem, which states that there are much more saddles and degenerate stable points which are not global optimal than global minimizers in the landscape (provided the target being an exponential sum). Theorem D.2. Fix any m ∈ N + relatively large. Consider the loss J m associated with the target being a non-degenerate exponential sum, i.e. ρ(t) = m j=1 a * j e -w * j t , where a * j = 0 and w * i = w * j for any i = j, i, j = 1, • • • , m. Assume that m ∈ N 12 with N defined in (124). Then in the landscape of J m , the number of coincided critical affine spaces is at least P oly(m) times larger than the number of global minimizers.

ii) Coincided critical affine spaces. Obviously N = ∅, and M = sup N ≥ m. According to Theorem D.1, for any d ∈ N + , 1 ≤ d ≤ min{m, M } = m, we have at least d! m d d-coincided critical affine spaces of J m . By (i), for any d ≤ m -1, there are no global minimizers in these affine spaces. Counting the total number Comparison. To give a bound between (130) and m!, we need an elementary recurrence m

, D 0 := sup D and write m := min{m, D 0 }. Then the total number of coincided critical affine spaces of J m is at least

for any k ≤ m . By Theorem D.5, J (k) has non-degenerate global minimizers for any k ≤ m , i.e. m ∈ N . According to Theorem D.1, for any d ∈ N + , 1 ≤ d ≤ m = min{m, m } ≤ min{m, M }, there exists at least d! m d d-coincided critical affine spaces of J m . Sum over d gives the total number Suppose the target is ρ(t) = 1/(1+t) α , α > 0, then D = N + and D 0 = ∞. The total number of coincided critical affine spaces of the corresponding J m is at least

In fact, V d (R+) V d+1 (R+) for any d ∈ N+ implies if ρ / ∈ V d (R+), ρ / ∈ V k (R+) for any k ≤ d, i.e. d ∈ D ⇒ k ∈ D for any k ≤ d; otherwise, if ρ ∈ V d (R+), ρ ∈ V l (R+) for any l ≥ d, i.e. d / ∈ D ⇒ l / ∈ D for any l ≥ d.

j ) 3 , and a 2 := â -a 1 . A straightforward computation shows thatA (â, ŵ) (a 1 ) =

A (â, ŵ) (a 1 ) has one positive eigenvalue 1/ ŵ and one eigenvalue 0. What remains are the eigenvalues of A (â, ŵ) (a 1 ) := To determine their signs, we computedet(A (â, ŵ) (a 1 )) = a 1 (â -a 1 ) • 4c( ŵ1 (â -a 1 ) • âc( ŵ) • â2 + 16 ŵ3 âc( ŵ) .(143)So we need to analyze the sign of âc( ŵ) and â2 + 16 ŵ3 âc( ŵ) under different assumptions on (a * , w * ).

By the optimality condition of (â, ŵ) for J 1 , we have w * j ) 3 . Write v j := w * j / ŵ, j = 1, 2, we get 0 < v 1 < v 2 , and

By (145), âc( ŵ) < 0. â2 + 16 ŵ3 âc( ŵ) = 3â 2 -16â c 2 + 6cs 2 -2cs -2cs 3 (s = u 2 /u 1 > 12s(s 2 -3s + 1)c + s 4 .

Write 2 (τ ) := f (x(τ )). It is not hard to show that there are still different timescales of 2 (τ ). In fact, whenτ = O(1/ √ ), 2 (τ ) = O( 2 )|O(1)| -|O(1)|(e |O(1)| + e -|O(1)| ) 2 = O( ).However, when τ continuous to increase, say+ O( ) = O( ) -δ 0 < -δ 0 /2, hence 2 (τ ) ≤ O( ) -δ 0 < -δ 0 /2 for any τ ≥ τ 2 .

Concretely, let us denote by e i (i = 1, . . . , d) the standard basis vector in R d , and e i denotes a constant signal with e i,t = e i 1 {t≥0} . Then, smoothness and memory is characterized by the regularity and decay rate of the maps t → H t (e i ),

That is to say, on one hand, there are infinitely many critical points forming affine spaces in the landscape of J m ; on the other hand, we see that even if only counting the number of affine spaces, there are still much less global minimizers (provided the width m relatively large). Remark D.3. When the target ρ is not an exponential sum, it is straightforward to see Theorem D.2 still holds if there is a finite number (with the scale of no more than factorial) of global minimizers. Now we get down to investigate ∇ 2 J m on the above coincided critical affine spaces. It is shown that ∇ 2 J m is singular and can have multiple zero eigenvalues. Theorem D.3. Fix any m, d ∈ N + , 1 ≤ d ≤ m. On the d-coincided critical affine spaces (induced by non-degenerate global minimizers of J d 13 ) of J m , ∇ 2 J m is with rank at most m + d, and hence has at least m -d zero eigenvalues. Proof. A straightforward computation shows that, for k

annex

Here we take the width m = 2 in ρ. The corresponding gradient flow ( 21) is numerically solved by the Adams-Bashforth-Moulton method. Observe that the plateauing time increases rapidly as the memory becomes longer (ω decreases).In all cases, the trained RNN model has a hidden dimension of 16 and the total length of the path is T = 6.4. The continuous-time RNN is discretized using the Euler method with step size 0.1.

C.3.2 LONG-TERM MEMORY SIGNIFICANTLY CONTRIBUTES TO PLATEAUS

Recall the simple example of target with memoryWe aim to learn ρ with the exponential sumThat is, the simplified linear RNN model with a diagonal recurrent kernel (see ( 76) and Appendix C.1). The optimization is performed by the gradient flow (21), i.e. a continuous-time idealization of the gradient descent dynamics. The ODE ( 21) is numerically solved by the Adams-Bashforth-Moulton method. The results are illustrated in Figure 2 . It is shown that the plateauing time increases rapidly as the memory 1/ω becomes longer.

C.3.3 NUMERICAL VERIFICATIONS

(1) The timescale estimateWe first numerically verify the timescale proved in Theorem 5.1 (or Theorem C.1). That is, the time of plateauing (and also parameter separation θ(τ ) -θ(0) 2 ) is exponentially large as the memory 1/ω → +∞. The results are shown in Figure 3 , where we observe good agreement with the predicted scaling.(

2) General cases

To facilitate mathematical tractability, the analysis so far is done on the restrictive cases of the diagonal recurrent kernel W with negative entries, linear activations and the gradient flow training dynamics. However, we show here that the plateauing behavior -which we now understood as a generic feature of long-term memory of the target functional and its interaction with the optimization dynamics -is present even for general cases, and hence our simplified analytical setting is representative of the general situation.In Figure 4 , we still take the target functional as defined in ( 119), but apply more general models to learn it, including using full (non-diagonal) recurrent kernels of the RNN with no restrictions on

