LEARNING LOW DIMENSIONAL STATE SPACES WITH OVERPARAMETERIZED RECURRENT NEURAL NETS

Abstract

Overparameterization in deep learning typically refers to settings where a trained neural network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of Recurrent Neural Networks (RNNs), there exists an additional layer of overparameterization, in the sense that a model may exhibit many solutions that generalize well for sequence lengths seen in training, some of which extrapolate to longer sequences, while others do not. Numerous works have studied the tendency of Gradient Descent (GD) to fit overparameterized NNs with solutions that generalize well. On the other hand, its tendency to fit overparameterized RNNs with solutions that extrapolate has been discovered only recently and is far less understood. In this paper, we analyze the extrapolation properties of GD when applied to overparameterized linear RNNs. In contrast to recent arguments suggesting an implicit bias towards short-term memory, we provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Our result relies on a dynamical characterization which shows that GD (with small step size and near-zero initialization) strives to maintain a certain form of balancedness, as well as on tools developed in the context of the moment problem from statistics (recovery of a probability distribution from its moments). Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.

1. INTRODUCTION

Neural Networks (NNs) are often overparameterized, in the sense that their representational capacity far exceeds what is necessary for fitting training data. Surprisingly, training overparameterized NNs via (variants of) Gradient Descent (GD) tends to produce solutions that generalize well, despite existence of many solutions that do not. This implicit generalization phenomenon attracted considerable scientific interest, resulting in various theoretical explanations (see, e.g., Woodworth et al. ( 2020 Recent studies have surfaced a new form of implicit bias that arises in Recurrent Neural Networks (RNNs) and their variants (e.g., Long Short-Term Memory Hochreiter & Schmidhuber (1997) and Gated Recurrent Units Chung et al. (2014) ). For such models, the length of sequences in training is often shorter than in testing, and it is not clear to what extent a learned solution will be able to extrapolate beyond the sequence lengths seen in training. In the overparameterized regime, where the representational capacity of the learned model exceeds what is necessary for fitting short sequences, there may exist solutions that generalize but do not extrapolate, meaning that their accuracy is high over short sequences but arbitrarily poor over long ones (see Cohen-Karlik et al. (2022) ). In practice however, when training RNNs using GD, accurate extrapolation is often observed. We refer to this phenomenon as the implicit extrapolation of GD. As opposed to the implicit generalization of GD, little is formally known about its implicit extrapolation. Existing theoretical analyses of the latter focus on linear RNNs -also known as Linear Dynamical Systems (LDS) -and either treat infinitely wide models (Emami et al., 2021) , or models of finite width that learn from a memoryless teacher (Cohen-Karlik et al., 2022) . In these regimes, GD has been argued to exhibit an implicit bias towards short-term memory. While such results are informative, their generality remains in question, particularly since infinitely wide NNs are known to substantially differ from their finite-width counterparts, and since a memoryless teacher essentially neglects the main characteristic of RNNs (memory). In this paper, we theoretically investigate the implicit extrapolation of GD when applied to overparameterized finite-width linear RNNs learning from a teacher with memory. We consider models with symmetric transition matrices, in the case where a student (learned model) with state space dimension d is trained on sequences of length k generated by a teacher with state space dimension d. Our interest lies in the overparameterized regime, where d is greater than both k and d, meaning that the student has state space dimensions large enough to fully agree with the teacher on sequences of length k, while potentially disagreeing with it on longer sequences. As a necessary assumption on initialization, we follow prior work and focus on a certain balancedness condition, which is known (see experiments in Cohen-Karlik et al. (2022) , as well as our theoretical analysis) to capture near-zero initialization as commonly employed in practice. Our main theoretical result states that GD originating from a balanced initialization leads the student to extrapolate, irrespective of how large its state space dimension is. Key to the result is a surprising connection to a moment matching theorem from Cohen & Yeredor (2011), whose proof relies on ideas from compressed sensing (Elad, 2010; Eldar & Kutyniok, 2012) and neighborly polytopes (Gale, 1963) . This connection may be of independent interest, and in particular may prove useful in deriving other results concerning the implicit properties of GD. We corroborate our theory with experiments, which demonstrate extrapolation via learning low-dimensional state spaces in both the analyzed setting and ones involving non-linear RNNs. The implicit extrapolation of GD is an emerging and exciting area of inquiry. Our results suggest that short-term memory is not enough for explaining it as previously believed. We hope the techniques developed in this paper will contribute to a further understanding of this phenomenon.

2. RELATED WORK

The study of linear RNNs, or LDS, has a rich history dating back to at least the early works of Kalman (Kalman, 1960; 1963) . An extensively studied question relevant to extrapolation is that of system identification, which explores when the parameters of a teacher LDS can be recovered (see Ljung (1999) ). Another related topic concerns finding compact realizations of systems, i.e. realizations of the same input-output mapping as a given LDS, with a state space dimension that is lower (see Antoulas (2005) ). Despite the relation, our focus is fundamentally different from the above -we ask what happens when one learns an LDS using GD. Since GD is not explicitly designed to find a low-dimensional state space, it is not clear that the application of GD to an overparameterized student allows system identification through a compact realization. The fact that it does relate to the implicit properties of GD, and to our knowledge has not been investigated in the classic LDS literature. 2022) studies finite-width linear RNNs (as we do), showing that when the teacher is memoryless (has state space dimension zero), GD emanating from a balanced initialization successfully extrapolates. Our work tackles an arguably more realistic and challenging setting -we analyze the regime in which the teacher has memory. Our results suggest that the implicit extrapolation of GD does not originate from a bias towards short-term memory, but rather a tendency to learn low-dimensional state spaces. 1 We note that there have been works studying extrapolation in the context of non-recurrent NNs, e.g. Xu et al. (2020) .



There is no formal contradiction between our results and those of(Emami et al., 2021) and(Cohen-Karlik et al., 2022). These works make restrictive assumptions (namely, the former assumes that the teacher is stable and its impulse response decays exponentially fast, and the latter assumes that the teacher is memoryless) under which implicit extrapolation via learning low-dimensional state spaces leads to solutions with short-term memory. Our work on the other hand is not limited by these assumptions, and we show that in cases where they are violated, learning yields solutions with low-dimensional state spaces that do not result in short-term memory.



); Yun et al. (2020); Zhang et al. (2017); Li et al. (2020); Ji & Telgarsky (2018); Lyu & Li (2019)).

The implicit generalization of GD in training RNNs has been a subject of theoretical study for at least several years (see, e.g., Hardt et al. (2016); Allen-Zhu et al. (2019); Lim et al. (2021)). In contrast, works analyzing the implicit extrapolation of GD have surfaced only recently, specifically in Emami et al. (2021) and Cohen-Karlik et al. (2022) Emami et al. (2021) analyzes linear RNNs in the infinite width regime, suggesting that in this case GD is implicitly biased towards impulse responses corresponding to short-term memory. Cohen-Karlik et al. (

