LEARNING LOW DIMENSIONAL STATE SPACES WITH OVERPARAMETERIZED RECURRENT NEURAL NETS

Abstract

Overparameterization in deep learning typically refers to settings where a trained neural network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of Recurrent Neural Networks (RNNs), there exists an additional layer of overparameterization, in the sense that a model may exhibit many solutions that generalize well for sequence lengths seen in training, some of which extrapolate to longer sequences, while others do not. Numerous works have studied the tendency of Gradient Descent (GD) to fit overparameterized NNs with solutions that generalize well. On the other hand, its tendency to fit overparameterized RNNs with solutions that extrapolate has been discovered only recently and is far less understood. In this paper, we analyze the extrapolation properties of GD when applied to overparameterized linear RNNs. In contrast to recent arguments suggesting an implicit bias towards short-term memory, we provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Our result relies on a dynamical characterization which shows that GD (with small step size and near-zero initialization) strives to maintain a certain form of balancedness, as well as on tools developed in the context of the moment problem from statistics (recovery of a probability distribution from its moments). Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.

1. INTRODUCTION

Neural Networks (NNs) are often overparameterized, in the sense that their representational capacity far exceeds what is necessary for fitting training data. Surprisingly, training overparameterized NNs via (variants of) Gradient Descent (GD) tends to produce solutions that generalize well, despite existence of many solutions that do not. This implicit generalization phenomenon attracted considerable scientific interest, resulting in various theoretical explanations (see, e.g., Woodworth et al. ( 2020 2014)). For such models, the length of sequences in training is often shorter than in testing, and it is not clear to what extent a learned solution will be able to extrapolate beyond the sequence lengths seen in training. In the overparameterized regime, where the representational capacity of the learned model exceeds what is necessary for fitting short sequences, there may exist solutions that generalize but do not extrapolate, meaning that their accuracy is high over short sequences but arbitrarily poor over long ones (see Cohen-Karlik et al. (2022) ). In practice however, when training RNNs using GD, accurate extrapolation is often observed. We refer to this phenomenon as the implicit extrapolation of GD. As opposed to the implicit generalization of GD, little is formally known about its implicit extrapolation. Existing theoretical analyses of the latter focus on linear RNNs -also known as Linear Dynamical Systems (LDS) -and either treat infinitely wide models (Emami et al., 2021) , or models of finite width that learn from a memoryless teacher (Cohen-Karlik et al., 2022) . In these regimes,



); Yun et al. (2020); Zhang et al. (2017); Li et al. (2020); Ji & Telgarsky (2018); Lyu & Li (2019)). Recent studies have surfaced a new form of implicit bias that arises in Recurrent Neural Networks (RNNs) and their variants (e.g., Long Short-Term Memory Hochreiter & Schmidhuber (1997) and Gated Recurrent Units Chung et al. (

