THE RECURRENT NEURAL TANGENT KERNEL

Abstract

The study of deep neural networks (DNNs) in the infinite-width limit, via the so-called neural tangent kernel (NTK) approach, has provided new insights into the dynamics of learning, generalization, and the impact of initialization. One key DNN architecture remains to be kernelized, namely, the recurrent neural network (RNN). In this paper we introduce and study the Recurrent Neural Tangent Kernel (RNTK), which provides new insights into the behavior of overparametrized RNNs. A key property of the RNTK should greatly benefit practitioners is its ability to compare inputs of different length. To this end, we characterize how the RNTK weights different time steps to form its output under different initialization parameters and nonlinearity choices. A synthetic and 56 real-world data experiments demonstrate that the RNTK offers significant performance gains over other kernels, including standard NTKs, across a wide array of data sets.

1. INTRODUCTION

The overparameterization of modern deep neural networks (DNNs) has resulted in not only remarkably good generalization performance on unseen data (Novak et al., 2018; Neyshabur et al., 2019; Belkin et al., 2019) but also guarantees that gradient descent learning can find the global minimum of their highly nonconvex loss functions (Du et al., 2019b; Allen-Zhu et al., 2019b; a; Zou et al., 2018; Arora et al., 2019b) . From these successes, a natural question arises: What happens when we take overparameterization to the limit by allowing the width of a DNN's hidden layers to go to infinity? Surprisingly, the analysis of such an (impractical) DNN becomes analytically tractable. Indeed, recent work has shown that the training dynamics of (infinite-width) DNNs under gradient flow is captured by a constant kernel called the Neural Tangent Kernel (NTK) that evolves according to a linear ordinary differential equation (ODE) (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019a) . Every DNN architecture and parameter initialization produces a distinct NTK. The original NTK was derived from the Multilayer Perceptron (MLP) (Jacot et al., 2018) and was soon followed by kernels derived from Convolutional Neural Networks (CNTK) (Arora et al., 2019a; Yang, 2019a ), Residual DNNs (Huang et al., 2020) , and Graph Convolutional Neural Networks (GNTK) (Du et al., 2019a) . In (Yang, 2020a), a general strategy to obtain the NTK of any architecture is provided. In this paper, we extend the NTK concept to the important class of overparametrized Recurrent Neural Networks (RNNs), a fundamental DNN architecture for processing sequential data. We show that RNN in its infinite-width limit converges to a kernel that we dub the Recurrent Neural Tangent Kernel (RNTK). The RNTK provides high performance for various machine learning tasks, and an analysis of the properties of the kernel provides useful insights into the behavior of RNNs in the following overparametrized regime. In particular, we derive and study the RNTK to answer the following theoretical questions: Q: Can the RNTK extract long-term dependencies between two data sequences? RNNs are known to underperform at learning long-term dependencies due to the gradient vanishing or exploding (Bengio et al., 1994) . Attempted ameliorations have included orthogonal weights (Arjovsky et al., 2016; Jing et al., 2017; Henaff et al., 2016) and gating such as in Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) RNNs. We demonstrate that the RNTK can detect long-term dependencies with proper initialization of the hyperparameters, and moreover, we show how the dependencies are extracted through time via different hyperparameter choices.

