LIPSCHITZ RECURRENT NEURAL NETWORKS

Abstract

Viewing recurrent neural networks (RNNs) as continuous-time dynamical systems, we propose a recurrent unit that describes the hidden state's evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity. This particular functional form facilitates stability analysis of the long-term behavior of the recurrent unit using tools from nonlinear systems theory. In turn, this enables architectural design decisions before experimentation. Sufficient conditions for global stability of the recurrent unit are obtained, motivating a novel scheme for constructing hidden-to-hidden matrices. Our experiments demonstrate that the Lipschitz RNN can outperform existing recurrent units on a range of benchmark tasks, including computer vision, language modeling and speech prediction tasks. Finally, through Hessian-based analysis we demonstrate that our Lipschitz recurrent unit is more robust with respect to input and parameter perturbations as compared to other continuous-time RNNs.

1. INTRODUCTION

Many interesting problems exhibit temporal structures that can be modeled with recurrent neural networks (RNNs), including problems in robotics, system identification, natural language processing, and machine learning control. In contrast to feed-forward neural networks, RNNs consist of one or more recurrent units that are designed to have dynamical (recurrent) properties, thereby enabling them to acquire some form of internal memory. This equips RNNs with the ability to discover and exploit spatiotemporal patterns, such as symmetries and periodic structures (Hinton, 1986) . However, RNNs are known to have stability issues and are notoriously difficult to train, most notably due to the vanishing and exploding gradients problem (Bengio et al., 1994; Pascanu et al., 2013) . Several recurrent models deal with the vanishing and exploding gradients issue by restricting the hidden-to-hidden weight matrix to be an element of the orthogonal group (Arjovsky et al., 2016; Wisdom et al., 2016; Mhammedi et al., 2017; Vorontsov et al., 2017; Lezcano-Casado & Martinez-Rubio, 2019) . While such an approach is advantageous in maintaining long-range memory, it limits the expressivity of the model. To address this issue, recent work suggested to construct hidden-tohidden weights which have unit norm eigenvalues and can be nonnormal (Kerg et al., 2019) . Another approach for resolving the exploding/vanishing gradient problem has recently been proposed by Kag et al. (2020) , who formulate the recurrent units as a differential equation and update the hidden states based on the difference between predicted and previous states. In this work, we address these challenges by viewing RNNs as dynamical systems whose temporal evolution is governed by an abstract system of differential equations with an external input. The data are formulated in continuous-time where the external input is defined by the function x = x(t) ∈ R p , and the target signal is defined as y = y(t) ∈ R d . Based on insights from dynamical systems theory, we propose a continuous-time Lipschitz recurrent neural network with the functional form ḣ = A β A ,γ A h + tanh(W β W ,γ W h + U x + b) , y = Dh , (1a) (1b) A β A ,γ A = (1 -β A )(M A + M T A ) + β A (M A -M T A ) -γ A I W β W ,γ W = (1 -β W )(M W + M T W ) + β W (M W -M T W ) -γ W I, where β A , β W ∈ [0, 1], γ A , γ W > 0 are tunable parameters and M A , M W ∈ R N ×N are trainable matrices. Here, h = h(t) ∈ R N is a function of time t that represents an internal (hidden) state, and ḣ = ∂h(t) ∂t is its time derivative. The hidden state represents the memory that the system has of its past. The function in Eq. ( 1) is parameterized by the hidden-to-hidden weight matrices A ∈ R N ×N and W ∈ R N ×N , the input-to-hidden encoder matrix U ∈ R N ×p , and an offset b. The function in Eq. ( 1b) is parameterized by the hidden-to-output decoder matrix D ∈ R d×N . Nonlinearity is introduced via the 1-Lipschitz tanh activation function. While RNNs that are governed by differential equations with an additive structure have been studied before (Zhang et al., 2014) , the specific formulation that we propose in (1) and our theoretical analysis are distinct. Treating RNNs as dynamical systems enables studying the long-term behavior of the hidden state with tools from stability analysis. From this point of view, an unstable unit presents an exploding gradient problem, while a stable unit has well-behaved gradients over time (Miller & Hardt, 2019) . However, a stable recurrent unit can suffer from vanishing gradients, leading to catastrophic forgetting (Hochreiter & Schmidhuber, 1997b) . Thus, we opt for a stable model whose dynamics do not (or only slowly do) decay over time. Importantly, stability is also a statement about the robustness of neural units with respect to input perturbations, i.e., stable models are less sensitive to small perturbations compared to unstable models. Recently, Chang et al. ( 2019) explored the stability of linearized RNNs and provided a local stability guarantee based on the Jacobian. In contrast, the particular structure of our unit (1) allows us to obtain guarantees of global exponential stability using control theoretical arguments. In turn, the sufficient conditions for global stability motivate a novel symmetric-skew decomposition based scheme for constructing hidden-to-hidden matrices. This scheme alleviates exploding and vanishing gradients, while remaining highly expressive. In summary, the main contributions of this work are as follows: • First, in Section 3, using control theoretical arguments in a direct Lyapunov approach, we provide sufficient conditions for global exponential stability of the Lipschitz RNN unit (Theorem 1). Global stability is advantageous over local stability results since it guarantees non-exploding gradients regardless of the state. In the special case where A is symmetric, we find that these conditions agree with those in classical theoretical analyses (Lemma 1). • Next, in Section 4, drawing from our stability analysis, we propose a novel scheme based on the symmetric-skew decomposition for constructing hidden-to-hidden matrices. This scheme mitigates the vanishing and exploding gradients problem, while obtaining highly expressive hidden-to-hidden matrices. • In Section 6, we show that our Lipschitz RNN has the ability to outperform state-of-theart recurrent units on computer vision, language modeling and speech prediction tasks. Further, our results show that the higher-order explicit midpoint time integrator improves the predictive accuracy as compared to using the simpler one-step forward Euler scheme. • Finally, in Section 7), we study our Lipschitz RNN via the lens of the Hessian and show that it is robust with respect to parameter perturbations; we also show that our model is more robust with respect to input perturbations, compared to other continuous-time RNNs.

2. RELATED WORK

The problem of vanishing and exploding gradients (and stability) have a storied history in the study of RNNs. Below, we summarize two particular approaches to the problem (constructing unitary/orthogonal RNNs and the dynamical systems viewpoint) that have gained significant attention. Unitary and orthogonal RNNs. Unitary recurrent units have received attention recently, largely due to Arjovsky et al. (2016) showing that unitary hidden-to-hidden matrices alleviate the vanishing and exploding gradients problem. Several other unitary and orthogonal models have also been proposed (Wisdom et al., 2016; Mhammedi et al., 2017; Jing et al., 2017; Vorontsov et al., 2017; Jose et al., 2018) . While these approaches stabilize the training process of RNNs considerably, they also

