SEQUENCE-LEVEL FEATURES: HOW GRU AND LSTM CELLS CAPTURE N -GRAMS

Abstract

Modern recurrent neural networks (RNN) such as Gated Recurrent Units (GRU) and Long Short-term Memory (LSTM) have demonstrated impressive results on tasks involving sequential data in practice. Despite continuous efforts on interpreting their behaviors, the exact mechanism underlying their successes in capturing sequence-level information have not been thoroughly understood. In this work, we present a study on understanding the essential features captured by GRU/LSTM cells by mathematically expanding and unrolling the hidden states. Based on the expanded and unrolled hidden states, we find there was a type of sequence-level representations brought in by the gating mechanism, which enables the cells to encode sequence-level features along with token-level features. Specifically, we show that the cells would consist of such sequence-level features similar to those of N -grams. Based on such a finding, we also found that replacing the hidden states of the standard cells with N -gram representations does not necessarily degrade performance on the sentiment analysis and language modeling tasks, indicating such features may play a significant role for GRU/LSTM cells.

1. INTRODUCTION

Long Short-term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Chung et al., 2014) are widely used and investigated for tasks that involve sequential data. They are generally believed to be capable of capturing long-range dependencies while being able to alleviate gradient vanishing or explosion issues (Hochreiter & Schmidhuber, 1997; Karpathy et al., 2015; Sutskever et al., 2014) . While such models were empirically shown to be successful across a range of tasks, certain fundamental questions such as "what essential features are GRU or LSTM cells able to capture?" have not yet been fully addressed. Lacking answers to them may limit our ability in designing better architectures. One obstacle can be attributed to the non-linear activations used in the cells that prevent us from obtaining explicit closed-form expressions for hidden states. A possible solution is to expand the non-linear functions using the Taylor series (Arfken & Mullin, 1985) and represent hidden states with explicit input terms. Literally, each hidden state can be viewed as the combination of constituent terms capturing features of different levels of complexity. However, there is a prohibitively large number of polynomial terms involved and they can be difficult to manage. But it is possible that certain terms are more significant than others. Through a series of mathematical transformation, we found there were sequence-level representations in a form of matrix-vector multiplications among the expanded and unrolled hidden states of the GRU/LSTM cell. Such representations could represent sequence-level features that could theoretically be sensitive to the order of tokens and able to differ from the token-level features of its tokens as well as the sequence-level features of its sub-sequences, thus making it able to represent N -grams. We assessed the significance of such sequence-level representations on sentiment analysis and language modeling tasks. We observed that the sequence-level representations derived from a GRU or LSTM cell were able to reflect desired properties on sentiment analysis tasks. Furthermore, in both the sentiment analysis and language modeling tasks, we replaced the GRU or LSTM cell with corresponding sequence-level representations (along with token-level representations) directly during training, and found that such models behaved similarly to the standard GRU or LSTM based models. This indicated that the sequence-level features might be significant for GRU or LSTM cells.

2. RELATED WORK

There have been plenty of prior works aiming to explain the behaviors of RNNs along with the variants. Early efforts were focused on exploring the empirical behaviors of recurrent neural networks (RNNs). Li et al. (2015) proposed a visualization approach to analyze intermediate representations of the LSTM-based models where certain interesting patterns could be observed. However, it might not be easy to extend to models with high-dimension representations. Greff et al. (2016) explored the performances of LSTM variants on representative tasks such as speech recognition, handwriting recognition, and argued that none of the proposed variants could significantly improve upon the standard LSTM architecture. Karpathy et al. (2015) studied the existence of interpretable cells that could capture long-range dependencies such as line lengths, quotes and brackets. However, those works did not involve the internal mechanism of GRUs or LSTMs. Krause et al. (2017) found that creating richer interaction between contexts and inputs on top of standard LSTMs could result in improvements. Their efforts actually pointed out the significance of rich interactions between inputs and contexts for LSTMs, but did not study what possible features such interactions could result in for good performances. Arras et al. ( 2017) applied an extended technique Layer-wise Relevance Propagation (LRP) to a bidirectional LSTM for sentiment analysis and produced reliable explanations of which words are responsible for attributing sentiment in individual text. Murdoch et al. (2018) leverage contextual decomposition methods to conduct analysis on the interactions of terms for LSTMS, which could produce importance scores for words, phrases and word interactions. A RNN unrolling technique was proposed by Sherstinsky (2018) based on signal processing concepts, transforming the RNN into the "Vanilla LSTM" network through a series of logical arguments, and Kanai et al. ( 2017) discussed the conditions that could prevent gradient explosions by looking into the dynamics of GRUs. Merrill et al. ( 2020) examined the properties of saturated RNNs and linked the update behaviors to weighted finite-state machines. Their ideas gave inspirations to explore internal behaviors of LSTM or GRU cells further. In this work, we sought to explore and study such significant underlying features.

3. MODEL DEFINITIONS

Vanilla RNN The representation of a vanilla RNN cell can be written as: h t = tanh(W i x t + W h h t-1 ), where h t ∈ R d , x t ∈ R dx are the hidden state and input at time step t respectively, h t-1 is the hidden state of the layer at time (t -1) or the initial hidden state. W i and W h are weight matrices. Bias is suppressed here as well. GRU The representation of a GRU cell can be written as 1 : r t = σ(W ir x t + W hr h t-1 ), z t = σ(W iz x t + W hz h t-1 ), n t = tanh(W in x t + r t W hn h t-1 ), h t = (1 -z t ) n t + z t h t-1 , where h t ∈ R d , x t ∈ R dx are the hidden state and input at time step t respectively, h t-1 is the hidden state of the layer at time (t -1) or the initial hidden state. r t ∈ R d , z t ∈ R d , n t ∈ R d are the reset, update, and the new gates respectively. W refers to a weight matrix. σ is the elementwise sigmoid function, and is the element-wise Hadamard product. LSTM The representation of an LSTM cell can be written as: i t = σ(W ii x t + W hi h t-1 ), f t = σ(W if x t + W hf h t-1 ), o t = σ(W io x t + W ho h t-1 ), c t = tanh(W ic x t + W hc h t-1 ), c t = f t c t-1 + i t c t , h t = o t tanh(c t ), 1 For brevity, we suppressed the bias for both GRU and LSTM cells here.

