SEQUENCE-LEVEL FEATURES: HOW GRU AND LSTM CELLS CAPTURE N -GRAMS

Abstract

Modern recurrent neural networks (RNN) such as Gated Recurrent Units (GRU) and Long Short-term Memory (LSTM) have demonstrated impressive results on tasks involving sequential data in practice. Despite continuous efforts on interpreting their behaviors, the exact mechanism underlying their successes in capturing sequence-level information have not been thoroughly understood. In this work, we present a study on understanding the essential features captured by GRU/LSTM cells by mathematically expanding and unrolling the hidden states. Based on the expanded and unrolled hidden states, we find there was a type of sequence-level representations brought in by the gating mechanism, which enables the cells to encode sequence-level features along with token-level features. Specifically, we show that the cells would consist of such sequence-level features similar to those of N -grams. Based on such a finding, we also found that replacing the hidden states of the standard cells with N -gram representations does not necessarily degrade performance on the sentiment analysis and language modeling tasks, indicating such features may play a significant role for GRU/LSTM cells.

1. INTRODUCTION

Long Short-term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Chung et al., 2014) are widely used and investigated for tasks that involve sequential data. They are generally believed to be capable of capturing long-range dependencies while being able to alleviate gradient vanishing or explosion issues (Hochreiter & Schmidhuber, 1997; Karpathy et al., 2015; Sutskever et al., 2014) . While such models were empirically shown to be successful across a range of tasks, certain fundamental questions such as "what essential features are GRU or LSTM cells able to capture?" have not yet been fully addressed. Lacking answers to them may limit our ability in designing better architectures. One obstacle can be attributed to the non-linear activations used in the cells that prevent us from obtaining explicit closed-form expressions for hidden states. A possible solution is to expand the non-linear functions using the Taylor series (Arfken & Mullin, 1985) and represent hidden states with explicit input terms. Literally, each hidden state can be viewed as the combination of constituent terms capturing features of different levels of complexity. However, there is a prohibitively large number of polynomial terms involved and they can be difficult to manage. But it is possible that certain terms are more significant than others. Through a series of mathematical transformation, we found there were sequence-level representations in a form of matrix-vector multiplications among the expanded and unrolled hidden states of the GRU/LSTM cell. Such representations could represent sequence-level features that could theoretically be sensitive to the order of tokens and able to differ from the token-level features of its tokens as well as the sequence-level features of its sub-sequences, thus making it able to represent N -grams. We assessed the significance of such sequence-level representations on sentiment analysis and language modeling tasks. We observed that the sequence-level representations derived from a GRU or LSTM cell were able to reflect desired properties on sentiment analysis tasks. Furthermore, in both the sentiment analysis and language modeling tasks, we replaced the GRU or LSTM cell with corresponding sequence-level representations (along with token-level representations) directly during training, and found that such models behaved similarly to the standard GRU or LSTM based models. This indicated that the sequence-level features might be significant for GRU or LSTM cells. 1

