SEQUENCE-LEVEL FEATURES: HOW GRU AND LSTM CELLS CAPTURE N -GRAMS

Abstract

Modern recurrent neural networks (RNN) such as Gated Recurrent Units (GRU) and Long Short-term Memory (LSTM) have demonstrated impressive results on tasks involving sequential data in practice. Despite continuous efforts on interpreting their behaviors, the exact mechanism underlying their successes in capturing sequence-level information have not been thoroughly understood. In this work, we present a study on understanding the essential features captured by GRU/LSTM cells by mathematically expanding and unrolling the hidden states. Based on the expanded and unrolled hidden states, we find there was a type of sequence-level representations brought in by the gating mechanism, which enables the cells to encode sequence-level features along with token-level features. Specifically, we show that the cells would consist of such sequence-level features similar to those of N -grams. Based on such a finding, we also found that replacing the hidden states of the standard cells with N -gram representations does not necessarily degrade performance on the sentiment analysis and language modeling tasks, indicating such features may play a significant role for GRU/LSTM cells.

1. INTRODUCTION

Long Short-term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Chung et al., 2014) are widely used and investigated for tasks that involve sequential data. They are generally believed to be capable of capturing long-range dependencies while being able to alleviate gradient vanishing or explosion issues (Hochreiter & Schmidhuber, 1997; Karpathy et al., 2015; Sutskever et al., 2014) . While such models were empirically shown to be successful across a range of tasks, certain fundamental questions such as "what essential features are GRU or LSTM cells able to capture?" have not yet been fully addressed. Lacking answers to them may limit our ability in designing better architectures. One obstacle can be attributed to the non-linear activations used in the cells that prevent us from obtaining explicit closed-form expressions for hidden states. A possible solution is to expand the non-linear functions using the Taylor series (Arfken & Mullin, 1985) and represent hidden states with explicit input terms. Literally, each hidden state can be viewed as the combination of constituent terms capturing features of different levels of complexity. However, there is a prohibitively large number of polynomial terms involved and they can be difficult to manage. But it is possible that certain terms are more significant than others. Through a series of mathematical transformation, we found there were sequence-level representations in a form of matrix-vector multiplications among the expanded and unrolled hidden states of the GRU/LSTM cell. Such representations could represent sequence-level features that could theoretically be sensitive to the order of tokens and able to differ from the token-level features of its tokens as well as the sequence-level features of its sub-sequences, thus making it able to represent N -grams. We assessed the significance of such sequence-level representations on sentiment analysis and language modeling tasks. We observed that the sequence-level representations derived from a GRU or LSTM cell were able to reflect desired properties on sentiment analysis tasks. Furthermore, in both the sentiment analysis and language modeling tasks, we replaced the GRU or LSTM cell with corresponding sequence-level representations (along with token-level representations) directly during training, and found that such models behaved similarly to the standard GRU or LSTM based models. This indicated that the sequence-level features might be significant for GRU or LSTM cells.

2. RELATED WORK

There have been plenty of prior works aiming to explain the behaviors of RNNs along with the variants. Early efforts were focused on exploring the empirical behaviors of recurrent neural networks (RNNs). Li et al. (2015) proposed a visualization approach to analyze intermediate representations of the LSTM-based models where certain interesting patterns could be observed. However, it might not be easy to extend to models with high-dimension representations. Greff et al. (2016) explored the performances of LSTM variants on representative tasks such as speech recognition, handwriting recognition, and argued that none of the proposed variants could significantly improve upon the standard LSTM architecture. Karpathy et al. (2015) studied the existence of interpretable cells that could capture long-range dependencies such as line lengths, quotes and brackets. However, those works did not involve the internal mechanism of GRUs or LSTMs. Melis et al. (2020) and Krause et al. (2017) found that creating richer interaction between contexts and inputs on top of standard LSTMs could result in improvements. Their efforts actually pointed out the significance of rich interactions between inputs and contexts for LSTMs, but did not study what possible features such interactions could result in for good performances. Arras et al. (2017) applied an extended technique Layer-wise Relevance Propagation (LRP) to a bidirectional LSTM for sentiment analysis and produced reliable explanations of which words are responsible for attributing sentiment in individual text. Murdoch et al. (2018) leverage contextual decomposition methods to conduct analysis on the interactions of terms for LSTMS, which could produce importance scores for words, phrases and word interactions. A RNN unrolling technique was proposed by Sherstinsky (2018) based on signal processing concepts, transforming the RNN into the "Vanilla LSTM" network through a series of logical arguments, and Kanai et al. (2017) discussed the conditions that could prevent gradient explosions by looking into the dynamics of GRUs. Merrill et al. (2020) examined the properties of saturated RNNs and linked the update behaviors to weighted finite-state machines. Their ideas gave inspirations to explore internal behaviors of LSTM or GRU cells further. In this work, we sought to explore and study such significant underlying features.

3. MODEL DEFINITIONS

Vanilla RNN The representation of a vanilla RNN cell can be written as: h t = tanh(W i x t + W h h t-1 ), where h t ∈ R d , x t ∈ R dx are the hidden state and input at time step t respectively, h t-1 is the hidden state of the layer at time (t -1) or the initial hidden state. W i and W h are weight matrices. Bias is suppressed here as well. GRU The representation of a GRU cell can be written asfoot_0 : r t = σ(W ir x t + W hr h t-1 ), z t = σ(W iz x t + W hz h t-1 ), n t = tanh(W in x t + r t W hn h t-1 ), h t = (1 -z t ) n t + z t h t-1 , where h t ∈ R d , x t ∈ R dx are the hidden state and input at time step t respectively, h t-1 is the hidden state of the layer at time (t -1) or the initial hidden state. r t ∈ R d , z t ∈ R d , n t ∈ R d are the reset, update, and the new gates respectively. W refers to a weight matrix. σ is the elementwise sigmoid function, and is the element-wise Hadamard product. LSTM The representation of an LSTM cell can be written as: i t = σ(W ii x t + W hi h t-1 ), f t = σ(W if x t + W hf h t-1 ), o t = σ(W io x t + W ho h t-1 ), c t = tanh(W ic x t + W hc h t-1 ), c t = f t c t-1 + i t c t , h t = o t tanh(c t ), where  h t ∈ R d , x t ∈

4. UNROLLING RNNS

Using the Taylor series, the activations tanh(x) and σ(x) can be expanded (at 0) as: tanh(x) = x + O(x 3 ) (|x| < π 2 ), σ(x) = 1 2 + 1 4 x + O(x 3 ) (|x| < π) In this work, we do not seek to approximate the GRU or LSTM cells precisely, but to explore what features the cells could capture.

4.1. VANILLA RNN

We can expand the vanilla RNN hidden state using the Taylor series as: h t = x n t + W h h t-1 + O(x n t + W h h t-1 ) 3 , where x n t = W i x t . Let us unroll it as: h t = x n t + t-1 i=1 W t-i h x n i + r (x 1 , x 2 , ..., x t ), ( ) where r is the unrolled representation produced by higher-order terms. It can be seen that the vanilla RNN cell can capture the input information at each time step.

4.2. GRU

Let us write 4in Equation 2, we can expand the hidden state at time step t, then combine like terms with respect to the order of h t-1 and represent them as: x r t = W ir x t , x z t = W iz x t , x n t = W in x t , h r t-1 = W hr h t-1 , h z t-1 = W hz h t-1 , h n t-1 = W hn h t-1 . Plugging Equation h t = 1 2 x n t - 1 4 x n t x z t zeroth-order + 1 2 h t-1 + 1 4 h n t-1 + 1 4 x z t h t-1 - 1 4 x n t h z t-1 + 1 8 (x r t -x z t ) h n t-1 - 1 16 x r t x z t h n t-1 f irst-order + 1 4 h z t-1 h t-1 + 1 8 (h r t-1 -h z t-1 ) h n t-1 - 1 16 x r t h z t-1 h n t-1 - 1 16 x z t h r t-1 h n t-1 second-order - 1 16 h z t-1 h r t-1 h n t-1 third-order +ξ(x t , h t-1 ), ) where ξ(x t , h t-1 ) refers to the higher-order terms of x t , h t-1 as well as their interactions. We will focus on the terms involving zeroth-order and first-order terms of h t-1 and explore the features they can possibly result in. Then the hidden state at time step t can be written as: h t = 1 2 x n t - 1 4 x n t x z t + 1 2 h t-1 + 1 4 h n t-1 + 1 4 x z t h t-1 - 1 4 x n t h z t-1 + 1 8 (x r t -x z t - 1 2 x z t x r t ) h n t-1 + ξ (x t , h t-1 ), where ξ (x t , h t-1 ) refers to the higher-order terms of h t-1 plus ξ(x t , h t-1 ). Note that the Hadamard products can be transformed into matrix-vector multiplications (a b = diag(a)b) and we can obtain the following : h t = 1 2 x n t - 1 4 x n t x z t + 1 2 I + 1 4 W hn + 1 4 diag(x z t ) - 1 4 diag(x n t )W hz + 1 8 diag(x r t -x z t - 1 2 x z t x r t )W hn h t-1 + ξ (x t , h t-1 ). (9) For brevity, let us define two functions of x t : g(x t ) = 1 2 x n t - 1 4 x n t x z t , A(x t ) = 1 2 I + 1 4 W hn + 1 4 diag(x z t ) - 1 4 diag(x n t )W hz + 1 8 diag(x r t -x z t - 1 2 x z t x r t )W hn . (10) Both g(x t ) and A(x t ) are only functions of x t . Then we can rewrite Equation 9 as: h t = g(x t ) + A(x t )h t-1 + ξ (x t , h t-1 ). ( ) Throughout all the previous time steps (assuming the initial state are 0s), the hidden state at time step t can be finally unrolled as: h t = g(x t ) + t-1 i=1 A(x t )A(x t-1 )...A(x i+1 ) M (i+1):t g(x i ) + g (x 1 , x 2 , ..., x t ) = g(x t ) + t-1 i=1 M (i+1):t g(x i ) Φi:t + g (x 1 , x 2 , ..., x t ), where M (i+1):t = i+1 k=t A(x k ) ∈ R d×d is the matrix-matrix product from time step t to i + 1, g (x 1 , x 2 , ..., x t ) are the unrolled representations from the higher-order terms in Equation 11. The function g(x t ) solely encodes current input, namely token-level features and thus we call it tokenlevel representation. The matrix-vector product Φ i:t = M (i+1):t g(x i ) encodes the tokens starting from time step i and ending at time step t. If the matrices are different and not diagonal, any change of the order will result in a different product. Therefore, Φ i:t is able to capture the feature of the token sequence between time step i and t in an order-sensitive manner. We call it a sequence-level representation. Such representations are calculated sequentially from the left to the right through a sequence of vector/matrix multiplications, leading to features reminiscent of the classical N -grams commonly used in natural language processing (NLP). Let us use ĥt to denote the first two terms in Equation 12 as: ĥt = g(x t ) + t-1 i=1 Φ i:t . ( ) ĥt can be called as N -gram representations (N ≥ 1). At time step t, it is able to encode current token input and all the token sequences starting from time step i ∈ {1, 2, ..., t -1} and ending at time step t given an instance. In other words, it is a linear combination of current token-level input feature (can be understood as the unigram feature) and sequence-level features of all the possible N -grams ending at time step t. Bidirectional GRUs would be able to capture sequence-level features from both directions. If we make a comparison with the unrolled vanilla RNN cell as discussed above, we can see that the sequence-level representation A(x t )A(x t-1 )...A(x i+1 )g(x i ) is more expressive than W t-i h x n i (i = 1, ..., t -1) when capturing the sequence level features. Specifically, the sequence level representation in GRU explicitly models interactions among input tokens, while capturing the useful order information conveyed by them. This may also be a reason why gating mechanism can bring in improved effectiveness over vanilla RNNs apart from alleviating the gradient vanishing or explosion problems.

4.3. LSTM Let us write x

i t = W ii x t , x f t = W if x t , x o t = W io x t , x c t = W ic x t . Similarly, for an LSTM cell, we can expand the memory cell and the hidden state in a similar way. We also focus on the terms that involve the zeroth-order and first-order c t-1 as well as h t-1 , and write the final memory cell and hidden state together as: c t h t = g c (x t ) g h (x t ) + B(x t ) D(x t ) E(x t ) F (x t ) c t-1 h t-1 + ξ c (x t , h t-1 , c t-1 ) ξ h (x t , h t-1 , c t-1 ) . ( ) where: g c (x t ) = 1 4 (x i t + 2) x c t , B(x t ) = 1 4 diag(x f t + 2), D(x t ) = 1 4 diag(x i t + 2)W hc + 1 4 diag(x c t )W hi , g h (x t ) = 1 4 (x o t + 2) g c (x t ), E(x t ) = 1 4 diag(x o t + 2)B(x t ), F (x t ) = 1 4 diag(x o t + 2)D(x t ) + 1 4 diag(g c (x t ))W ho . (15) ξ c (x t , h t-1 , c t-1 ) and ξ h (x t , h t-1 , c t-1 ) are the higher-order terms. Let us use the matrix A(x t ) ∈ R 2d×2d to denote B(x t ) D(x t ) E(x t ) F (x t ) , then the equation above can be written as: c t h t = g c (x t ) g h (x t ) + A(x t ) c t-1 h t-1 + ξ c (x t , h t-1 , c t-1 ) ξ h (x t , h t-1 , c t-1 ) . ( ) And the final memory cell and hidden state can be unrolled as: c t h t = g c (x t ) g t (x t ) + t-1 i=1 i+1 k=t A(x k ) g c (x i ) g h (x i ) Φi:t + l (x 1 , x 2 , ..., x t ), where l (x 1 , x 2 , ..., x t ) are the unrolled representations from the higher-order terms. The matrixvector product Φ i:t ∈ R 2d for the token sequence between time step i and t will be viewed as the sequence-level representation. Similar properties can be inferred. Analogously, we will use ĉt and ĥt to denote the first two terms in Equation 17 respectively.

5. EXPERIMENTS

We would assess the significance of the sequence-level representations on downstream tasks. For N -grams, negation is a common linguistic phenomenon that negates part or all of the meaning of a linguistic unit with negation words or phrases. Particularly in sentiment analysis, the polarity of certain N -grams can be negated by adding negation words or phrases. This task is a good testing ground for us to verify the effectiveness of the learned sequence-level features. Thus, we would like to examine whether the sequence-level representations could capture the negation information for N -grams, which is crucial for the sentiment analysis tasks. Language modeling tasks are often used in examining how capable an encoder is when extracting features from texts. We would use them to verify whether the sequence-level representations along with token-level representations could capture sufficient features during training and produce performances on par with standard GRU or LSTM cells. The statistics of the datasets are shown in Appendix A.1.

5.1. INTERPRET SEQUENCE-LEVEL REPRESENTATIONS

We first trained the model with the standard GRU or LSTM cell with Adagrad optimizers (Duchi et al., 2011) , then used the learned parameters to calculate and examine the token-level and sequence-level features on sentiment analysis tasks. Final hidden states were used for classification. L-2 regularization was adopted. There were three layers in the model: an embedding layer, a GRU/LSTM layer,and fully-connected layer with a sigmoid/softmax function for binary/multi-class sentiment analysis.

5.1.1. POLARITY SCORE

We would like to use the metric polarity score (Sun & Lu, 2020) to help us understand the properties of the token-level features and sequence-level features. We followed the work of Sun & Lu (2020) , and defined two types of "polarity scores" to quantify such polarity information, token-level polarity score and sequence-level polarity score. Such scores are able to capture the degree of association between a token (a sequence) and a specific label. For binary sentiment analysis, each polarity score is a scalar. For multi-class sentiment analysis, each polarity score is a vector corresponding to labels. For a GRU cell, the two types of scores will be calculated as: s g t = w g(x t ), s Φ i:t = w Φ i:t . For an LSTM cell, the sequence-level representation can be split into two parts: Φ i:t = [Φ c i:t , Φ h i:t ] (Φ c i:t , Φ h i:t ∈ R d ). The polarity scores will be calculated as: s g t = w g h (x t ), s Φ i:t = w Φ h i:t . ( ) And the overall polarity score at time step t can be viewed as the sum of the token-level polarity score, sequence-level polarity scores and other polarity scores: s t = w h t = s g t + t-1 i=1 s Φ i:t + s t . ( ) where w is the fully-connected layer weight, s g t is the token-level polarity score at time step t and s Φ i:t is the sequence-level polarity score for the sequence between time step i and t, s t is the polarity score produced by (x 1 , x 2 , ..., x t ). For binary sentiment analysis, w ∈ R d , s g t ∈ R, s Φ i:t ∈ R. For multi-class sentiment analysis, w ∈ R d×k , s g t ∈ R k , s Φ i:t ∈ R k , and k is the label size. The overall polarity scores will be used to make decisions for sentiment analysis. We examined sequence-level representations on the binary and 3-class Stanford Sentiment Treebank (SST) dataset with subphrase labels (Socher et al., 2013) respectively. Final models were selected based on validation performances with embeddings randomly initialized. The embedding size and hidden size were set as 300 and 1024 respectively.

5.1.2. SEPARATING PHRASES

We extracted all the short labeled phrases (2-5 words, 18490 positive/12199 negative) from the binary training set, and calculated the sum of the token-level polarity score and sequence-level polarity scores for each phrase using the first two terms in Equation 20. We call the sum phrase polarity score. Figure 1 shows that the two types of phrases can be generally separated by the phrase polarity scores. This set of experiments show that the learned sequence-level features can capture information useful in making discrimination between positive and negative phrases. 

5.1.3. NEGATING ADJECTIVES

We extracted 65 positive adjectives and 42 negative adjectivesfoot_1 following the criterion in the work of Sun & Lu (2020) from the vocabulary of the binary SST training set. We calculated the tokenlevel polarity scores for those adjectives and sequence-level polarity scores for their corresponding negation bigrams (adding the negation word "not" and "never"). It can be seen from Table 1 that the model could likely learn to infer negations with respect to sequence-level polarity scores: the negation bigrams generally have sequence-level polarity scores of opposite signs to the corresponding adjectives. For example, "outstanding" has a large positive token-level polarity score while "not outstanding" has a large negative sequence-level polarity score. We also examined whether the sequence-level representations could play a dissenting role in negating the polarity of sub-phrases. We searched for labeled phrases (3-6 tokens) that start with negation words "not", "never" and "hardly" and have corresponding labeled sub-phrases without negation words (those sub-phrases have opposite labels). For example, the phrase "hardly seems worth the effort" was labeled as "negative" while the sub-phrase "seems worth the effort" was labeled as "positive". Based on such conditions, we automatically extracted 14 positive phrases and 36 negative phrases along with their corresponding sub-phrases, then we calculated the polarity scores with pretrained models. We would like to see if the polarity scores assigned to such linguistic units by our models are consistent with the labels. Ideally, based on our analysis the longest N -grams for the phrases will be assigned polarity scores consist with their labels to offset the impact of their sub-phrases. Table 2 shows that the sequencelevel representations of the N -grams could be generally assigned polarity scores with the signs opposite to the sub-phrases, likely playing a dissenting role. Figure 2 shows the sequence-level polarity score of the four-gram "hardly an objective documentary" was negatively large that could help reverse the polarity of the sub-phrase "an objective documentary" and make the overall polarity of the phrase negative. Such negation can be also observed on models with bidirectional GRU or LSTM cells in A.2. We noticed that in order for such cells as GRU/LSTM to learn complex compositional language structures, it may be essential for the model to have enough exposure to relevant structures during the training phase. We did a simple controlled experiment on two sets of labeled instances. In the first set, the six training instances are "good", "not good", "not not good", "not not not good", "not not not not good" and "not not not not not good" with alternating labels "positive" and "negative". In the second set, the training set only consists two labeled instances, the positive phrase "good" and the negative phrase "not not not not not good". We then trained the GRU model on these two training sets, and then applied these models on a dataset by extending the first training set with two additional phrases "not not not not not not good" and "not not not not not not not good". As we can see from Figure 3 , the model can infer the multiple negation correctly for the given cases when trained on the first set, and is able to generalize to unseen phrases well. However, it fails to do so for the second. This indicates that proper supervision would be needed for the models to capture the compositional nature of the semantics as conveyed by N -grams. ) in the phrase "not not not not not good". Each box represents a polarity score distribution for either the token "good" or the i times negation N -grams (e.g., "3-not" refers to "not not not good"). Circles refer to outliers. Left, model trained on six labeled phrases. Right, model trained on two labeled phrases. Results from 30 trials with random initializations. A GRU cell is used.

5.2. TRAINING WITH SEQUENCE-LEVEL REPRESENTATIONS

We examined whether the sequence-level representations along with the token-level representations could capture sufficient features during training and perform on par with the standard cells. We trained models by replacing the standard GRU or LSTM cell with the corresponding N -gram representations ( ĥt for a GRU or LSTM cell). We evaluated them on both sentiment analysis and language modeling tasks and compared them with the standard models. Additionally, we created a baseline model named "Simplified" by removing all the terms involving x t from A(x t ) in Equation 11and 16. The resulting representations do not capture sequence-level information. On the binary SST dataset (with sub-phrases) and Movie Review dataset (Pang & Lee, 2004) , we found both the standard cells and our N -gram representations behaved similarly, as shown in Table 3 . But the "Simplified" models did not perform well. Glove (Pennington et al., 2014) embeddings were used. We ran language modeling tasks on the Penn Treebank (PTB) dataset (Marcus et al., 1993) , Wikitext-2 dataset and Wikitext-103 dataset (Merity et al., 2016 ) respectivelyfoot_2 . The embedding size and hidden size were both set as 128 for PTB and Wikitext-2, and 256 for Wikitext-103. Adaptive softmax (Joulin et al., 2017) was used for Wikitext-103. It can be seen that using such representations can yield comparable results as the standard GRU or LSTM cells, as shown in Table 4 . However, for the "simplified" representations, their performances dropped sharply which implies the significance of sequence-level representations. We noticed the intermediate outputs could grow to very large values on Wikitext-103, therefore, we clamped the elements in the hidden states to the range (-3, 3) at each time step. This demonstrated that the sequence-level features might be a significant contributor to the performances of a GRU or LSTM cell apart from the token-level features. Although using the N -gram representations perform well on both tasks, we cannot rule out the contributions of other underlying complex features possibly captured by the standard cells. This can be observed from the test perplexities obtained from the N -gram representations, which are generally slightly higher than those obtained from the standard GRU or LSTM cells. However, the N -gram representations even outperformed the standard GRU cell on Wikitext-103. 

6. CONCLUSION

In this work, we explored the underlying features captured by GRU and LSTM cells. We expanded and unrolled the internal states of a GRU or LSTM cell, and found there were special representations among the terms through a series of mathematical transformations. Theoretically, we found those representations were able to encode sequence-level features, and found their close connection with the N -gram information captured by classical sequence models. Empirically, we examined the use of such representations based on our finding, and showed that they can be used to construct linguistic phenomenons such as negations on sentiment analysis tasks. We also found that models using such representations only can behave similarly to the standard GRU or LSTM models on both the sentiment analysis and language modeling tasks. Our results confirm the importance of sequencelevel features as captured by GRU or LSTM, but at the same time, we also note that we could not rule out the contributions of other more complex features captured by the standard models. There are some future directions that are worth exploring. One of them is to explore possible significant features captured by higher-order terms in a GRU/LSTM cell, and understand how they contribute to the performances.

A APPENDIX A.1 DATASET STATISTICS

We listed the statistics of the datasets used in our experiments. We extracted adjectives (shown in Table 7 ) based on their frequency ratio in the positive and negative instances. If an adjective appeared mostly in positive (negative) instances, we would regard it as a positive (negative) adjective. The textblob packagefoot_3 was used to detect adjectives. The labeled phrases (shown in Table 8 ) would be selected from the SST dataset if they had labeled sub-phrases of opposite signs by removing the negation words. We also conducted experiments using bidirectional GRU and LSTM cells on the binary SST dataset. Figures 4 and 5 show that the models can capture such negation from both directions. For example, the four-gram "to feel contradictory things" has a negative sequence-level polarity score while the five-gram "freedom to feel contradictory things" has a positive one. Similarly, in the backward direction, the bigram "to freedom" has a positive sequence-level polarity score while the five-gram "things contradictory feel to freedom" has a negative one. Glove (Pennington et al., 2014) embeddings were used. Embedding size and hidden size were set as 300 and 1024 respectively. Table 8 : Selected labeled phrases from the SST dataset Type Phrase positive never fails him, never fails to fascinate ., not a bad way, never wanted to leave ., never feels derivative, not to be dismissed, never becomes claustrophobic, never fails to entertain, never feels draggy, never veers from its comic course, never growing old, not a bad premise, not without merit, not mean -spirited negative not life -affirming, not for every taste, hardly an objective documentary, not a must -own, never takes hold ., not a classic, not the best herzog, never rises to a higher level, not exactly assured in its execution, not as good as the original, not be a breakthrough in filmmaking, not always for the better, not well -acted, not very compelling or much fun, never reach satisfying conclusions, not as sharp, never rises above superficiality ., never seems fresh and vital ., not a good movie, not a movie make, not the great american comedy, not smart and, hardly seems worth the effort ., not enough of interest onscreen, not funny performers, not well enough, not in a good way, hardly a nuanced portrait, not good enough, not number 1, not very amusing, never gaining much momentum, not well enough, not one clever line, not a good movie, never comes together 

A.2.2 3-CLASS SST DATASET

We also considered the dissenting scenarios for 3-class sentiment analysis. We extracted 160 pairs of labeled phrases starting with negation words and their sub-phrases with different labels from the 3-class SST dataset. We trained the model on the extracted pairs directly until all the instances were classified correctly. Table 9 showed that the sequence-level polarity scores of the longest Ngrams in the phrases could generally capture the differences between the pairs, and dominate in the dimensions corresponding to the labels. Table 9 : Results on the sampled 3-class SST dataset. "neu", "pos" and "neg" refer to that the sequence-level polarity scores have the largest value in the dimension corresponding to the label "neutral", "positive" and "negative". 

A.2.3 IMPACT OF N -GRAM LENGTHS

To understand the impact of sequences with different lengths, we selected positive and negative Ngrams (N =1-4) ending with the last token of the instances based on their association with positive and negative labels respectively. It appears that the token-level polarity scores for unigrams and sequence-level polarity scores for bigrams can reflect their association with labels better than for trigrams and four-grams as shown in Figure 6 . This demonstrates that although the sequence-level features can be well captured and are important for sentiment analysis tasks, the shorter sequences in general may be playing more crucial roles than the longer ones. words following simple grammatical rules. We created a vocabulary of 85 words including nouns, verbs, adjectives, adverbs, articles and negation words. Negation phrases and double negation phrases are also incorporated in those instances. We let positive adjectives such as "inspiring" and the double negation expression "not not inspiring" appear in the instances labeled as positive, e.g., "her dramas were indeed inspiring", "her movies are not not inspiring" .Let the negation expression "not inspiring" appear in the instances labeled as negative, e.g., "his movie is not inspiring". We did similarly for negative adjectives. Table 10 shows that the sequence-level representations could generally have polarity scores matching the roles of the negation and double negation N -grams. For the positive (negative) adjectives, their negation N -grams have negative (positive) sequence-level polarity scores while their double negation N -grams have positive (negative) ones. The negation tokens are "not" and "never". Models were trained until all the negation expressions have been classified correctly. The key tokens, negation and double negation phrases are shown in Table 11 .

A.4 PERFORMANCE DURING TRAINING

We make comparisons between the performances of the standard GRU/LSTM cell and the N -gram representations during training. It can be seen from Figure 7 and 8 that the N -gram representations perform similarly to the standard GRU/LSTM cell on the PTB, Wikitext-2 and Wikitext-103 datasets. Adam optimizers (Kingma & Ba, 2014) were used.

A.5 MULTIPLE NEGATION ON A SIMPLE SET

To scrutinize the sequence-level features under controlled conditions, we created a training set consisting of the phrases: "good", "not good", "not not good", "not not not good", "not not not not good" and "not not not not not good" with alternating labels "positive" and "negative". We trained a standard GRU or LSTM cell on the training set until the loss converged (less than 10e-6) with the embedding size 256, hidden size 1024, then calculated the corresponding polarity scores for N -grams. Figure 9 shows the sequence-level features derived from the pre-trained GRU or LSTM cell were able to detect multiple negation, implying those features were likely to be significant for classification decisions. For example, the four-gram "not not not good" generally has a large negative sequence-level polarity score while the five-gram "not not not not good" generally has a large positive one, and the six-gram "not not not not not good" has a large negative one again. These polarity scores could help the phrases reverse the polarity of their sub-phrases with opposite labels. Figure 10 shows examples from a pre-trained GRU and an LSTM cell with random initializations. We can see with each "not", the sequence-level polarity score will reverse the polarity. Based on aforementioned analysis, it may be essential for the models to have enough exposure to relevant structures during the training phase. Each box represents a polarity score distribution for either the token "good" or the i times negation N -grams (shown as i-not, i = 1, 2...5). Circles refer to outliers. Results from 30 trials with random initializations. An LSTM cell is used. 



For brevity, we suppressed the bias for both GRU and LSTM cells here. The adjectives are listed in Table7in the appendix. Our model is a word-level language model, we used the torchtext package to obtain and process the data. https://textblob.readthedocs.io/en/dev/ dull, boring, tedious, horrible, terrible, pathetic, mediocre, shallow, pointless, unfunny, gross, poor, dreadful, dire, useless not horrible, not pointless not dire, not mediocre not terrible, not unfunny not bad, not dull never boring, never horrible never seem terrible, never seem pointless never seemed unfunny



Figure 1: Phrase polarity score distribution for short phrases in binary SST. Left, GRU; right, LSTM.

Figure2: Example of dissenting a sub-phrase. Polarity scores are listed for N -grams in the phrase "hardly an objective documentary". Each vertical bar represents either a token or a sequence that starts from its left side and ends with its right side. Left, GRU; right, LSTM.

Figure3: Distributions of polarity scores for negation N -grams (N = 1 -8) in the phrase "not not not not not good". Each box represents a polarity score distribution for either the token "good" or the i times negation N -grams (e.g., "3-not" refers to "not not not good"). Circles refer to outliers. Left, model trained on six labeled phrases. Right, model trained on two labeled phrases. Results from 30 trials with random initializations. A GRU cell is used.

Figure 4: Polarity scores for N -grams (N = 1 -5) in the phrase "freedom to feel contradictory things" from bidrectional GRU cells. Left, forward cell; right, backward cell. Each bar represents a N -gram. Red refers to negative polarity scores while blue refers to positive ones.

Figure 5: Polarity scores for N -grams (N = 1 -5) in the phrase "freedom to feel contradictory things" from bidrectional LSTM cells. Each bar represents a N -gram. Red refers to negative polarity scores while blue refers to positive ones.

Figure 6: Polarity score distribution for the N -grams (N =1-4) that have strong association with a specific label. Result from a GRU cell on the SST dataset.

Figure 7: standard GRU Cell; bottom, approximate hidden state representation. From left to right: PTB, Wikitext-2, Wikitext-103. Training and validation losses for language modeling tasks.

Figure 8: Top, standard LSTM Cell; bottom, approximate hidden state representation. From left to right: PTB, Wikitext-2, Wikitext-103. Training and validation losses for language modeling tasks.

Figure9: Distributions of polarity scores for negation N -grams (N = 1 -6). Each box represents a polarity score distribution for either the token "good" or the i times negation N -grams (shown as i-not, i = 1, 2...5). Circles refer to outliers. Results from 30 trials with random initializations. An LSTM cell is used.

Figure 10: Polarity scores for N -grams (N = 1 -6) in "not not not not not good". Left, GRU; right, LSTM. Each vertical bar represents the polarity score (token-level or sequence-level) for the N -gram that it covers. Red bars refer to negative polarity scores and blue bars refer to positive scores.

R dx are the hidden state and input at time step t respectively, i t , f t , o t ∈ R d are the input gate, forget gate, output gate respectively. c t ∈ R d is the new memory, c t is the final memory. W refers to a weight matrix.

Statistics of token-level polarity scores for positive and negative adjectives and sequencelevel polarity scores for negation bigrams. Negation words are "not" and "never". Models trained on the binary SST dataset.

Dissenting Sub-phrases: "Polarity score" refers to the sequence-level polarity scores for the longest N -gram in the phrases. ">0" and "<0"refers to the number of positive polarity scores and number of negative polarity scores respectively. "$ Ph" refers to the number of phrases. "Example" refers to the example phrase (with sequence-level polarity score) for each type of extracted phrases. The sub-phrases have opposite labels to the phrases. Models trained on the binary SST dataset.

Accuracy (%) on sentiment analysis datasets

Perplexities on language modeling datasets. All parameters were initialized randomly.

Statistics of sentiment analysis datasets. "Label" refers to the size of the positive labels and negative labels in the training set, "positive/negative" for binary classification, "positive/neutral/negative" for 3-class classification.

Statistics of language modeling dataset, quoted from Einstein.ai

Extracted adjectives from the SST dataset

Polarity score distribution for tokens, negation phrases and double negation phrases. Results from 10 trials with random initializations.

Adjectives, negation and double negation examples for the synthetic dataset

