THE SURPRISING COMPUTATIONAL POWER OF NONDETERMINISTIC STACK RNNS

Abstract

Traditional recurrent neural networks (RNNs) have a fixed, finite number of memory cells. In theory (assuming bounded range and precision), this limits their formal language recognition power to regular languages, and in practice, RNNs have been shown to be unable to learn many context-free languages (CFLs). In order to expand the class of languages RNNs recognize, prior work has augmented RNNs with a nondeterministic stack data structure, putting them on par with pushdown automata and increasing their language recognition power to CFLs. Nondeterminism is needed for recognizing all CFLs (not just deterministic CFLs), but in this paper, we show that nondeterminism and the neural controller interact to produce two more unexpected abilities. First, the nondeterministic stack RNN can recognize not only CFLs, but also many non-context-free languages. Second, it can recognize languages with much larger alphabet sizes than one might expect given the size of its stack alphabet. Finally, to increase the information capacity in the stack and allow it to solve more complicated tasks with large alphabet sizes, we propose a new version of the nondeterministic stack that simulates stacks of vectors rather than discrete symbols. We demonstrate perplexity improvements with this new model on the Penn Treebank language modeling benchmark.

1. INTRODUCTION

Standard recurrent neural networks (RNNs), including simple RNNs (Elman, 1990) , GRUs (Cho et al., 2014), and LSTMs (Hochreiter & Schmidhuber, 1997) , rely on a fixed, finite number of neurons to remember information across timesteps. When implemented with finite precision, they are theoretically just very large finite automata, restricting the class of formal languages they recognize to regular languages (Kleene, 1951) . In practice, too, LSTMs cannot learn simple non-regular languages such as {w#w R | w ∈ {0, 1} * } (DuSell & Chiang, 2020). To increase the theoretical and practical computational power of RNNs, past work has proposed augmenting RNNs with stack data structures (Sun et al., 1995; Grefenstette et al., 2015; Joulin & Mikolov, 2015; DuSell & Chiang, 2020) , inspired by the fact that adding a stack to a finite automaton makes it a pushdown automaton (PDA), raising its recognition power to context-free languages (CFLs). Recently, we proposed the nondeterministic stack RNN (NS-RNN) (DuSell & Chiang, 2020) and renormalizing NS-RNN (RNS-RNN) (DuSell & Chiang, 2022) , augmenting an LSTM with a differentiable data structure that simulates a real-time nondeterministic PDA. (The PDA is real-time in that it executes exactly one transition per input symbol scanned, and it is nondeterministic in that it executes all possible sequences of transitions.) This was in contrast to prior work on stack RNNs, which exclusively modeled deterministic stacks, theoretically limiting such models to deterministic CFLs (DCFLs), which are a proper subset of CFLs (Sipser, 2013) . The RNS-RNN proved more effective than deterministic stack RNNs at learning both nondeterministic and deterministic CFLs. In practical terms, giving RNNs the ability to recognize context-free patterns may be beneficial for modeling natural language, as syntax exhibits hierarchical structure; nondeterminism in particular is necessary for handling the very common phenomenon of syntactic ambiguity. However, the RNS-RNN's reliance on a PDA may still render it inadequate for practical use. For one, not all phenomena in human language are context-free, such as cross-serial dependencies. Secondly, the RNS-RNN's computational cost restricts it to small stack alphabet sizes, which is likely insufficient for storing detailed lexical information. In this paper, we show that the RNS-RNN is surprisingly good at overcoming both difficulties. Whereas an ordinary weighted PDA must use the same transition weights for all timesteps, the RNS-RNN can update them based on the status of ongoing nondeterministic branches of the PDA. This means it can coordinate multiple branches in a way a PDA cannot-for example, to simulate multiple stacks, or to encode information in the distribution over stacks. • • • • • • h t-1 x t-1 y t-1 s t-1 h t x t y t s t h t+1 x t+1 y t+1 • • • • • • r t -2 a t -1 r t -1 a t r t a t + Our contributions in this paper are as follows. We first prove that the RNS-RNN can recognize all CFLs and intersections of CFLs despite restrictions on its PDA transitions. We show empirically that the RNS-RNN can model some non-CFLs; in fact it is the only stack RNN able to learn {w#w | w ∈ {0, 1} * }, whereas a deterministic multi-stack RNN cannot. We then show that, surprisingly, an RNS-RNN with only 3 stack symbol types can learn to simulate a stack of no fewer than 200 symbol types, by encoding them as points in a vector space related to the distribution over stacks. Finally, in order to combine the benefits of nondeterminism and vector representations, we propose a new model that simulates a PDA with a stack of vectors instead of discrete symbols. We show that the new vector RNS-RNN outperforms the original on the Dyck language and achieves better perplexity than other stack RNNs on the Penn Treebank. Our code is publicly available.foot_0 

2. STACK RNNS

In this paper, we examine two styles of stack-augmented RNN, using the same architectural framework as in our previous work (DuSell & Chiang, 2022) (Fig. 1 ). In both cases, the model consists of an LSTM, called the controller, connected to a differentiable stack. At each timestep, the stack receives actions from the controller (e.g. to push and pop elements). The stack simulates those actions and produces a reading vector, which represents the updated top element of the stack. The reading is fed as an extra input to the controller at the next timestep. The actions and reading consist of continuous and differentiable weights so the whole model can be trained end-to-end with backpropagation; their form and meaning vary depending on the particular style of stack. We assume the input w = w 1 • • • w n is encoded as a sequence of vectors x 1 , • • • , x n . The LSTM's memory consists of a hidden state h t and memory cell c t (we set h 0 = c 0 = 0). The controller computes the next state (h t , c t ) given the previous state, input vector x t , and stack reading r t-1 : (h t , c t ) = LSTM (h t-1 , c t-1 ), x t r t-1 . The hidden state generates the stack actions a t and logits y t for predicting the next word w t+1 . The previous stack and new actions generate a new stack s t , which produces a new reading r t : a t = ACTIONS(h t ) y t = W hy h t + b hy s t = STACK(s t-1 , a t ) r t = READING(s t ). Each style of stack differs only in the definitions of ACTIONS, STACK, and READING.

2.1. SUPERPOSITION STACK RNN

We start by describing a stack with deterministic actions-the superposition stack of Joulin & Mikolov (2015)-which we include because it was one of the best-performing stack RNNs we



https://github.com/bdusell/nondeterministic-stack-rnn



Figure 1: Conceptual diagram of the RNN controller-stack interface, unrolled across a portion of time. The LSTM memory cell c t is not shown.

