THE SURPRISING COMPUTATIONAL POWER OF NONDETERMINISTIC STACK RNNS

Abstract

Traditional recurrent neural networks (RNNs) have a fixed, finite number of memory cells. In theory (assuming bounded range and precision), this limits their formal language recognition power to regular languages, and in practice, RNNs have been shown to be unable to learn many context-free languages (CFLs). In order to expand the class of languages RNNs recognize, prior work has augmented RNNs with a nondeterministic stack data structure, putting them on par with pushdown automata and increasing their language recognition power to CFLs. Nondeterminism is needed for recognizing all CFLs (not just deterministic CFLs), but in this paper, we show that nondeterminism and the neural controller interact to produce two more unexpected abilities. First, the nondeterministic stack RNN can recognize not only CFLs, but also many non-context-free languages. Second, it can recognize languages with much larger alphabet sizes than one might expect given the size of its stack alphabet. Finally, to increase the information capacity in the stack and allow it to solve more complicated tasks with large alphabet sizes, we propose a new version of the nondeterministic stack that simulates stacks of vectors rather than discrete symbols. We demonstrate perplexity improvements with this new model on the Penn Treebank language modeling benchmark.

1. INTRODUCTION

Standard recurrent neural networks (RNNs), including simple RNNs (Elman, 1990) , GRUs (Cho et al., 2014), and LSTMs (Hochreiter & Schmidhuber, 1997) , rely on a fixed, finite number of neurons to remember information across timesteps. When implemented with finite precision, they are theoretically just very large finite automata, restricting the class of formal languages they recognize to regular languages (Kleene, 1951) . In practice, too, LSTMs cannot learn simple non-regular languages such as {w#w R | w ∈ {0, 1} * } (DuSell & Chiang, 2020). To increase the theoretical and practical computational power of RNNs, past work has proposed augmenting RNNs with stack data structures (Sun et al., 1995; Grefenstette et al., 2015; Joulin & Mikolov, 2015; DuSell & Chiang, 2020) , inspired by the fact that adding a stack to a finite automaton makes it a pushdown automaton (PDA), raising its recognition power to context-free languages (CFLs). Recently, we proposed the nondeterministic stack RNN (NS-RNN) (DuSell & Chiang, 2020) and renormalizing NS-RNN (RNS-RNN) (DuSell & Chiang, 2022) , augmenting an LSTM with a differentiable data structure that simulates a real-time nondeterministic PDA. (The PDA is real-time in that it executes exactly one transition per input symbol scanned, and it is nondeterministic in that it executes all possible sequences of transitions.) This was in contrast to prior work on stack RNNs, which exclusively modeled deterministic stacks, theoretically limiting such models to deterministic CFLs (DCFLs), which are a proper subset of CFLs (Sipser, 2013) . The RNS-RNN proved more effective than deterministic stack RNNs at learning both nondeterministic and deterministic CFLs. In practical terms, giving RNNs the ability to recognize context-free patterns may be beneficial for modeling natural language, as syntax exhibits hierarchical structure; nondeterminism in particular is necessary for handling the very common phenomenon of syntactic ambiguity. However, the RNS-RNN's reliance on a PDA may still render it inadequate for practical use. For one, not all phenomena in human language are context-free, such as cross-serial dependencies. Secondly, the RNS-RNN's

