MINIMUM DESCRIPTION LENGTH RECURRENT NEURAL NETWORKS

Abstract

Recurrent neural networks (RNNs) face two well-known challenges: (a) the difficulty of such networks to generalize appropriately as opposed to memorizing, especially from very short input sequences (generalization); and (b) the difficulty for us to understand the knowledge that the network has attained (transparency). We explore the implications to these challenges of employing a general search through neural architectures using a genetic algorithm with Minimum Description Length (MDL) as an objective function. We find that MDL leads the networks to reach adequate levels of generalization from very small corpora, improving over backpropagation-based alternatives. We demonstrate this approach by evolving networks which perform tasks of increasing complexity with absolute correctness. The resulting networks are small, easily interpretable, and unlike classical RNNs, are provably appropriate for sequences of arbitrary length even when trained on very limited corpora. One case study is addition, for which our system grows a network with just four cells, reaching 100% accuracy (and at least .999 certainty) for arbitrary large numbers.

1. INTRODUCTION

The modeling of sequential knowledge and learning requires making appropriate generalizations from input sequences that are often quite short. This holds both for language capabilities and for other sequential tasks such as counting. Moreover, it is often helpful for the modeler to inspect the acquired knowledge and reason about its properties. Neural networks, despite their impressive results and popularity in a wide range of domains, still face some challenges in these respects: they tend to overfit the learning data and require regularization or other special measures, as well as very large training corpora, to avoid this problem. In terms of knowledge, networks are often very big, and it is generally very hard to inspect a given network and determine what it is that it actually knows (see Papernot & McDaniel, 2018 , among others, for a recent attempt to probe this knowledge). Some of the challenges above arise from the reliance of common connectionist approaches on backpropagation as a training method, and in this paper we explore the implications to sequential modeling of well-known alternative perspectives on neural network design. Specifically, we consider replacing backpropagation with a general search using a genetic algorithm through a large space of possible networks using Minimum Description Length (MDL; Rissanen, 1978) as an objective function. In essence, this amounts to minimizing error as usual, while at the same time trying to minimize the size of the network. We find that MDL helps the networks reach adequate levels of generalization from very small corpora, avoiding overfitting and performing significantly better than backpropagation-based alternatives. The MDL search converges on networks that are often small, transparent, and provably correct. We illustrate this across a range of sequential tasks.

2. PREVIOUS WORK

Our work follows several lines of work in the literature. Evolutionary programming has been used to evolve neural networks in a range of studies. Early work that uses genetic algorithms for various aspects of neural network optimization includes Miller et al. (1989 ), Montana & Davis (1989) , Whitley et al. (1990), and Zhang & Mühlenbein (1993; 1995) . These works focus on feed-forward architectures, but Angeline et al. (1994) present an evolutionary algorithm that discovers recurrent neural networks and test it on a range of sequential tasks that are very relevant to the goals of the current paper. Evolutionary programming for neural networks remains an active area of research (see Schmidhuber, 2015 and Gaier & Ha, 2019 , among others, for relevant references). In terms of objective function, Zhang & Mühlenbein (1993; 1995) use a simplicity metric that is essentially the same as the MDL metric that we use (and describe below). Schmidhuber (1997) presents an algorithm for discovering networks that optimize a simplicity metric that is closely related to MDL. Simplicity criteria have been used in a range of works on neural networks, including recent contributions (e.g., Ahmadizar et al., 2015 and Gaier & Ha, 2019) . Our paper connects also with the literature on using recurrent neural networks for grammar induction and on the interpretation of such networks in terms of symbolic knowledge (often formal-language theoretic objects). These challenges were already taken up by early work on recurrent neural networks (see Giles et al., 1990 and Elman, 1990 , among others), and they remain the focus of recent work (see, e.g., Wang et al., 2018 and Weiss et al., 2018) . See Jacobsson (2005) and Wang et al. ( 2018) for discussion and further references. In a continuation of these efforts, our contribution is twofold. First, we put together a minimal, outof-the-box combination of the core of these ideas: evaluate the performance that can be achieved by a learner that seeks to optimize MDL measures. The search and optimization itself is done through a standard genetic algorithm. From there, we benchmark the performance obtained through MDL optimization against the performance obtained by a set of 3 classic RNN architectures of different sizes. Second, we show the benefit of optimizing networks not only for performance but also for their own architecture size, in that it makes the black box much more permeable; for current tasks, we are able to provide full proofs of accuracy (above and beyond a test set).

3.1. MDL

Consider a hypothesis space G of possible grammars, and a corpus of input data D. In our case, G is the set of all possible network architectures expressible using our representations, and D is a set of input sequences. For a given G ∈ G we may consider the ways in which we can encode the data D given that G. The MDL principle (Rissanen, 1978) , a computable approximation of Kolmogorov Complexity (Solomonoff, 1964; Kolmogorov, 1965; Chaitin, 1966) , aims at the G that minimizes |G| + |D : G|, where |G| is the size of G and |D : G| is the length of the shortest encoding of D given G (with both components typically measured in bits). Minimizing |G| favors small, general grammars that often fit the data poorly. Minimizing |D : G| favors large, overly specific grammars that overfit the data. By minimizing the sum, MDL aims at an intermediate level of generalization: reasonably small grammars that fit the data reasonably well. MDL -and the closely related Bayesian approach to induction -have been used in a wide range of models of linguistic phenomena, in which one is often required to generalize from very limited data (see Horning, 1969 , Berwick, 1982 , Stolcke, 1994 , Grünwald, 1996 , and de Marcken, 1996, among others) . The term |D : G| corresponds to the surprisal of the data D according to the probability distribution defined by G and is closely related to the cross-entropy between the distribution defined by G and the true distribution that generated D. The term |G| depends on an encoding scheme for a network. We provide the details of such an encoding scheme in Appendix A and now turn to describe the space of networks that will be considered.

3.2. REPRESENTATIONS

A network is represented as a directed graph which contains nodes, weighted edges, and activation functions for each node. Since we do not use backpropagation to train the network, the set of possible networks is larger here than what is usually allowed; for example, output units can have outgoing edges which feed hidden units, and input units can feed into other input units, thus saving intermediate hidden units. Beyond the topological flexibility, the activation functions also allow for more diversity in the possible networks, they can vary freely from one unit to the next, and they can be chosen from any set of possible activation functions, including non-differentiable ones

