IT'S HARD FOR NEURAL NETWORKS TO LEARN THE GAME OF LIFE Anonymous

Abstract

Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods rather than on weight initializations. Recent findings, however, suggest that neural networks rely on lucky random initial weights of subnetworks called "lottery tickets" that converge quickly to a solution (Frankle & Carbin, 2018). To investigate how weight initializations affect performance, we examine small convolutional networks that are trained to predict n steps of the two-dimensional cellular automaton Conway's Game of Life, the update rules of which can be implemented efficiently in a small CNN. We find that networks of this architecture trained on this task rarely converge. Rather, networks require substantially more parameters to consistently converge. Furthermore, we find that the initialization parameters that gradient descent converges to a solution are sensitive to small perturbations, such as a single sign change. Finally, we observe a critical value d 0 such that training minimal networks with examples in which cells are alive with probability d 0 dramatically increases the chance of convergence to a solution. Our results are consistent with the lottery ticket hypothesis (Frankle & Carbin, 2018).

1. INTRODUCTION

Recent findings suggest that neural networks can be "pruned" by 90% or more to eliminate unnecessary weights while maintaining performance similar to the original network . Similarly, the lottery ticket hypothesis (Frankle & Carbin, 2018) proposes that neural networks contain subnetworks, called winning tickets, that can be trained in isolation to reach the performance of the original. These results suggest that neural networks may rely on these lucky initializations to learn a good solution. Rather than extensively exploring weight-space, networks trained with gradient-based optimizers may converge quickly to local minima that are nearby the initialization, many of which will be poor estimators of the dataset distribution. If some subset of the weights must be in a winning configuration for a neural network to learn a good solution to a problem, then neural networks initialized with random weights must be significantly larger than the minimal network configuration that would solve the problem in order to optimize the chance having a winning initialization. Furthermore, small networks with winning initial configurations may be sensitive to small perturbations. Similarly, gradient-based optimizers sample the gradient of the loss function with respect to the weights by averaging the gradient at a few elements of the dataset. Thus, a biased training dataset may bias the gradient in a way that can be detrimental to the success of the network. Here we examine how the distribution of the training dataset affects the network's ability to learn. In this paper, we explore how effectively small neural networks learn to take as input a configuration for Conway's Game of Life (Life), and then output the configuration n steps in the future. Since this task can be implemented minimally in a convolutional neural network with 2n + 1 layers and 23n + 2 trainable parameters, a neural network with identical architecture should, in principle, be able to learn a similar solution. Nonetheless, we find that networks of this architecture rarely find solutions. We show that the number of weights necessary for networks to reliably converge on a solution increases quickly with n. Additionally, we show that the probability of convergence is highly sensitive to small perturbations of initial weights. Finally, we explore properties of the training data that significantly increase the probability that a network will converge to a correct solution. While Life is a toy problem, we believe that these studies give insight into more general issues with training neural networks. In particular, we expect that other neural network architectures and problems exhibit similar issues. We expect that networks likely require a large number of parameters to learn any domain, and that small networks likely exhibit similar sensitivities to small perturbations to their weights. Furthermore, optimal training datasets may be highly particular to certain parameters. Thus, with the growing interest in efficient neural networks (Han et al., 2015; Hassibi & Stork, 1993; Hinton et al., 2015; LeCun et al., 1990; Li et al., 2016) , this results serve as an important step forward in developing ideal training conditions.

1.1. CONWAY'S GAME OF LIFE

Prior studies have shown interest in applying neural networks to model physical phenomena in applications including weather simulation and fluid dynamics (Baboo & Shereef, 2010; Maqsood et al., 2004; Mohan & Gaitonde, 2018; Shrivastava et al., 2012) . Similarly, neural networks are trained to learn computational tasks, such as adding and multiplying (Kaiser & Sutskever, 2015; Graves et al., 2014; Joulin & Mikolov, 2015; Trask et al., 2018) . In all of these tasks, neural networks are required to learn hidden-step processes in which the network must learn some update rule that can be generalized to perform multi-step computation. Conway's Life is a two-dimensional cellular automaton with a simple local update rule that can produce complex global behavior. In a Life configuration, cells in an n × m grid can be either alive or dead (represented by 1 or 0 respectively). To determine the state of a given cell on the next step, Life considers the 3 × 3 grid of neighbors around the cell. Every step, cells with exactly two alive neighbors will maintain their state, cells with exactly three alive neighbors will become alive, and cells with any other number of neighbors will die (Figure 1 ). We consider a variant of Life in which cells outside of the n × m grid are always considered to be dead. Despite the simplicity of the update rule, Life can produce complex output over time, and thus can serve as an idealized problem for modeling hidden-step behavior.

2. RELATED WORK

Convolutional models of cellular automata including the Game of Life have been studied by Gilpin (2019), who classifies structural representations of the learned solutions to different cellular automata. Furthermore, Gilpin notes that narrow networks do not often converge, and for consistent convergence behavior, the cellular networks must be sufficiently wide. Our work in this paper quantifies this result, and in addition explores the sensitivity of initial conditions to perturbations and different dataset distributions. Prior research has shown interest in whether neural networks can learn particular tasks. Joulin & Mikolov (2015) argue that certain recurrent neural networks cannot learn addition in a way that generalizes to an arbitrary number of bits. Theoretical work has shown that sufficiently overparameterized neural networks converge to global minima (Oymak & Soltanolkotabi, 2020; Du et al., 2018) . Further theoretical work has found methods to minimize local minima (Kawaguchi & Kaelbling, 2019; Nguyen & Hein, 2017; Kawaguchi, 2016) . Nye & Saxe (2018) show that minimal networks for the parity function and fast Fourier transform do not converge to a solution unless they are initialized close to a solution. Increasing the depth and number of parameters of neural networks has been shown to increase the speed at which networks converge and their testing performance (Arora et al., 2018; Park et al., 2019) . Similarly, Frankle & Carbin (2018) find that increasing parameter count can increase the chance of convergence to a good solution. Similarly, Li et al. (2018) and Neyshabur et al. (2018) find that training near-minimal networks leads to poor performance. Choromanska et al. (2015) provide some theoretical insight into why small networks are more likely to find poor local minima. Weight initialization has been shown to matter in training deep neural networks. Glorot & Bengio (2010) find that initial weights should be normalized with respect to the size of each layer. Dauphin & Schoenholz (2019) find that tuning weight norms prior to training can increase training performance. Similarly, Mishkin & Matas (2016) propose a method for finding a good weight initialization for learning. Zhou et al. (2020) find that the sign of initial weights can determine if a particular subnetwork will converge to a good solution.

