IT'S HARD FOR NEURAL NETWORKS TO LEARN THE GAME OF LIFE Anonymous

Abstract

Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods rather than on weight initializations. Recent findings, however, suggest that neural networks rely on lucky random initial weights of subnetworks called "lottery tickets" that converge quickly to a solution (Frankle & Carbin, 2018). To investigate how weight initializations affect performance, we examine small convolutional networks that are trained to predict n steps of the two-dimensional cellular automaton Conway's Game of Life, the update rules of which can be implemented efficiently in a small CNN. We find that networks of this architecture trained on this task rarely converge. Rather, networks require substantially more parameters to consistently converge. Furthermore, we find that the initialization parameters that gradient descent converges to a solution are sensitive to small perturbations, such as a single sign change. Finally, we observe a critical value d 0 such that training minimal networks with examples in which cells are alive with probability d 0 dramatically increases the chance of convergence to a solution. Our results are consistent with the lottery ticket hypothesis (Frankle & Carbin, 2018).

1. INTRODUCTION

Recent findings suggest that neural networks can be "pruned" by 90% or more to eliminate unnecessary weights while maintaining performance similar to the original network . Similarly, the lottery ticket hypothesis (Frankle & Carbin, 2018) proposes that neural networks contain subnetworks, called winning tickets, that can be trained in isolation to reach the performance of the original. These results suggest that neural networks may rely on these lucky initializations to learn a good solution. Rather than extensively exploring weight-space, networks trained with gradient-based optimizers may converge quickly to local minima that are nearby the initialization, many of which will be poor estimators of the dataset distribution. If some subset of the weights must be in a winning configuration for a neural network to learn a good solution to a problem, then neural networks initialized with random weights must be significantly larger than the minimal network configuration that would solve the problem in order to optimize the chance having a winning initialization. Furthermore, small networks with winning initial configurations may be sensitive to small perturbations. Similarly, gradient-based optimizers sample the gradient of the loss function with respect to the weights by averaging the gradient at a few elements of the dataset. Thus, a biased training dataset may bias the gradient in a way that can be detrimental to the success of the network. Here we examine how the distribution of the training dataset affects the network's ability to learn. In this paper, we explore how effectively small neural networks learn to take as input a configuration for Conway's Game of Life (Life), and then output the configuration n steps in the future. Since this task can be implemented minimally in a convolutional neural network with 2n + 1 layers and 23n + 2 trainable parameters, a neural network with identical architecture should, in principle, be able to learn a similar solution. Nonetheless, we find that networks of this architecture rarely find solutions. We show that the number of weights necessary for networks to reliably converge on a solution increases quickly with n. Additionally, we show that the probability of convergence is highly sensitive to small perturbations of initial weights. Finally, we explore properties of the training data that significantly increase the probability that a network will converge to a correct solution. While Life is a toy problem, we believe that these studies give insight into more general issues with training

