SLOT MACHINES: DISCOVERING WINNING COMBINA-TIONS OF RANDOM WEIGHTS IN NEURAL NETWORKS

Abstract

In contrast to traditional weight optimization in a continuous space, we demonstrate the existence of effective random networks whose weights are never updated. By selecting a weight among a fixed set of random values for each individual connection, our method uncovers combinations of random weights that match the performance of traditionally-trained networks of the same capacity. We refer to our networks as "slot machines" where each reel (connection) contains a fixed set of symbols (random values). Our backpropagation algorithm "spins" the reels to seek "winning" combinations, i.e., selections of random weight values that minimize the given loss. Quite surprisingly, we find that allocating just a few random values to each connection (e.g., 8 values per connection) yields highly competitive combinations despite being dramatically more constrained compared to traditionally learned weights. Moreover, finetuning these combinations often improves performance over the trained baselines. A randomly initialized VGG-19 with 8 values per connection contains a combination that achieves 90% test accuracy on CIFAR-10. Our method also achieves an impressive performance of 98.1% on MNIST for neural networks containing only random weights.

1. INTRODUCTION

Innovations in how deep networks are trained have played an important role in the remarkable success deep learning has produced in a variety of application areas, including image recognition (He et al., 2016) , object detection (Ren et al., 2015; He et al., 2017) , machine translation (Vaswani et al., 2017) and language modeling (Brown et al., 2020) . Learning typically involves either optimizing a network from scratch (Krizhevsky et al., 2012) , finetuning a pre-trained model (Yosinski et al., 2014) or jointly optimizing the architecture and weights (Zoph & Le, 2017) . Against this predominant background, we pose the following question: can a network instantiated with only random weights achieve competitive results compared to the same model using optimized weights? For a given task, an untrained, randomly initialized network is unlikely to produce good performance. However, we demonstrate that given sufficient random weight options for each connection, there exist selections of these random weight values that have generalization performance comparable to that of a traditionally-trained network with the same architecture. More importantly, we introduce a method that can find these high-performing randomly weighted configurations consistently and efficiently. Furthermore, we show empirically that a small number of random weight options (e.g., 2 -8 values per connection) are sufficient to obtain accuracy comparable to that of the traditionally-trained network. Instead of updating the weights, the algorithm simply selects for each connection a weight value from a fixed set of random weights. We use the analogy of "slot machines" to describe how our method operates. Each reel in a Slot Machine has a fixed set of symbols. The reels are jointly spinned in an attempt to find winning combinations. In our context, each connection has a fixed set of random weight values. Our algorithm "spins the reels" in order to find a winning combination of symbols, i.e., selects a weight value for each connection so as to produce an instantiation of the network that yields strong performance. While in physical Slot Machines the spinning of the reels is governed by a fully random process, in our Slot Machines the selection of the weights is guided by a method that optimizes the given loss at each spinning iteration. More formally, we allocate K fixed random weight values to each connection. Our algorithm assigns a quality score to each of these K possible values. In the forward pass a weight value is selected for each connection based on the scores. The scores are then updated in the backward pass via stochastic gradient descent. However, the weights are never changed. By evaluating different combinations of fixed randomly generated values, this extremely simple procedure finds weight configurations that yield high accuracy. We demonstrate the efficacy of our algorithm through experiments on MNIST and CIFAR-10. On MNIST, our randomly weighted Lenet-300-100 (Lecun et al., 1998 ) obtains a 97.0% test set accuracy when using K = 2 options per connection and 98.1% with K = 128. On CIFAR-10 (Krizhevsky, 2009), our six-layer convolutional network matches the test set performance of the traditionally-trained network when selecting from K = 64 fixed random values at each connection. Finetuning the models obtained by our procedure generally boosts performance over networks with optimized weights albeit at an additional compute cost (see Figure 4 ). Also, compared to traditional networks, our networks are less memory efficient due to the inclusion of scores. That said, our work casts light on some intriguing phenomena about neural networks: • First, our results suggest a performance comparability between selection from multiple random weights and traditional training by continuous weight optimization. This underscores the effectiveness of strong initializations. • Second, this paper further highlights the enormous expressive capacity of neural networks. Maennel et al. (2020) show that contemporary neural networks are so powerful that they can memorize randomly generated labels. This work builds on that revelation and demonstrates that current networks can model challenging non-linear mappings extremely well even by simple selection from random weights. • This work also connects to recent observations (Malach et al., 2020; Frankle & Carbin, 2018) suggesting strong performance can be obtained by utilizing gradient descent to uncover effective subnetworks. • Finally, we are hopeful that our novel model -consisting in the the introduction of multiple weight options for each edge-will inspire other initialization and optimization strategies.

2. RELATED WORK

Supermasks and the Strong Lottery Ticket Conjecture. The lottery ticket hypothesis was articulated in (Frankle & Carbin, 2018) and states that a randomly initialized neural network contains sparse subnetworks which when trained in isolation from scratch can achieve accuracy similar to that of the trained dense network. Inspired by this result, Zhou et al. ( 2019) present a method for identifying subnetworks of randomly initialized neural networks that achieve better than chance performance without training. These subnetworks (named "supermasks") are found by assigning a probability value to each connection. These empirical results as well as recent theoretical ones (Malach et al., 2020; Pensia et al., 2020) suggest that pruning a randomly initialized network is just as good as optimizing the weights, provided a good pruning mechanism is used. Our work corroborates this intriguing phenomenon but differs from these prior methods in a significant aspect. We eliminate pruning completely and instead introduce multiple weight values per connection. Thus, rather than selecting connections to define a subnetwork, our method selects weights for all connections in a network of fixed structure.



These probabilities are used to sample the connections to use and are updated via stochastic gradient descent. Without ever modifying the weights, Zhou et al. (2019) find subnetworks that perform impressively across multiple datasets.

