ARCHITECTURE AGNOSTIC NEURAL NETWORKS

Abstract

In this paper, we explore an alternate method for synthesizing neural network architectures, inspired by the brain's stochastic synaptic pruning. During a person's lifetime, numerous distinct neuronal architectures are responsible for performing the same tasks. This indicates that biological neural networks are, to some degree, architecture agnostic. However, artificial networks rely on their fine-tuned weights and hand-crafted architectures for their remarkable performance. This contrast begs the question: Can we build artificial architecture agnostic neural networks? To ground this study we utilize sparse, binary neural networks that parallel the brain's circuits. Within this sparse, binary paradigm we sample many binary architectures to create families of architecture agnostic neural networks not trained via backpropagation. These high-performing network families share the same sparsity, distribution of binary weights, and succeed in both static and dynamic tasks. In summation, we create an architecture manifold search procedure to discover families of architecture agnostic neural networks.

1. INTRODUCTION

Fascinated by the developmental algorithms and stochasticity inherent in the developmental synaptic pruning process, in this paper, we will explore architecure agnostic neural networks via the lens of binary, sparse, networks. We ground our study using sparse binary neural networks because these networks capture many of the most salient aspects of biological networks: • distinct neuronal units implementing non-linear functions in constrain an output to (-1, +1) • synaptic connections that are restricted to (-1, +1) • inhibatory and excitatory connections are represented by (-1, +1) respectively In this paper we demonstrate that (i) AANNs exist in silico, (ii) high-performance sparse binary neural networks on static (MNIST classification) and dynamic (imitation learning on car-racing) tasks exist, and (iii) that our stochastic search and succeed (SENSE) algorithm explores the architecture manifold.

2. RELATED WORK

Biological neural networks endow organisms with the ability to perform a multitude of tasks, ranging from sensory processing (Glickfeld & Olsen, 2017; Peirce, 2015) , to memory storage and retrieval (Tan et al., 2017; Denny et al., 2017) , to decision making (Hanks & Summerfield, 2017; Padoa-Schioppa & Conen, 2017) . Remarkably, these complex tasks persist throughout our lives despite neuronal pruning, and synapse deletion up until adulthood. This partially stochastic process of neuronal refinement is known as developmental synaptic pruning. Developmental synaptic pruning occurs when the physical connection between a neuron's dendrite and another neuron's axon is eliminated (Riccomagno & Kolodkin, 2015) , preventing any further relay of information. Interestingly, between infancy and adulthood mammals lose roughly 50% of their neuronal synapses (Chechik et al., 1999) . A study in humans estimated that our prefrontal cortex dendritic spine density, a proxy for synaptic density, is on average more than two times higher in childhood than adulthood (Petanjek et al., 2011) . This evolved process is also partially stochastic (Vogt, 2015) . One of the main manifestations of stochastic developmental variation in the brain occurs at the circuit level (Clarke, 2012), insinuating that there are many similar neural architectures that would have sufficed in place of your current brain's architecture! Given the ubiquity, extent, and stochastic nature of developmental synaptic pruning there are many theories for why this process exists: to increase information transfer efficiency (Horn et al., 1998) , or to derive optimal synaptic architectures (Chechik et al., 1999) . Previous work in the machine learning field has sought out several methodologies to search the architecture manifold. Neural architecture search methods enabled traversing the architecture space to discover high-performance networks, by making them malleable to neuro-evolution strategies (Stanley & Miikkulainen, 2002; Real et al., 2017; 2018) , reinforcement learning (Zoph & Le, 2016) and multi-objective searches (Elsken et al., 2018; Zhou & Diamos, 2018) . For example, Gaier & Ha (2019) described an elegant architecture search by de-emphasizing the importance of weights. By utilizing a shared weight parameter they were able to develop ever-growing networks that acquired skills based on their interactions with the environment. However, given the brain's excitatory and inhibitory connections there is a rigidity to the weights that biological neural networks actually use. Despite the weight implication, the principle of minimizing parameter count that Gaier & Ha (2019) addressed is productive when conceiving of biologically inspired artificial neural networks. In practice, neural networks tend to be over-parameterized, making them highly energy and memory inefficient. There has been a lot of work in the machine learning field of sparsity and low precision weights to alleviate these prominent issues. Sparsity of networks can be introduced prior to training, as shown by SqueezeNet (Iandola et al., 2016) and MobileNet (Howard et al., 2017) . These networks were carefully engineered to have an order of magnitude fewer parameters than standard architectures while performing image recognition. Sparsity can also be introduced while training, as shown by Louizos et al. (2017); Srinivas & Babu (2015) where they explicitly prune and sparsify networks during training as dropout probabilities for some weights reach 1. Additionally, sparsity can be added after training is complete. In this paper, we will leverage prior work in neuroscience, architecure search, sparse networks, and binary networks to demonstrate the presence of architecture agnostic neural networks, architecture agnostic neural network families, and the stochastic search and succeed algorithm's ability to navigate through the architecture manifold.

3.1. SPARSE BINARIZED NEURAL NETWORKS

Preliminaries We represent a feed-forward neural network as a function f (x, w), that maps an input vector, x ∈ R k , to an output vector, f (x, w) = y ∈ R m . The function, f (x, w), is parameterized by a vector of weights, w ∈ R n , that are typically set in training to solve a specific task. We refer to W = R n as the weight space (W ) of the network. Here, k is the input dimension, m is the output dimension and n is the total number of parameters in the neural network. In this paper, we use two different neural network architectures for the static and dynamic task respectively. For the static task (e.g., MNIST classification): the network has 2 convolutional layers (16 filters 5 x 5) , 2 max-pooling layers and 1 fully-connected layer (1568 x 10). For the dynamic task (e.g., imitation learning for car-racing): this network has 2 convolutional layers (32 filters 7 x 7, 64 filers 5 x 5), 2 max-pooling layers and 2 fully-connected layers (576 x 100, 100 x 3). Sparse Binarized Neural Network Throughout this paper, a binarized neural network refers to networks with weights constrained to (-1, 0, +1). We also constrain the output from every neuronal unit in the network to be in the range [-1, +1] by applying a binarized activation function. We use a "HardTanh" function defined as follows: HardT anh(x) =    +1 x > 1 -1 x < -1 x otherwise . A p-sparse binary network (w b ), which is a network with p percent sparsity, is defined as follows:

