NOISE TRANSFORMS FEED-FORWARD NETWORKS INTO SPARSE CODING NETWORKS

Abstract

A hallmark of biological neural networks, which distinguishes them from their artificial counterparts, is the high degree of sparsity in their activations. Here, we show that by simply injecting symmetric, random, noise during training in reconstruction or classification tasks, artificial neural networks with ReLU activation functions eliminate this difference; the neurons converge to a sparse coding solution where only a small fraction are active for any input. The resulting network learns receptive fields like those of primary visual cortex and remains sparse even when noise is removed in later stages of learning.

1. INTRODUCTION

The brain is highly sparse with an estimated 15% of neurons firing at any given time (Attwell & Laughlin, 2001) . The most immediate answer for why is metabolic efficiency: action potentials consume ∼20% of the brain's energy (Sterling & Laughlin, 2015; Attwell & Laughlin, 2001; Sengupta et al., 2010) . However, there are further advantages to sparsity in the brain (Olshausen & Field, 2004) . One significant advantage is improving the signal to noise ratio (SNR) of neural signals. Sparsity improves SNR by (i) turning off any weakly firing neurons activated by noise; (ii) increasing the separability of data points (Ahmad & Scheinkman, 2019; Xie et al., 2022) . Inhibitory interneurons that suppress all but the most active neurons from firing are an important mechanism for enforcing this sparsity (Haider et al., 2010) . Theoretical and empirical results support their involvement in both silencing noise and separating neural representations. Examples include horizontal interneurons in the retina (Sterling & Laughlin, 2015) and Golgi interneurons in cerebellar-like structures (Fleming et al., 2022; Lin et al., 2014; Xie et al., 2022) . Biologically, as depicted in Fig. 1 , these inhibitory interneurons implement a negative feedback loop whereby the more active the excitatory neurons are, the more active the interneuron becomes, and hence the more it inhibits these excitatory neurons. Simplified models of the circuit written as ordinary differential equations (ODEs) show convergence to an approximate k of the most active neurons remaining on (Gozel & Gerstner, 2021 ). This will be referred to as a Top-K activation function (also known as k Winners Take All). There is empirical support for a number of interneuron circuits approximating the Top-K operation (Sterling & Laughlin, 2015; Fleming et al., 2022; Lin et al., 2014) . By contrast, in the field of deep learning, while inhibition is possible, analogous interneuron circuits that enforce sparsity across a layer have not been widely adopted. The only truly sparse activation function is the ReLU (Glorot et al., 2011) . Moreover, mechanisms that enforce sparse neuronal activity are also rarely used and when given the choice, networks will prefer to be dense. This is because sparsity can limit model capacity, resulting in information bottlenecks that harm performance (Goodfellow et al., 2015) . Here, we find that by simply introducing isotropic, symmetric noise centered about zero during training, a layer of artificial neurons will converge to a sparse coding solution. This solution mimicks a simplified version of the biological inhibitory interneuron circuit. The network gradually implementing this inhibitory interneuron also results in better performance than explicitly enforcing any sort of inhibition at the start of training. Concretely, the network synchronizes every neuron's bias term, setting them to be approximately the same negative value, and also every neuron's weight vector, setting them to have the same L 2 norm. This results in the activity of every neuron existing within a particular range that, when combined with the negative bias term they all agree upon, results in a sparse k neurons remaining active. Because we assume that k is small, Top-K networks are a special class of sparse coding networks that always have an approximate k neurons on for any input. They also implement this sparsity by approximating the functionality of an inhibitory interneuron. The sparsity that comes from inhibition -allowing a small subset of neurons to fire from the total number that otherwise would -is distinct from sparsity where many of the neurons in the network are dead and never fire for any input. This latter form of sparsity is misleading and equivalent to having a pruned, smaller network that is densely firing; we indicate when this alternative form of sparsity occurs. The degree to which our network becomes a sparse coding, Top-K network is proportional to the amount of noise applied, up to a noise limit. To validate that the network approximates an inhibitory interneuron, we replace each neuron's bias term with a single, shared bias term. This results in identical model performance and very similar levels of sparsity. Further investigating the degree to which this network uses sparse coding, we find that it learns receptive fields similar to those of mammalian V1 with Gabor filters and retinal ganglion cells with on/off center surround (Sterling & Laughlin, 2015; Olshausen & Field, 1997) . Figure 1 : An inhibitory interneuron circuit implementing a negative feedback loop to silence all but the k most active neurons. We find this Top-K network formation is particularly evident for reconstruction tasks but also exists for classification tasks and the intermediate MLP layers of a Transformer architecture. Our results hold across a variety of datasets, noise distributions, and numbers of neurons. We also observe that the network retains its Top-K approximation even after the noise is removed, making it an effective pre-training task for sparsifying activations. This increase in sparsity could reduce FLOPs when run on hardware that can take advantage of it (Wang, 2020; Gale et al., 2020; Davies et al., 2018) . We first review related work (Section 2) before outlining our experimental setup and presenting empirical observations (Section 3). Finally, we analyze the network's learning dynamics, providing intuition for our results and highlighting avenues for future work (Section 4).

2. RELATED WORK

Injecting noise into training data for statistical models was previously proposed by (Sietsma & Dow, 1991) . Gaussian noise injection has been interpreted as a form of model regularization, to help avoid overfitting (Zur et al., 2009) and improve generalization (Sietsma & Dow, 1991; Matsuoka, 1992) . It was later shown that noise injection is in fact equivalent to having a regularization term that seeks to minimize the L 2 norm of the network's Jacobian (Bishop, 1995; Rifai et al., 2011; Alain & Bengio, 2014) . From this perspective, it makes sense that sparsity can be used to turn off neurons and minimize this Jacobian, providing a potential explanation for our results. Training with noise became particularly prominent in the form of de-noising autoencoders (Goodfellow et al., 2015) . Before deep neural networks with many layers could be trained end-to-end, it was popular to train each layer of the network one at a time with an unsupervised, de-noising reconstruction loss (Goodfellow et al., 2015) . With small amounts of noise, these de-noising autoencoders were shown to be equivalent to contractive auto-encoders that were regularized to be robust to local perturbations of the training data (Alain & Bengio, 2014) . Approaches to introduce activation sparsity have included explicitly using Top-K activation functions (Ranzato et al., 2007; Makhzani & Frey, 2014; Ahmad & Scheinkman, 2019) , novel regularization terms (Kurtz et al., 2020; Yang et al., 2020) and other approaches (Schwarz et al., 2021; Molchanov et al., 2017) . However, we are unaware of any existing work which finds that noisy

