Surrogate Gradient Design for LIF networks

Abstract

Spiking Neuromorphic Computing uses binary activity to improve Artificial Intelligence energy efficiency. However, the non-smoothness of binary activity requires approximate gradients, known as Surrogate Gradients (SG), to close the performance gap with Deep Learning. Several SG have been proposed in the literature, but it remains unclear how to determine the best SG for a given task and network. Good performance can be achieved with most SG shapes, after a costly search of hyper-parameters. Thus, we aim at experimentally and theoretically define the best SG across different stress tests, to reduce future need of grid search. Here we first show that more complex tasks and network need more careful choice of SG, and that overall the derivative of the fast sigmoid outperforms other SG across tasks and networks, for a wide range of learning rates. Secondly, we focus on the Leaky Integrate and Fire (LIF) spiking neural model, and we note that high initial firing rates, combined with a sparsity encouraging loss term, can lead to better generalization, depending on the SG shape. Finally, we provide a theoretical solution, inspired by Glorot and He initializations, to find a SG and initialization that experimentally result in improved accuracy. We show how it can be used to reduce the need of extensive grid-search of dampening, sharpness and tail-fatness.

1. Introduction

Spiking Neuromorphic Computing uses binary and sparse signals to construct learning algorithms with higher energy efficiency (Henderson et al., 2020; Blouw et al., 2019; Davies et al., 2021; Lapique, 1907; Izhikevich, 2003) . However, a binary signal means that the true derivative is zero essentially always, and training with gradient descent will be at best very poor. Research has shown that designing an approximate gradient, referred to as Surrogate Gradient (SG) (Esser et al., 2016; Zenke and Ganguli, 2018; Bellec et al., 2018) , significantly improves training success. However, that entails an additional hyper-parameter to choose: the SG to use. Additionally, the best SG can depend on the neural architecture chosen, on the task, on the learning rate, on the initialization, and so on, making it difficult to know a priori which to pick. Thus, finding the best SG for a particular setting, requires a time consuming grid search, and reducing that search time is desirable. To meet that need, we stress test a wide variety of SG, focusing on one specific neuron model, the Leaky Integrate and Fire (LIF) (Lapique, 1907; Gerstner et al., 2014) , and provide a mathematical solution based on gradient stability methods (Glorot and Bengio, 2010; He et al., 2015) , to design the best SG for a LIF. In contrast, it is standard to pick one SG for all the experiments (Bohte, 2011; Hubara et al., 2016a; Bellec et al., 2018; Zenke and Ganguli, 2018; Zenke and Vogels, 2021; Yin et al., 2021) , possibly exploring the effect of changing a width factor (sharpness) (Zenke and Ganguli, 2018) or a height factor (dampening) (Bellec et al., 2018) . Only in the past five years, the possibility of choosing an optimal SG has been considered (Neftci et al., 2019; Zenke and Vogels, 2021) . Moreover, even if many SG can achieve good performance (Zenke and Vogels, 2021), some shapes have more chances to fail training or achieve lower accuracy. It seems therefore valuable to have a complete picture of when and where each SG works, and which ones are better left behind. For example, on more complex neural models and tasks, we measure an increase in sensitivity to the choice of SG with complexity, and observe some to degrade more gracefully, which stresses the need to pick the right SG in each setting. We then focus on arguably the simplest spiking neural model, the LIF, and confirm that the initialization scheme has different impact on each SG. Finally, to be able to propose our theoretical solution, we need to justify the use of high firing rates. Fortunately, we observe that low initial sparsity, can help generalization with high final sparsity. We use this observation to better justify, that setting the network on a high firing rate at the beginning of training, is not in contrast with a low firing rate at the end of training. Taking this finding into consideration, and in the spirit of Glorot and He initializations (Glorot and Bengio, 2010; He et al., 2015) , we propose four conditions that keep the representations and gradients stable with time. We show that these conditions provide hyper-parameters that result in improved performance without additional hyper-parameter grid-search. When we observe closely the fine details of the SG shape, such as (1) its dampening, (2) its sharpness, and (3) how fast it decays to zero, i.e. tail-fatness, we see that the theoretically justified choice tends to be close to the best experimental choice. Our contribution is therefore • We show how task and network complexity, lead learning to be more sensitive to the choice of SG; • We observe that the derivative of the fast-sigmoid outperforms other SG across tasks and networks; • High initial firing rate can promote generalization with low final firing rate; • We provide a theoretical method for SG choice based on bounding representations that improves experimental performance. • Our method predicts dampening, sharpness and tail-fatness, that lead to high accuracy experimentally on the LIF network; 2 Preliminaries

2.1. Initialization Schemes

Our theoretical method for SG choice is based on techniques from the weights initialization literature, that we use in an unorthodox way, to design a SG. The initial values of the network parameters have a strong impact on training speed (Hanin and Rolnick, 2018) and peak performance (Glorot and Bengio, 2010; He et al., 2015) . The theory often focuses on fully-connected feed-forward networks (FFN), given their mathematical tractability (Roberts et al., 2022) . FFNs are defined as y l = b l + W l σ(y l-1 ), where y 0 is the data, y L is the network output at depth L, σ(•) an activation and b l ∈ R n l , W l ∈ R n l ×n l-1 are the layer biases and weights, where n l is layer l size. Typically, biases are sampled as zero and weights such that M ean[W l ] = 0 and V ar[W l ] = c l . The general recommendation is a 1/c l ∝ n l to avoid exploding variance of representations (Glorot and Bengio, 2010; He et al., 2015) . (Glorot and Bengio, 2010) finds V ar[W l ] = 2/(n l-1 + n l ) optimal for linear networks (σ(y) = y), known as Glorot initialization, while (He et al., 2015) finds V ar[W l ] = 2/n l-1 for ReLU networks (σ(y) = max(0, y)), known as He initialization. Instead, (Saxe et al., 2014) finds a column orthogonal W l optimal for linear networks, known as Orthogonal initialization. Usually W l elements are drawn from a uniform or a normal distribution. We propose the BiGamma distribution, such that w ij ∼ Gamma(w; α, β)/2 + Gamma(-w; α, β)/2, Fig. 1 . The BiGamma keeps the optimal variance and orthogonality without sampling zeros. On the contrary, theoretical justification for recurrent networks initialization has been proposed for the LSTM (Mehdipour Ghazi et al., 2019) , and other non spiking recurrent networks (Hochreiter et al., 2001; Arjovsky et al., 2016; Pascanu et al., 2013) . However, on spiking recurrent networks, an initialization theory is missing, since arguments such as the Echo State Network (Jaeger et al., 2007) do not apply to non convex activations, or activations without a slope one regime. In practice, (Zenke and Vogels, 2021) samples a V ar[W l ] = 1/3n l-1 Uniform, while (Bellec et al., 2018) a V ar[W l ] = 1/n l-1 Normal distribution, for similar spiking models.

