Surrogate Gradient Design for LIF networks

Abstract

Spiking Neuromorphic Computing uses binary activity to improve Artificial Intelligence energy efficiency. However, the non-smoothness of binary activity requires approximate gradients, known as Surrogate Gradients (SG), to close the performance gap with Deep Learning. Several SG have been proposed in the literature, but it remains unclear how to determine the best SG for a given task and network. Good performance can be achieved with most SG shapes, after a costly search of hyper-parameters. Thus, we aim at experimentally and theoretically define the best SG across different stress tests, to reduce future need of grid search. Here we first show that more complex tasks and network need more careful choice of SG, and that overall the derivative of the fast sigmoid outperforms other SG across tasks and networks, for a wide range of learning rates. Secondly, we focus on the Leaky Integrate and Fire (LIF) spiking neural model, and we note that high initial firing rates, combined with a sparsity encouraging loss term, can lead to better generalization, depending on the SG shape. Finally, we provide a theoretical solution, inspired by Glorot and He initializations, to find a SG and initialization that experimentally result in improved accuracy. We show how it can be used to reduce the need of extensive grid-search of dampening, sharpness and tail-fatness.

1. Introduction

Spiking Neuromorphic Computing uses binary and sparse signals to construct learning algorithms with higher energy efficiency (Henderson et al., 2020; Blouw et al., 2019; Davies et al., 2021; Lapique, 1907; Izhikevich, 2003) . However, a binary signal means that the true derivative is zero essentially always, and training with gradient descent will be at best very poor. Research has shown that designing an approximate gradient, referred to as Surrogate Gradient (SG) (Esser et al., 2016; Zenke and Ganguli, 2018; Bellec et al., 2018) , significantly improves training success. However, that entails an additional hyper-parameter to choose: the SG to use. Additionally, the best SG can depend on the neural architecture chosen, on the task, on the learning rate, on the initialization, and so on, making it difficult to know a priori which to pick. Thus, finding the best SG for a particular setting, requires a time consuming grid search, and reducing that search time is desirable. To meet that need, we stress test a wide variety of SG, focusing on one specific neuron model, the Leaky Integrate and Fire (LIF) (Lapique, 1907; Gerstner et al., 2014) , and provide a mathematical solution based on gradient stability methods (Glorot and Bengio, 2010; He et al., 2015) , to design the best SG for a LIF. In contrast, it is standard to pick one SG for all the experiments (Bohte, 2011; Hubara et al., 2016a; Bellec et al., 2018; Zenke and Ganguli, 2018; Zenke and Vogels, 2021; Yin et al., 2021) , possibly exploring the effect of changing a width factor (sharpness) (Zenke and Ganguli, 2018) or a height factor (dampening) (Bellec et al., 2018) . Only in the past five years, the possibility of choosing an optimal SG has been considered (Neftci et al., 2019; Zenke and Vogels, 2021) . Moreover, even if many SG can achieve good performance (Zenke and Vogels, 2021), some shapes have more chances to fail training or achieve lower accuracy. It seems therefore valuable to have a complete picture of when and where each SG works, and which ones are better left behind.

