EFFICIENT SURROGATE GRADIENTS FOR TRAINING SPIKING NEURAL NETWORKS

Abstract

Spiking Neural Network (SNN) is widely regarded as one of the next-generation neural network infrastructures, yet it suffers from an inherent non-differentiable problem that makes the traditional backpropagation (BP) method infeasible. Surrogate gradients (SG), which are an approximation to the shape of the Dirac's δ-function, can help alleviate this issue to some extent. To our knowledge, the majority of research, however, keep a fixed surrogate gradient for all layers, ignorant of the fact that there exists a trade-off between the approximation to the delta function and the effective domain of gradients under the given dataset, hence limiting the efficiency of surrogate gradients and impairing the overall model performance. To guide the shape optimization in applying surrogate gradients for training SNN, we propose an indicator χ, which represents the proportion of parameters with non-zero gradients in backpropagation. Further we present a novel χ-based training pipeline that adaptively makes trade-offs between the surrogate gradients' shapes and its effective domain, followed by a series of ablation experiments for verification. Our algorithm achieves 69.09% accuracy on the ImageNet dataset using SEW-ResNet34 -a 2.05% absolute improvement from baseline. Moreover, our method only requires extremely low external cost and can be simply integrated into the existing training procedure.

1. INTRODUCTION

Spike Neural Networks (SNN) have gained increasing attention in recent years due to their biological rationale and potential energy efficiency as compared to the common real-value based Artificial Neural Networks (ANN). SNN communicates across layers by the addition of spiking signals. On the one hand, this spiking mechanism turns multiplicative operations to additive operations, increasing the inference procedure's efficiency. On the other hand, it introduces an intrinsic issue of differentiability, which makes training SNNs more challenging. At present, the method for obtaining practical SNNs can be roughly divided into three categories: converting a pretrained ANN to SNN (Sengupta et al., 2019; Deng & Gu, 2020; Li et al., 2021a; Bu et al., 2021) , training with biological heuristics methods (Hao et al., 2020; Shrestha et al., 2017; Lee et al., 2018) , and training with BP-like methods (Wu et al., 2018; Zheng et al., 2020; Li et al., 2021b; Yang et al., 2021) . The converting method may not promote increased inference efficiency in practice since it requires a lengthy simulation period (high inference latency) to catch up to the accuracy of the source ANN (Sengupta et al., 2019; Rueckauer et al., 2017) . Although the biological heuristics technique requires just local information to change network parameters, it is confined to small datasets due to its limitation in representing global information (Wu et al., 2018; Shrestha et al., 2017) . Compared to these two approaches, direct training with BP-like method is capable of handling complex models with a very short simulation duration to attain adequate model performance (Zheng et al., 2020; Fang et al., 2021; Li et al., 2021b) . With the help of surrogate gradient, the SNN can be directly trained through the BPTT algorithm on an ANN-based platform (Wu et al., 2018) . Nonetheless, there is a non-negligible performance disparity between directly trained SNN and ANN, particularly on large and complicated datasets (Deng et al., 2020; Jin et al., 2018) . This is because training SNN with surrogate gradient can only obtain approximate gradients, and the final performance is highly affected by the surrogate gradient shape. A more suitable surrogate gradient shape usually results in a better performing SNN (Neftci et al., 2019) . However, an appropriate surrogate gradient must strike a compromise between the approximation shape and the effective domain of gradients. So just altering the shape of the surrogate gradient to be more similar to the δ-function may result in the training failing due to gradient disappearance, as the gradients of most membrane potentials are extremely small. Additionally, the optimal surrogate gradient shapes for various layers may different and may change throughout the training process (Li et al., 2021b) . As a result, using a fixed initial surrogate gradient shape (adequate effective domain) during the whole training phase will always have a substantial gradient error, which affects the final training result. The purpose of this work is to optimize the SNN training pipeline by adaptively altering the shape of surrogate gradient in order to control the effective domain for the surrogate gradients. We suggest an index χ to denote the proportion of membrane potential with non-zero gradients in backpropagation and present a technique to control the proportion of non-zero gradients (CPNG) in the network. The CPNG technique modifies the shape of surrogate gradients during network training, progressively approaching the δ-function while maintaining the index χ steady within an effective range to ensure training stability. Finally, each layer succeeds in finding a surrogate gradient shape that makes a better balance between the approximation error to the δ-function with the size of effective domain than the fixed-shape surrogate gradients. It's worth mentioning that our strategy only incurs minor additional costs during the training phase and has no effect on the inference phase. We verify the compatibility of CPNG to the existing mainstream SNN infrastructures such as VGG (Simonyan & Zisserman, 2014 ), ResNet (He et al., 2016 ), and Sew-ResNet (Fang et al., 2021) . In all reported comparative experiments, training with CPNG gives more accurate models than training with vanilla surrogate gradients. Our main contributions can be summarized as follows: • We identify and investigate the impact of the shape of surrogate gradients on SNN training. Our finding characterizes a special representative power for SNN that can be utilized to improve its performance. • We propose a statistical indicator χ for the domain efficiency of surrogate gradients and a χ-based training method CPNG that adjusts the shape of surrogate gradients through the training process, driving the surrogate gradients close to the theoretical δ-function with ensured trainability on sufficiently large domains. • Our CPNG method improves classification accuracy on both static image datasets including CIFAR10, CIFAR100 and ImageNet, as well as event-based image datasets such as CIFAR10-DVS. We achieve an accuracy of 69.09% in the experiment that trains ImageNet on Sew-ResNet34.

2. RELATED WORK

There are two primary branches of training a high-performing deep spiking neural network, converting a pretrained artificial neural network to its corresponding spiking neural network, and directly training a spiking neural network through BP-like method. 



ANN-SNN conversion takes advantage of the high performance of ANN and converts the source ANN to the target SNN through weight-normalization(Diehl et al., 2015;  2016)  or threshold balancing(Sengupta et al., 2019). However, SNN forming this method requires a huge simulation length to catch up with the source ANN's performance. Numerous strategies have been proposed to shorten the simulation time, including robust threshold (Rueckauer et al., 2016), SPIKE-NORM (Sengupta et al., 2019), and RMP (Han et al., 2020). A work (Deng & Gu, 2020) examines the conversion error theoretically, decomposes it layer by layer, and offers threshold ReLU and shift bias procedures to decrease the error. Based on it, Li et al. (Li et al., 2021a) divide the conversion error into clip error and floor error and design adaptive threshold, bias correction, potential correction, and weight calibration to dramatically decrease the required simulation length. A recent work (Bu et al., 2021) further proposes unevenness error, trains ANN with a novel activation function and reduce simulation length.BP-like Method HM2-BP(Jin et al., 2018)  enables SNN to adjust the spike sequence rather than just the spike at a certain moment. TSSL-BP (Zhang & Li, 2020) decomposes the backpropagation error into inter and intra interactions, calculating the derivatives only at the spiking moment. NA algorithm(Yang et al., 2021), which calculates the gradient of the non-differentiable part through

