EFFICIENT SURROGATE GRADIENTS FOR TRAINING SPIKING NEURAL NETWORKS

Abstract

Spiking Neural Network (SNN) is widely regarded as one of the next-generation neural network infrastructures, yet it suffers from an inherent non-differentiable problem that makes the traditional backpropagation (BP) method infeasible. Surrogate gradients (SG), which are an approximation to the shape of the Dirac's δ-function, can help alleviate this issue to some extent. To our knowledge, the majority of research, however, keep a fixed surrogate gradient for all layers, ignorant of the fact that there exists a trade-off between the approximation to the delta function and the effective domain of gradients under the given dataset, hence limiting the efficiency of surrogate gradients and impairing the overall model performance. To guide the shape optimization in applying surrogate gradients for training SNN, we propose an indicator χ, which represents the proportion of parameters with non-zero gradients in backpropagation. Further we present a novel χ-based training pipeline that adaptively makes trade-offs between the surrogate gradients' shapes and its effective domain, followed by a series of ablation experiments for verification. Our algorithm achieves 69.09% accuracy on the ImageNet dataset using SEW-ResNet34 -a 2.05% absolute improvement from baseline. Moreover, our method only requires extremely low external cost and can be simply integrated into the existing training procedure.

1. INTRODUCTION

Spike Neural Networks (SNN) have gained increasing attention in recent years due to their biological rationale and potential energy efficiency as compared to the common real-value based Artificial Neural Networks (ANN). SNN communicates across layers by the addition of spiking signals. On the one hand, this spiking mechanism turns multiplicative operations to additive operations, increasing the inference procedure's efficiency. On the other hand, it introduces an intrinsic issue of differentiability, which makes training SNNs more challenging. At present, the method for obtaining practical SNNs can be roughly divided into three categories: converting a pretrained ANN to SNN (Sengupta et al., 2019; Deng & Gu, 2020; Li et al., 2021a; Bu et al., 2021) , training with biological heuristics methods (Hao et al., 2020; Shrestha et al., 2017; Lee et al., 2018) , and training with BP-like methods (Wu et al., 2018; Zheng et al., 2020; Li et al., 2021b; Yang et al., 2021) . The converting method may not promote increased inference efficiency in practice since it requires a lengthy simulation period (high inference latency) to catch up to the accuracy of the source ANN (Sengupta et al., 2019; Rueckauer et al., 2017) . Although the biological heuristics technique requires just local information to change network parameters, it is confined to small datasets due to its limitation in representing global information (Wu et al., 2018; Shrestha et al., 2017) . Compared to these two approaches, direct training with BP-like method is capable of handling complex models with a very short simulation duration to attain adequate model performance (Zheng et al., 2020; Fang et al., 2021; Li et al., 2021b) . With the help of surrogate gradient, the SNN can be directly trained through the BPTT algorithm on an ANN-based platform (Wu et al., 2018) . Nonetheless, there is a non-negligible performance disparity between directly trained SNN and ANN, particularly on large and complicated datasets (Deng et al., 2020; Jin et al., 2018) . This is because training SNN with surrogate gradient can only obtain approximate gradients, and the final performance is highly affected by the surrogate gradient shape. A more suitable surrogate gradient shape usually results in a better performing SNN (Neftci et al., 2019) . However, an appropriate surrogate gradient must strike a compromise between the approximation shape and the effective domain of gradients. So just altering the shape of the

