ADAPTIVE SMOOTHING GRADIENT LEARNING FOR SPIKING NEURAL NETWORKS

Abstract

Spiking neural networks (SNNs) with biologically inspired spatio-temporal dynamics show higher energy efficiency on neuromorphic architectures. Error backpropagation in SNNs is prohibited by the all-or-none nature of spikes. The existing solution circumvents this problem by a relaxation on the gradient calculation using a continuous function with a constant relaxation degree, so-called surrogate gradient learning. Nevertheless, such solution introduces additional smoothness error on spiking firing which leads to the gradients being estimated inaccurately. Thus, how to adjust adaptively the relaxation degree and eliminate smoothness error progressively is crucial. Here, we propose a methodology such that training a prototype neural network will evolve into training an SNN gradually by fusing the learnable relaxation degree into the network with random spike noise. In this way, the network learns adaptively the accurate gradients of loss landscape in SNNs. The theoretical analysis further shows optimization on such a noisy network could be evolved into optimization on the embedded SNN with shared weights progressively. Moreover, we conduct extensive experiments on static images, dynamic event streams, speech, and instrumental sounds. The results show the proposed method achieves state-of-the-art performance across all the datasets with remarkable robustness on different relaxation degrees.

1. INTRODUCTION

Spiking Neural Networks (SNNs), composed of biologically plausible spiking neurons, present high potential for fast inference and low power consumption on neuromorphic architectures (Akopyan et al., 2015; Davies et al., 2018; Pei et al., 2019) . Instead of the expensive multiply-accumulation (MAC) operations presented in ANNs, SNNs operate with binary spikes asynchronously and offer sparse accumulation (AC) operations with lower energy costs. Additionally, existing research has revealed SNNs promise to realize machine intelligence especially on sparse spatio-temporal patterns (Roy et al., 2019) . Nevertheless, such bio-mimicry with the all-or-none firing characteristics of spikes brings inevitably difficulties to supervised learning in SNNs. Error backpropagation is the most promising methodology to develop deep neural networks. However, the nondifferentiable spike firing prohibits the direct application of backpropagation on SNNs. To address this challenge, two families of gradient-based training methods are developed: (1) surrogate gradient learning (Shrestha & Orchard, 2018; Wu et al., 2018; Neftci et al., 2019) and (2) Time-based learning (Mostafa, 2017; Zhang & Li, 2020) . For surrogate gradient learning, it adopts a smooth curve to estimate the ill-defined derivative of the Heaviside function in SNNs. The backpropagation, in this way, could be tractable at both spatial and temporal domains in an iterative manner. Meanwhile, surrogate gradient learning could substantially benefit from the complete ecology of deep learning. It has been widely used to solve complex pattern recognization tasks (Zenke & Vogels, 2021; Neftci et al., 2019) . However, the smooth curve distributes the gradient of a single spike into a group of analog items in temporal neighbors (Zhang & Li, 2020) , which is mismatched with the inherent dynamics of spiking neurons. So we identify the problem as gradient mismatching in this paper. As a result, most parameters are updated in a biased manner in surrogate gradient learning, which limits the performance of SNNs. Besides, different smoothness of surrogate functions may greatly affect the network performance (Hagenaars et al., 2021; Li et al., 2021c) . The time-based method is the other appealing approach. By estimating the gradients on the exact spike times, the time-based method circumvents the gradient mismatching issues in surrogate gradient learning naturally. However, to obtain the exact expression of spike time, most works (Mostafa, 2017; Zhang & Li, 2020) suppose the firing count of each neuron remains unchanged during training (Yang et al., 2021) which is difficult to establish in practice. Besides, it is difficult to adapt the customized backward flows in the time-based method with auto-differential frameworks such as PyTorch, MXNet, and TensorFlow. Moreover, some special tricks are necessary to avoid the phenomenon of dead neurons (Bohte et al., 2002) . Therefore, it is not flexible to obtain deep SNNs with the time-based method. To solve the problems, this paper proposes adaptive smoothing gradient learning (ASGL) to train SNNs directly. In general, we inject spikes as noise in ANN training and force the error surfaces of ANNs into that of SNNs. With the design of dual-mode forwarding, the smoothness factor could be incorporated into training without the need for a specific design of hyperparameter search, which could be computationally expensive. Therefore most parameters could be updated against mismatched gradients adaptively. In addition, compared to the time-based method, ASGL backpropagates errors in both spatial and temporal domains without special constraints and restart mechanism. We analyze the evolution of the noisy network with dual mode from the perspective of iterative optimization. As a result, the optimization of the noisy network could be converted into minimizing the loss of the embedded SNN with the penalty of smooth factors. Experiments show the proposed method achieves state-of-the-art performance on static images, dynamic visual streams, speech, and instrumental sounds. It is worth noting that the method shows extraordinary robustness for different hyperparameter selections of smooth factors. Finally, we investigate the evolution of such a hybrid network by visualizing activation similarities, network perturbation, and updated width.

2. RELATED WORKS

Direct Training. To circumvent the difficulties from non-differential spikes, surrogate gradient learning approximates spike activities with a pre-defined curve (Wu et al., 2019; Shrestha & Orchard, 2018; Gu et al., 2019; Zenke & Vogels, 2021; Fang et al., 2021b) . Wu et al. (2018) proposed to backpropagate errors in both spatial and temporal domains to train SNNs directly with surrogate gradient. Similarly, Shrestha & Orchard (2018) solved the temporal credit assignment problem in SNNs with the smoothness of a custom probability density function. To suppress gradient vanishing or explosion in deep SNNs, Zheng et al. ( 2021) further proposed threshold-dependent batch normalization (tdBN) and elaborated shortcut connection in standard ResNet architectures. Gradient Mismatching. The mismatching problem in surrogate gradient learning has attracted considerable attention. Li et al. (2021c) optimized the shape of the surrogate gradient function with the finite difference method to compensate for this problem. Nevertheless, their method needs to initialize an update step for finite difference empirically. Meanwhile, it is limited by the high computational complexity of the finite difference method. Therefore, only information from the proceeding layers is adopted in the update of surrogate functions. There are still other works bypassing the difficulty without using surrogate gradient. Zhang & Li (2020) handled error backpropagation across inter-neuron and intra-neuron dependencies based on the typical time-based scheme. Furthermore, the unifying of surrogate gradient and time-based method was suggested by (Kim et al., 2020) to fuse the gradient from both spike generation and time shift. In general, those methods are constrained by specified assumptions or coupling tricks during training. Wunderlich & Pehle (2021) first proposed to compute the exact gradients in an event-driven manner and so avoid smoothing operations by solving ODEs about adjoint state variables. Nevertheless, the approach is only verified on simple datasets with shallow networks. Yang et al. ( 2021) developed a novel method to backpropagate errors with neighborhood aggregation and update weights in the desired direction. However, the method based on the finite difference is computationally expensive. Severa et al. (2019) propose to sharpen the bounded ReLU activation function in ANN into the Heaviside function in SNN progressively. Although yielding a similar effect with ASGL, it utterly depends on hand-craft sharpening schedulers with difficulties in adaptive update considering whole network evolution. Different from the previous works, ASGL incorporates directly learnable width factors into the end-to-end training of a noisy network.

