ADAPTIVE SMOOTHING GRADIENT LEARNING FOR SPIKING NEURAL NETWORKS

Abstract

Spiking neural networks (SNNs) with biologically inspired spatio-temporal dynamics show higher energy efficiency on neuromorphic architectures. Error backpropagation in SNNs is prohibited by the all-or-none nature of spikes. The existing solution circumvents this problem by a relaxation on the gradient calculation using a continuous function with a constant relaxation degree, so-called surrogate gradient learning. Nevertheless, such solution introduces additional smoothness error on spiking firing which leads to the gradients being estimated inaccurately. Thus, how to adjust adaptively the relaxation degree and eliminate smoothness error progressively is crucial. Here, we propose a methodology such that training a prototype neural network will evolve into training an SNN gradually by fusing the learnable relaxation degree into the network with random spike noise. In this way, the network learns adaptively the accurate gradients of loss landscape in SNNs. The theoretical analysis further shows optimization on such a noisy network could be evolved into optimization on the embedded SNN with shared weights progressively. Moreover, we conduct extensive experiments on static images, dynamic event streams, speech, and instrumental sounds. The results show the proposed method achieves state-of-the-art performance across all the datasets with remarkable robustness on different relaxation degrees.

1. INTRODUCTION

Spiking Neural Networks (SNNs), composed of biologically plausible spiking neurons, present high potential for fast inference and low power consumption on neuromorphic architectures (Akopyan et al., 2015; Davies et al., 2018; Pei et al., 2019) . Instead of the expensive multiply-accumulation (MAC) operations presented in ANNs, SNNs operate with binary spikes asynchronously and offer sparse accumulation (AC) operations with lower energy costs. Additionally, existing research has revealed SNNs promise to realize machine intelligence especially on sparse spatio-temporal patterns (Roy et al., 2019) . Nevertheless, such bio-mimicry with the all-or-none firing characteristics of spikes brings inevitably difficulties to supervised learning in SNNs. Error backpropagation is the most promising methodology to develop deep neural networks. However, the nondifferentiable spike firing prohibits the direct application of backpropagation on SNNs. To address this challenge, two families of gradient-based training methods are developed: (1) surrogate gradient learning (Shrestha & Orchard, 2018; Wu et al., 2018; Neftci et al., 2019) and (2) Time-based learning (Mostafa, 2017; Zhang & Li, 2020) . For surrogate gradient learning, it adopts a smooth curve to estimate the ill-defined derivative of the Heaviside function in SNNs. The backpropagation, in this way, could be tractable at both spatial and temporal domains in an iterative manner. Meanwhile, surrogate gradient learning could substantially benefit from the complete ecology of deep learning. It has been widely used to solve complex pattern recognization tasks (Zenke & Vogels, 2021; Neftci et al., 2019) . However, the smooth curve distributes the gradient of a single spike into a group of analog items in temporal neighbors (Zhang & Li, 2020) , which is mismatched with the inherent dynamics of spiking neurons. So we identify the problem as gradient mismatching in this paper. As a result, most parameters are updated in a biased manner in surrogate gradient learning, which limits the performance of SNNs. Besides, different smoothness of surrogate functions may greatly affect the network performance (Hagenaars et al., 2021; Li et al., 2021c) . The time-based method is the other appealing approach. By estimating the gradients on the exact spike times, the time-based method circumvents the gradient mismatching issues in surrogate gradient learning naturally. However, to obtain the exact expression of spike time, most works (Mostafa, 2017; Zhang & Li, 2020) suppose the firing count of each neuron remains unchanged during training (Yang et al., 2021) which is difficult to establish in practice. Besides, it is difficult to adapt the customized backward flows in the time-based method with auto-differential frameworks such as PyTorch, MXNet, and TensorFlow. Moreover, some special tricks are necessary to avoid the phenomenon of dead neurons (Bohte et al., 2002) . Therefore, it is not flexible to obtain deep SNNs with the time-based method. To solve the problems, this paper proposes adaptive smoothing gradient learning (ASGL) to train SNNs directly. In general, we inject spikes as noise in ANN training and force the error surfaces of ANNs into that of SNNs. With the design of dual-mode forwarding, the smoothness factor could be incorporated into training without the need for a specific design of hyperparameter search, which could be computationally expensive. Therefore most parameters could be updated against mismatched gradients adaptively. In addition, compared to the time-based method, ASGL backpropagates errors in both spatial and temporal domains without special constraints and restart mechanism. We analyze the evolution of the noisy network with dual mode from the perspective of iterative optimization. As a result, the optimization of the noisy network could be converted into minimizing the loss of the embedded SNN with the penalty of smooth factors. Experiments show the proposed method achieves state-of-the-art performance on static images, dynamic visual streams, speech, and instrumental sounds. It is worth noting that the method shows extraordinary robustness for different hyperparameter selections of smooth factors. Finally, we investigate the evolution of such a hybrid network by visualizing activation similarities, network perturbation, and updated width.

2. RELATED WORKS

Direct Training. To circumvent the difficulties from non-differential spikes, surrogate gradient learning approximates spike activities with a pre-defined curve (Wu et al., 2019; Shrestha & Orchard, 2018; Gu et al., 2019; Zenke & Vogels, 2021; Fang et al., 2021b) . Wu et al. (2018) proposed to backpropagate errors in both spatial and temporal domains to train SNNs directly with surrogate gradient. Similarly, Shrestha & Orchard (2018) solved the temporal credit assignment problem in SNNs with the smoothness of a custom probability density function. To suppress gradient vanishing or explosion in deep SNNs, Zheng et al. (2021) further proposed threshold-dependent batch normalization (tdBN) and elaborated shortcut connection in standard ResNet architectures. Gradient Mismatching. The mismatching problem in surrogate gradient learning has attracted considerable attention. Li et al. (2021c) optimized the shape of the surrogate gradient function with the finite difference method to compensate for this problem. Nevertheless, their method needs to initialize an update step for finite difference empirically. Meanwhile, it is limited by the high computational complexity of the finite difference method. Therefore, only information from the proceeding layers is adopted in the update of surrogate functions. There are still other works bypassing the difficulty without using surrogate gradient. Zhang & Li (2020) handled error backpropagation across inter-neuron and intra-neuron dependencies based on the typical time-based scheme. Furthermore, the unifying of surrogate gradient and time-based method was suggested by (Kim et al., 2020) to fuse the gradient from both spike generation and time shift. In general, those methods are constrained by specified assumptions or coupling tricks during training. Wunderlich & Pehle (2021) first proposed to compute the exact gradients in an event-driven manner and so avoid smoothing operations by solving ODEs about adjoint state variables. Nevertheless, the approach is only verified on simple datasets with shallow networks. Yang et al. (2021) developed a novel method to backpropagate errors with neighborhood aggregation and update weights in the desired direction. However, the method based on the finite difference is computationally expensive. Severa et al. (2019) propose to sharpen the bounded ReLU activation function in ANN into the Heaviside function in SNN progressively. Although yielding a similar effect with ASGL, it utterly depends on hand-craft sharpening schedulers with difficulties in adaptive update considering whole network evolution. Different from the previous works, ASGL incorporates directly learnable width factors into the end-to-end training of a noisy network.

3. PRELIMINARY

Notation. We follow the conventions representing vectors and matrix with bold italic letters and bold capital letters respectively, such as s and W . For matrix derivatives, we use a consistent numerator layout across the paper. For a function f (x) : R d1 → R d2 , we use D k f [x] instead of ∂f (k) (x) ∂x to represent the k-th derivative of f with respect to the variable x in the absence of ambiguity. Let < M 1 , M 2 > represent the Frobenius inner production between two matrices. For two vectors u 1 and u 2 , we use u 1 ⊙u 2 to represent the entrywise production. Similarly, the notation of u ⊙2 refers to u ⊙ u for simplification. We use f • g to denote the composition of f with g. Leaky Integrate-and-Fire (LIF) Model. To capture the explicit relation between input current c and output spikes s, we adopt the iterative form of the LIF neuron model (Wu et al., 2018) in most experiments. At each time step t, the spiking neurons at l-th layer will integrate the postsynaptic current c l [t] and update its membrane potential u l [t]: u l [t] = γu l [t -1] ⊙ 1 -s l [t -1] + c l [t] (1) where γ = 1 -1/τ m is the leaky factor that acts as a constant forget gate through time. The term of (1 -s l [t -1]) indicates the membrane potential will be reset to zero when a spike is emitted at the last time step. As done in (Wu et al., 2018; Fang et al., 2021b) , we use simply the dot production between weights W l and spikes from the preceding layer s l-1 [t] with a shift b l to model the synaptic function: c l [t] = W l s l-1 [t] + b l (2) The neurons will emit spikes s l [t] whenever u l [t] crosses the threshold ϑ with enough integration of postsynaptic currents: s l [t] = Θ ûl [t] = Θ u l [t] -ϑ ) where Θ(x) is the Heaviside function: Θ(x) = 1, if x ≥ 0 0, otherwise A C-LIF variant is also applied in our experiments to make a fair comparison. And we provide its iterative equations in the Appendix A.2. Readout and Loss. For classification tasks, we need to define the readout method matching with the supervised signal y. As done in the recent work (Li et al., 2021a; Rathi et al., 2020) , we fetch the average postsynaptic current in the last layer c L = 1 N t c L [t] where N = T /∆t is the number of discrete time steps. Then the SNN prediction could be defined as the one with maximum average postsynaptic current naturally. Furthermore, we could define the cross-entropy loss removing the temporal randomness: L(c L , y) = -y T log(softmax(c L )) (5) For simplication, we denote ŷ = softmax(c L ) in the rest part of the paper.

4.1. SPIKE-BASED BACKPROPAGATION

The nature of all-or-none firing characteristics of spikes blocks the direct utilization of backpropagation which is the key challenge to developing spike-based backpropagation. Formally, we usually need to backpropagate the credit for the state at a specified time step t * of δ l [t] = ∂c L [t * ] ∂W in both spatial domain and temporal domain (Detailed derivatives are provided in Appendix A.1 ) as follows: In the above equations, the partial derivative of the Heaviside function Θ(x) is the Dirac-Deta function which is equal to zero almost everywhere. Moreover, it goes to infinity when x = 0 shown with the green line in Figure 1 . Such properties prohibit the backward flow in SNNs directly. Thus most researchers approximate the gradient of the Heaviside function with a predefined differentiable curve (Neftci et al., 2019; Shrestha & Orchard, 2018) . Here, we use the rectangular function (Wu et al., 2018) , one of the most popular approximation functions with low computational complexity, as an example to illustrate the proposed method. The idea could be applied to other alternative approximation functions. Formally, the rectangular function could be defined as: δ l [t] = δ l [t + 1] ∂u l [t + 1] ∂u l [t] + δ l+1 [t] ∂u l+1 [t] ∂u l [t] ∂u l [t + 1] ∂u l [t] = γ diag 1 -s l [t] -u l [t] ⊙ Θ ′ u l [t] ∂u l+1 [t] ∂u l [t] = W l+1 diag Θ ′ u l [t] Θ ′ (u) ≈ h α (u) = 1 α sign |u -ϑ| < α 2 (7) where the width α controls the smooth degree of h α (x) and the temperature κ = 1/α determines the relative steepness.

4.2. DESIGN OF ASGL

In surrogate learning, Θ ′ (x) is estimated by a smooth function such as h α (x) , which predicts the change rate of loss in a relatively larger neighborhood (Li et al., 2021c) . However, such estimation with constant width will deviate from the correct direction progressively during the network training. It not only brings difficulties to the network convergence but also affects the generalization ability with the disturbed underline neuron dynamics. Essentially, this problem comes from the mismatching ||Θ ′ (x) -h α (x)|| shown in the red part of Figure 1 . So how to adjust the width α and eliminate such mismatching adaptively is an important problem for surrogate gradient learning. For ASGL, we try to solve the problem without defining surrogate gradient. The method is simple but rather effective. Firstly, we derive the antiderivative function of surrogate function h α (x): H α (x) = x -∞ h α (u) du = clip( 1 α x + 1 2 , 0, 1) Whetstone (Severa et al., 2019) uses H α (x) for forwarding calculation and h α (x) for backward propagation. In this way, although there is no mismatching problem, it is difficult to guarantee the network dynamics evolving into that of SNNs finally. In contrast, surrogate gradient learning uses Θ(x) and h α (x) for forwarding calculation and backward propagation respectively. It guarantees fully spike communication but introduces the problem of gradient mismatching. Sequentially, it makes sense to seek an approach combining advantages from both perspectives and training deep SNNs with matching gradients. To implement this, the basic idea of ASGL is just to couple the analog activation H α (x) and binary activation Θ(x) through a random mask m during forwarding calculation: Ĥα (x) = (1 -m) ⊙ H α (x) + m ⊙ Φ(Θ(x)) where m ∼ Bernoulli (p) represents independent random masking. The p controlling the proportion of analogs and spikes is referred to as the noise probability. To avoid the gradient mismatching from the surrogate function, we use function Φ to detach the gradients from spikes. Mathematically, Φ standards for the special identical mapping with the derivative ∂Φ(x) ∂x = 0. In this way, the Heaviside function Θ(x) is taken as the spike noise without error backpropagation. To guarantee Synapse Weights ...

Spike Afferents

Post-Synaptic Currents ... the gradient still flows even when p = 1, we recouple both modes and only inject the difference Θ(x) -H α (x) as noise: 𝒄𝒄[0] 𝒄𝒄[𝑁𝑁] … 𝒄𝒄[𝑡𝑡] 𝒄𝒄[𝑡𝑡] Surrogate Gradient -𝜃𝜃 Θ(𝑥𝑥) 𝑡𝑡 = 𝑡𝑡 + 1 γ𝒖𝒖[𝑡𝑡 -1] Adaptive Smoothing Gradient γ𝒖𝒖[𝑡𝑡 -1] -𝜃𝜃 𝐻𝐻 α (𝑥𝑥) Θ 𝑥𝑥 -𝐻𝐻 α (𝑥𝑥) 𝒎𝒎 𝑡𝑡 = 𝑡𝑡 + 1 𝒔𝒔[𝑡𝑡] 𝒔𝒔[𝑡𝑡] Noise Injection Ĥα (x) = H α (x) + m ⊙ Φ (Θ(x) -H α (x)) The operator pipeline is visualized in Figure 2 . Surprisingly, the core idea of ASGL is so simple: replace Θ(u l [t] -ϑ) with Ĥα (u l [t] -ϑ) in Equation (3) during training and still adopt Θ(u l [t] -ϑ) for validation. In practice, minor alterations shown in Algorithm 1 are needed compared to the surrogate gradient learning. Notably, the hard reset given in Equation ( 1) transforms into a soft version at probability when the analog activations H α (x) are propagated, despite the fact that the equations describing neuron dynamics remain unchanged. Assume H α (x) models the function from expectational membrane potential into spike probability (rate) in a short period (corresponding to one time step). The neurons have a (1 -H α (u[t])) chance of not emitting a spike and maintaining the previous potential state. In the sense of expectation, the soft reset should be performed as u[t] = (1 -H α (u[t])) ⊙ u[t] rather than resetting into a fixed value. Algorithm 1 Core function in ASGL 1: Require: The difference between membrane voltage and threshold ûl [t] = u l [t] -ϑ; The sign T indicates training or validation 2: Ensure: α is the learnable width parameter 3: if T is true then 4: generate mask m with noise probability p 5: s l [t] = H α ( ûl [t]) + m ⊙ Φ Θ( ûl [t]) -H α ( ûl [t]) {The only line needed to update compared to the surrogate learning} 6: else 7: s l [t] = Θ( ûl [t]).detach() 8: end if 9: return s l [t] The only remaining problem is to guarantee the noisy network trained with Ĥα (x) could be involved into an SNN with Θ(x) as activation finally. Fortunately, with the perspective of mixture feedforward, it could be achieved by simply setting the width α as learnable (see Section 4.3 for detailed analysis): ∂H α (x) ∂α = -1 α 2 x, if -1 2 α ≤ x ≤ 1 2 α 0, otherwise In this way, the gradient mismatching will gradually diminish when adaptive α approaches 0 against spike noise. Besides, it avoids tricky adjustments for width α which usually has a significant impact on performance (Wu et al., 2018; Hagenaars et al., 2021) . The idea could also bring insights into surrogate gradient learning. For example, ASGL will demote to surrogate gradient but still enable the adaptive width learning if we use fully spike (p = 1) for forwarding computation. Moreover, ASGL benefits naturally from the pretrained ANNs by increasing gradually p from 0. Therefore, both ANN-SNN conversion and surrogate gradient learning could be implemented in the framework of ASGL with different settings of noise probability p.

4.3. THEORETICAL ANALYSIS

In this section, we show network dynamics of noisy networks with learnable α could be evolved into that of embedded SNNs . Suppose F noise represents the noisy network used in training with H α (x) across all layers while F snn denotes the target SNN embedded in F noise with fully spikebased calculation. The expectation over mask matrix is adopted to estimate the real loss ℓ noise (F, s) of the hybrid network F noise : ℓ noise (F , s) ≜ E m[ℓ(F noise (s))] = E m[ℓ(F snn (s, m))] Here, m = m/p represents the normalized mask gathered from all layers and s denotes the input spike pattern. With the perturbation analysis in Appendix A.9, we have the following proposition: Proposition 4.1 Minimizing the loss of noisy network ℓ noise (F, s) can be approximated into minimizing the loss of the embedded SNN ℓ snn (F, s) regularized by the layerwise distance between Θ( ûl ) and H α ( ûl ). ℓ noise (F , s) ≈ ℓ snn (F , s) + 1 -p 2p L l=1 C l , diag(H α (û l ) -Θ( ûl )) ⊙2 where C l = D 2 ℓ • E m[G l ] [s l ] is the second derivative of loss function ℓ with respect to the l-th layer spike activation s l in the constructed network G l , which could be treated as a constant (Nagel et al., 2020) . G l denotes the network using mixed activations after l-th layer and full spikes are adopted in the front l layers. To explain the proposition intuitively, we analyze the non-trivial case of p ̸ = 1 from the perspective of iterative alternate optimization. There are two steps: (1) fix weights W , optimize width α. (2) fix width α, optimize weights W . For the first case, as ℓ snn (F , s) is constant with fixed weights, the width α tends to minimize the distance between H α ( ûl ) and Θ( ûl ). So the penalty term diminishes and ℓ noise (F , s) approaches ℓ snn (F , s) in this step. In the second step, the noisy network with global task-related loss ℓ noise (F , s) is optimized under a constant smooth degree. Therefore, by applying the two steps iteratively and alternately, a high-performance SNN could be obtained through training a noisy network even if we do not increase p explicitly during training. The theoretical results have also been validated further in Fig 3b of training with fixed p. Notably, both trainable width and random noise injection with spikes are important to guarantee the first step holds on. The spike noise could be converted into the penalty on layerwise activations while the learnable α enables local optimization on it by forcing H α (x) into Θ(x).

5. EXPERIMENTS

To validate the effectiveness of the proposed method, we conduct extensive experiments on static images and spatio-temporal patterns such as dynamic event streams, spoken digital speech, and instrumental music. Specially, we study the evolutions of network dynamics and the effect of noise rate to explore whether and how injecting noise with spikes obtains the real observation about the loss landscape of target SNNs. More implementation details and energy estimation could be found in Appendix A.3 and Appendix A.6 respectively.

5.1. PERFORMANCE ON STATIC IMAGES

In Table 1 , we compare our work with state-of-the-art methods on the CIFAR datasets. We use the widely-used CifarNet (Wu et al., 2019) and a modified ResNet-18 structure (Li et al., 2021c) . As done in (Li et al., 2021c; Deng et al., 2022) , AutoAugment (Cubuk et al., 2018) and Cutout (DeVries & Taylor, 2017) are used for data augmentation. However, we do not adopt a pretrained ANN (Li et al., 2021c; Rathi et al., 2020; Rathi & Roy, 2020) to initialize weights and Time Inheritance Training (TIT) (Li et al., 2021c; Deng et al., 2022) to improve performance under low time steps. Even though, ASGL outperforms the state-of-the-art surrogate gradient and conversion methods with the same or fewer time steps on both datasets. Remarkably, we also compare two special training methods TSSL (Zhang & Li, 2020) and TL (Wu et al., 2021a) without the definition of surrogate gradient functions. Our method also achieves a significant remarkable on the tradeoff between accuracy and latency which indicates the effectiveness of adaptive learning employed in ASGL. 5.2 PERFORMANCE ON SPATIO-TEMPORAL PATTERNS. To validate that our method handles spatio-temporal error backpropagation properly, we conduct experiments on different datasets of spatio-temporal patterns such as DVS-CIFAR10 (Li et al., 2017) , and Spiking Heidelberg Dataset (SHD) (Cramer et al., 2020) . More results of MedlyDB (Bittner et al., 2014) and DVS128 Gesture (Amir et al., 2017) could be found in Appendix A.4. Performance on DVS-CIFAR10. DVS-CIFAR10 (Li et al., 2017) is a challenging benchmark neuromorphic dataset, where each sample is a record of an image of CIFAR10 scanning with repeated closed-loop motion in front of a DVS. DVS-CIFAR10 has the same number of categories (10) and samples (1k/class) as CIFAR10, but its recording process generates more noise, thus making classification more difficult. To alleviate the overfitting problem caused by data size and noise, we adopt the VGGSNN architecture and data augmentation method in (Deng et al., 2022) . As shown in Table 2, our method achieves state-of-the-art performance (78.90%) without a larger network (e.g., ResNet-19, DenseNet), which outperforms existing surrogate-gradient based approaches. Performance on Sound Datasets. The SHD dataset is a spiking dataset containing 10k spoken digits generated through an encoding model simulating auditory bushy cells in the cochlear nucleus. For training and evaluation, the dataset is split into a training set (8156 samples) and a test set (2264 samples). In this experiment, we train a three-layer SNN (800-240-20) with recurrent synaptic connections to identify the keywords in utterances (More details about recurrent connections could be found in Appendix A.5). As shown in Table 3 , the proposed method achieves 2.5% accuracy improvement at least without any data augmentation introduced in (Cramer et al., 2020) compared to the latest results. Remarkably, we use standard LIF neurons shown in Equations ( 1) to (3) while the adaptive LIF model (Yin et al., 2020) and the heterogenous LIF model (Perez-Nieves et al., 2021) are adopted to enhance the dynamics of neurons respectively.

5.3. ABLATION STUDY

In Table 4 , we compare ASGL with Surrogate Gradient (SG) on CIFAR-10 with ResNet-19 architecture (Zheng et al., 2021) under N = 3 for the ablation study. The rectangular function is adopted with the same optimizer setting, seed, and weight initialization for a fair comparison. Specially, we train 100 epochs with SGD optimizer and the weight decay of 5e-4. The results show ASGL outperforms SG across a large range of width initialization. It is catastrophic damage for SG when width α is selected inappropriately (α ≥5). In contrast, ASGL exhibits surprising robustness for different width α. Furthermore, image reconstruction, a challenging regression task for SNNs, is also conducted to verify the effectiveness of ASGL. Here, we use h α (x) = 1 2 tanh(αx) + 1 2 as a surrogate to show ASGL could be also applied to other functions. The fully-connected autoencoder is adopted for evaluation with the architecture of 784-128-64-32-64-128-784. Table 5 reports the peak signalto-noise ratio (PSNR) and Structural Similarity (SSIM) of reconstructed MNIST images under 8 time steps. We could find the adaptive mechanism in ASGL reduces sensitivity for width in SG and so shows higher performance.

5.4. EFFECT OF NOISE PROBABILITY

In this section, we aim to analyze how noise probability affects the performance of SNN. Firstly, we increase noise probability p from 0 to 0.8 by 0.1 with every 30 epochs during the training of ResNet-19 on the CIFAR-10 dataset. As shown in Figure 3a , the training accuracy of the noisy network is extremely stable while the validation accuracy of the target SNN grows erratically in the first 30 epochs. It is reasonable as the noise probability is zero in the first 30 epochs and the network is purely analog without spike injections. With the noise injection of 10% spikes, the validation accuracy of SNN increases rapidly around 30-th epoch at the expense of the training accuracy drop of the noisy network. That means the noisy network begins to transform into target SNNs. Interestingly, for the noise injection at the 60-th and 90-th epoch, the training accuracy is improved actually which indicates the small account of spikes in the first injection is enough for the dynamics of SNNs and the analog activations may become the obstacle for fast convergence instead. Then we explores the effect of different choices of fixed p in Figure 3b and Figure 3c . Generally, low p will block the evolution into SNNs through high analog activations while excessive p could not achieve the best generalization performance.

5.5. NETWORK EVOLUTION

To reveal the evolution of the noisy network, we visualize accuracy changes of the noisy network and the embedded SNN with shared weights during training under p = 0.8 and p = 0.5. As shown in Figure 3d , the accuracy curves of SNNs and the noisy network are extremely close in both cases. It shows that the noisy network exhibits consistency in network prediction with the embedded SNN. Furthermore, we record layer-wise activations of the noisy network and the embedded SNN for each sample, and calculate the average cosine similarities S over all layers after each training epoch with ASGL. Panel (e) reports the mean and standard deviation of S across all samples in the training set of CIFAR-10. According to the results, even with shared weights, the hybrid network initially has relatively low overall similarities, but after training with ASGL, the hybrid network shifts toward SNN, and the similarities increase to about 0.8. The decremental standard deviation also verifies the effectiveness of ASGL. We also evaluate the evolution of such a noisy network by observing the change of learnable width α (Figure 3f ) in image reconstruction task. The width α declines steadily and converges to respective values across all layers.

6. CONCLUSION

In this paper, we propose a novel training method called ASGL to develop deep SNNs. Different from the typical surrogate gradient learning, our method circumvents the gradient mismatching problem naturally and updates weights adaptively with the random noise injection in spikes. Specifically, we train a special hybrid network with a mixture of spike and analog signals where only the analog part is involved in the calculation of gradients. In this way, the hybrid network will learn the optimal shapes of the activation functions against the spike noise and evolve into SNN. To validate the effectiveness and generalization of the proposed method, we analyze theoretically the evolution from hybrid networks to SNNs. Besides, we conduct extensive experiments on various benchmark datasets. Experimental result shows our method achieves state-of-the-art performance across all the tested datasets. Meanwhile, it exhibits surprising robustness for different width selections.  ∇W l = ∂L(c L , y) ∂c L ∂c L ∂W l = - (y T -ŷT ) N N t * =1 ∂c L [t * ] ∂W l (14) To obtain the expression of ∂c L [t * ] ∂W , we assign the credits for c L [t * ] into the membrane potential u l [t] at all time steps satisfying t ≤ t * : ∂c L [t * ] ∂W = t * k=0 ∂c L [t * ] ∂u l [t] ∂u l [t] ∂c l [t] ∂c l [t] ∂W l = t * t=0 ∂c L [t * ] ∂u l [t] ∂c l [t] ∂W l (15) where ∂c l [t] ∂W is a three-dimensional tensor about the afferent spikes s l-1 [t]. For simplification, we denote ∂c L [t * ] ∂u l [t] as δ l [t]. When t < t * and l < L -1, δ l [t] could be calculated as follows: δ l [t] = δ l [t + 1] ∂u l [t + 1] ∂u l [t] + δ l+1 [t] ∂u l+1 [k] ∂u l [k] ∂u l [t + 1] ∂u l [t] = γ diag 1 -s l [t] -u l [t] ⊙ Θ ′ (u l [t]) ∂u l+1 [t] ∂u l [t] = W l+1 diag Θ ′ (u l [t]) where Θ ′ (x) = [Θ ′ (x 1 ), Θ ′ (x 2 ), ..., Θ ′ (x n )] T represents the element-wise partial on the colum vector x. As for the boundary condition of the layer L -1 and time step t * , we could obtain the expression of δ l [t]: δ l [t] =            δ L [t + 1] ∂u L [t+1] ∂u L [t] if l = L -1 and t < t * δ l+1 [t * ] ∂u l+1 [t * ] ∂u l [t * ] if t = t * and l < L -1 W L diag Θ ′ (u l [t]) if t = t * and l = L -1 (17) Then the full backward flow through time of SNNs with the LIF model could be obtained with Equations ( 14) to (17).

A.2 C-LIF MODEL

For the instrumental recognization, we adopt the current-base LIF model (C-LIF) (Gütig, 2016) as the basic computational unit for a fair comparison. The iterative form of C-LIF could be presented as: s l [t] = Θ u l [t] -ϑ c l [t] = u 0 ⊙ W l s l-1 [t] u l [t] = m l [t] -v l [t] -e l [t] v l [t] = β v v l [t -1] + c l [t] m l [t] = β m m l [t -1] + c l [t] e l [t] = β m m l [t -1] + ϑs l [t -1] Θ(x) = 1, x ≥ ϑ 0, x < ϑ (18) where u 0 is the normalization factor. m l [t] -v l [t] models the current integration of the synapse with the double exponential function. e l [t] simulates the refractory period in spiking neurons. The other symbols keep consistent with the standard LIF model. We use ADAM with the initial learning rate λ = 0.1 for CIFAR100 and SGD with the initial learning rate of λ = 0.1 for CIFAR10 dataset. As done in (Li et al., 2021c) , we use AutoAugment (Cubuk et al., 2018) and Cutout (DeVries & Taylor, 2017) for data augmentation in both static image datasets. Meanwhile, a cyclic cosine annealing learning rate scheduler is adopted. For the SHD dataset, we discretize the time into 250 time steps and decrease noise probability starting from 0.2. The corresponding network architecture is 700 -240 -20 while the neurons in the middle layer are connected with recurrent synapses. For the MedlyDB dataset, we increase the noise probability at 30-th, 70-th, 90-th, 95-th epoch with the discretization of 500 time steps. In particular, we update p as 1 -(1 -p) • ζ at each milestone and make the ratio of analog activations attenuation at the rate of ζ. All the p and ζ we use for each dataset are shown in Table 6 unless otherwise specified. For the results of Table 5 , we provide detailed statistics and configurations in Table 11 .

A.4 EXPERIMENTS ON MEDLYDB, DVS128 GESTURE, AND TINY IMAGENET

Performance on Tiny-ImageNet. Tiny-ImageNet contains 200 categories and 100,000 64×64 colored images for training, which is a more challenging static image dataset than CIFAR datasets. Here, we use h α (x) = 1 2 tanh(αx) + 1 2 as the surrogate forwarding function. The initial width α and decay γ is set as 2.5 and 0.5 respectively. As shown in Table 8 , ASGL still achieve competitive results compared to other methods using only 4 time steps which further verify the effectiveness of ASGL. Performance on MedlyDB. In this experiment, we explore the instruments recognization task with various music pieces in different melodies and styles. Specifically, as done in (Gu et al., 2019) , we subtract the subset of MedlyDB which contains the monophonic stems of 10 instruments. To test our algorithm in sparse spike patterns, we adopt the efficient coding scheme based on the sparse representation theory (Lewicki, 2002) . Moreover, we use the same metric, the same network structure ged accuracies with the decreased noisy proba-0-th epoch at the expense of the training the noisy network. That means the noisy transform into target SNNs. Interestingly, noise injection at the 60-th and 90-th epoch, racy is improved actually which indicates t of spikes in the first injection are enough mics of SNNs for the static image dataset. tivations may become the obstacle for fast ead. In practice, we could develop similar ose well-known learning rate schedulers y probability adaptively. Energy Efficiency e visualize the average spike rate of each esNet-18 shown in Figure 5 and provide rgy by counting synaptic operations (SOP) ANN counterpart. Especially, the SOP ted in ANNs is constant given a specified er, the SOP in SNN is executed by AC with umption and varies with the spike sparsity uld be found in Appendix F). Here, we les randomly and estimate the average SOP hile, we measure 32-bit floating-point AC ation and 32-bit floating-point MAC by 4.6 Han et al., 2015) . The experimental result achieves 94.11% classification accuracy eps on CIFAR-100 with only 8.96% energy /D\HU,QGH[ $YHUDJH6SLNH5DWH Figure 5 . The average spike rate of each layer. consumption compared to the ANN in the same architecture.

6.. Conclusion

In this paper, we propose a novel training method called Spike to develop deep SNNs. Different from the typical surrogate gradient learning, our method circumvents the gradient mismatching problem naturally and updates weights adaptively with the random noise injection in spikes. Specifically, we train a special hybrid network with a mixture of spikes and analogs where only the analogs are involved in the calculation of gradient. In this way, the analog part of the hybrid network will learn the optimal shapes of the activation functions against the spikes. As a result, the dynamics of the hybrid network will evolve into that of target SNNs evaluated in network validation. To validate the effectiveness and generalization of the proposed method, we analyze theoretically the evolution from hybrid networks to SNNs. Besides, we conduct extensive experiments on static images, dynamic event streams, speech, and instrumental sounds in practice. Experimental result shows our method achieve SOTA across all the datasets. Generally, our method explores the potential power of hybrid architectures by coupling dual modes of spikes and analog during training. We believe the idea brings solid insights for the training of deep SNNs. (384-700-10), and the same C-LIF neuron with the previous work for a fair comparison. The results show our method outperforms the others on most metrics except the Recall rate of the specialized CNN (Pons et al., 2017) . which designs special convolutional kernels. Performance on DVS128 Gesture. DVS128 Gesture is a challenging neuromorphic dataset that records 11 gestures performed by 29 different participants under three lighting conditions. The dataset comprises 1,342 samples with an average duration of 6.5 ± 1.7 s and all samples are split into a training set (1208 samples) and test set (134 samples). Considering the long sample duration and the limited sample size, we follow the RCS approach (Yao et al., 2021) that randomly selects the starting point of the sample to maximize the use of the dataset. The time step N is set to be 60 and the network receives only one slice at each step, where the temporal resolution of each slice is adjusted to 25ms according to the tuning method in (He et al., 2020) . In Table 7 , our method has achieved the state-of-the-art performance (97.90 %) without a larger network (e.g., ResNet17, DenseNet), outperforming the directly-trained approaches based on surrogate gradient. Even compared with the specially-designed DNN approaches for neuromorphic data, our model also performs better.

A.5 RECCURENT CONNECTIONS

To enhance the memory capacity in SNNs at the temporal domain, synatic recurrence is adopted widely distinguish from the internal dynamics with decay mechanism in spiking neurons. The basic equation for such an external recurrence could be given by: dc l dt = - c l (t) τ syn exp. decay + W l s l-1 (t) feedforward + V l s l (t) recurrent , where the terms of decay τ syn and τ m in Equation ( 1) both contribute to the internal recurrence. And the synaptic recurrence with weight V l enhance the temporal memory.

A.6 ENERGY ESTIMATION

In this section, we visualize the spike counts of each layer in spiking ResNet-18 shown in Figure 4 and provide the estimated energy by counting synaptic operations (SOP) compared to the ANN counterpart. Especially, the SOP with MAC presented in ANNs is constant given a specified structure. However, the SOP in SNN is executed by AC with lower power consumption and varies with the spike sparsity. For SNNs, the total synaptic operation with accumulation N AC is defined as: N AC = T t=1 L-1 l=1 N l i=1 f l i s l i [t] where fan-out f l i is the number of outgoing connections to the subsequent layer. N l i is the neuron number of the l-th layer. ANNs, similar synaptic operation N MAC with more exexpensive multiplyaccumulate is defined as: N MAC = L-1 l=1 N l i=1 f l i (21) Specially, we use MAC to estimate the energy cost of the first layer as the direct current input without explicit encoding is adopted in our experiments on static images. For Here, we select 1024 samples randomly and estimate the average SOP for SNNs. Meanwhile, we measure 32-bit floating-point AC by 0.9 pJ per operation and 32-bit floating-point MAC by 4.6 pJ per operation as done in (Han et al., 2015) . The experimental result shows the SNN achieves 94.11% classification accuracy under two time steps on CIFAR-10 with only 8.96% energy consumption compared to the ANN with the same architecture. We follow the convention of the neuromorphic computing community by counting the total synaptic operations to estimate the computation overhead of SNN models compared to their ANN counterparts (Merolla et al., 2014) . (Li et al., 2021c) . Therefore we train 100 epochs for each case considering time cost. As shown in both tables, ASGL shows robustness surprisingly on different width initializations compared to surrogate gradient with rectangular function and Dspike function. This is the main advantage of ASGL which could save the cost of hyperparameter selection in practice. From Table 12 , we could find that the Dspike function shows certain robustness with respect to the rectangular function considering the result of α = 2.5 from two tables. However, it could be improved further by ASGL when α ≤ 0.5 shown in Table 12 .

A.7 ABLATION STUDY ON RESNET-19

Experimental Setting: We train each model in ablation study for 100 epochs with initial learning rate of 0.1. We use SGD with the momentum of 0.9 across all experiments. The weight decay is set as 5e-4. As done in (Li et al., 2021c) A.9 THEORY ANALYSIS input spike pattern f l the l-th spiking layer with Θ( ûl ) F l f 1 • f 2 • ... • f l s l the output of F l with fully spike propagation g l the l-th noise spiking layer with random mask ml G l g L • g L-1 • ... • g l+1 p noise probability controlling the percent of spike mode Proposition A.1 Minimizing the loss of noisy network ℓ noise (F, s) can be approximated into minimizing the loss of the embedded SNN ℓ snn (F, s) regularized by the layerwise distance between Θ( ûl ) and H α ( ûl ). ℓ noise (F , s) ≈ ℓ snn (F , s) + 1 -p 2p L l=1 C l , diag(H α (û l ) -Θ( ûl )) ⊙2 Derivation. Suppose G l • F l as the hybrid network using spike activations Θ(x) in the preceding l layers and hybrid activations H α (x) after l-th layer. Morever, s l is the output spikes at l-th layer across all time steps and neurons. The detailed symbol definations and descriptions are provided in Table 13 . Then we can have: Then we adopt Taylor expansion on s l inspired from (Wei et al., 2020) to analyze the effect of the perturbation of m at l-th layer. From Equation (10) and Equation (3), we could obtain: G l-1 s l-1 , m = G l (1 -ml ) ⊙ H α ( ûl ) + ml b ⊙ Θ( ûl ), m = G l (1 -ml ) ⊙ (H α ( ûl ) -Θ( ûl )) + s l , m Here, we denote ∆ = (1 -ml ) ⊙ (H α ( ûl ) -Θ( ûl )) for simplication. As the expection of ∆ is zero and |∆| is bounded in [0, max( 1-p 2p , 1 2 )], we adopt Taylor expansion around ∆ = 0 and approximate R(G l , s l ) here: R(G l , s l ) = E m[ℓ(G l-1 (s l-1 , m)) -ℓ(G l (s l , m))] = E m[ℓ(G l (s l + ∆, m)) -ℓ(G l (s l , m))] ≈ E m D(ℓ • G l )[s l ]∆ + 1 2 ∆ T D 2 (ℓ • G l )[s] ∆ As ∆ is a zero-mean vector, we discard the first-order term for expectation calculation here: R(G l , s l ) ≈ E m 1 2 ∆ T D 2 (ℓ • G l )[s] ∆ Then we could take the expectation over ∆: R G l , s l ≈ 1 2 D 2 ℓ • E m[G l ] [s] , E ∆ ∆∆ T = 1 -p 2p D 2 ℓ • E m[G l ] [s] , diag H α ( ûl ) -Θ( ûl ) ⊙2 Here, only the diagonal elements in the covariance matrix E ∆ ∆∆ T are non-zero because of the independent sampling strategy presented in m. For the second order term, we could just take it as a constant like (Nagel et al., 2020) . Therefore, by substiting Equation ( 27) into Equation ( 23), we get: ℓ noise (F , s) ≈ ℓ(F snn (s)) + 1 -p 2p L l=1 C l , diag(H α (û l ) -Θ( ûl )) ⊙2



† Those results are reproduced by(Yang et al., 2022) through publicly available codes.



Figure 1: The forward and backward of the Heaviside function (green solid line) and the surrogate clipping function (purple dashed line). The red area represents the mismatching between the Heaviside function and surrogate function considering only one spike emission.

Figure 2: Comparasion of the computation flow between ASGL and surrogate gradient learning. The dashed line in purple area represents those operations are detached with the error backpropagation while the solid line represent the matching forward and backward flow.

Figure 3: (a) explores the evolution procedure by observing accuracy change with increasing spike noise. (b) and (c) examine the accuracy for different fixed selections during the test and training, respectively. (d) and (e) shows the similarities between the noisy network and embedded SNN when training with fixed p. (f) manifests the width change in the image reconstruction task.

, 70-th, 90-th, 95-th (100)

Figure 4: The average spike counts of each layer.

For the image reconstruction (Figure5a), the width α declines steadily across all layers. It indicates the injection of spike noise will force the noisy network to evolve into the target SNN and then optimize the target SNN in a coupled manner. Furthermore, Figure5bshows the change of α in CIFAR-10 classification. Interestingly, the width α of the last layer in CIFARNet increases while others descend consistently. It may result from the coupling training with both goals of reducing the loss of SNN and minimizing the distance between the noisy network and SNN. Besides, it indicates that the width update should change dynamically according to different layers of the network and different training epochs.

Figure 5: Fig. (a) and (b) visualize the width update in image reconstruction and image recognization, respectively.

ℓ noise (F , s) = ℓ(F snn (s)) + E m[ℓ(F snn (s, m)) -ℓ(F snn (s))] = ℓ(F snn (s)) + E m[ℓ(G 0 (s 0 , m)) -ℓ(G L (s L ))] = ℓ(F snn (s)) + E m L l=1 ℓ(G l-1 (s l-1 , m)) -ℓ(G l (s l , m)) = ℓ(F snn (s)) + L l=1 E m[ℓ(G l-1 (s l-1 , m)) -ℓ(G l (s l , m)

Classification Performance on Static Image Benchmarks.

Comparasion on DVS-CIFAR10 dataset.

Classification Performance on SHD

Comparison on Image Reconstruction.

Bojian Yin, Federico Corradi, and Sander M Bohté. Effective and efficient computation with multiple-timescale spiking recurrent neural networks. In International Conference on Neuromorphic Systems 2020, pp. 1-8, 2020.

Different hyperparameters related with p.

Classification Performance on DVS128 Gesture

Comparasion on Tiny-ImageNet dataset.

Classification Performance on Music Instrument Dataset

Comparsion with Surrogate Gradient Learning.

, we use AutoAugment(Cubuk et al., 2018) and Cutout(DeVries & Taylor, 2017) for data augmentation. Meanwhile, a cyclic cosine annealing learning rate scheduler is adopted. Besides, we use 128 as batch size during training. Both threshold and decay are set to 0.5. Specially, we decrease the noise rate at 20-th, 40-th, 60-th, 80-th and 95-th epoch with different ζ listed in both tables. Comparsion between ASGL and Surrogate Gradient with rectangular functions under different width initializations.

Comparsion between ASGL and Surrogate Gradient with Dspike functions under different width initializations.We evaluate the evolution of such a noisy network by observing the change of learnable width α.

The symbols and corresponding definitions (explanations).

