A UNIFIED OPTIMIZATION FRAMEWORK OF ANN-SNN CONVERSION: TOWARDS OPTIMAL MAPPING FROM ACTIVATION VALUES TO FIRING RATES Anonymous authors Paper under double-blind review

Abstract

Spiking Neural Networks (SNNs) have attracted great attention as a primary candidate for running large-scale deep artificial neural networks (ANNs) in real-time due to their distinctive properties of energy-efficient and event-driven fast computation. Training an SNN directly from scratch is usually difficult because of the discreteness of spikes. Converting an ANN to an SNN, i.e., ANN-SNN conversion, is an alternative method to obtain deep SNNs. The performance of the converted SNN is determined by both the ANN performance and the conversion error. The existing ANN-SNN conversion methods usually redesign the ANN with a new activation function instead of the regular ReLU, train the tailored ANN and convert it to an SNN. The performance loss between the regular ANN with ReLU and the tailored ANN has never been considered, which will be inherited to the converted SNN. In this work, we formulate the ANN-SNN conversion as a unified optimization problem which considers the performance loss between the regular ANN and the tailored ANN, as well as the conversion error simultaneously. Following the unified optimization framework, we propose the SlipReLU activation function to replace the regular ReLU activation function in the tailored ANN. The SlipReLU is a weighted sum of the threhold-ReLU and the step function, which improves the performance of either as an activation function alone. The SlipReLU method covers a family of activation functions mapping from activation values in source ANNs to firing rates in target SNNs; most of the state-ofthe-art optimal ANN-SNN conversion methods are special cases of our proposed SlipReLU method. We demonstrate through two theorems that the expected conversion error between SNNs and ANNs can theoretically be zero on a range of shift values δ ∈ [-1 2 , 1 2 ] rather than a fixed shift term 1 2 , enabling us to achieve converted SNNs with high accuracy and ultra-low latency. We evaluate our proposed SlipReLU method on CIFAR-10/100 and Tiny-ImageNet datasets, and the results show that the SlipReLU outperforms the state-of-the-art ANN-SNN conversion methods and directly trained SNNs in both accuracy and latency. To our knowledge, this is the first work to explore high-performance ANN-SNN conversion method considering the ANN performance and the conversion error simultaneously, with ultra-low latency, especially for 1 time-step (T = 1).

1. INTRODUCTION

Spiking neural networks (SNNs) are biologically-inspired neural networks based on biological plausible spiking neuron models to process real-time signals (Hodgkin & Huxley, 1952; Izhikevich, 2003) . With the significant advantages of low power consumption and fast inference on neuromorphic hardware (Roy et al., 2019) , SNNs are therefore becoming a primary candidate to run large-scale deep artificial neural networks (ANNs) in real-time. The most commonly used neuron model in SNNs is the Integrate-and-Fire (IF) neuron model (Liu & Wang, 2001) . Each neuron in the SNNs emits a spike only when its accumulated membrane potential exceeds the threshold voltage, otherwise, it stays inactive in the current time-step. This setting makes SNNs more similar to biological neural networks. Compared to ANNs, event-driven SNNs have binarized/spiking activation values, resulting in low energy consumption when implemented on specialized neuromorphic hardware. Another significant property of SNNs is the pseudo-simultaneity of their inputs and outputs for making inferences in a spatial-temporal paradigm. Compared to conventional ANNs that present a whole input vector at once, and process layer-by-layer to produce one output value, the forwarding pass in SNN can efficiently process streaming time-varying inputs. Generally, there are two distinct routes to obtain an SNN: (1) training an SNN from scratch (Wu et al., 2018; Neftci et al., 2019; Zenke & Vogels, 2021) , and (2) ANN-SNN conversion (Cao et al., 2015; Diehl et al., 2015; Deng & Gu, 2021) , i.e., converting ANNs to SNNs. Training from scratch uses a gradient-based supervised optimization method in back-propagation, pretending that SNNs are specialized ANNs. Due to the non-differentiability of the binary activation function in SNNs, surrogate gradients are usually used (Neftci et al., 2019) , but it essentially optimizes different networks in forward and backward passes. This method can only train SNNs on small-and moderatesize datasets (Li et al., 2021) . ANN-SNN conversion is an effective method to obtain deep SNNs, with comparable performance as ANNs on large-scale datasets. There are two main types of ANN-SNN conversion mechanism: (1) one-step conversion, which converts the pre-trained ANN to SNN without changing the architecture of the pre-trained ANN, for example Diehl et al. (2015) ; Li et al. (2021) , and (2) two-step conversion, which involves redesigning the ANN, training it and converting it to SNN, for example Cao et al. (2015) ; Deng & Gu (2021) ; Bu et al. (2021) . In this work, we investigate the two-step ANN-SNN conversion methods, where we usually redesign the ANN by replacing the regular ReLU activation function to a new activation function, train the tailored ANN and convert it to an SNN. A tailored ANN that deviates too much from the regular ANN will degrade its performance, resulting in a performance loss which will be inherited to the converted SNN. However, the performance degradation between the regular ANN and the tailored ANN has never been considered in the existing ANN-SNN conversion studies. To achieve highaccuracy and low-latency SNNs (e.g., 1 or 2 time-steps), we are the first to consider the performance loss between the regular ANN with ReLU and the tailored ANN, as well as the conversion error simultaneously. Our main contributions are summarized as follows: (1) We formulate the ANN-SNN conversion as a unified optimization problem which considers the ANN performance as well as the conversion error simultaneously. (2) We propose to use the SlipReLU activation function in the tailored ANN, in order to minimize the layer-wise conversion error and keep tailored ANN performance as good as the regular ANN. (3) The SlipReLU method covers a family of activation functions mapping from activation values in source ANNs to firing rates in target SNNs; most of the state-of-the-art optimal ANN-SNN conversion methods are special cases of our proposed SlipReLU method. (4) We demonstrate through two theorems that the expected conversion error between SNNs and ANNs can theoretically be zero on a range of shift values δ ∈ [-1 2 , 1 2 ] rather than a fixed shift 1 2 . Experiment results also demonstrate the effectiveness of the proposed SlipReLU method.

2. PRELIMINARIES

Given a classification problem on an image dataset (x, y) ∈ D, where y ∈ {1, • • • , C} is the true class label for image x ∈ R m , we train a neural network f : x → f (x) in the form of an ANN/SNN, by optimizing the standard cross-entropy (CE) loss, L CE (y, p) = -C j=1 y c log(p c ), where y c and p c are the c-th elements of the label y and the network prediction p = f (x). Since the infrastructures of the source ANN and target SNN are the same, we use the same f notation when it is unambiguous. And f ANN or f SNN , otherwise. For the notations, refer to Table S1 . ANN Neuron Model. In conventional ANN, a whole input vector is presented to the network at one time, and processed layer-by-layer through continuous activation to produce one output value. In ANNs, the forwarding computation of analog neurons is formulated as a (ℓ) = F ANN (z (ℓ) ) = F ANN (W (ℓ) a (ℓ-1) ) , where z (ℓ) and a (ℓ) are the pre-activation and post-activation vectors of the ℓ-th layer considered, W (ℓ) denotes the weight matrix, and F ANN (•) is the activation function of the ANN. SNN Neuron Model. Compared with ANN, SNN employs binary activation (i.e. spikes) in each layer. To compensate the weak representation capacity of the binary activation, the time dimension (or the latency) is introduced to SNN, where the inputs of the forwarding pass in SNN are presented as streams of events, by repeating the forwarding pass T time-steps to get the final result. Here we consider the Integrate-and-Fire (IF) neuron model (Cao et al., 2015; Bu et al., 2021; Deng & Gu, 2021) for SNNs. We derive the forward propagation of PSP through layers in the target SNN which is equivalent to the forwarding computation of the analog neurons in the source ANN. Suppose at time-step t the IF neuron in ℓ-th layer receive its binary input x (ℓ-1) (t) from the previous layer, the IF neuron will temporarily update its membrane potential by u (ℓ) (t) = v (ℓ) (t -1) + W (ℓ) x (ℓ-1) (t) , where v (ℓ) (t) denotes the membrane potential at time step t, u (ℓ) (t) denotes the temporary intermediate variable that would be used to determine the update from v (ℓ) (t -1) to v (ℓ) (t). For each IF neuron, if the temporary intermediate potential u th , it would produce a spike output s (ℓ) i (t). Otherwise, it would release no spikes s (ℓ) i (t) = 0. s (ℓ) i (t) = H(u (ℓ) i (t) -V (ℓ) th ) = 1, if u (ℓ) i (t) ⩾ V (ℓ) th , 0, otherwise. (3) The vector s (ℓ) (t) = {s (ℓ) i (t)} collects spikes of all neurons of ℓ-th layer at time t. Note that V (ℓ) th can be different in each layer. We update the membrane potential by the reset-by-subtraction mechanism (Rueckauer et al., 2017; Han et al., 2020) , which means the temporary membrane potential u  i (t) = 1. That is v (ℓ) (t) = u (ℓ) (t) -s (ℓ) (t)V (ℓ) th . (4) Similar to Deng & Gu (2021) ; Bu et al. (2021) , if the neuron in the current ℓ layer fires a spike, then it will release an unweighted PSP (postsynaptic potential) x (ℓ) (t) as input to the next layer, x (ℓ) (t) = s (ℓ) (t)V (ℓ) th . As for the input to the first layer and the output of the last layer of the SNN, we do not employ any spiking mechanism as in Li et al. (2021) . We directly encode the static image to temporal dynamic spikes as input to the first layer, which can prevent the undesired information loss introduced by the Poisson encoding. For the last layer output, we only integrate the pre-synaptic input and do not fire any spikes.

3. UNIFIED OPTIMIZATION FRAMEWORK OF ANN-SNN CONVERSION

In this section, we formulate the ANN-SNN conversion problem as an optimization problem determined by two terms: one is used to make the tailored ANN not far away from the regular ANN with ReLU, the other one is used to control the ANN-SNN conversion error. Starting from the first ANN-SNN conversion work (Cao et al., 2015) , all the previous ANN-SNN conversion methods (Cao et al., 2015; Diehl et al., 2015; Deng & Gu, 2021) have focused on optimizing the ANN-SNN conversion error. The performance of the converted SNN is determined by both the ANN performance and the conversion error. However, no research work has considered the ANN performance in the two-step ANN-SNN conversions, which is considered in our unified optimization framework. In the two-step conversion method, we need to redesign the new activation function of the regular ANN to get a tailored ANN, train the tailored ANN and convert it to SNN. By considering the performance loss between the tailored ANN and the regular ANN, the new activation function should not deviate too much from the regular ReLU.

3.1. ANN-SNN CONVERSION IN A UNIFIED OPTIMIZATION FRAMEWORK

Definition 1 (Unified Optimization Framework of ANN-SNN Conversion). The ANN-SNN conversion can be formulated into a unified optimization framework with an implicit variable, T , min F ,T {wE z (|f (z; W, ReLU) -f (z; W, T, F ANN )|) (5) + (1 -w)E z (|f (z; W, F ANN ) -f (z; W, T, F SNN )|)} . where w ∈ [0, 1]. Specially, if F ANN is designed by considering the deviation from the regular ReLU, the layer-wise conversion error becomes E z Err (ℓ) = E z F ANN (a (ℓ-1) ; W (ℓ) ) -F SNN (x (ℓ-1) ; W (ℓ) , T ) . ( ) Here f denotes the same neural network infrastructures shared by the source ANN and target SNN (see description in Sect. 2), f (z; W, ReLU) denotes the regular ANN with ReLU, f (z; W, T, F ANN ) is the tailored ANN with activation function F ANN , f (z; W, T, F SNN ) is the converted SNN, z is the input to the neural network, W = {W (ℓ) } are the weight matrix trained from the tailored ANN and copied to the target SNN, F = F ANN ∪ F SNN is the space of activation functions of the tailored ANNs and the target SNNs, and the latency T (or time-steps) is seen as an implicit variable inherently inherited from the target SNNs. Moreover, the latency also allows the flexibility of adjusting T to balance between the latency and the accuracy of the converted SNN for different applications. |) achieves its minimum, it is called an "optimal" ANN-SNN conversion. For example, Deng & Gu (2021) achieves optimal minimum error of (V (ℓ) th ) 2 4T , whereas Bu et al. (2021) can theoretically achieve optimal minimum error of 0.

3.2.1. FIRING RATES IN SNNS AND ACTIVATION VALUES IN ANNS

To make the layer-wise error as small as possible, ideally, the converted SNN is expected to have approximately the same activation values as the source ANN for each layer, i.e., a (ℓ) ≈ x(ℓ) = 1 T T t=1 x (ℓ) (t) = 1 T T t=1 s (ℓ) (t)V (ℓ) th = V (ℓ) th s(ℓ) . Here a (ℓ) denotes activation value of the ANN, and x(ℓ) is activation value of the SNN which is actually the average postsynaptic potential (i.e. average PSP) released by the ℓ-th layer as input to the next layer. Note s(ℓ) is the firing rate over latency T of ℓ-th layer. Note that the thresholding V (ℓ) th in SNN can be different from layer to layer, we make it a trainable parameter that can be learned in the source ANN and copied to the target SNN. Any mismatch between the activation values a (ℓ) and x(ℓ) can lead to conversion error.

3.2.2. ACTIVATION FUNCTION IN SNNS

We use the derivation in Deng & Gu (2021) ; Li et al. (2021) to deduce the SNN activation function F SNN which gives the relationship between activation values x(ℓ-1) and x(ℓ) of successive layers of SNN. By combining Eq. ( 2) and Eq. ( 4) and summing up the time-step from 1 to T , then we get v (ℓ) (T ) -v (ℓ) (0) = W (ℓ) T t=1 x (ℓ-1) (t) - T t=1 s (ℓ) (t)V (ℓ) th . The accumulated spikes are m = T t=1 s (ℓ) (t) = {m i } where each m i ∈ {0, 1, 2, • • • , T } denotes the total number of spikes of neuron i. Further assume v (ℓ) (T ) ∈ [0, V (ℓ) th ). Therefore, we have T W (ℓ) x(ℓ-1) -V (ℓ) th V (ℓ) th + δ < m ⩽ T W (ℓ) x(ℓ-1) V (ℓ) th + δ with shift δ = v (ℓ) (0) V (ℓ) th . Then, we use the clip and floor functions to determine m, m = clip T W (ℓ) x(ℓ-1) V (ℓ) th + δ , 0, T . We provide the detailed analysis of the activation function of SNNs in Appendix A. Here the clip (x, a, b) function sets the lower bound a and upper bound b. Floor function ⌊x⌋ gives the greatest integer that is less than or equal to x. With x(ℓ) = V (ℓ) th s(ℓ) = mV (ℓ) th /T , finally, the SNN activation function gives the relationship between activation values x(ℓ-1) and x(ℓ) as follows, x(ℓ) = F SNN W (ℓ) x(ℓ-1) = V (ℓ) th clip 1 T T W (ℓ) x(ℓ-1) V (ℓ) th + δ , 0, 1 . The SNN activation function F SNN (•) is a step function in interval [0, V th ] with a step size V (ℓ) th T (see the green curve in Fig. 1 ). Since the SNN output is discrete while the ANN output is continuous, there actually would be an intrinsic difference between a (ℓ) and x(ℓ) as shown in Fig. 1 .

Activation value

Pre-activation, N=T=3, shift-0 (C1) SNN Step Function SlipReLU Activation 0 2V (ℓ) th T 2θ (ℓ) T V (ℓ) th θ (ℓ) V (ℓ) th T θ (ℓ) T Pre-activation, N=T=3, shift-I (C2) 0 2V (ℓ) th T 2θ (ℓ) T V (ℓ) th θ (ℓ) V (ℓ) th T θ (ℓ) T Pre-activation, N=3, T=6, shift-0 (C3) Figure 1 : Activation functions of source ANNs, i.e., the shift-threshold-ReLU (blue curve) (Deng & Gu, 2021) in (A), quantization clip-floor-shift (QCFS) activation (orange curve) (Bu et al., 2021) In this section, by following our unified optimization framework, we will exploit the two-step conversion mechanism. We redesign the ANN with to get a tailored ANN, train the tailored ANN and convert it to SNN by copying the weights from the tailored ANN to the target SNN. Performance loss will occur if the the new activation function of the tailored ANN deviates too much from the regular ReLU activation function, and we need to minimize the conversion error at the same time. By keeping that in mind, an effective activation function F ANN of the tailored ANN should not deviate too much from the regular ReLU and be close to the step function. Therefore, we propose the SlipReLU activation function which is a weighted sum of the threshold-ReLU and the step function to balance the trade-off between the regular ReLU and step function. Assume that both ANN and SNN receive the same input from the previous layer, a (ℓ-1) = x(ℓ-1) . Denote z (ℓ) = W (ℓ) x(ℓ-1) = W (ℓ) a (ℓ-1) . Proposed SlipReLU activation function. Following the unified optimization framework Sect. 3, new activation functions of the tailored ANNs are designed by minimizing the mismatch to the step function of the target SNNs, and minimizing the deviation from the regular ReLU. We propose the SlipReLU activation function for the tailored ANN (see the red curves in (C1)-(C3) of Fig. 1 ), F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) , 0, 1 . From definition in Eq. ( 9), the SlipReLu activation function is just a weighted sum of the threshold-ReLU (first part) and the step function (second part), with the slope 0 ⩽ c ⩽ 1 balancing its weight. With some linear algebra, the SlipReLU can be formulated as a piece-wise linear function with a constant slope c (see the red curves in (C1)-(C3) of Fig. 1 and detailed derivation in Appendix B), SlipReLU(z (ℓ) ) = cz (ℓ) + (1 -c) kθ (ℓ) N , kθ (ℓ) N ⩽ z (ℓ) < (k+1)θ (ℓ) N , k = 0, 1, • • • , N -1. Here 0 ⩽ c ⩽ 1 is the constant slope of the piece-wise linear function. From the definition and red curves in (C1)-(C3) of Fig. 1 , we see that the new proposed function is very similar to a slippery step function with a slope, hence the name "SlipReLU".

SlipReLU extension

The SlipReLU extension with shift (see Appendix B) can be formulated as F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) + δ 1 , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + δ , 0, 1 . ( ) An application of the unified optimization framework. Recall the step activation function of target SNNs in Eq. ( 8), by setting c = 0 the SlipReLU becomes the step function, a (ℓ) = F ANN (z (ℓ) ) = θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + δ , 0, 1 . Note that Eq. ( 11) is exactly the step function in Eq. ( 8), except V th ← θ (ℓ) , T ← N . Because the latency T in Eq. ( 8) is an inherent property of the target SNNs, so it cannot be used in the source ANN. Therefore, instead of T , we use quasi-time-steps N in SNN. As mentioned in Sect. 3.2.1, the threshold V (ℓ) th in SNN can be different from layer to layer, and we make it a trainable value θ (ℓ) in the ANN which can be learned and copied to the target SNN. Coincidentally, Bu et al. (2021) uses this function defined in Eq. ( 11) as the activation function in the source ANNs, and they name it the quantization clip-floor-shift (QCFS) activation function and call N the quantization steps. Special cases of SlipReLU Here we list some related works which fall in to our proposed unifined optimization framework and are special cases of the SlipReLU. (1) When c = 0, δ = [ 1 2 ], the proposed SlipRLU activation function becomes the quantization clip-floor-shift in Bu et al. (2021) . It only considers to be close to the SNN step function, but neglects the deviation from the regular ReLU. (2) When c = 1, δ 1 = [-1 2N ], the proposed SlipRLU activation function becomes shiftthreshold ReLU in Deng & Gu (2021) . It considers the deviation from the regular ReLU but neglects the closeness to the step function of the target SNN. Our proposed SlipReLU balances the trade-off between the regular ReLU and the step function. Refer to Appendix B for the details.

4.2. THEOREMS ON THE CONVERSION ERROR

The following two theorems gives the conversion error of the proposed unified method. Theorem 1. An ANN trained with SlipReLU activation function Eq. ( 10) is converted to an SNN with the same weights. Let V (ℓ) th = θ (ℓ) , v (ℓ) (0) = V (ℓ) th δ, c = 0. Then for arbitrary T and N , the expectation of the conversion error of the proposed unified method reaches 0, i.e., ∀ T, L E z Err (ℓ) δ∈[-1 2 , 1 2 ] = 0 , holds for any the shift term δ in the source ANN when δ ∈ [-1 2 , 1 2 ] . Theorem 1 indicates that for c = 0 the expected conversion error reaches zero even though N ̸ = T provided that the shift term δ ∈ [-1 2 , 1 2 ]. The proof is in Appendix C. Theorem 2. An ANN trained with SlipReLU activation function Eq. ( 10) is converted to an SNN with the same weights. Let V (ℓ) th = θ (ℓ) , v (ℓ) (0) = V (ℓ) th δ, δ 1 = [ δ-1/2 T ]. Then for arbitrary T and N and arbitrary c ∈ [0, 1], the expectation of the conversion error of the proposed unified method reaches the optimal c(V (ℓ) th ) 2 4T , i.e., ∀ T, L E z Err (ℓ) δ∈[-1 2 , 1 2 ] = c(V (ℓ) th ) 2 4T , holds for any the shift term δ in the source ANN when δ ∈ [-1 2 , 1 2 ] . Theorem 2 indicates that for any ∀ c ∈ [0, 1], the expectation of the conversion error can reach the minimum c(V (ℓ) th ) 2 4T , provided that the shift term δ in the source ANN is in the interval [-1 2 , 1 2 ], and δ 1 = δ-1/2 T . The proof is in Appendix C. These results indicate we can achieve high-performance converted SNN at ultra-low time-steps.

4.3. ALGORITHM FOR TRAINING SLIPRELU ACTIVATION FUNCTION IN BACKPROPAGATION

Training an ANN with SlipReLU activation instead of ReLU is also a challenging problem. Although the SlipReLU has a constant slope as its derivative from Eq. ( 9), however, small slope values c ∈ [0, 1] can cause the gradient vanishing problem. Therefore, inspired by Bu et al. (2021) ; Bengio et al. (2013) , we use the surrogate gradient as the derivative of the floor function d⌊x⌋ x = 1. The overall derivation rule is given as follows, dF ANN (z (ℓ) ) dz (ℓ) i = 1, if z (ℓ) i ∈ D 1 ∪ D 2 0, otherwise where D 1 = [-δ 1 θ, θ -δ 1 θ], D 2 = [-δθ, θ -δθ], and z (ℓ) i is the i-th element of z (ℓ) . Then we can train the ANN with SlipReLU activation using Stochastic Gradient Descent algorithm, and convert it to the SNN. Refer to Appendix D for our proposed ANN-SNN conversion algorithm.

5. RELATED WORK

The first study of ANN-SNN conversion is proposed by Cao et al. (2015) , which convert the ANNs with the ReLU activation function to SNNs. Afterwards, Diehl et al. (2015) proposed data-based and model-based weight-normalization method to convert a three-layer CNN to an SNN. However, it usually requires hundreds of time-steps for the converted SNN to get accurate results due to the error analyzed in Sect. 3. To address the potential information loss, the "reset-by-subtraction" mechanism (Rueckauer et al., 2017) , also called "soft-reset" (Han et al., 2020) rather than "resetto-zero" is proposed. Recently, many methods and algorithms have been proposed to eliminate the conversion error. Sengupta et al. (2019) proposed a novel weight-normalization technique which considers the actual SNN operation in the conversion step. For direct conversion from a pre-trained ANN to an SNN, Ding et al. (2021) proposed Rate Norm Layer to replace the ReLU activation function in source ANN training, and Li et al. (2021) proposed calibration for weights and biases using quantized fine-tuning to correct the error layer-by-layer. Our work share similarity with Deng & Gu (2021); Bu et al. (2021) which are also on optimal conversion. Deng & Gu (2021) minimized the layer-wise error by shift-threshold ReLU which only considers the deviation from the ReLU in the unified optimization framework in Sect. 3. Bu et al. (2021) proposed to use the quantization clip-floor-shift activation function to train ANNs and the clip-floor-shift activation function only minimizes the conversion error neglecting the performance loss of the tailored ANN with new activation function. They all got the theoretical "optimal" results with some fixed shift term. In comparison, our proposed unified framework gives more flexibility for different application scenarios to covert ANN into SNN with techniques eliminating the conversion error and keeping the ANN performance with less deviation from the regular ANN with ReLU. Our SlipReLU can balance the trade-off between the ANN performance and the conversion error simultaneously.

6. EXPERIMENTS

In this section, we compare our SlipReLU method with existing state-of-the-art approaches for image classification task on CIFAR-10 (LeCun et al., 1998) and CIFAR-100 (Krizhevsky & Hinton, We use SlipReLU with shift setting δ 1 = 0, δ = 0.5, refer to Appendix H for ablation studies on SlipReLU and SlipReLU-shift activation.

6.1. COMPARISON WITH THE STATE-OF-THE-ART METHODS

Table 1 shows the performance comparison of the proposed SlipReLU with the state-of-the-art ANN-SNN conversion methods on CIFAR-10. For ultra-low latency inference (T = 1 or T = 2), our proposed SlipReLU has the best performance compared to existing state-of-the-art ANN-SNN conversion methods. Specially, when the latency T = 1, our SlipReLU method is able to achieve an accuracy of 93.11% for ResNet-18, with a good margin compared to the next best baseline QCFS (88.30%); the accuracy for VGG-16 is 85.40% with SlipReLU activation, while the next best accuracy is 75.51% with QCFS activation. For ResNet-20, we achieve an accuracy of 82.8% with 2 time-steps. Our proposed SlipReLU method indeed gives the best SNN accuracy for ultra-low latency inference. When considering low-latency inference (T ⩽ 4), our model outperforms almost all the other methods with the same time-step setting. Notably, our ultra-low latency performance is comparable with other state-of-the-art supervised training methods, which is shown in Table S2 . The most competitive method of our SlipReLU is the QCFS method, however, it cannot to provide as high performance as our SlipReLU in terms of ANN accuracy, which can be seen from the ANN testing accuracy. Here the ANN accuracy of ReLU activation is the baseline. The results with SlipReLU activation shows that it has higher ANN accuracy than the QCFS (step function). The reason is that QCFS only considers the conversion error but not the ANN performance, while our SlipReLU proposes to consider the conversion error as well as the ANN performance. The highest ANN accuracy can sometimes be achieved by the SNNC-AP method, which is a one-step conversion method, however, the SNNC-AP usually fails to give moderate accuracy for low-latency SNN inference. Considering both ANN accuracy and SNN inference accuracy, the SlipReLU performs the best among all the other state-of-the-art models and it is the closest to the regular ReLU activation function, as our SlipReLU method is designed to consider both the ANN accuracy and the conversion error simultaneously. We further test the performance of our method on the large-scale dataset. Experimental results on CIFAR-100 and Tiny-ImageNet datasets are reported in Table S3 and Table S4 of Appendix G.

6.2. EFFECT OF THE SLOPE c AND EFFECT OF THE QUASI-LATENCY N

In our SlipReLU method, the slope c balances the weight of the threshold ReLU and the step function, which affects the accuracy of the converted SNN. To analyze the effect of c and better deter- Results in Fig. 2 show that for small values of quasi-latency N , the slope c has a large effect on SNN accuracy for ultra-low and low-latency inference. In particular, for small quasi-latency N , different slope values c can result in different SNN accuracy when the time-step T is small. But for large values of quasi-latency N , the colored curves are close to each other, and different values of slope c give similar results no matter whether the time-step T is small or large. This brings the flexibility to apply our SlipReLU to different scenarios. When we need ultra-low/low-latency inference for the converted SNN, we choose small quasi-latency N , but when we do not care about the inference time (the time-step T can be large), we then choose large quasi-latency N . Refer to Appendix I for more detailed results.

7. DISCUSSION AND CONCLUSION

The performance of the converted SNN is determined by both the ANN performance and the conversion error. The performance loss between the regular ANN with regular ReLU and the tailored ANN has never been considered in the existing ANN-SNN conversion methods, which will be inherited to the converted SNN. In this work, we formulate the ANN-SNN conversion as a unified optimization problem which considers the performance loss between the regular ANN with and the tailored ANN, as well as the conversion error simultaneously. Following the unified optimization framework, we propose the SlipReLU activation function to replace the regular ReLU activation function in the tailored ANN. The SlipReLU is a weighted sum of the shift-threhold-ReLU and the step function, which improves the performance of either as an activation function alone. The SlipReLU method covers a family of activation functions mapping from activation values in source ANNs to firing rates in target SNNs; most of the state-of-the-art optimal ANN-SNN conversion methods are special cases of our proposed SlipReLU method. We demonstrate through two theorems that the expected conversion error between SNNs and ANNs can theoretically be zero on a range of shift values δ ∈ [-1 2 , 1 2 ] rather than a fixed shift term 1 2 , enabling us to achieve converted SNNs with high accuracy and ultra-low latency. We evaluate our proposed SlipReLU method on CIFAR-10 dataset, and the results show that our proposed SlipReLU outperforms the state-of-theart ANN-SNN conversion in both accuracy and latency. To our knowledge, this is the first work to explore high-performance ANN-SNN conversion method considering the ANN performance and the conversion error simultaneously, with ultra-low latency, especially for 1 time-step (T = 1). 

NOTATIONS IN THE APPENDIX

Throughout the paper and this Appendix, we use the following notations in Table S1 . Bold-face lower-case letters refer to vectors, and normal-face letters refer to scalars. Note V (ℓ) th and θ (ℓ) are vectors whose dimensions match the number of neurons in the layer of interest, and denote  V (ℓ) th = [V (ℓ) th ] and θ (ℓ) = [θ (ℓ) ] respectively. Namely, vector V (ℓ) th = [V (ℓ) th ] means that each element is the same V (ℓ) th . Denote δ = [δ].

A ANALYSIS OF ACTIVATION FUNCTION IN SNNS

We will derive the activation function of SNN, F SNN (•) in this section. The activation function of SNN gives the relationship between activation values x(ℓ-1) and x(ℓ) of successive layers of SNN, which defines input-output function mapping for adjacent layers. Specifically, we can get the potential update equation by combining Eq. ( 2) and Eq. ( 4), v (ℓ) (t) = v (ℓ) (t -1) + W (ℓ) x (ℓ-1) (t) -s (ℓ) (t)V (ℓ) th . (A.1) By summing the time-step from time 1 to T , then we get v (ℓ) (T ) -v (ℓ) (0) = W (ℓ) T t=1 x (ℓ-1) (t) - T t=1 s (ℓ) (t)V (ℓ) th . (A.2) Due to the spike-in-spike-out property of the IF neurons in SNN, the output at each time-step can be ether 0 or 1. For each neuron i, let m i = T t=1 s (ℓ) i (t) , and each m i ∈ {0, 1, 2, • • • , T } denotes the total number of spikes of each neuron i. Then m = {m i } is the vector collecting all the number of spikes of all neurons in the ℓ-th layer. The accumulated spikes m = T t=1 s (ℓ) (t) denotes the total number of spikes. According to the above equations, we have v (ℓ) (T ) -v (ℓ) (0) = W (ℓ) T • x(ℓ-1) -mV (ℓ) th . (A.3) Then we get mV (ℓ) th = T W (ℓ) x(ℓ-1) -(v (ℓ) (T ) -v (ℓ) (0)) . (A.4) A.1 ELEMENT-WISE VERSION DERIVATION Denote z (ℓ) = W (ℓ) x(ℓ-1) . We use z (ℓ) i , v (ℓ) i (T ), v i (0), and m i to denote the i-th element in vector z (ℓ) , v (ℓ) (T ), v (ℓ) (0), and m respectively. That is, z (ℓ) = {z (ℓ) i }, v (ℓ) (T ) = {v (ℓ) i (T )}, v (ℓ) (0) = {v (ℓ) i (0)}, and m = {m i }. Then we have mV (ℓ) th = T z (ℓ) -(v (ℓ) (T ) -v (ℓ) (0)) ⇐⇒ m i V (ℓ) th = T z (ℓ) i -(v (ℓ) i (T ) -v (ℓ) i (0)) (For each neuron i with m = {m i }, z (ℓ) = {z (ℓ) i }) . Note that we assume the terminal membrane potential v (ℓ) i (T ) lies within the range [0, V (ℓ) th ), by further assuming v (ℓ) i (0) = 0, we get 0 ⩽ v (ℓ) i (T ) < V (ℓ) th ⇐⇒ -V (ℓ) th < -v (ℓ) i (T ) ⩽ 0 (adding T z (ℓ) i to each term) ⇐⇒ T z (ℓ) i -V (ℓ) th < T z (ℓ) i -v (ℓ) i (T ) ⩽ T z (ℓ) i (m i = T z (ℓ) i -v (ℓ) i (T )) ⇐⇒ T z (ℓ) i -V (ℓ) th < m i V (ℓ) th ⩽ T z (ℓ) i ⇐⇒ T z (ℓ) i -V (ℓ) th V (ℓ) th < m i ⩽ T z (ℓ) i V (ℓ) th . Then we use floor operation and clip operation to determine the totoal number of spikes, m i , m i = clip T z (ℓ) i V (ℓ) th , 0, T ( and m i = Ts (ℓ) i ) s(ℓ) i = clip 1 T T z (ℓ) i V (ℓ) th , 0, 1 ( and x(ℓ) i = V (ℓ) th s(ℓ) i ) x(ℓ) i = V (ℓ) th clip 1 T T z (ℓ) i V (ℓ) th , 0, 1 . The assumption v (ℓ) i (0) = 0 may be too strong, without it, we will get ⇐⇒ T z (ℓ) i -V (ℓ) th + v (ℓ) i (0) < m i V (ℓ) th ⩽ T z (ℓ) i + v (ℓ) i (0) ⇐⇒ T z (ℓ) i -V (ℓ) th + v (ℓ) i (0) V (ℓ) th < m i ⩽ T z (ℓ) i + v (ℓ) i (0) V (ℓ) th ⇐⇒ T z (ℓ) i -V (ℓ) th V (ℓ) th + δ < m i ⩽ T z (ℓ) i V (ℓ) th + δ with δ = v (ℓ) i (0) V (ℓ) th . Denote δ = v (ℓ) i (0) V (ℓ) th . Then we have m i = clip T z (ℓ) i V (ℓ) th + δ , 0, T ( and m i = Ts (ℓ) i ) s(ℓ) i = clip 1 T T z (ℓ) i V (ℓ) th + δ , 0, 1 ( and x(ℓ) i = V (ℓ) th s(ℓ) i ) x(ℓ) i = V (ℓ) th clip 1 T T z (ℓ) i V (ℓ) th + δ , 0, 1 . The relationship between activation values x(ℓ-1) and x(ℓ) of successive layers of SNN can be formulated as x(ℓ) i = V (ℓ) th clip 1 T T z (ℓ) i V (ℓ) th + δ , 0, 1 . A.2 VECTOR VERSION DERIVATION The accumulated spikes m = T t=1 s (ℓ) (t) denotes the total number of spikes, and m = {m i } is the vector collecting all the number of spikes of all neurons in the ℓ-th layer. Each m i ∈ {0, 1, 2, • • • , T } denotes the total number of spikes of each neuron i. According to the above equations, we have v (ℓ) (T ) -v (ℓ) (0) = W (ℓ) T • x(ℓ-1) -mV (ℓ) th . (A.5) Then we get mV (ℓ) th = T W (ℓ) x(ℓ-1) -(v (ℓ) (T ) -v (ℓ) (0)) . (A.6) Note that we assume the terminal membrane potential v (ℓ) (T ) lies within the range [0, V th ), by further assuming v (ℓ) (0) = 0, we get 0 ⩽ v (ℓ) (T ) < V (ℓ) th ⇐⇒ -V (ℓ) th < -v (ℓ) (T ) ⩽ 0 ⇐⇒ T W (ℓ) x(ℓ-1) -V (ℓ) th < T W (ℓ) x(ℓ-1) -v (ℓ) (T ) ⩽ T W (ℓ) x(ℓ-1) ⇐⇒ T W (ℓ) x(ℓ-1) -V (ℓ) th < mV (ℓ) th ⩽ T W (ℓ) x(ℓ-1) ⇐⇒ T W (ℓ) x(ℓ-1) -V (ℓ) th V (ℓ) th < m ⩽ T W (ℓ) x(ℓ-1) V (ℓ) th . Then we use floor operation and clip operation to determine the totoal number of spikes, m, m = clip T W (ℓ) x(ℓ-1) V (ℓ) th , 0, T ( and m = T s(ℓ) ) s(ℓ) = clip 1 T T W (ℓ) x(ℓ-1) V (ℓ) th , 0, 1 ( and x(ℓ) = V (ℓ) th s(ℓ) ) x(ℓ) = V (ℓ) th clip 1 T T W (ℓ) x(ℓ-1) V (ℓ) th , 0, 1 . The assumption v (ℓ) (0) = 0 may be too strong, without it, we will get ℓ) x(ℓ-1) ⇐⇒ T W (ℓ) x(ℓ-1) -V (ℓ) th + v (ℓ) (0) < mV (ℓ) th ⩽ T W (ℓ) x(ℓ-1) + v (ℓ) (0) ⇐⇒ T W (ℓ) x(ℓ-1) -V (ℓ) th + v (ℓ) (0) V (ℓ) th < m ⩽ T W (ℓ) x(ℓ-1) + v (ℓ) (0) V (ℓ) th ⇐⇒ T W (ℓ) x(ℓ-1) -V (ℓ) th V (ℓ) th + δ < m ⩽ T W ( V (ℓ) th + δ with δ = v (ℓ) (0) V (ℓ) th . Denote δ = v (ℓ) (0) V (ℓ) th . Note δ is a vector whose dimension matches the number of neurons in that layer. Then we have ℓ) x(ℓ-1) m = clip T W (ℓ) x(ℓ-1) V (ℓ) th + δ , 0, T ( and m = T s(ℓ) ) s(ℓ) = clip 1 T T W ( V (ℓ) th + δ , 0, 1 ( and x(ℓ) = V (ℓ) th s(ℓ) ) x(ℓ) = V (ℓ) th clip 1 T T W (ℓ) x(ℓ-1) V (ℓ) th + δ , 0, 1 . The relationship between activation values x(ℓ-1) and x(ℓ) of successive layers of SNN can be formulated as x(ℓ) = V (ℓ) th clip 1 T T W (ℓ) x(ℓ-1) V (ℓ) th + δ , 0, 1 . Note V (ℓ) th is a vector whose dimension matches the number of neurons in that layer, and ℓ) x(ℓ-1) . Then V (ℓ) th = [V (ℓ) th ] means each element is the same V (ℓ) th . Denote z (ℓ) = W ( x(ℓ) = V (ℓ) th clip 1 T T z (ℓ) V (ℓ) th + δ , 0, 1 .

B DERIVATION OF SLIPRELU ACTIVATION FUNCTION

In this section, we will give detailed derivation of the proposed SlipReLU activation function in Eq. ( 9) and its extension in Eq. ( 10) with different shift modes. In ANNs, denote z (ℓ) = W (ℓ) x (ℓ-1) . Then the forward propagation of activation values through layers in the ANN is a (ℓ) = F ANN (z (ℓ) ) = F ANN (W (ℓ) x (ℓ-1) ) .

B.1 DERIVATION OF SLIPRELU ACTIVATION FUNCTION

Derivation of SlipReLU activation function in Eq. ( 9). We start with the initial definition of the SlipReLU function in Eq. (B.1), SlipReLU(z (ℓ) ) =      0 if z (ℓ) < 0 cz (ℓ) + (1 -c) kθ (ℓ) N if kθ (ℓ) N ⩽ z (ℓ) < (k+1)θ (ℓ) N θ (ℓ) if z (ℓ) ⩾ θ (ℓ) . (B.1) ℓ) should be a vector whose dimension matches the number of neurons in that layer, θ (ℓ) = [θ (ℓ) ]. Here k = 0, 1, • • • , N -1. Note θ ( Then we can rewrite it to SlipReLU(z (ℓ) ) = y temp + c • (z temp -y temp ) = cz temp + (1 -c)y temp where z temp = θ (ℓ) clip z (ℓ) θ (ℓ) , 0, 1 , and ℓ) . y temp = θ (ℓ) N N • z temp θ Here y temp = θ (ℓ) N N z temp θ (ℓ) ⇐⇒ y temp = θ (ℓ) N N • clip z (ℓ) θ (ℓ) , 0, 1 ⇐⇒ y temp = θ (ℓ) clip 1 N N • z (ℓ) θ (ℓ) , 0, 1 ⇐⇒ y temp = θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) , 0, 1 . Then Eq. (B.1) can be written as follows, a (ℓ) = SlipReLU(z (ℓ) ) = cz temp + (1 -c)y temp = cθ (ℓ) clip z (ℓ) θ (ℓ) , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) , 0, 1 . That is the SlipReLU activation function in Eq. ( 9), ℓ) , 0, 1 . a (ℓ) = F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ

B.2 SLIPRELU ACTIVATION FUNCTION WITH DIFFERENT SHIFT MODES

Derivation of SlipReLU extension in Eq. ( 10) with shift As mentioned in Sect. 4, the SlipReLU activation function in Eq. ( 9) in a weighted combination of the threshold-ReLU (first part) and the step function (second part), with the slope 0 ⩽ c ⩽ 1 balancing the weight, then any shift to these two parts will lead to shift in the SlipReLU activation function. The SlipReLU extension with in Eq. ( 10) can be formulated as follows, a (ℓ) = F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) + δ 1 , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + δ , 0, 1 . The shift term δ 1 ∈ [-N, 0] and δ ∈ [-1 2 , 1 2 ] for the source ANNs. And δ 1 = [δ 1 ], δ = [δ]. Here we list several examples of the proposed SlipReLU with different shift modes. Mode 0 We set δ 1 = δ = 0, then a (ℓ) = F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) , 0, 1 . Mode 1 We set δ 1 = 0, δ = 1 2 , then a (ℓ) = F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) , 0, 1 +(1-c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + [ 1 2 ] , 0, 1 .

B.3 SPECIAL CASES OF THE SLIPRELU ACTIVATION FUNCTION

Here we list four different special cases of the proposed SlipReLU. Threshold-ReLU When c = 1 and δ 1 = 0, the SlipReLU becomes the threshold ReLU activation function which is studied in Deng & Gu (2021) . Shift-threshold-ReLU When c = 1 and δ 1 = -1/(2N ), the SlipReLU becomes the shiftthreshold ReLU activation function which is studied in Deng & Gu (2021) . Quantization clip-floor (QCF) When c = 0 and δ = 0, the SlipReLU becomes the quantization clip-floor (QCF) activation function which is studied in Bu et al. (2021) . Quantization clip-floor-shift (QCFS) When c = 0 and δ = 1/2, the SlipReLU becomes the quantization clip-floor-shift (QCFS) activation function which is studied in Bu et al. (2021) .

C PROOF OF THEOREMS

Before we proof Theorem 1 and Theorem 2, we first introduce an important Lemma. Lemma 1. If a random variable x ∈ [0, θ] is uniformly distributed in every small interval (m t , m t+1 ) with p t (t = 0, 1, • • • , T ), where m 0 = 0, m T +1 = θ, m t = (2t-1)θ 2T for t = 1, 2, • • • , T , p 0 = p T . For any value δ ∈ [-1 2 , 1 2 ], then we can conclude that E x x - θ T T x θ + δ = 0 . (C.1) Proof. We consider x in different small intervals (m t , m t+1 ). (1) For x ∈ 0, θ 2T , 0 < x < θ 2T ⇐⇒ δ < T x θ + δ < 1 2 + δ ⇐⇒ T x θ + δ = 0 . (2) For x ∈ (2t-1)θ 2T , (2t+1)θ

2T

, and t = 1, 2, • • • , T -1 (2t -1)θ 2T < x < (2t + 1)θ 2T ⇐⇒ t - 1 2 + δ < T x θ + δ < t + 1 2 + δ ⇐⇒ T x θ + δ = t . (3) For x ∈ (2T -1)θ , θ , (2T -1)θ 2T < x < θ ⇐⇒ T -T x θ + δ dx =p 0 θ/2T 0 |x| dx + T -1 t=1 p t (2t+1)θ/2T (2t-1)θ/2T x - tθ T dx + p T θ (2T -1)θ/2T |x -θ| dx =p 0 θ/2T 0 xdx + T -1 t=1 p t (2t+1)θ/2T (2t-1)θ/2T x - tθ T dx + p T θ (2T -1)θ/2T (x -θ) dx =p 0 θ 2 8T 2 + 0 -p T θ 2 8T 2 = 0 . Lemma 2. Let P be a probability distribution on R. If a random variable z ∈ R m and z ∼ P, a function g : z → g(z) ∈ R n and g(z) ⩾ 0 almost surely for ∀ z ∈ D, and E z |g(z)| = 0 , then we have E z ∥g(z)∥ 2 = 0 . Proof. By the definition of L 2 -norm, we have ∥g(z)∥ 2 = g 2 1 (z) + g 2 2 (z) + • • • + g 2 n (z) ⩽ |g 1 (z)| + |g 2 (z)| + • • • + |g n (z)| . Then, we can get E z ∥g(z)∥ 2 ⩽ E z |g 1 (z)| + E z |g 2 (z)| + • • • + E z |g n (z)| = E z g 1 (z) + E z g 2 (z) + • • • + E z g n (z) = 0 . Then E z ∥g(z)∥ 2 = 0 .

C.1 PROOF OF THEOREM 1

For Theorem 1, we need to prove ∀ T, L E z Err (ℓ) δ∈[-1 2 , 1 2 ] = 0 . Proof. The activation function of the SNN is F SNN (z (ℓ) ) = V (ℓ) th clip 1 T T z (ℓ) + v (ℓ) (0) V (ℓ) th , 0, 1 . For c = 0, the SlipReLU activation function used in the source ANN then becomes ℓ) , then the error becomes F ANN (z (ℓ) ) = θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + δ , 0, 1 . With V (ℓ) th = θ ( Err (ℓ) = F SNN (z (ℓ) ) -F ANN (z (ℓ) ) = θ (ℓ) N N z (ℓ) θ (ℓ) + δ - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th . Then E z Err (ℓ) δ∈[-1 2 , 1 2 ] =E z θ (ℓ) N N z (ℓ) θ (ℓ) + δ - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th ⩽ E z θ (ℓ) N N z (ℓ) θ (ℓ) + δ -z (ℓ) δ∈[-1 2 , 1 2 ] + E z z (ℓ) - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th . Denote v (ℓ) i (0) and z i the i-th element of vector v (ℓ) (0) and z. Denote δ = [δ]. Then we need to consider every element of vector z. E zi θ (ℓ) N N z (ℓ) i θ (ℓ) + δ - V (ℓ) th T T z (ℓ) i + v (ℓ) i (0) V (ℓ) th ⩽ E zi θ (ℓ) N N z (ℓ) i θ (ℓ) + δ -z (ℓ) i δ∈[-1 2 , 1 2 ] + E zi z (ℓ) i - V (ℓ) th T T z (ℓ) i + v (ℓ) i (0) V (ℓ) th . (C.2) Then according to Lemma 1, we have E zi θ (ℓ) N N z (ℓ) i θ (ℓ) + δ -z (ℓ) i δ∈[-1 2 , 1 2 ] = 0 E zi z (ℓ) i - V (ℓ) th T T z (ℓ) i + v (ℓ) i (0) V (ℓ) th v (ℓ) i (0)=δV (ℓ) th = 0 . This holds for any shift value δ in the ANNs when -1 2 ⩽ δ ⩽ 1 2 , which gives the conclusion of the Theorem 1. ∀ T, L E z Err (ℓ) δ∈[-1 2 , 1 2 ] = 0 . C.2 PROOF OF THEOREM 2 For Theorem 2, we need to prove, ∀ T, L E z Err (ℓ) δ∈[-1 2 , 1 2 ] = c(V (ℓ) th ) 2 4T , (C.3) Proof. The activation function of the SNN is F SNN (z (ℓ) ) = V (ℓ) th clip 1 T T z (ℓ) + v (ℓ) (0) V (ℓ) th , 0, 1 . For arbitrary c ∈ [0, 1], the SlipReLU activation function used in the source ANN then becomes ℓ) , then the error becomes, F ANN (z (ℓ) ) = cθ (ℓ) clip z (ℓ) θ (ℓ) + δ 1 , 0, 1 + (1 -c)θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + δ , 0, 1 . With V (ℓ) th = θ ( Err (ℓ) = F ANN (z (ℓ) ) -F SNN (z (ℓ) ) = c θ (ℓ) clip z (ℓ) θ (ℓ) + δ 1 , 0, 1 -V (ℓ) th clip 1 T T z (ℓ) + v (ℓ) (0) V (ℓ) th , 0, 1 + (1 -c) θ (ℓ) clip 1 N N z (ℓ) θ (ℓ) + δ , 0, 1 -V (ℓ) th clip 1 T T z (ℓ) + v (ℓ) (0) V (ℓ) th , 0, 1 = c z (ℓ) + δ 1 θ (ℓ) - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th (with v (ℓ) (0) = V (ℓ) th δ, V (ℓ) th = θ (ℓ) ) + (1 -c) θ (ℓ) N N z (ℓ) + v (ℓ) (0) θ (ℓ) - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th = c z (ℓ) + V (ℓ) th δ 1 - V (ℓ) th T T z (ℓ) V (ℓ) th + δ ∆ = c • Err 1 (C.4) + (1 -c) θ (ℓ) N N z (ℓ) θ (ℓ) + δ - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th ∆ = (1 -c) • Err 2 (C.5) Then Err (ℓ) ∆ = c • Err 1 + (1 -c) • Err 2 =⇒ Err (ℓ) = |c • Err 1 + (1 -c) • Err 2 | ⩽ c • |Err 1 | + (1 -c) • |Err 2 | . So we can minimize the whole error by minimized each of the two terms. Let δ 1 = ϕ+δ T . For Eq. (C.4), we have |Err 1 | ∆ = z (ℓ) + V (ℓ) th T (ϕ + δ) - V (ℓ) th T T z (ℓ) V (ℓ) th + δ = z (ℓ) + V (ℓ) th T ϕ - V (ℓ) th T T z (ℓ) V (ℓ) th . (C.6) Here z (ℓ) + V (ℓ) th T ϕ is the activation function of ANN V (ℓ) th T T z (ℓ) V (ℓ) th is the step activation function of SNN . This Eq. (C.6) recovers the loss of the shift-threshold ReLU (with a shift value ϕ) and the step function, which is the same as Deng & Gu (2021) . And as shown in (A) of Fig. 1 , the conversion error is the shaded area. The error between the activation function (of ANNs) and the step function (of SNNs) is obtained by summing up of all the shaded area together, which is the ANN-SNN conversion error. Then the objective becomes minimize min ϕ {E z |Err 1 |} = min ϕ T 2   V (ℓ) th T + V (ℓ) th T ϕ 2 + V (ℓ) th T ϕ 2   = (V (ℓ) th ) 2 4T =⇒ ϕ = - 1 2 . Then δ 1 = -1/2 + δ T . (C.7) And the minimum L 2 -norm of the first error becomes E z (|Err 1 |) = (V (ℓ) th ) 2 4T . For Eq. (C.5), with v (ℓ) (0) = V (ℓ) th δ, V th = θ (ℓ) , we have |Err 2 | ∆ = θ (ℓ) N N z (ℓ) θ (ℓ) + δ - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th From Lemma 1 and Theorem 1, we have E z (|Err 2 |) = E z θ (ℓ) N N z (ℓ) θ (ℓ) + δ - V (ℓ) th T T z (ℓ) + v (ℓ) (0) V (ℓ) th δ∈[-1 2 , 1 2 ] = 0 . Then E z Err (ℓ) = E z F ANN (z (ℓ) ) -F SNN (z (ℓ) ) = E z (|c • Err 1 + (1 -c) • Err 2 |) ⩽ c • E z (|Err 1 |) + (1 -c) • E z (|Err 2 |) = c • (V (ℓ) th ) 2 4T + (1 -c) • 0 = c(V (ℓ) th ) 2 4T . This concludes the Theorem 2. ∀ T, L E z Err (ℓ) δ∈[-1 2 , 1 2 ] = c(V (ℓ) th ) 2 4T .

D PSEUDO-CODE FOR THE UNIFIED ANN-SNN CONVERSION ALGORITHM

Here is the pseudo-code for our proposed unified ANN-SNN conversion algorithm.  x (ℓ) = SlipReLU(W (ℓ) x (ℓ-1) ; N, θ (ℓ) ) 11 Loss = CrossEntropy(x (ℓ) , y) 12 for ℓ = 1 to f ANN .layers do 13 W (ℓ) ← W (ℓ) -ϵ ∂Loss ∂W (ℓ) 14 θ (ℓ) ← θ (ℓ) -ϵ ∂Loss ∂θ (ℓ) 15 for ℓ = 1 to f ANN .layers do 16 f SNN .W (ℓ) ← f ANN .W (ℓ) 17 f SNN .V (ℓ) th ← f ANN .θ (ℓ) 18 f SNN .v (ℓ) (0) ← f SNN .V (ℓ) th × δ 19 Return f SNN E EXPERIMENTS DETAILS E.

1. NETWORK STRUCTURE AND TRAINING SETUPS

There are three steps in our proposed ANN-SNN conversion, Step 1: Tailor the ANN; Step 2: Train the tailored ANN; Step 3: Convert the trained ANN to an SNN. In the first step, we first replace max-pooling with average-pooling and then replace the ReLU activation with the proposed SlipReLU activation function. The tailored ANN is also called the source ANN. In the second step, we train the tailored ANN. After training the tailored ANN, we copy all weights from the trained-tailored source ANN to the converted SNN, and set the threshold V (ℓ) th in each layer of the converted SNN equal to the threshold value θ (ℓ) of the source ANN in the same layer. Besides, we set the initial membrane potential v (ℓ) (0) in converted SNN as V (ℓ) th δ to match the optimal shift δ of the SlipReLU activation in the tailored source ANN, where the optimal shift δ can be any value in the interval δ ∈ [-1 2 , 1 2 ]. Common data normalization and some data pre-processing techniques are used in the experiments. For example, we resize the images in the CIFAR-10/CIFAR-100 datasets into 32 × 32. Besides, random cropping images, Cutout (DeVries & Taylor, 2017) and AutoAugment (Cubuk et al., 2019) are used for all datasets. The Stochastic Gradient Descent (SGD) optimizer (Bottou, 2012) is used in the experiments with a momentum parameter of 0.9. We set the initial learning rate to ϵ = 0.1 for CIFAR-10 and CIFAR-100. We use a cosine decay scheduler (Loshchilov & Hutter, 2017) to adjust the learning rate with a weight decay 5 × 10 -4 for CIFAR-10/CIFAR-100 datasets. All models are trained for 300 epochs. When considering small quasi-latency N = 1 and N = 2, for models that can not be trained properly with learning rate ϵ = 0.1, we set the initial learning rate to 0.05 for CIFAR-10/CIFAR-100. We train all the networks on CIFAR-10/CIFAR-100 dataset with two different settings; we set the quasi-latency N = 1 with the slope c = 0.3, 0.4, 0.5 for low-latency inference (T ⩽ 8), and we set the quasi-latency N = 4 with the slope c = 0.9 for latency T > 8. We set δ 1 = 0, δ = 1 2 for the SlipReLU activation for all the models and all the datasets. As for the input to the first layer and the output of the last layer of the SNN, we do not employ any spiking mechanism as in Li et al. (2021) . We directly encode the static image to temporal dynamic spikes as input to the first layer, which can prevent the undesired information loss introduced by the Poisson encoding. For the last layer output, we only integrate the pre-synaptic input and do not fire any spikes. We use constant input when evaluating the converted SNNs.

E.2 INTRODUCTION OF DATASETS

CIFAR-10: The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) consists of 60, 000 32 × 32 color images in 10 classes of objects such as airplanes, cars, and birds, with 6, 000 images per class. There are 50, 000 samples in the training set and 10, 000 samples in the test set.

CIFAR-100:

The CIFAR-100 dataset (Krizhevsky & Hinton, 2009) consists of 60, 000 32×32 color images in 100 classes with 6, 000 images per class. There are 50, 000 samples in the training set and 10, 000 samples in the test set. Tiny-ImageNet: Tiny-ImageNet (Le & Yang, 2015) is a subset of ImageNet-1k (Russakovsky et al., 2015) with 200 classes. Training data contains a total of 100, 000 images, 500 images from each class. The test data contains a total of 10, 000 images, 50 images per class. The dimension of the images is 64 × 64.

F COMPARISON WITH THE STATE-OF-THE-ART SUPERVISED TRAINING

METHODS ON CIFAR-10 DATASET Our proposed SlipReLU method is comparable with other state-of-the-art supervised training methods in terms of the ultra-low latency performance. Table S2 reports Our approach for CIFARNet achieves an accuracy of 95.31% with time-step T = 4, and the achieved accuracy is higher than any other supervised trained models. Sufficient time-step is required for back-propagation to train the SNN directly. The hybrid training method involves training the converted SNN model using back-propagation as a second step, still it requires 200 time-steps to achieve a good accuracy, which is high compared to the time-step required for our method. For VGG-16, the hybrid training method requires 200 time-steps to obtain 92.03% accuracy, whereas our method achieves 91.08% accuracy with 4 time-steps. G RESULTS ON CIFAR-100 DATASET AND TINY-IMAGENET DATASET We report the results on CIFAR-100 in Table S3 , and the results on Tiny-ImageNet in Table S4 . From Table S3 , we see that our SlipReLU method also outperforms the others both in terms of high accuracy and ultra-low latency. For VGG16, the accuracy of the proposed method can achieve an accuracy of 64.21% which is 29.1% higher than QCFS and 39.98% higher than SNNC-AP when the time-steps is only 1. For ResNet-18, when T = 1, we can still achieve an accuracy of 71.51%. These results demonstrate that our method outperforms the previous conversion methods. Training an SNN on large-scale dataset such as Tiny-ImageNet is considered to be one of the challenges in the SNN literature. From Table S4 , we can infer our SlipReLU method outperforms other baselines in terms of the SNN accuracy when the time-step T is ⩽ 4. When the time-step is 1, for ResNet-34 our SlipReLU method achieves an accuracy of 40.55% which is 6.94% higher than the baseline QCFS (33.61%). For VGG16, our SlipReLU method outperforms other baseline methods with the accuracy of 43.73% when time-step is 1, which is 12.14% better than the baseline QCFS (31.59% ) and 32.93% better than the baseline SNNC-AP (10.80%).

H COMPARISON OF SLIPRELU AND SLIPRELU-SHIFT ACTIVATION

Here we further conduct ablation studies on SlipReLU and SlipReLU-shift, by comparing the performance of SNNs converted from ANNs with SlipReLU activation and ANN with SlipReLU-shift activation. In Sect. 4, we prove that for arbitrary T and N , the expectation of the conversion error reaches 0 with SlipReLU-shift activation function when c = 0. We also prove that for arbitrary T and N and arbitrary c ∈ [0, 1], the expectation of the conversion error of the proposed unified method reaches the optimal c(V ( th ℓ) 2 )/(4T ). To verify these, we set N = 1, 2, 4, 8, 16, 32 and train ANNs with SlipReLU activation and SlipReLU-shift activation, respectively. Fig. S1 shows how the accuracy of converted SNNs changes with respect to the time-step T under different quasi-latency N settings. The accuracy of the converted SNN from ANN with SlipReLU activation (in the first and third columns) first increases or stays flat for time-step T ⩽ 4, and then decreases rapidly with the increase of time-steps, because we cannot guarantee that the conversion error is zero when c ̸ = 0. The best performance is still lower than the SlipReLU-shift activation. The non-shifted SlipReLU activation shows no advantage for ultra-low latency inference when T ⩽ 4. In contrast, the accuracy of the converted SNN from ANN with SlipReLU-shift activation (in the second and fourth columns) increases with the increase of time-step T . It converges to the same accuracy when the time-step is larger than 16. The SlipReLU-shift activation shows advantages for ultra-low latency inference when T ⩽ 4.

I EFFECT OF THE SLOPE c AND THE QUASI-LATENCY N

In our SlipReLU method, the slope c balances the weight of the threshold ReLU and the step function, which affects the accuracy of the converted SNN. To analyze the effect of c and better determine the optimal value, we train VGG-16/ResNet-20 networks with quasi-latency N = 1, 2, 4, 8, 16, 32, and then converted the trained networks to SNNs. The experimental results on CIFAR-10/100 dataset are shown in Fig. S2 , where each of the colored curves shows the effect of the slope c on the SNN accuracy over different time-step/latency T , under different quasi-latency settings. Table S5 , Table S6 and Table S7 are the detailed data used to plot the curves.

J FUTURE STUDY

Remark 2. Our unified conversion framework exploits both the one-step conversion mechanism and the two-step conversion mechanism. The one-step conversion method uses a pre-trained source ANN, such as Li et al. (2021) , however, the two-step conversion method needs to redesign the activation function of the ANN to get a tailored source ANN, train it and convert it to SNN, such as Deng & Gu (2021); Bu et al. (2021) . Remark 3. Usually, implicit variables of an optimization problem are variables which do not need to be optimized but are used to model feasibility conditions (Mehlitz & Benko, 2021) , and they are often interpreted as explicit ones (Mehlitz & Benko, 2021) , by using union of image sets associated with given set-valued mappings to make the implicit variables as explicit variables, which can be an interesting future work but not what we are interested in this paper. As mentioned in Sect. 3.1, the multi-step output feature of SNN implies that higher-latency output depend on the outputs of all previous time-steps, which can be explored through multi-task learning. Therefore, it is reasonable to use multi-task learning for ANN-SNN conversion where the different time-steps can be seen as different but related tasks. 



is subtracted by the threshold value V (ℓ) th if the neuron fires, s

Remark 1. (A) An effective activation function F ANN of the tailored ANN should address the performance lose caused by the deviation from the regular ReLU. (B) When F ANN is designed by considering the deviation from the regular ReLU, the layer-wise error Eq. (6) would come from any mismatch of the three parts: (1) different activation values from source ANNs and target SNNs, i.e. a (ℓ) and x(ℓ) , (2) different activation functions, i.e. F ANN (•) and F SNN (•), and (3) the latency variable T which implicitly affects both the activation values and activation functions. (C) Whenever the conversion error E z (|Err(ℓ)

Figure 2: Effect of different slopes c with different quasi-latency N on CIFAR-10 and CIFAR-100.

Algorithm for ANN-SNN conversion. Input: ANN model structure f ANN (x; W) with initial weights W = {W (ℓ) }; Quasi-latency N ; Shift value δ from the interval δ ∈ [-1 2 , 1 2 ]; Initial dynamic threshold θ = {θ (ℓ) }; Learning rate ϵ. Output: SNN model f SNN (x; W) Data: Dataset D 1 for ℓ = 1 to f ANN .layers do 2 if is ReLU activation then 3 Replace ReLU(x) by SlipReLU(x; N, θ (ℓ) ) 4 if is MaxPooling layer then 5 Replace MaxPooling layer by AvgPooling layer 6 for e = 1 to epochs do 7 for length of Dataset D do 8 Sample minibach {(x (0) , y)} from D 9 for ℓ = 1 to f ANN .layers do 10

the results of the proposed models against the state-of-the-art supervised training methods on CIFAR10 dataset. These stateof-the-art supervised training methods include Hybrid-Conversion (HC) from Rathi et al. (2020), STBP from Wu et al. (2018), TSSL from Zhang & Li (2020) and GDDP from Zheng et al. (2021), which are back-propagation or hybrid training methods.

Figure S2: Effect of different slopes c with different quasi-latency N on CIFAR-10 and CIFAR-100. 26

Comparison between the proposed SlipReLU method and previous works on CIFAR10. .00 10.00 10.00 10.00 10.00

Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51-63, 2019.

Summary of notations in this paper.

Comparison with state-of-the-art supervised training methods on CIFAR-10 dataset.

Comparison between the proposed SlipReLU method and previous works on CIFAR-100.

Comparison between the proposed SlipReLU and other methods on Tiny-ImageNet.

Influence of different slope c with the quasi-latency N = 1.

Influence of different slope c with the quasi-latency N = 2.

Influence of different slope c with the quasi-latency N = 4.

