HOYER REGULARIZER IS ALL YOU NEED FOR ULTRA LOW-LATENCY SPIKING NEURAL NETWORKS

Abstract

Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, stateof-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for onetime-step SNNs that uses a novel variant of the recently proposed Hoyer regularizer. We estimate the threshold of each SNN layer as the Hoyer extremum of a clipped version of its activation map, where the clipping threshold is trained using gradient descent with our Hoyer regularizer. This approach not only downscales the value of the trainable threshold, thereby emitting a large number of spikes for weight update with a limited number of iterations (due to only one time step) but also shifts the membrane potential values away from the threshold, thereby mitigating the effect of noise that can degrade the SNN accuracy. Our approach outperforms existing spiking, binary, and adder neural networks in terms of the accuracy-FLOPs trade-off for complex image recognition tasks. Downstream experiments on object detection also demonstrate the efficacy of our approach. Codes will be made publicly available.

1. INTRODUCTION & RELATED WORKS

Due to its high activation sparsity and use of cheaper accumulates (AC) instead of energy-expensive multiply-and-accumulates (MAC), SNNs have emerged as a promising low-power alternative to compute-and memory-expensive deep neural networks (DNN) (Indiveri et al., 2011; Pfeiffer et al., 2018; Cao et al., 2015) . Because SNNs receive and transmit information via spikes, analog inputs have to be encoded with a sequence of spikes using techniques such as rate coding (Diehl et al., 2016) , temporal coding (Comsa et al., 2020) , direct encoding (Rathi et al., 2020a) and rank-order coding (Kheradpisheh et al., 2018) . In addition to accommodating various forms of spike encoding, supervised training algorithms for SNNs have overcome various roadblocks associated with the discontinuous spike activation function (Lee et al., 2016; Kim et al., 2020) . Moreover, previous SNN efforts propose batch normalization (BN) techniques (Kim et al., 2020; Zheng et al., 2021) that leverage the temporal dynamics with rate/direct encoding. However, most of these efforts require multiple time steps which increases training and inference costs compared to non-spiking counterparts for static vision tasks. The training effort is high because backpropagation must integrate the gradients over an SNN that is unrolled once for each time step (Panda et al., 2020) . Moreover, the multiple forward passes result in an increased number of spikes, which degrades the SNN's energy efficiency, both during training and inference, and possibly offsets the compute advantage of the ACs. The multiple time steps also increase the inference complexity because of the need for input encoding logic and the increased latency associated with requiring one forward pass per time step. To mitigate these concerns, we propose one-time-step SNNs that do not require any non-spiking DNN pre-training and are more compute-efficient than existing multi-time-step SNNs. Without any temporal overhead, these SNNs are similar to vanilla feed-forward DNNs, with Heaviside activation functions (McCulloch & Pitts, 1943) . These SNNs are also similar to sparsity-induced or uni-polar binary neural networks (BNNs) (Wang et al., 2020b) that have 0 and 1 as two states. However, these BNNs do not yield SOTA accuracy like the bi-polar BNNs (Diffenderfer & Kailkhura, 2021) that has 1 and -1 as two states. A recent SNN work (Chowdhury et al., 2021 ) also proposed the use of one time-step, however, it required CNN pre-training, followed by iterative SNN training from 5 to 1 steps, significantly increasing the training complexity, particularly for ImageNet-level tasks. Note that there have been significant efforts in the SNN community to reduce the number of time steps via optimal DNN-to-SNN conversion (Bu et al., 2022b; Deng et al., 2021) , lottery ticket hypothesis (Kim et al., 2022c) , and neural architecture search (Kim et al., 2022b) . However, none of these techniques have been shown to train one-time-step SNNs without significant accuracy loss. Our Contributions. Our training framework is based on a novel application of the Hoyer regularizer and a novel Hoyer spike layer. More specifically, our spike layer threshold is training-inputdependent and is set to the Hoyer extremum of a clipped version of the membrane potential tensor, where the clipping threshold (existing SNNs use this as the threshold) is trained using gradient descent with our Hoyer regularizer. In this way, compared to SOTA one-time-step non-iteratively trained SNNs, our threshold increases the rate of weight updates and our Hoyer regularizer shifts the membrane potential distribution away from this threshold, improving convergence. We consistently surpass the accuracies obtained by SOTA one-time-step SNNs (Chowdhury et al., 2021) on diverse image recognition datasets with different convolutional architectures, while reducing the average training time by ∼19×. Compared to binary neural networks (BNN) and adder neural network (AddNN) models, our SNN models yield similar test accuracy with a ∼5.5× reduction in the floating point operations (FLOPs) count, thanks to the extreme sparsity enabled by our training framework. Downstream tasks on object detection also demonstrate that our approach surpasses the test mAP of existing BNNs and SNNs.

2. PRELIMINARIES ON HOYER REGULARIZERS

Based on the interplay between L1 and L2 norms, a new measure of sparsity was first introduced in (Hoyer, 2004) , based on which reference (Yang et al., 2020) proposed a new regularizer, termed the Hoyer regularizer for the trainable weights that was incorporated into the loss term to train DNNs. We adopt the same form of Hoyer regularizer for the membrane potential to train our SNN models as (Kurtz et al., 2020) . Here, ∥u l ∥ i represents the Li norm of the tensor u l , and the superscript t for the time step is omitted for simplicity. Compared to the L1 and L2 regularizers, the Hoyer regularizer has scale-invariance (similar to the L0 regularizer). It is also differentiable almost everywhere, as shown in equation 1, where |u l | represents the element-wise absolute of the tensor u l . H(u l ) = ∥u l ∥ 1 ∥u l ∥ 2 2 ∂H(u l ) ∂u l = 2sign(u l ) ∥u l ∥ 1 ∥u l ∥ 4 2 (∥u l ∥ 2 2 -∥u l ∥ 1 |u l |) Letting the gradient ∂H(u l ) ∂u l = 0, we estimate the value of the Hoyer extremum as Ext(u l ) = ∥u l ∥ 2 2 ∥u l ∥ 1 . This extremum is actually the minimum, because the second derivative is greater than zero for any value of the output element. Training with the Hoyer regularizer can effectively help push the activation values that are larger than the extremum (u l >Ext(u l )) even larger and those that are smaller than the extremum (u l <Ext(u l )) even smaller.

3. PROPOSED TRAINING FRAMEWORK

Our approach is inspired by the fact that Hoyer regularizers can shift the pre-activation distributions away from the Hoyer extremum in a non-spiking DNN (Yang et al., 2020) . Our principal insight is that setting the SNN threshold to this extremum shifts the distribution of the membrane potentials away from the threshold value, reducing noise and thereby improving convergence. To achieve this goal for one-time-step SNNs we present a novel Hoyer spike layer that sets the threshold based upon a Hoyer regularized training process, as described below.

3.1. HOYER SPIKE LAYER

In this work, we adopt a time-independent variant of the popular Leaky Integrate and Fire (LIF) representation, as illustrated in Eq. 2, to model the spiking neuron with one time-step. u l = w l o l-1 z l = u l v th l o l = 1, if z l ≥ 1; 0, otherwise where z l denotes the normalized membrane potential. Such a neuron model with a unit step activation function is difficult to optimize even with the recently proposed surrogate gradient descent techniques for multi-time-step SNNs (Panda et al., 2020; Panda & Roy, 2016) , which either approximates the spiking neuron functionality with a continuous differentiable model or uses surrogate gradients to approximate the real gradients. This is because the average number of spikes with only one time step is too low to adjust the weights sufficiently using gradient descent with only one iteration available per input. This is because if a pre-synaptic neuron does not emit a spike, the synaptic weight connected to it cannot be updated because its gradient from neuron i to j is calculated as g uj ×o i , where g uj is the gradient of the membrane potential u j and o i is the output of the neuron i Therefore, it is crucial to reduce the value of the threshold to generate enough spikes for better network convergence. Note that a sufficiently low value of threshold can generate a spike for every neuron, but that would yield random outputs in the final classifier layer. Previous works (Datta et al., 2021; Rathi et al., 2020a) show that the number of SNN time steps can be reduced by training the threshold term v th l using gradient descent. However, our experiments indicate that, for one-time-step SNNs, this approach still yields thresholds that produce significant drops in accuracy. In contrast, we propose to dynamically down-scale the threshold (see Fig. 1(a) ) based on the membrane potential tensor using our proposed form of the Hoyer regularizer. In particular, we clip the membrane potential tensor corresponding to each convolutional layer to the trainable threshold v th l obtained from the gradient descent with our Hoyer loss, as detailed later in Eq. 11. Unlike existing approaches (Datta & Beerel, 2022; Rathi et al., 2020a) that require v th l to be initialized from a pre-trained non-spiking model, our approach can be used to train SNNs from scratch with a Kaiming uniform initialization (He et al., 2015) for both the weights and thresholds. In particular, the down-scaled threshold value for each layer is computed as the Hoyer extremum of the clipped membrane potential tensor, as shown in Fig. 1 (a) and mathematically defined as follows. z clip l =    1, if z l > 1 z l , if 0 ≤ z l ≤ 1 0, if z l < 0 o l = h s (z l ) = 1, if z l ≥ Ext(z clip l ); 0, otherwise Note that our threshold Ext(z clip l ) is indeed less than the trainable threshold v th l used in earlier works (Datta & Beerel, 2022; Rathi et al., 2020a) for any output, and the proof is shown in Appendix A.1. Moreover, we observe that the Hoyer extremum in each layer changes only slightly during the later stages of training, which indicates that it is most likely an inherent attribute of the dataset and model architecture. Hence, to estimate the threshold during inference, we calculate the exponential average of the Hoyer extremums during training (similar to BN), and use the same during inference.

3.2. HOYER REGULARIZED TRAINING

The loss function (L total ) of our proposed approach is shown below in Eq. 4. L total = L CE + L H = L CE + λ H L-1 l=1 H(u l ) where L CE denotes the cross-entropy loss calculated on the softmax output of the last layer L, and L H represents the Hoyer regularizer calculated on the output of each convolutional and fullyconnected layer, except the last layer. The weight update for the last layer is computed as ∆W L = ∂L CE ∂w L +λ H ∂L H ∂w L = ∂L CE ∂u L ∂u L ∂w L +λ H ∂L H ∂u L ∂u L ∂w L =(s -y)o L-1 +λ H ∂H(u L ) ∂u L o L-1 (5) ∂L CE ∂o L-1 = ∂L CE ∂u L ∂u L ∂o L-1 = (s -y)w L (6) where s denotes the output softmax tensor, i.e., s i = e u i L N k=1 u k L where u i L and u k L denote the i th and k th elements of the membrane potential of the last layer L, and N denotes the number of classes. Note that y denotes the one-hot encoded tensor of the true label, and ∂H(u L ) ∂u L is computed using Eq. 1. The last layer does not have any threshold and hence does not emit any spike. For a hidden layer l, the weight update is computed as ∆W l = ∂L CE ∂w l +λ H ∂L H ∂w l = ∂L CE ∂o l ∂o l ∂z l ∂z l ∂u l ∂u l ∂w l +λ H ∂L H ∂u l ∂u l ∂w l = ∂L CE ∂o l ∂o l ∂z l o l-1 v th l +λ H ∂L H ∂u l o l-1 (7) where ∂L H ∂u l can be computed as ∂L H ∂u l = ∂L H ∂u l+1 ∂u l+1 ∂o l ∂o l ∂z l ∂z l ∂u l + ∂H(u l ) ∂u l = ∂L H ∂u l+1 w l+1 ∂o l ∂z l 1 v th l + ∂H(u l ) ∂u l (8) where ∂L H ∂u l+1 is the gradient backpropagated from the (l + 1) th layer, that is iteratively computed from the last layer L (see Eqs. 6 and 9). Note that for any hidden layer l, there are two gradients that contribute to the Hoyer loss with respect to the potential u l ; one is from the subsequent layer (l+1) and the other is directly from its Hoyer regularizer. Similarly, ∂L CE ∂o l is computed iteratively, starting from the penultimate layer (L-1) defined in Eq. 6, as follows. ∂L CE ∂o l = ∂L CE ∂o l+1 ∂o l+1 ∂z l+1 ∂z l+1 ∂u l+1 ∂u l+1 ∂o l = ∂L CE ∂o l+1 ∂o l+1 ∂z l+1 1 v th l w l+1 All the derivatives in Eq. 8-11 can be computed by Pytorch autograd, except the spike derivative ∂o l ∂z l , whose gradient is zero almost everywhere and undefined at o l =0. We extend the existing idea of surrogate gradient (Neftci et al., 2019) to compute this derivative for one-time-step SNNs with Hoyer spike layers, as illustrated in Fig. 1 (b) and mathematically defined as follows. ∂o l ∂z l = scale×1 if 0 < z l < 2 0 otherwise ( ) where scale denotes a hyperparameter that controls the dampening of the gradient. Finally, the threshold update for the hidden layer l is computed as ∆v th l = ∂L CE ∂v th l +λ H ∂L H ∂v th l = ∂L CE ∂o l ∂o l ∂z l ∂z l ∂v th l +λ H ∂L H ∂v th l = ∂L CE ∂o l ∂o l ∂z l -u l (v th l ) 2 +λ H ∂L H ∂u l+1 ∂u l+1 ∂v th l (11) ∂u l+1 ∂v th l = ∂u l+1 ∂o l • ∂o l ∂v th l = w l+1 • ∂o l ∂z l • -u l (v th l ) 2 Note that we use this v l th , which is updated in each iteration, to estimate the threshold value in our spiking model using Eq. 3.

3.3. NETWORK STRUCTURE

We propose a series of network architectural modifications of existing SNNs (Datta & Beerel, 2022; Chowdhury et al., 2021; Rathi et al., 2020a) for our one-time-step models. As shown in Fig. 2(a) , for the VGG variant, we use the max pooling layer immediately after the convolutional layer that is common in many BNN architectures (Rastegari et al., 2016) , and introduce the BN layer after max pooling. Similar to recently developed multi-time-step SNN models (Zheng et al., 2021; Li et al., 2021b; Deng et al., 2022; Meng et al., 2022) , we observe that BN helps increase the test accuracy with one time step. In contrast, for the ResNet variants, inspired by (Liu et al., 2018) , we observe models with shortcuts that bypass every block can also further improve the performance of the SNN. We also observe that the sequence of BN layer, Hoyer spike layer, and convolution layer outperforms the original bottleneck in ResNet. More details are shown in Fig. 2(b) .

3.4. POSSIBLE TRAINING STRATEGIES

Based on existing SNN literature, we hypothesize a couple of training strategies that can used to train one-time-step SNNs, other than our proposed approach. Pre-trained DNN, followed by SNN fine-tuning. Similar to the hybrid training proposed in (Rathi et al., 2020b) , we pre-train a non-spiking DNN model, and copy its weights to the SNN model. Initialized with these weights, we train a one-time-step SNN with normal cross-entropy loss. Iteratively convert ReLU neurons to spiking neurons. First, we train a DNN model which uses the ReLU function with threshold as the activation function, then we iteratively reduce the number of the ReLU neurons whose output activation values are multi-bit. Specifically, we first force the neurons with values in the top N percentile to spike (set the output be 1), and those with bottom N percentile percent to die (set the output be 0), and gradually increase the N until there is a significant drop of accuracy or all neuron outputs are either 1 or 0. 

4. EXPERIMENTAL RESULTS

Datasets & Models: Similar to existing SNN works (Rathi et al., 2020b; a) , we perform object recognition experiments on CIFAR10 (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) dataset using VGG16 (Simonyan & Zisserman, 2014) and several variants of ResNet (He et al., 2016) architectures. For object detection, we use the MMDetection framework (Chen et al., 2019) with PASCAL VOC2007 and VOC2014 (Everingham et al., 2010) as training dataset, and benchmark our SNN models and the baselines on the VOC2007 test dataset. We use the Faster R-CNN (Ren et al., 2015) and RetinaNet (Lin et al., 2017) framework, and substitute the original backbone with our SNN models that are pretrained on ImageNet1K. Object Recognition Results: For training the recognition models, we use the Adam (Kingma & Ba, 2014) optimizer for VGG16, and use SGD optimizer for ResNet models. As shown in Table 2 , we obtain the SOTA accuracy of 93.44% on CIFAR10 with VGG16 with only one time step; the accuracy of our ResNet-based SNN models on ImageNet also surpasses the existing works. On ImageNet, we obtain a 68.00% top-1 accuracy with VGG16 which is only ∼2% lower compared to the non-spiking counterpart. All our SNN models yield a spiking activity of ∼25% or lower on both CIFAR10 and ImageNet, which is significantly lower compared to the existing multi-time-step SNN models as shown in Fig. 3 . Object Detection Results: For object detection on VOC2007, we compare the performance obtained by our spiking models with non-spiking DNNs and BNNs in Table 3 . For two-stage architectures, such as Faster R-CNN, the mAP of our one-time-step SNN models surpass the existing BNNs by >0.6%foot_0 . For one-stage architectures, such as RetinaNet (chosen because of its SOTA performance), our one-time-step SNN models with a ResNet50 backbone yields a mAP of 70.5% (highest among existing BNN, SNN, AddNNs). Note that our spiking VGG and ResNet-based backbones lead to a significant drop in mAP with the YOLO framework that is more compatible with the DarkNet backbone (even existing DarkNet-based SNNs lead to very low mAP with YOLO as shown in Table 3 ). However, our models suffer 5.8-6.8% drop in mAP compared to the non-spiking DNNs which may be due to the significant sparsity and loss in precision. Accuracy Comparison: We compare our results with various SOTA ultra low-latency SNNs for image recognition tasks in et al., 2021) , where S l denotes the average number of spikes per neuron per inference over all timesteps in layer l. 2) Use of only AC (0.9pJ) operations that consume 5.1× lower compared to each MAC (4.6pJ) operation in 45nm CMOS technology (Horowitz, 2014) for floatingpoint (FP) representation. Note that the binary activations can replace the FP multiplications with logical operations, i.e., conditional assignment to 0 with a bank of AND gates. These replacements can be realized using existing hardware (eg. standard GPUs) depending on the compiler and the details of their data paths. Building a custom accelerator that can efficiently implement these reduced operations is also possible (Wang et al., 2020a; Frenkel et al., 2019; Lee & Li, 2020) . In fact, in neuromorphic accelerators such as Loihi (Davies et al., 2018) , FP multiplications are typically avoided using message passing between processors that model multiple neurons. The total compute energy (CE) of a multi-time-step SNN (SN N CE ) can be estimated as SN N CE = DN N f lops 1 * 4.6+DN N com 1 * 0.4 + L l=2 S l * DN N f lops l * 0.9 + DN N com l * 0.7 (13) because the direct encoded SNN receives analog input in the first layer (l=1) without any sparsity (Chowdhury et al., 2021; Datta & Beerel, 2022; Rathi et al., 2020a) . Note that DN N com l denotes the total number of comparison operations in the layer l with each operation consuming 0.4pJ energy. The CE of the non-spiking DNN (DN N CE ) is estimated as DN N CE = L l=1 DN N f lops l * 4.6, where we ignore the energy consumed by the ReLU operation since that includes only checking the sign bit of the input. We compare the layer-wise spiking activities S l for time steps ranging from 5 to 1 in Fig. 3(a-b ) that represent existing low-latency SNN works, including our work. Note, the spike rates decrease significantly with time step reduction from 5 to 1, leading to considerably lower FLOPs in our one-time-step SNNs. These lower FLOPs, coupled with the 5.1× reduction for AC operations leads to a 22.9× and 32.1× reduction in energy on CIFAR10 and ImageNet respectively with VGG16. Though we focus on compute energies for our comparison, multi-time-step SNNs also incur a large number of memory accesses as the membrane potentials and weights need to be fetched from and read to the on-/off-chip memory for each time step. Our one-time-step models can avoid these repetitive read/write operations as it does involve any state and lead to a ∼T × reduction in the number of memory accesses compared to a T -time-step SNN model. Considering this memory cost and the overhead of sparsity (Yin et al., 2022) , as shown in Fig. 3 (c), our one-time-step SNNs lead to a 2.08-14.74× and 22.5-31.4× reduction of the total energy compared to multi-time-step SNNs and non-spiking DNNs respectively on a systolic array accelerator. 5 , where the model without Hoyer spike layer indicates that we set the threshold as v th l similar to existing works (Datta & Beerel, 2022; Rathi et al., 2020a ) rather than our proposed Hoyer extremum. With VGG16, our optimal network modifications lead to a 1.9% increase in accuracy. Furthermore, adding only the Hoyer regularizer leads to negligible accuracy and spiking activity improvements. This might be because the regularizer alone may not be able to sufficiently down-scale the threshold for optimal convergence with one time step. However, with our Hoyer spike layer, the accuracy improves by 2.68% to 93.13% while also yielding a 2.09% increase in spiking activity. We observe a similar trend for our network modifications and Hoyer spike layer with ResNet18. However, Hoyer regularizer substantially reduces the spiking activity from 27.62% to 20.50%, while also negligibly reducing the accuracy. Note that the Hoyer regularizer alone contributes to 0.20% increase in test accuracy on average. In summary, while our network modifications significantly increase the test accuracy compared to the SOTA SNN training with one time step, the combination of our Hoyer regularizer and Hoyer spike layer yield the SOTA SNN performance. Effect on Quantization: In order to further improve the compute-efficiency of our one-timestep SNNs, we perform quantization-aware training of the weights in our models to 2-6 bits. 7 ). We also compare our SNN models with SOTA BNNs in Table 7 that replaces the costly MAC operations with cheaper pop-count counterparts, thanks to the binary weights and activations. Both our full-precision and 2-bit quantized one-time-step SNN models yield accuracies higher than BNNs at iso-architectures on both CIFAR10 and ImageNet. Additionally, our 2-bit quantized SNN models also consume 3.4× lower energy compared to the bi-polar networks (see (Diffenderfer & Kailkhura, 2021) in Table 7 ) due to the improved trade-off between the low spiking activity (∼22% as shown in Table 7 ) provided by our one-time-step SNN models, and less energy due to XOR operations compared to quantized ACs. On the other hand, our one-time-step SNNs consume similar energy compared to unipolar BNNs (see (Sakr et al., 2018; Wang et al., 2020b) in Table 7 ) while yielding 3.2% higher accuracy on CIFAR10 at iso-architecture. The energy consumption is similar because the ∼20% advantage of the pop-count operations is mitigated by the ∼22% higher spiking activity of the unipolar BNNs compared to our one-time-step SNNs.

5. DISCUSSIONS & FUTURE IMPACT

Existing SNN training works choose ANN-SNN conversion methods to yield high accuracy or SNN fine-tuning to yield low latency or a hybrid of both for a balanced accuracy-latency trade-off. However, none of the existing works can discard the temporal dimension completely, which can enable the deployment of SNN models in multiple real-time applications, without significantly increasing the training cost. This paper presents a SNN training framework from scratch involving a novel combination of a Hoyer regularizer and Hoyer spike layer for one time step. Our SNN models incur similar training time as non-spiking DNN models and achieve SOTA accuracy, outperforming the existing SNN, BNN, and AddNN models. However, our work can also enable cheap and real-time computer vision systems that might be susceptible to adversarial attacks. Preventing the application of this technology from abusive usage is an important and interesting area of future work. (Deng et al., 2022) . As illustrated in Table 9 , we surpass the test accuracy of existing works (Li et al., 2022; Kim & Panda, 2021 ) by 1.30% on average at iso-time-step and architecture. Note that the architecture VGGSNN employed in our work and (Deng et al., 2022) is based on VGG11 with two fully connected layers removed as (Deng et al., 2022) found that additional fully connected layers were unnecessary for neuromorphic datasets. In fact, our accuracy gain is more significant at low time steps, thereby implying the portability of our approach to DVS tasks. Note that similar to static datasets, a large number of time steps increase the temporal overhead in SNNs, resulting in a large memory footprint and spiking activity.

A.5 FURTHER INSIGHTS ON HOYER REGULARIZED TRAINING

Since existing works (Panda et al., 2020) use surrogate gradients (and not real gradients) to update the thresholds with appropriate initializations, it is difficult to estimate the optimal value of the IF thresholds. On the other hand, our Hoyer extremums dynamically change with the activation maps particularly during the early stages of training (coupled with the distribution shift enabled by Hoyer regularized training), which enables our Hoyer extremum-based scaled thresholds to be closer to optimal. In fact, as shown from our ablation studies in Table 5 , our Hoyer extremum-based spike layer is more effective than the Hoyer regularizer which further justifies the importance of the combination of the Hoyer extremum with the trainable threshold. Additionally, we use the clip function of the membrane potential before computing the Hoyer extremum. This is done to get rid of a few outlier values in the activation map that may otherwise unnecessarily increase the value of the Hoyer extremum, i.e., threshold value, thereby reducing the accuracy, without any noticeable increase in energy efficiency. In fact, the test accuracy with VGG16 on CIFAR10 drops by more than 1.4% (from 93.13% obtained by our training framework) to 91.7% without the clip function. A.6 TUNING SPIKING ACTIVITY WITH HOYER REGULARIZER λ H We conduct experiments with different coefficients of Hoyer regularizer λ H to demostrate its impact on the trade-off between accuracy and spikinf activity. As shown in Table 10 , we can clearly see that a larger Hoyer regularizer alone can decrease the spike activity rate, while a smaller Hoyer regularizer will increase the same. In fact, the spiking activity can be precisely tuned using λ H to yield a range of accuracies. Interestingly, Hoyer-regularized training on ResNet18 yields a wider range of spiking activities and a narrower range of accuracies compared to VGG16. This might be because each architecture can have different optimization headroom.



We were unable to find existing SNN works for two-stage object detection architectures.



Figure 1: (a) Comparison of our Hoyer spike activation function with existing activation functions where the blue distribution denotes the shifting of the membrane potential away from the threshold using Hoyer regularized training, (b) Proposed derivative of our Hoyer activation function.

Figure 2: Spiking network architectures corresponding to (a) VGG and (b) ResNet based models.

Figure 3: Layerwise spiking activities for a VGG16 across time steps ranging from 5 to 1 (average spiking activity denoted as S in parenthesis) representing existing low-latency SNNs including our work on (a) CIFAR10, (b) ImageNet, (c) Comparison of the total energy consumption between SNNs with different time steps and non-spiking DNNs.

The compute-efficiency of SNNs stems from two factors:-1) sparsity, that reduces the number of floating point operations in convolutional and linear layers compared to non-spiking DNNs according to SN N f lops l = S l × DN N f lops l (Chowdhury

Figure 4: Normalized training and inference time per epoch with iso-batch (256) and hardware (RTX 3090 with 24 GB memory) conditions for (a) CIFAR10 and (b) ImageNet with VGG16. Training & Inference Time Requirements: Because SOTA SNNs require iteration over multiple time steps and storage of the membrane potentials for each neuron, their training and inference time can be substantially higher than their DNN counterparts. However, reducing their latency to 1 time step can bridge this gap significantly, as shown in Figure 4. On average, our low-latency, one-time-step SNNs represent a 2.38× and 2.33× reduction in training and inference time per epoch respectively, compared to the multi-time-step training approaches (Datta & Beerel, 2022; Rathi et al., 2020a) with iso-batch and hardware conditions. Compared to the existing one-time-step SNNs (Chowdhury et al., 2021), we yield a 19× and 1.25× reduction in training and inference time. Such significant savings in training time, which translates to power savings in big data centers, can potentially reduce AI's environmental impact.

Accuracies from different strategies to train one-step SNNs on CIFAR10 -spiking models with one time step. One possible reason for this might be the difference in the distribution of the pre-activation values between the DNN and SNN models(Datta & Beerel, 2022). It is also intuitive to obtain a one-time-step SNN model by iteratively reducing the proportion of the ReLU neurons from a pretrained full-precision DNN model. However, our results indicate that this method also fails to generate enough spikes at one time step required to yield close to SOTA accuracy. Finally, with our network structure modifications to existing SNN works, our Hoyer spike layer and our Hoyer regularizer, we can train a one-time-step SNN model with SOTA accuracy from scratch.

Comparison of the test accuracy of our one-time-step SNN models with the non-spiking DNN models for object recognition. Model * indicates that we remove the first max pooling layer.





Comparison of our one-time-step SNN models to existing low-latency counterparts. SGD and hybrid denote surrogate gradient descent and pre-trained DNN followed by SNN fine-tuning respectively. (qC, dL) denotes an architecture with q convolutional and d linear layers.

Ablation study of the different methods in our proposed training framework on CIFAR10.

Accuracies of weight quantized onetime-step SNN models based on VGG16 on CIFAR10 where FP is 32-bit floating point.

Comparison of our one-time-step SNN models to AddNNs and BNNs that also incur AC-only operations for improved energy-efficiency, where CE is compute energy

Test accuracy obtained by our approach with multiple time steps on CIFAR10.

Comparison of our one-and multi-time-step SNN models to existing SNN models on DVS-CIFAR10 dataset.

annex

So the Hoyer extremum of z clip l is always less than or equal one, and our Hoyer extremum of every layer l which is the product of v th l and Ext(z clip l) is always less than or equal v th l .

A.2 EXPERIMENTAL SETUP

For training VGG16 models, we using Adam optimizer with initial learning rate of 0.0001, weight decay of 0.0001, dropout of 0.1 and batch size of 128 in CIFAR10 for 600 epochs, and Adam optimizer with weight decay of 5e -6 and with batch size 64 in ImageNet for 180 epochs. For training ResNet models, we using SGD optimizer with initial learning rate of 0.1, weight decay of 0.0001 and batch size of 128 in CIFAR10 for 400 epochs, and Adam optimizer with weight decay of 5e -6 and with batch size 64 in ImageNet for 120 epochs. We divide the learning rate by 5 at 60%, 80%, and 90% of the total number of epochs.When calculating the Hoyer extremum we implement two versions, one that calculates the Hoyer extremum for the whole batch, while another that calculates it channel-wise. Our experiments show that using the channel-wise version can bring 0.1-0.3% increase in accuracy. All the experimental results reported in this paper use this channel-wise version.For Faster R-CNN, we use SGD optimizer with initial learning rate of 0.01 for 50 epochs, and divide the learning rate by 10 after 25 and 40 epochs each. For Retinanet, we use SGD optimizer with initial learning rate of 0.001 with the same learning rate scheduler as Faster R-CNN.

A.3 EXTENSION TO MULTIPLE TIME-STEPS

We extend our proposed approach to multi-time-step SNN models. As show in Table 8 , as time step increases from 1 to 4, the accuracy of the model also increases from 93.44% to 94.14%, which validates the effectiveness of our method. However, this accuracy increase comes at the cost of a significant increase in spiking activity (see Table 8 ), thereby increasing the compute energy and the temporal "overhead" increases, thereby increasing the memory cost due to the repetitive access of the membrane potential and weights across the different time steps.

A.4 EXTENSION TO DYNAMIC VISION SENSOR (DVS) DATASETS

The inherent temporal dynamics in SNNs may be better leveraged in DVS or event-based tasks (Deng et al., 2022; Li et al., 2022; Kim & Panda, 2021; Kim et al., 2022a) compared to standard static vision tasks that are studied in this work. Hence, we have evaluated our framework on the DVS-CIFAR10 dataset, which provides each label with only 0.9k training samples, and is considered The Hoyer spike layer, when used with the Hoyer regularizer (with the optimal value of the coefficient that yields the best test accuracy), increase the spiking activity for both VGG16 and ResNet18. Please check the 2.08% (from 20.48% to 22.57%) increase in spiking activity for VGG16 and 5.33% (20.50% to 25.83%) for ResNet18. This is because the Hoyer spike layer downscales the threshold value, enabling more neurons to spike.Note that the Hoyer spike layer, when used without the Hoyer regularizer, may be unable to tune the trade-off between spiking activity and accuracy. This is because we do not have any explicit regularizer co-efficient, and the Hoyer extremum may not always lower the threshold value because it is computed based on the SGL-based trainable threshold which, without Hoyer regularizer, may be updated randomly (i.e., not in a systematic manner that may encourage sparsity). This is the reason we believe we do not observe any definitive trend for the trade-off between accuracy and spiking activity in this case. Note that all the results in Table 9 

