HOYER REGULARIZER IS ALL YOU NEED FOR ULTRA LOW-LATENCY SPIKING NEURAL NETWORKS

Abstract

Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, stateof-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for onetime-step SNNs that uses a novel variant of the recently proposed Hoyer regularizer. We estimate the threshold of each SNN layer as the Hoyer extremum of a clipped version of its activation map, where the clipping threshold is trained using gradient descent with our Hoyer regularizer. This approach not only downscales the value of the trainable threshold, thereby emitting a large number of spikes for weight update with a limited number of iterations (due to only one time step) but also shifts the membrane potential values away from the threshold, thereby mitigating the effect of noise that can degrade the SNN accuracy. Our approach outperforms existing spiking, binary, and adder neural networks in terms of the accuracy-FLOPs trade-off for complex image recognition tasks. Downstream experiments on object detection also demonstrate the efficacy of our approach. Codes will be made publicly available.

1. INTRODUCTION & RELATED WORKS

Due to its high activation sparsity and use of cheaper accumulates (AC) instead of energy-expensive multiply-and-accumulates (MAC), SNNs have emerged as a promising low-power alternative to compute-and memory-expensive deep neural networks (DNN) (Indiveri et al., 2011; Pfeiffer et al., 2018; Cao et al., 2015) . Because SNNs receive and transmit information via spikes, analog inputs have to be encoded with a sequence of spikes using techniques such as rate coding (Diehl et al., 2016) , temporal coding (Comsa et al., 2020 ), direct encoding (Rathi et al., 2020a) and rank-order coding (Kheradpisheh et al., 2018) . In addition to accommodating various forms of spike encoding, supervised training algorithms for SNNs have overcome various roadblocks associated with the discontinuous spike activation function (Lee et al., 2016; Kim et al., 2020) . Moreover, previous SNN efforts propose batch normalization (BN) techniques (Kim et al., 2020; Zheng et al., 2021) that leverage the temporal dynamics with rate/direct encoding. However, most of these efforts require multiple time steps which increases training and inference costs compared to non-spiking counterparts for static vision tasks. The training effort is high because backpropagation must integrate the gradients over an SNN that is unrolled once for each time step (Panda et al., 2020) . Moreover, the multiple forward passes result in an increased number of spikes, which degrades the SNN's energy efficiency, both during training and inference, and possibly offsets the compute advantage of the ACs. The multiple time steps also increase the inference complexity because of the need for input encoding logic and the increased latency associated with requiring one forward pass per time step. To mitigate these concerns, we propose one-time-step SNNs that do not require any non-spiking DNN pre-training and are more compute-efficient than existing multi-time-step SNNs. Without any temporal overhead, these SNNs are similar to vanilla feed-forward DNNs, with Heaviside activation functions (McCulloch & Pitts, 1943) . These SNNs are also similar to sparsity-induced or uni-polar binary neural networks (BNNs) (Wang et al., 2020b) that have 0 and 1 as two states. However, these BNNs do not yield SOTA accuracy like the bi-polar BNNs (Diffenderfer & Kailkhura, 2021) that has 1 and -1 as two states. A recent SNN work (Chowdhury et al., 2021 ) also proposed the use of one time-step, however, it required CNN pre-training, followed by iterative SNN training from 5 to 1 steps, significantly increasing the training complexity, particularly for ImageNet-level tasks. Note that there have been significant efforts in the SNN community to reduce the number of time steps via optimal DNN-to-SNN conversion (Bu et al., 2022b; Deng et al., 2021) , lottery ticket hypothesis (Kim et al., 2022c) , and neural architecture search (Kim et al., 2022b) . However, none of these techniques have been shown to train one-time-step SNNs without significant accuracy loss. Our Contributions. Our training framework is based on a novel application of the Hoyer regularizer and a novel Hoyer spike layer. More specifically, our spike layer threshold is training-inputdependent and is set to the Hoyer extremum of a clipped version of the membrane potential tensor, where the clipping threshold (existing SNNs use this as the threshold) is trained using gradient descent with our Hoyer regularizer. In this way, compared to SOTA one-time-step non-iteratively trained SNNs, our threshold increases the rate of weight updates and our Hoyer regularizer shifts the membrane potential distribution away from this threshold, improving convergence. We consistently surpass the accuracies obtained by SOTA one-time-step SNNs (Chowdhury et al., 2021) on diverse image recognition datasets with different convolutional architectures, while reducing the average training time by ∼19×. Compared to binary neural networks (BNN) and adder neural network (AddNN) models, our SNN models yield similar test accuracy with a ∼5.5× reduction in the floating point operations (FLOPs) count, thanks to the extreme sparsity enabled by our training framework. Downstream tasks on object detection also demonstrate that our approach surpasses the test mAP of existing BNNs and SNNs.

2. PRELIMINARIES ON HOYER REGULARIZERS

Based on the interplay between L1 and L2 norms, a new measure of sparsity was first introduced in (Hoyer, 2004), based on which reference (Yang et al., 2020) proposed a new regularizer, termed the Hoyer regularizer for the trainable weights that was incorporated into the loss term to train DNNs. We adopt the same form of Hoyer regularizer for the membrane potential to train our SNN models as et al., 2020) . Here, ∥u l ∥ i represents the Li norm of the tensor u l , and the superscript t for the time step is omitted for simplicity. Compared to the L1 and L2 regularizers, the Hoyer regularizer has scale-invariance (similar to the L0 regularizer). It is also differentiable almost everywhere, as shown in equation 1, where |u l | represents the element-wise absolute of the tensor u l . H(u l ) = ∥u l ∥ 1 ∥u l ∥ 2 2 (Kurtz ∂H(u l ) ∂u l = 2sign(u l ) ∥u l ∥ 1 ∥u l ∥ 4 2 (∥u l ∥ 2 2 -∥u l ∥ 1 |u l |) Letting the gradient ∂H(u l ) ∂u l = 0, we estimate the value of the Hoyer extremum as Ext(u l ) = ∥u l ∥ 2 2 ∥u l ∥ 1 . This extremum is actually the minimum, because the second derivative is greater than zero for any value of the output element. Training with the Hoyer regularizer can effectively help push the activation values that are larger than the extremum (u l >Ext(u l )) even larger and those that are smaller than the extremum (u l <Ext(u l )) even smaller.

3. PROPOSED TRAINING FRAMEWORK

Our approach is inspired by the fact that Hoyer regularizers can shift the pre-activation distributions away from the Hoyer extremum in a non-spiking DNN (Yang et al., 2020) . Our principal insight is that setting the SNN threshold to this extremum shifts the distribution of the membrane potentials away from the threshold value, reducing noise and thereby improving convergence. To achieve this goal for one-time-step SNNs we present a novel Hoyer spike layer that sets the threshold based upon a Hoyer regularized training process, as described below.

3.1. HOYER SPIKE LAYER

In this work, we adopt a time-independent variant of the popular Leaky Integrate and Fire (LIF) representation, as illustrated in Eq. 2, to model the spiking neuron with one time-step. u l = w l o l-1 z l = u l v th l o l = 1, if z l ≥ 1; 0, otherwise where z l denotes the normalized membrane potential. Such a neuron model with a unit step activation function is difficult to optimize even with the recently proposed surrogate gradient descent

