SPIKFORMER: WHEN SPIKING NEURAL NETWORK MEETS TRANSFORMER

Abstract

We consider two biologically plausible structures, the Spiking Neural Network (SNN) and the self-attention mechanism. The former offers an energy-efficient and event-driven paradigm for deep learning, while the latter has the ability to capture feature dependencies, enabling Transformer to achieve good performance. It is intuitively promising to explore the marriage between them. In this paper, we consider leveraging both self-attention capability and biological properties of SNNs, and propose a novel Spiking Self Attention (SSA) as well as a powerful framework, named Spiking Transformer (Spikformer). The SSA mechanism in Spikformer models the sparse visual feature by using spike-form Query, Key, and Value without softmax. Since its computation is sparse and avoids multiplication, SSA is efficient and has low computational energy consumption. It is shown that Spikformer with SSA can outperform the state-of-the-art SNNs-like frameworks in image classification on both neuromorphic and static datasets. Spikformer (66.3M parameters) with comparable size to SEW-ResNet-152 (60.2M, 69.26%) can achieve 74.81% top1 accuracy on ImageNet using 4 time steps, which is the state-of-the-art in directly trained SNNs models. Codes is avaiable at Spikformer.

1. INTRODUCTION

As the third generation of neural network (Maass, 1997) , the Spiking Neural Network (SNN) is very promising for its low power consumption, event-driven characteristic, and biological plausibility (Roy et al., 2019) . With the development of artificial neural networks (ANNs), SNNs are able to lift performance by borrowing advanced architectures from ANNs, such as ResNet-like SNNs (Hu et al., 2021a; Fang et al., 2021a; Zheng et al., 2021; Hu et al., 2021b) , Spiking Recurrent Neural Networks (Lotfi Rezaabad & Vishwanath, 2020) and Spiking Graph Neural Networks (Zhu et al., 2022) . Transformer, originally designed for natural language processing (Vaswani et al., 2017) , has flourished for various tasks in computer vision, including image classification (Dosovitskiy et al., 2020; Yuan et al., 2021a) , object detection (Carion et al., 2020; Zhu et al., 2020; Liu et al., 2021) , semantic segmentation (Wang et al., 2021; Yuan et al., 2021b ) and low-level image processing (Chen et al., 2021) . Self-attention, the key part of Transformer, selectively focuses on information of interest, and is also an important feature of the human biological system (Whittington et al., 2022; Caucheteux & King, 2022) . Intuitively, it is intriguing to explore applying self-attention in SNNs for more advanced deep learning, considering the biological properties of the two mechanisms. It is however non-trivial to port the self-attention mechanism into SNNs. In vanilla self-attention (VSA) (Vaswani et al., 2017) , there are three components: Query, Key, and Value. As shown in Figure 1 (a), standard inference of VSA is firstly obtaining a matrix by computing the dot product of float-point-form Query and Key; then softmax, which contains exponential calculations and division operations, is adopted to normalize the matrix to give the attention map which will be used to weigh the Value. The above steps in VSA do not conform to the calculation characteristics of SNNs, i.e., avoiding multiplication. Moreover, the heavy computational overhead of VSA almost prohibits × × For convenience, we choose one of the heads of SSA, where N is the number of input patches and d is the feature dimension of one head. FLOPs is the floating point operations and SOPs is the theoretical synaptic operations. The theoretical energy consumption to perform one calculation between Query, Key and Value in one time step is obtained from 8-encoder-blocks 512-embedding-dimension Spikformer on ImageNet test set according to (Kundu et al., 2021b; Hu et al., 2021a) . More details about the calculation of theoretical SOP and energy consumption are included in appendix. C.2. (a) In VSA, QF , KF , VF are float-point forms. After the dot-product of QF and KF , the softmax function regularizes negative values in the attention map to positive values. (b) In SSA, all value in attention map is non-negative and the computation is sparse using spike-form Q, K, V (5.5 × 10 6 VS. 77 × 10 6 in VSA). Therefore, the computation in SSA consumes less energy compared with VSA (354.2µJ). In addition, the SSA is decomposable (the calculation order of Q, K and V is changeable). Q ℱ Q ℱ N × d N × d d × N d × N × × = = N × N N × N -0. N × d N × d = = Attn Map N × N N × N × × 0 1 1 1 0 1 0 1 1 V V N × d N × d K T V K T V 1 0 1 2 × × d × d d × d Q Q 1 2 4. applying it directly to SNNs. Therefore, in order to develop Transformer on SNNs, we need to design a new effective and computation-efficient self-attention variant that can avoid multiplications. We thus present Spiking Self Attention (SSA), as illustrated in Figure 1 (b). SSA introduces selfattention mechanism to SNNs for the first time, which models the interdependence using spike sequences. In SSA, the Query, Key, and Value are in spike form which only contains of 0 and 1. The obstacles to the application of self-attention in SNNs are mainly caused by softmax. 1) As shown in Figure 1 , the attention map calculated from spike-form Query and Key has natural non-negativeness, which ignores irrelevant features. Thus, we do not need the softmax to keep the attention matrix non-negative, which is its most important role in VSA (Qin et al., 2022) . 2) The input and the Value of the SSA are in the form of spikes, which only consist of 0 and 1 and contain less fine-grained feature compared to the float-point input and Value of the VSA in ANNs. So the float-point Query and Key and softmax function are redundant for modeling such spike sequences. Tab. 1 illustrates that our SSA is competitive with VSA in the effect of processing spike sequences. Based on the above insights, we discard softmax normalization for the attention map in SSA. Some previous Transformer variants also discard softmax or replace it with a linear function. For example, in Performer (Choromanski et al., 2020) , positive random feature is adopted to approximate softmax; CosFormer (Qin et al., 2022) replaces softmax with ReLU and cosine function. With such designs of SSA, the calculation of spike-form Query, Key, and Value avoids multiplications and can be done by logical AND operation and addition. Also, its computation is very efficient. Due to sparse spike-form Query, Key and Value (shown in appendix D.1) and simple computation, the number of operations in SSA is small, which makes the energy consumption of SSA very low. Moreover, our SSA is decomposable after deprecation of softmax, which further reduces its computational complexity when the sequence length is greater than the feature dimension of one head, as depicted in Figure 1  (b) ① ②. Based on the proposed SSA, which well suits the calculation characteristics of SNNs, we develop the Spiking Transformer (Spikformer). An overview of Spikformer is shown in Figure 2 . It boosts the performance trained on both static datasets and neuromorphic datasets. To the best of our knowledge, it is the first time to explore the self-attention mechanism and directly-trained Transformer in the SNNs. To sum up, there are three-fold contributions of our work: • We design a novel spike-form self-attention named Spiking Self Attention (SSA) for the properties of SNNs. Using sparse spike-form Query, Key, and Value without softmax, the calculation of SSA avoids multiplications and is efficient. • We develop the Spiking Transformer (Spikformer) based on the proposed SSA. To the best of our knowledge, this is the first time to implement self-attention and Transformer in SNNs. • Extensive experiments show that the proposed architecture outperforms the state-of-the-art SNNs on both static and neuromorphic datasets. It is worth noting that we achieved more than 74% accuracy on ImageNet with 4 time steps using directly-trained SNN model for the first time.

2. RELATED WORK

Vision Transformers. For the image classification task, a standard vision transformer (ViT) includes a patch splitting module, the transformer encoder layer(s), and linear classification head. The Transformer encoder layer consists of a self-attention layer and a multi perception layer block. Selfattention is the core component making ViT successful. By weighting the image-patches feature value through the dot-product of query and key and softmax function, self-attention can capture the global dependence and interest representation (Katharopoulos et al., 2020; Qin et al., 2022) . Some works have been carried out to improve the structures of ViTs. Using convolution layers for patch splitting has been proven to be able to accelerate convergence and alleviate the data-hungry problem of ViT (Xiao et al., 2021b; Hassani et al., 2021) . There are some methods aiming to reduce the computational complexity of self-attention or improve its ability of modeling visual dependencies (Song, 2021; Yang et al., 2021; Rao et al., 2021; Choromanski et al., 2020) . This paper focuses on exploring the effectiveness of self-attention in SNNs and developing a powerful spiking transformer model for image classification. Spiking Neural Networks. Unlike traditional deep learning models that convey information using continuous decimal values, SNNs use discrete spike sequences to calculate and transmit information. Spiking neurons receive continuous values and convert them into spike sequences, including the Leaky Integrate-and-Fire (LIF) neuron (Wu et al., 2018) , PLIF (Fang et al., 2021b) , etc. There are two ways to get deep SNN models: ANN-to-SNN conversion and direct training. In ANNto-SNN conversion (Cao et al., 2015; Hunsberger & Eliasmith, 2015; Rueckauer et al., 2017; Bu et al., 2021; Meng et al., 2022; Wang et al., 2022) , the high-performance pre-trained ANN is converted to SNN by replacing the ReLU activation layers with spiking neurons. The converted SNN requires large time steps to accurately approximate ReLU activation, which causes large latency (Han et al., 2020) . In the area of direct training, SNNs are unfolded over the simulation time steps and trained in a way of backpropagation through time (Lee et al., 2016; Shrestha & Orchard, 2018) . Because the event-triggered mechanism in spiking neurons is non-differentiable, the surrogate gradient is used for backpropagation (Lee et al., 2020; Neftci et al., 2019 )Xiao et al. (2021a) adopts implicit differentiation on the equilibrium state to train SNN. Various models from ANNs have been ported to SNNs. However, the study of self-attention on SNN is currently blank. Yao et al. (2021) proposed temporal attention to reduce the redundant time step. Zhang et al. (2022a; b) both use ANN-Transformer to process spike data, although they have 'Spiking Transformer' in the title. Mueller et al. (2021) provides a ANN-SNN conversion Transformer, but remains vanilla self-attention which does not conform the characteristic of SNN. In this paper, we will explore the feasibility of implementing self-attention and Transformer in SNNs. As the fundamental unit of SNNs, the spike neuron receives the resultant current and accumulates membrane potential which is used to compare with the threshold to determine whether to generate the spike. We uniformly use LIF spike neurons in our work. The dynamic model of LIF is described as: H[t] = V [t -1] + 1 τ (X[t] -(V [t -1] -V reset )) , S[t] = Θ(H[t] -V th ), (2) V [t] = H[t] (1 -S[t]) + V reset S[t], (3) where τ is the membrane time constant, and X[t] is the input current at time step t. When the membrane potential H[t] exceeds the firing threshold V th , the spike neuron will trigger a spike S[t]. Θ(v) is the Heaviside step function which equals 1 for v ≥ 0 and 0 otherwise. V [t] represents the membrane potential after the trigger event which equals H[t] if no spike is generated, and otherwise equals to the reset potential V reset .

3. METHOD

We propose Spiking Transformer (Spikformer), which incorporates the self-attention mechanism and Transformer into the spiking neural networks (SNNs) for enhanced learning capability. Now we explain the overview and components of Spikformer one by one. (SPS), a Spikformer encoder and a Linear classification head. We empircally find that the layer normalization (LN) does not apply to SNNs, so we use batch normalization (BN) instead.

3.1. OVERALL ARCHITECTURE

An overview of Spikformer is depicted in Figure 2 . Given a 2D image sequence I ∈ R T ×C×H×Wfoot_0 , the Spiking Patch Splitting (SPS) module linearly projects it to a D dimensional spike-form feature vector and splits it into a sequence of N flattened spike-form patches x. Float-point-form position embedding cannot be used in SNNs. We employ a conditional position embedding generator (Chu et al., 2021) to generate spike-form relative position embedding (RPE) and add the RPE to patches sequence x to get X 0 . The conditional position embedding generator contains a 2D convolution layer (Conv2d) with kernel size 3, batch normalization (BN), and spike neuron layer (SN ). Then we pass the X 0 to the L-block Spikformer encoder. Similar to the standard ViT encoder block, a Spikformer encoder block consists of a Spiking Self Attention (SSA) and an MLP block. Residual connections are applied in both the SSA and MLP block. As the main component in Spikformer encoder block, SSA offers an efficient method to model the local-global information of images using spike-form Query (Q), Key (K), and Value (V ) without softmax, which will be analyzed in detail in Sec. 3.3. A global average-pooling (GAP) is utilized on the processed feature from Spikformer encoder and outputs the D-dimension feature which will be sent to the fully-connected-layer classification head (CH) to output the prediction Y . Spikformer can be written as follows: x = SPS (I) , I ∈ R T ×C×H×W , x ∈ R T ×N ×D , RPE = SN (BN((Conv2d(x)))), RPE ∈ R T ×N ×D (5) X 0 = x + RPE, X 0 ∈ R T ×N ×D (6) X ′ l = SSA(X l-1 ) + X l-1 , X ′ l ∈ R T ×N ×D , l = 1...L (7) X l = MLP(X ′ l ) + X ′ l , X l ∈ R T ×N ×D , l = 1...L (8) Y = CH(GAP(X L )) (9) 3.2 SPIKING PATCH SPLITTING As shown in Figure 2 , the Spiking Patch Splitting (SPS) module aims to linearly project an image to a D dimensional spike-form feature and split the feature into patches with a fixed size. SPS can contain multiple blocks. Similar to the convolutional stem in Vision Transformer (Xiao et al., 2021b; Hassani et al., 2021) , we apply a convolution layer in each SPS block to introduce inductive bias into Spikformer. Specifically, given an image sequence I ∈ R T ×C×H×W : x = MP (SN (BN((Conv2d(I))))) where the Conv2d and MP represent the 2D convolution layer (stride-1, 3 × 3 kernel size) and max-pooling, respectively. The number of SPS blocks can be more than 1. When using multiple SPS blocks, the number of output channels in these convolution layers is gradually increased and finally matches the embedding dimension of patches. For example, given an output embedding dimension D and a four-block SPS module, the number of output channels in four convolution layers is D/8, D/4, D/2, D. While the 2D-max-pooling layer is applied to down-sample the feature size after SPS block with a fixed size. After the processing of SPS, I is split into an image patches sequence x ∈ R T ×N ×D .

3.3. SPIKING SELF ATTENTION MECHANISM

Spikformer encoder is the main component of the whole architecture, which contains the Spiking Self Attention (SSA) mechanism and MLP block. In this section we focus on SSA, starting with a review of vanilla self-attention (VSA). Given an input feature sequence X ∈ R T ×N ×D , the VSA in ViT has three float-point key components, namely query (Q F ), key (K F ), and value (V F ) which are calculated by learnable linear matrices W Q , W K , W V ∈ R D×D and X: Q F = XW Q , K F = XW K , V F = XW V (11) where F denotes the float-point form. The output of vanilla self-attention can be computed as: VSA(Q F , K F , V F ) = Softmax Q F K T F √ d V F where d = D/H is the feature dimension of one head and H is the head number. Converting the float-point-form Value (V F ) into spike form (V ) can realize the direct application of VSA in SNNs, which can be expressed as: VSA(Q F , K F , V ) = Softmax Q F K T F √ d V However, the calculation of VSA is not applicable in SNNs for two reasons. 1) The float-point matrix multiplication of Q F , K F and softmax function which contains exponent calculation and division operation, do not comply with the calculation rules of SNNs. 2) The quadratic space and time complexity of the sequence length of VSA do not meet the efficient computational requirements of SNNs. We propose Spiking Self-Attention (SSA), which is more suitable for SNNs than the VSA, as shown in Figure 1 (b) and the bottom of Figure 2 . The query (Q), key (K), and Value (V ) are computed through learnable matrices firstly. Then they become spiking sequences via different spike neuron layers: Q = SN Q (BN(XW Q )), K = SN K (BN(XW K )), V = SN V (BN(XW V )) where Q, K, V ∈ R T ×N ×D . We believe that the calculation process of the attention matrix should use pure spike-form Query and Key(only containing 0 and 1). Inspired by vanilla self-attention (Vaswani et al., 2017) , we add a scaling factor s to control the large value of the matrix multiplication result. s does not affect the property of SSA. As shown in Figure 2 , the spike-friendly SSA is defined as: SSA ′ (Q, K, V ) = SN Q K T V * s (15) SSA(Q, K, V ) = SN (BN(Linear(SSA ′ (Q, K, V )))). The single-head SSA introduced here can easily be extended to the multi-head SSA, which is detailed in the appendix A. SSA is independently conducted on each time step and seeing more details in appendix B. As shown in Eq. ( 15), SSA cancels the use of softmax to normalize the attention matrix in Eq. ( 12) and directly multiplies Q, K and V . An intuitive calculation example is shown in Figure 1 (b). The softmax is unnecessary in our SSA, and it even hinders the implementation of self-attention to SNNs. Formally, based on Eq. ( 14), the spike sequences Q and K output by the spiking neuron layer SN Q and SN k respectively, are naturally non-negative (0 or 1), resulting in a non-negative attention map. SSA only aggregates these relevant features and ignores the irrelevant information. Hence it does not need the softmax to ensure the non-negativeness of the attention map. Moreover, compared to the float-point-form X F and V F in ANNs, the input X and the Value V of self-attention in SNNs are in spike form, containing limited information. The vanilla self-attention (VSA) with float-point-form Q F , K F and softmax is redundant for modeling the spike-form X, V , which cannot get more information from X, V than SSA. That is, SSA is more suitable for SNNs than the VSA. We conduct experiments to validate the above insights by comparing the proposed SSA with four different calculation methods of the attention map, as shown in Tab. 1. A I denotes multiplying the float-points Q and K directly to get the attention map, which preserves both positive and negative correlation. A ReLU uses the multiplication between ReLU(Q) and ReLU(K) to obtain the attention map. A ReLU retains the positive values of Q, K and sets the negative values to 0, while A LeakyReLU still retains the negative points. A softmax means the attention map is generated following VSA. The above four methods use the same Spikformer framework and weight the spike-form V . From Tab. 1, the superior performance of our A SSA over A I and A LeakyReLU proves the superiority of SN . The reason why A SSA is better than A ReLU may be that A SSA has better non-linearity in self-attention. By comparing with A softmax , A SSA is competitive, which even surpasses A softmax on CIFAR10DVS and CIFAR10. This can be attributed to SSA being more suitable for spike sequences (X and V ) with limited information than VSA. Furthermore, the number of operations and theoretical energy consumption required by the A SSA to complete the calculation of Q, K, V is much lower than that of the other methods. SSA is specially designed for modeling spike sequences. The Q, K, and V are all in spike form, which degrades the matrix dot-product calculation to logical AND operation and summation operation. We take a row of Query q and a column of Key k as a calculation example: d i=1 q i k i = qi=1 k i . Also, as shown in Tab. 1, SSA has a low computation burden and energy consumption due to sparse spike-form Q, K and V (Figure . 4 ) and simplified calculation. In addition, the order of calculation between Q, K and V is changeable: QK T first and then V , or K T V first and then Q. When the sequence length N is bigger than one head dimension d, the second calculation order above will incur less computation complexity (O(N dfoot_2 )) than the first one (O(N 2 d)). SSA maintains the biological plausibility and computationally efficient properties throughout the whole calculation process.

4. EXPERIMENTS

We conduct experiments on both static datasets CIFAR, ImageNet (Deng et al., 2009) , and neuromorphic datasets CIFAR10-DVS, DVS128 Gesture (Amir et al., 2017) to evaluate the performance of Spikformer. The models for conducting experiments are implemented based on Pytorch (Paszke et al., 2019) , SpikingJelly 2 and Pytorch image models library (Timm)foot_3 . We train the Spikformer from scratch and compare it with current SNNs models in Sec. 4.1 and 4.2. We conduct ablation studies to show the effects of the SSA module and Spikformer in Sec. 4.3.

4.1. STATIC DATASETS CLASSIFICATION

ImageNet contains around 1.3 million 1, 000-class images for training and 50, 000 images for validation. The input size of our model on ImageNet is set to the default 224 × 224. The optimizer ANN (94.97%). The performance is improved as the dimensions or blocks increase. Specifically, Spikformer-4-384 improves by 1.25% compared to Spikformer-4-256 and improves by 0.39% compared to Spikformer-2-384. We also find that extending the number of training epochs to 400 can improve the performance (Spikformer-4-384 400E achieves 0.32% and 0.35% advance compared to Spikformer-4-384 on CIFAR10 and CIFAR100). The improvement of the proposed Spikformer on complex datasets such as CIFAR100 is even higher. Spikformer-4-384 (77.86%, 9.32M) obtains a significant improvement of 2.51% compared with ResNet-19 ANN (75.35%, 12.63M) model. The ANN-Transformer model is 1.54% and 3.16% higher than Spikformer-4-384, respectively. As shown in appendix D.5, transfer learning can achieve higher performance on CIFAR based on pre-trained Spikformer, which demonstrates high transfer ability.

4.2. NEUROMORPHIC DATASETS CLASSIFICATION

DVS128 Gesture is a gesture recognition dataset that contains 11 hand gesture categories from 29 individuals under 3 illumination conditions. CIFAR10-DVS is also a neuromorphic dataset converted from the static image dataset by shifting image samples to be captured by the DVS camera, which provides 9, 000 training samples and 1, 000 test samples. For the above two datasets of image size 128 × 128, we adopt a four-block SPS. The patch embedding dimension is 256 and the patch size is 16 × 16. We use a shallow Spikformer with 2 transformer encoder blocks. The SSA contains 8 and 16 heads for DVS128 Gesture and CIFAR10-DVS, respectively. The time-step of the spiking neuron is 10 or 16. The training epoch is 200 for DVS128 Gesture and 106 for CIFAR10-DVS. The optimizer is AdamW and the batch size is set to 16. The learning rate is initialized to 0.1 and reduced with cosine decay. We apply data augmentation on CIFAR10-DVS according to (Li et al., 2022) . We use a learnable parameter as the scaling factor to control the QK T V result. The classification performance of Spikformer as well as the compared state-of-the-art models on neuromorphic datasets is shown in Tab. 4. It can be seen that our model achieves good performance on both datasets by using a 2.59M model. On DVS128 Gesture, we obtain an accuracy of 98.2% with 16-time steps, which is higher than SEW-ResNet (97.9%). Our result is also competitive compared with TA-SNN (98.6%, 60 time steps) (Yao et al., 2021) which uses floating-point spikes in the forward propagation. On CIFAR10-DVS, we achieve a 1.6% and 3.6% better accuracy than the SOTA methods DSR (77.3%) with binary spikes using 10 steps and 16 steps respectively. TET is not an architecture-based but a loss-based method which achieves 83.2% using long epochs (300) and 9.27M VGGSNN, so we do not compare with it in the table.

Time step

The accuracy regarding different simulation time steps of the spike neuron is shown in Tab. 5. When the time step is 1, our method is 1.87% lower than the network with T = 4 on CIFAR10. SSA We conduct ablation studies on SSA to further identify its advantage. We first test its effect by replacing SSA with standard vanilla self-attention. We test two cases where Value is in floating point form (Spikformer-L-D w VSA V F ) and in spike form (Spikformer-L-D w VSA ). 

REPRODUCIBILITY STATEMENT

Our codes are based on SpikingJelly (Fang et al., 2020) , an open-source SNN framework, and Pytorch image models library (Timm) (Wightman, 2019) . The experimental results in this paper are reproducible. We explain the details of model training and dataset augmentation in the main text and supplement it in the appendix. Our codes of Spikformer models are uploaded as supplementary material and will be available on GitHub after review. where FL 1 SN N Conv is the first layer to encode static RGB images into spike-form. Then the SOPs of m SNN Conv layers, n SNN Fully Connected Layer (FC) and l SSA are added together and multiplied by E AC . For ANNs, the theoretical energy consumption of block b is calculated: Power(b) = 4.6pJ × FLOPs(b) For SNNs, Power(b) is: We conduct additional experiments on CIFAR as shown in Tab. 6. Power(b) = 0.9pJ × SOPs(b) (24) fire rate of Q, K, V Q, K, V

D.4 ANALYSIS OF SELF-ATTENTION VARIANTS NOT CONVERGING ON IMAGENET

The reason that the three models do not converge in Tab. 5 is explain as follows. As shown in Figure . 6 (a), the gradient of sigmoid surrogate function vanishes when the difference between the average input value V i and the firing threshold V th is too large or too small. We collect the output value of QK T V * s after one training eopch of Spikformer-8-512 w I , Spikformer-8-512 w ReLU , Spikformer-8-512 w LeakyReLU , and Spikformer-8-512 w SSA , which will be sent to the spike neuron layer as the input value V i , as shown in Eq. ( 15). Compared to the other three variants, as shown in 



In the neuromorphic dataset the data shape is I ∈ R T ×C×H×W , where T , C, H, and W denote time step, channel, height and width, respectively. A 2D image Is ∈ R C×H×W in static datasets need to be repeated T times to form a sequence of images. https://github.com/fangwei123456/spikingjelly https://github.com/rwightman/pytorch-image-models



2 d), 77 × 10 6 FLOPs, 354.2μJ O(N 2 d), 77 × 10 6 FLOPs, 354.2μJ

Figure 1: Illustration of vanilla self-attention (VSA) and our Spiking Self Attention (SSA). A red spike indicates a value of 1 at that location. The blue dashed boxes provide examples of matrix dot product operation. For convenience, we choose one of the heads of SSA, where N is the number of input patches and d is the feature dimension of one head. FLOPs is the floating point operations and SOPs is the theoretical synaptic operations. The theoretical energy consumption to perform one calculation between Query, Key and Value in one time step is obtained from 8-encoder-blocks 512-embedding-dimension Spikformer on ImageNet test set according to (Kundu et al., 2021b; Hu et al., 2021a). More details about the calculation of theoretical SOP and energy consumption are included in appendix. C.2. (a) In VSA, QF , KF , VF are float-point forms. After the dot-product of QF and KF , the softmax function regularizes negative values in the attention map to positive values. (b)In SSA, all value in attention map is non-negative and the computation is sparse using spike-form Q, K, V (5.5 × 10 6 VS. 77 × 10 6 in VSA). Therefore, the computation in SSA consumes less energy compared with VSA (354.2µJ). In addition, the SSA is decomposable (the calculation order of Q, K and V is changeable).

Figure 2: The overview of Spiking Transformer (Spikformer), which consists of a spiking patch splitting module

Figure 3: Attention map examples of SSA. The black region is 0.

Figure 5: Training loss, testing loss and test accuracy on ImageNet.

(b), the value of QK T V * s in Spikformer-8-512 w SSA is controlled in a suitable range. Therefore, SSA has stable surrogate gradients during training and converges easily.

Figure 6: (a) the sigmoid surrogate and its gradient curve. (b) the value of QK T V .

Analysis of the SSA's rationality. We replace SSA with other attention variants and keep the remaining network structure in Spikformer unchanged. We show the accuracy (Acc) on CIFAR10-DVS(Li et al., 2017), CIFAR10/100(Krizhevsky, 2009). OPs (M) is the number of operations (For AI, A LeakyReLU , AReLU and A softmax , OPs is FLOPs, and SOPs is ignored; For ASSA, it is SOPs.) and P (µJ) is the theoretical energy consumption to perform one calculation among Q, K, V .

Evaluation on ImageNet. Param refers to the number of parameters. Power is the average theoretical energy consumption when predicting an image from ImageNet test set, whose calculation detail is shown in Eq. 22. Spikformer-L-D represents a Spikformer model with L Spikformer encoder blocks and D feature embedding dimensions. The train loss, test loss and test accuracy curves are shown in appendix D.2. OPs refers to SOPs in SNN and FLOPs in ANN-ViT.

Performance comparison of our method with existing methods on CIFAR10/100. Our method improves network performance across all tasks. * denotes self-implementation results byDeng et al. (2021). Note that Hybrid training(Rathi et al., 2020)  adopts ResNet-20 for CIFAR10 and VGG-11 for CIFAR100.

Performance comparison to the state-of-the-art (SOTA) methods on two neuromorphic datasets. Bold font means the best; * denotes with Data Augmentation. Spikformer-8-512 with 1 time step still achieves 70.14%. The above results show Spikformer is robust under low latency (fewer time steps) conditions.

Ablation study results on SSA, and time step. LeakyReLU do not converge is that the value of dot-product value of Query, Key, and Value is large, which makes the surrogate gradient of the output spike neuron layer disappear. More details are in the appendix D.4. In comparison, the dot-product value of the designed SSA is in a controllable range, which is determined by the sparse spike-form Q, K and V , and makes Spikformer w SSA easy to converge.

Fire rate of Query, Key and Value of blocks in Spikformer-8-512 on ImageNet test set.

Additional result on CIFAR10/100. Spikformer-4-384 w IF uses the Integrate-and-Fire neuron.

Transfer Learning on CIFAR10/100.

ACKNOWLEDGEMENTS

This work is supported by Nature Science Foundation of China (No.62202014 and No.62006007), Shenzhen Basic Research Program (No.JCYJ20220813151736001), and the National Innovation 2030 Major ST Project of China (No.2020AAA0104203).

APPENDIX A MULTIHEAD SPIKING SELF ATTENTION

In practice, we reshape the Q, K, V ∈ R T ×N ×D into multi-head form R T ×H×N ×d , where D = H × d. Then we split Q, K, V into H parts and run H SSA operations, in parallel, which are called H-head SSA. The Multihead Spiking Self Attention (MSSA) is shown in follows:B SPIKING SELF ATTENTION AND TIME STEPIn practice, T is a independent dimension for spike neuron layer. In other layers, it is merged with the batch size.

C EXPERIMENT DETAILS C.1 TRAINING

Unlike the standard ViT, Dropout and Droppath are not applied in Spikformer. We remove the layer norm before each self-attention and MLP block, and add batch norm after each linear layer instead. In all Spikformer models, the hidden dimension of MLP blocks is 4 × D, where D is the embedding dimension. As in Eq. ( 20), we select the Sigmoid function as the surrogate function with α = 4.For DVS128 Gesture, we place a 1D max-pooling layer after Q and K to increase the density of the data, which improves the accuracy from 97.9% to 98.3% in 16 time steps. We set the threshold voltage V th of the spike neuron layer after QK T V * s to 0.5, while the others are set to 1.

C.2 THEORETICAL SYNAPTIC OPERATION AND ENERGY CONSUMPTION CALCULATION

The calculation of theoretical energy consumption requires first calculating the synaptic operations:where l is a block/layer in Spikformer, f r is the firing rate of the input spike train of the block/layer and T is the simulation time step of spike neuron. FLOPs(l) refers to floating point operations of l, which is the number of multiply-and-accumulate (MAC) operations. And SOPs is the number of spike-based accumulate (AC) operations. We estimate the theoretical energy consumption of Spikformer according to (Kundu et al., 2021b; Hu et al., 2021b; Horowitz, 2014; Kundu et al., 2021a; Yin et al., 2021; Panda et al., 2020; Yao et al., 2022) . We assume that the MAC and AC operations are implemented on the 45nm hardware [12] , where E M AC = 4.6pJ and E AC = 0.9pJ. The theoretical energy consumption of Spikformer is calculated:

