SPIKFORMER: WHEN SPIKING NEURAL NETWORK MEETS TRANSFORMER

Abstract

We consider two biologically plausible structures, the Spiking Neural Network (SNN) and the self-attention mechanism. The former offers an energy-efficient and event-driven paradigm for deep learning, while the latter has the ability to capture feature dependencies, enabling Transformer to achieve good performance. It is intuitively promising to explore the marriage between them. In this paper, we consider leveraging both self-attention capability and biological properties of SNNs, and propose a novel Spiking Self Attention (SSA) as well as a powerful framework, named Spiking Transformer (Spikformer). The SSA mechanism in Spikformer models the sparse visual feature by using spike-form Query, Key, and Value without softmax. Since its computation is sparse and avoids multiplication, SSA is efficient and has low computational energy consumption. It is shown that Spikformer with SSA can outperform the state-of-the-art SNNs-like frameworks in image classification on both neuromorphic and static datasets. Spikformer (66.3M parameters) with comparable size to SEW-ResNet-152 (60.2M, 69.26%) can achieve 74.81% top1 accuracy on ImageNet using 4 time steps, which is the state-of-the-art in directly trained SNNs models. Codes is avaiable at Spikformer.

1. INTRODUCTION

As the third generation of neural network (Maass, 1997) , the Spiking Neural Network (SNN) is very promising for its low power consumption, event-driven characteristic, and biological plausibility (Roy et al., 2019) . With the development of artificial neural networks (ANNs), SNNs are able to lift performance by borrowing advanced architectures from ANNs, such as ResNet-like SNNs (Hu et al., 2021a; Fang et al., 2021a; Zheng et al., 2021; Hu et al., 2021b) , Spiking Recurrent Neural Networks (Lotfi Rezaabad & Vishwanath, 2020) and Spiking Graph Neural Networks (Zhu et al., 2022) . Transformer, originally designed for natural language processing (Vaswani et al., 2017) , has flourished for various tasks in computer vision, including image classification (Dosovitskiy et al., 2020; Yuan et al., 2021a ), object detection (Carion et al., 2020; Zhu et al., 2020; Liu et al., 2021) , semantic segmentation (Wang et al., 2021; Yuan et al., 2021b ) and low-level image processing (Chen et al., 2021) . Self-attention, the key part of Transformer, selectively focuses on information of interest, and is also an important feature of the human biological system (Whittington et al., 2022; Caucheteux & King, 2022) . Intuitively, it is intriguing to explore applying self-attention in SNNs for more advanced deep learning, considering the biological properties of the two mechanisms. It is however non-trivial to port the self-attention mechanism into SNNs. In vanilla self-attention (VSA) (Vaswani et al., 2017) , there are three components: Query, Key, and Value. As shown in Figure 1 (a), standard inference of VSA is firstly obtaining a matrix by computing the dot product of float-point-form Query and Key; then softmax, which contains exponential calculations and division operations, is adopted to normalize the matrix to give the attention map which will be used to weigh the Value. The above steps in VSA do not conform to the calculation characteristics of SNNs, i.e., avoiding multiplication. Moreover, the heavy computational overhead of VSA almost prohibits In SSA, all value in attention map is non-negative and the computation is sparse using spike-form Q, K, V (5.5 × 10 6 VS. 77 × 10 6 in VSA). Therefore, the computation in SSA consumes less energy compared with VSA (354.2µJ). In addition, the SSA is decomposable (the calculation order of Q, K and V is changeable). V ℱ V ℱ Softmax 1.1 -1.3 -0.1 0.4 2.6 -2.6 K T ℱ K T ℱ -1.8 4.8 -5.2 3.3 -1.5 2.4 Q ℱ Q ℱ N × d N × d d × N d × N × × = = N × N N × N -0.84 -0.84 1.7e -5 1.7e -5 Softmax Attn Map V V N × d N × d (b) Spiking Self Attention × × × × (a) Vanilla Self Attention Q Q N × d N × d K T K T d × N d × N O(N 2 d), 77 × 10 6 FLOPs, 354.2μJ O(N 2 d), 77 × 10 6 FLOPs, 354.2μJ × × N × d N × d = = Attn Map N × N N × N × × 0 1 1 1 0 1 0 1 1 V V N × d N × d K T V K T V 1 0 1 2 × × d × d d × d Q Q 1 2 4.95μJ 4.95μJ O(Nd 2 ) O(Nd applying it directly to SNNs. Therefore, in order to develop Transformer on SNNs, we need to design a new effective and computation-efficient self-attention variant that can avoid multiplications. We thus present Spiking Self Attention (SSA), as illustrated in Figure 1(b) . SSA introduces selfattention mechanism to SNNs for the first time, which models the interdependence using spike sequences. In SSA, the Query, Key, and Value are in spike form which only contains of 0 and 1. The obstacles to the application of self-attention in SNNs are mainly caused by softmax. 1) As shown in Figure 1 , the attention map calculated from spike-form Query and Key has natural non-negativeness, which ignores irrelevant features. Thus, we do not need the softmax to keep the attention matrix non-negative, which is its most important role in VSA (Qin et al., 2022) . 2) The input and the Value of the SSA are in the form of spikes, which only consist of 0 and 1 and contain less fine-grained feature compared to the float-point input and Value of the VSA in ANNs. So the float-point Query and Key and softmax function are redundant for modeling such spike sequences. Tab. 1 illustrates that our SSA is competitive with VSA in the effect of processing spike sequences. Based on the above insights, we discard softmax normalization for the attention map in SSA. Some previous Transformer variants also discard softmax or replace it with a linear function. For example, in Performer (Choromanski et al., 2020) , positive random feature is adopted to approximate softmax; CosFormer (Qin et al., 2022) replaces softmax with ReLU and cosine function. With such designs of SSA, the calculation of spike-form Query, Key, and Value avoids multiplications and can be done by logical AND operation and addition. Also, its computation is very efficient. Due to sparse spike-form Query, Key and Value (shown in appendix D.1) and simple computation, the number of operations in SSA is small, which makes the energy consumption of SSA very low. Moreover, our SSA is decomposable after deprecation of softmax, which further reduces its computational complexity when the sequence length is greater than the feature dimension of one head, as depicted in Figure 1  (b) ① ②. Based on the proposed SSA, which well suits the calculation characteristics of SNNs, we develop the Spiking Transformer (Spikformer). An overview of Spikformer is shown in Figure 2 . It boosts the performance trained on both static datasets and neuromorphic datasets. To the best of our knowledge, it is the first time to explore the self-attention mechanism and directly-trained Transformer in the SNNs. To sum up, there are three-fold contributions of our work: • We design a novel spike-form self-attention named Spiking Self Attention (SSA) for the properties of SNNs. Using sparse spike-form Query, Key, and Value without softmax, the calculation of SSA avoids multiplications and is efficient. • We develop the Spiking Transformer (Spikformer) based on the proposed SSA. To the best of our knowledge, this is the first time to implement self-attention and Transformer in SNNs.



Figure 1: Illustration of vanilla self-attention (VSA) and our Spiking Self Attention (SSA). A spike indicates a value of 1 at that location. The blue dashed boxes provide examples of matrix dot product operation. For convenience, we choose one of the heads of SSA, where N is the number of input patches and d is the feature dimension of one head. FLOPs is the floating point operations and SOPs is the theoretical synaptic operations. The theoretical energy consumption to perform one calculation between Query, Key and Value in one time step is obtained from 8-encoder-blocks 512-embedding-dimension Spikformer on ImageNet test set according to (Kundu et al., 2021b; Hu et al., 2021a). More details about the calculation of theoretical SOP and energy consumption are included in appendix. C.2. (a) In VSA, QF , KF , VF are float-point forms. After the dot-product of QF and KF , the softmax function regularizes negative values in the attention map to positive values. (b)In SSA, all value in attention map is non-negative and the computation is sparse using spike-form Q, K, V (5.5 × 10 6 VS. 77 × 10 6 in VSA). Therefore, the computation in SSA consumes less energy compared with VSA (354.2µJ). In addition, the SSA is decomposable (the calculation order of Q, K and V is changeable).

