DCT-SNN: USING DCT TO DISTRIBUTE SPATIAL INFORMATION OVER TIME FOR LEARNING LOW-LATENCY SPIKING NEURAL NETWORKS Anonymous

Abstract

Spiking Neural Networks (SNNs) offer a promising alternative to traditional deep learning frameworks, since they provide higher computational efficiency due to event-driven information processing. SNNs distribute the analog values of pixel intensities into binary spikes over time. However, the most widely used input coding schemes, such as Poisson based rate-coding, do not leverage the additional temporal learning capability of SNNs effectively. Moreover, these SNNs suffer from high inference latency which is a major bottleneck to their deployment. To overcome this, we propose a scalable time-based encoding scheme that utilizes the Discrete Cosine Transform (DCT) to reduce the number of timesteps required for inference. DCT decomposes an image into a weighted sum of sinusoidal basis images. At each time step, a single frequency base, taken in order and modulated by its corresponding DCT coefficient, is input to an accumulator that generates spikes upon crossing a threshold. We use the proposed scheme to learn DCT-SNN, a low-latency deep SNN with leaky-integrate-and-fire neurons, trained using surrogate gradient descent based backpropagation. We achieve top-1 accuracy of 89.94%, 68.3% and 52.43% on CIFAR-10, CIFAR-100 and TinyImageNet, respectively using VGG architectures. Notably, DCT-SNN performs inference with 2-14X reduced latency compared to other state-of-the-art SNNs, while achieving comparable accuracy to their standard deep learning counterparts. The dimension of the transform allows us to control the number of timesteps required for inference. Additionally, we can trade-off accuracy with latency in a principled manner by dropping the highest frequency components during inference.

1. INTRODUCTION

Deep Learning networks have tremendously improved state-of-the-art performance for many tasks such as object detection, classification and natural language processing (Krizhevsky et al., 2012; Hinton et al., 2012; Deng & Liu, 2018) . However, such architectures are extremely energyintensive (Li et al., 2016) and hence require custom architectures and training methodologies for edge deployment (Howard et al., 2017) . To address this, Spiking Neural Networks (SNNs) have emerged as a promising alternative to traditional deep learning architectures (Maass, 1997; Roy et al., 2019) . SNNs are bio-plausible networks inspired from the learning mechanisms observed in mammalian brains. They are analogous in structure to standard networks, but perform computation in the form of spikes instead of fully analog values, as done in standard networks. For the rest of this paper, we refer to standard networks as Analog Neural Networks (ANNs) to distinguish them from their spiking counterparts with digital (spiking) inputs.The input and the correspondingly generated activations in SNNs are all binary spikes and inference is performed by accumulating the spikes over time. This can be visualized as distributing the one step inference of ANNs into a multi-step, very sparse inference scheme in the SNN. The primary source of energy efficiency of SNNs comes from the fact that very few neurons spike at any given timestep. This event driven computation and the replacement of every multiply-accumulate (MAC) operation in the ANN by an addition in SNN allows SNNs to infer with lesser energy. This energy benefit can be further enhanced using custom SNN implementations with architectural modifications (Ju et al., 2020) . (Li et al., 2017 ) have released a spiking version of the CIFAR-10 dataset based on inputs from neuromorphic sensors. IBM has designed a noncommercial processor 'TrueNorth' (F. Akopyan et al., 2015 ) , and Intel has designed its equivalent 'Loihi' (Davies et al., 2018 ) , that can train and infer on SNNs, and Blouw et al. ( 2019) have shown SNNs implemented on Loihi to be two orders of magnitude more efficient than an equivalent ANN running on GPU for keywork spotting. However, a major challenge still to be addressed is that the accumulation of spikes over timesteps results in a higher inference latency in SNNs. Energy efficiency at the cost of too high a latency would still hamper real-time deployment. Consequently, reduction of timesteps required for inference in SNNs is an active field of research. One of the factors that significantly affects the number of timesteps needed is the encoding scheme that converts pixels into spikes over the timesteps. Currently, the most common encoding scheme is Poisson spike generation (Rueckauer et al., 2017) , where the spikes at the input are generated as a Poisson spike train, with the mean spiking rate proportional to the pixel intensity. This scheme does not encode anything meaningful in the temporal axis and each timestep is the same as any other. Moreover, networks trained using this scheme suffer from high inference latency (Rueckauer et al., 2017) . Temporal coding schemes such as phase (Kim et al., 2018) or burst (Park et al., 2019) coding have been introduced to better encode temporal information into the spike trains, but they still incur high latency and require a large number of spikes for inference. Another related temporal method is time-to-first-spike (TTFS) coding (Zhang et al., 2019; Park et al., 2020) . They limit the number of spikes per neuron but the high latency problem still persists. Relative timing of spikes to encode information has been used in Comsa et al. ( 2020), but the results are only reported for simple tasks like MNIST and its scalability to deeper architectures such as VGG and more complex datasets like CIFAR remains unclear. In this paper, we propose a novel encoding scheme to convert pixels into spikes over time. The proposed scheme utilizes a block-wise matrix multiplication to decompose spatial information into a weighted sum of basis, and then reverses the transform to allow reconstruction of the input over multiple timesteps. These bases, taken one per timestep, modulated by the weights from the forward transform are then presented to the spike generating layer. The spike generator sums the contribution of all bases seen until the current timestep, as shown in Figure 1 . Though any invertible matrix can be utilized as the transform, the ideal transform follows the properties of energy compaction and orthonormality of bases as outlined in Section 3.1. We motivate Discrete Cosine Transform (DCT) as the ideal choice, since it is data independent, with orthogonal bases ordered by their contribution to spectral energy. Each timestep gets the information corresponding to a single base, starting from the zero frequency component at the first timestep. Each subsequent step refines the input representation progressively. At the end of the cycle, the entire pixel value has passed through the spike generating neuron. Thus, this methodology successfully distributes the pixel value over all the timesteps in a meaningful manner. Choosing the appropriate dimensions of the transform provides a fine grained control on the number of timesteps used for inference. We use the proposed scheme to learn DCT-SNN, a spiking version of an ANN and show that it cuts down the timesteps needed to infer an image taken from CIFAR-10, CIFAR-100 and TinyImageNet datasets from 100 to 48, 125 to 48 and 250 to 48, respectively, while achieving comparable accuracy to the state-ofthe-art Poisson encoded SNNs. Additionally, ordering the frequencies bases being input at each timestep provides a principled way of trading off accuracy for a reduced number of timesteps during inference, if desired, by dropping the least important (highest frequency) components. To summarize, the main contributions of this work are as follows, • A novel input encoding scheme for SNNs is introduced wherein each timestep of computation encodes distinct information, unlike other rate-encoding methods. • The proposed encoding scheme is used to learn DCT-SNN, which is able to infer with 2-14X lower timesteps compared to other state-of-the-art SNNs, while achieving comparable accuracy. • The proposed technique is, to the best of our knowledge, the first work that leverages frequency domain learning for SNNs on vision applications. • To the best of our knowledge, this is the first work that orders timesteps by significance to reconstruction. This provides an option to trade-off accuracy for faster inference by trimming some of the later frequency components, which is non-trivial to perform in other SNNs.

2. RELATED WORKS

Learning ANNs in the frequency domain. Successful learning for vision tasks in the frequency domain has been demonstrated in ANNs in several works. These utilize the DCT coefficients directly available from JPEG compression method (Wallace, 1992) without performing the decompression

