ROBUST AND ACCELERATED SINGLE-SPIKE SPIKING NEU-RAL NETWORK TRAINING WITH APPLICABILITY TO CHAL-LENGING TEMPORAL TASKS

Abstract

Spiking neural networks (SNNs), particularly the single-spike variant in which neurons spike at most once, are considerably more energy efficient than standard artificial neural networks (ANNs). However, single-spike SSNs are difficult to train due to their dynamic and non-differentiable nature, where current solutions are either slow or suffer from training instabilities. These networks have also been critiqued for their limited computational applicability such as being unsuitable for time-series datasets. We propose a new model for training single-spike SNNs which mitigates the aforementioned training issues and obtains competitive results across various image and neuromorphic datasets, with up to a 13.98× training speedup and up to an 81% reduction in spikes compared to the multi-spike SNN. Notably, our model performs on par with multispike SNNs in challenging tasks involving neuromorphic time-series datasets, demonstrating a broader computational role for single-spike SNNs than previously believed.

1. INTRODUCTION

Artificial neural networks (ANNs) have achieved impressive feats over recent years, obtaining human-level performance on visual and auditory tasks (Hinton et al., 2012; He et al., 2016) , natural language processing (Brown et al., 2020) and challenging games (Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019) . However, as the difficulty and complexity of the tasks increase, so has the size of the networks required to solve them, demanding a substantial and unsustainable amount of energy (Strubell et al., 2019; Schwartz et al., 2020) . Inspired by the extreme energy efficiency of the brain (Sokoloff, 1960) , spiking neural networks (SNNs) emulated on neuromorphic computers attempt to solve this dilemma, requiring significantly less energy than ANNs (Wunderlich et al., 2019) . These networks are of growing interest, obtaining noteworthy results on visual (Fang et al., 2021; Zhou & Li, 2021) , auditory (Yin et al., 2020; Yao et al., 2021) and reinforcement learning problems (Patel et al., 2019; Tang et al., 2020; Bellec et al., 2020) . A particular class of SNNs in which individual neurons respond with at most one spike aims to further amplify the energy and scaling advantages of standard SNNs and ANNs. Inspired by the sparse spike processing shown to exist at least for certain stimuli in the auditory and visual systems (Heil, 2004; Gollisch & Meister, 2008) , and forming a class of universal function approximator (Comsa et al., 2020) , these networks obtain extreme energy efficiency due to their singlespike nature (Oh et al., 2021; Liang et al., 2021) . Although providing a promising path toward building very large and energy-efficient networks, we are yet to understand how to properly train these SNNs. The success of the backprop training algorithm in ANNs does not naturally transfer to single-and multi-spike SNNs due to their non-differentiable activation function. Current attempts at training are either slow (as time is sequentially simulated) or suffer from training instabilities (e.g. the dead neuron problem) and idiosyncrasies (e.g. requiring particular regularisation) (Eshraghian et al., 2021) . Additionally, it has been argued that single-spike networks have limited applicability and are not suited for temporal problems, as recently pointed out by Eshraghian et al. (2021) : "[...] it enforces stringent priors upon the network (e.g., each neuron must fire only once) that are incompatible with dynamically changing input data" and Zenke et al. (2021) : "[...] only using single spikes in each neuron has its limits and is less suitable for processing temporal stimuli, such as electroencephalogram (EEG) signals, speech, or videos". In this work we address these shortcomings by proposing a new model for training single-spike networks, for which the main contributions are summarised as follows. 1. Our model for training single-spike SNNs eschews all sequential dependence on time and exclusively relies on GPU parallelisable non-sequential operations. We experimentally validate this to obtain faster training times over sequentially trained control models on synthetic benchmarks (up to 16.77× speedup) and real datasets (up to 13.98× speedup). 2. We obtain competitive accuracies on various image and neuromorphic datasets with extreme spike sparsity (up to 81% fewer spikes than standard multi-spike SNNs), with our model being insensitive to the dead neuron problem and not requiring careful network regularisation. In other single-spike training methods, but not in our model, the dead neuron problem tends to halt learning due to reduced network activity. 3. We showcase our model's applicability in deeper and convolutional networks, and through the inclusion of trainable membrane time constants manage to solve difficult temporal problems otherwise thought to be unsolvable by single-spike networks. 

2.1. SINGLE-SPIKE MODEL

A spiking neural network (SNN) consists of artificial neurons which output binary signals known as spikes (Figure 1a ). Assume a feedforward network architecture of L fully connected layers, where each layer l consists of N (l ) neurons that are fully connected to the next layer l + 1 via synaptic weights W (l +1) ∈ R N (l +1) ×N (l ) . Neuron i in layer l emits a spike S (l ) i [t ] ∈ {0, 1} at time t if its membrane potential V (l ) i [t ] ∈ R reaches firing threshold V t h . S (l ) i [t ] = f (V (l ) i [t ]) = ! 1, if V (l ) i [t ] > V t h 0, otherwise Membrane potentials evolve according to the leaky integrate and fire (LIF) model τ dV (l ) i (t ) d t = -V (l ) i (t ) + V r est + R I (l ) i (t ) (2) where τ ∈ R is the membrane time constant and R ∈ R is the input resistance (Gerstner et al., 2014) . 1 Without loss of generality the LIF model is normalised (V (l ) i (t ) ∈ [0, 1] by V r est = 0,V t h = 1, R = 1; see Appendix) and discretised using the forward Euler method (see Appendix), from which the membrane potential can be computed at every discrete simulation time step t ∈ {1, . . . , T } for T ∈ N using the difference equation below. V (l ) i [t + 1] = βV (l ) i [t ] + (1 -β) " b (l ) i + N (l -1) # j =1 W (l ) i j S (l -1) j [t + 1] $ % &' ( Input current I (l ) i [t + 1] -S (l ) i [t ] % &' ( Spike reset (3) The membrane potential is charged from the current induced by the incoming presynaptic spikes S (l -1) [t ] ∈ R N (l -1) and from the constant bias current source b (l ) i . Over time, this potential dissipates, and the degree of dissipation is captured by 0 ≤ β = exp( -∆t τ ) ≤ 1 (for simulation time-step size ∆t ∈ R). The neuron's membrane potential is at resting state V r est = 0 in the absence of any input current and emits a spike S (l ) i [t ] = 1 if the potential rises above firing threshold V t h = 1 (after which it is reduced back close to resting state). To enforce the single-spike constraint, we keep track if a neuron has spiked prior to time t using the variable d (l ) i [t ] = max(S (l ) i [t -1], d (l ) i [t -1] ), which is zero before the first spike and one thereafter (d (l ) i [t = 0] = 0). We then redefine the output spikes as S(l) i [t ] = (1 -d (l ) i [t ]) • S (l ) i [t ] , thus ensuring that no more than a single spike is emitted during simulation (Figure 1b ).

2.2. SINGLE-SPIKE TRAINING TECHNIQUES

The main problem with training single-and multi-spike SNNs is the non-differentiable nature of their activation function. This precludes the direct use of the backprop algorithm (Rumelhart et al., 1986) , which has underpinned the successful training of ANNs. Various SNN training solutions have been proposed, which we group into three categories.

Shadow training

Instead of directly training a SNN, an already trained ANN is mapped to a SNN. This approach has actively been explored in the multi-spike setting (O'Connor et al., 2013; Esser et al., 2015; Rueckauer et al., 2016; 2017) , with recent work extending this to single-spike networks (Stöckl & Maass, 2019; Park et al., 2020) . Although these approaches permit the training of large networks, they come with various shortcomings. Some shortcomings are method specific, such as Stöckl & Maass (2019) who outline how a single ANN unit can be represented as a network of spiking units. However, this leads to an undesirable blowup of network parameters in their conversion process (which is avoided by our approach). Other shortcomings are more general, such as the lack of support for training neural parameters besides synaptic weights (which our approach permits) or inference accuracy being lost in the conversion process, where mapped SNNs perform worse than the original ANNs (which we avoid). Training using the spike times An approach used to directly train SNNs using backprop involves passing gradients through the time of spiking, which sidesteps the aforementioned nondifferentiability issue (Bohte et al., 2002; Mostafa, 2017; Comsa et al., 2020; Kheradpisheh & Masquelier, 2020; Zhang et al., 2021; Zhou & Li, 2021; Zhou et al., 2021) . Although commonly used for training single-spike SNNs, this approach suffers from various shortcomings, such as 1. the dead neuron problem, where a lack of spiking activity halts the learning process (which we overcome), 2. being usually constrained to integrate and fire (IF) neurons (where we support both the IF and LIF model), 3. having performance dependent on the computationally costly processing of presynaptic spikes using postsynaptic potential (PSP) kernels (which we show not to be necessary) and 4. requiring careful network regularisation (which we avoid). Training using the membrane potentials Another approach to directly training SNNs using backprop is by replacing the undefined gradient of the non-differentiable spike function with a surrogate gradient (Esser et al., 2016; Hunsberger & Eliasmith, 2015; Zenke & Ganguli, 2018; Lee et al., 2016) , which permits the flow of gradient through every membrane potential in time (Bellec et al., 2018; Shrestha & Orchard, 2018; Neftci et al., 2019) . This method has been shown to circumvent the dead neuron problem and permit the training of other neural parameters besides synaptic connectivity (such as membrane time constants) that have been shown to improve network performance (Perez-Nieves et al., 2021) . However, these results have not been replicated in the single-spike setting (which we do). A shortcoming of this method is its slow training speed, as the network needs to sequentially be simulated at every point in time (which we overcome). 0) [4]

3. PROPOSED TRAINING SPEEDUP ALGORITHM

S (0) [1] S (0) [2] S (0) [3] S ( I (1) [1] I (1) [2] I (1) [3] I (1) [4] V (1) [1] Input spikes S (0) induce currents I (1) , which charge the membrane potential without reset Ṽ(1) . These no-reset membrane potentials are mapped to erroneous output spikes S(1) , which are then transformed to a latent representation z (1) encoding an ordering of spikes and finally mapped to the correct output spikes S (1) (same coloured edges denote output from same source). b. Example activity of our model throughout the different stacks of processing. S (1) [1] S (1) [2] S (1) [3] S (1) [4] z (1) [1] z (1) [2] z (1) [3] z (1) [4] V (1) [2] V (1) [3] V (1) [4] S (1) [1] S (1) [2] S (1) [3] S (1) [4] We propose a new model for training SNNs in which individual neurons spike at most once. Our solution overcomes the slow training speeds of prior training algorithms by eschewing all sequential dependence and recasting the standard single-spike model to exclusively rely on nonsequential operations. Although our model performs more calculations than the standard singlespike model, all these calculations are highly parallelisable and thus substantially faster to train (see Appendix). Our model is comprised of three main steps which are readily implementable in modern auto differentiation frameworks (Abadi et al., 2016; Paszke et al., 2017; Bradbury et al., 2018) . For illustration purposes, we provide a diagram of the model's computational graph (Figure 2a ) and an example of how input spikes are transformed throughout the model's different layers of processing (Figure 2b ).

1.. Convert presynaptic spikes to input current

As in the standard model, we map the time series of presynaptic spikes S (l -1) jfoot_1 to a time series of input currents I (l ) i , which is achieved using a tensor multiplication. I (l ) i [t ] = N (l -1) # j =1 W (l ) i j S (l -1) j [t ] (4)

2.. Calculate membrane potentials without reset

In contrast to the standard model, we calculate modified membrane potentials Ṽ(l) i from the input current I (l ) i by excluding the reset mechanism. By dropping the reset term -S (l ) i [t ] in Equation 3 and unrolling this altered equation (see Appendix), we obtain a convolutional form allowing us to calculate these no-reset membrane potentials Ṽ(l) i without any sequential operations (where β = [β 0 , β 1 , • • • , β T -1 ]). Ṽ (l ) i [t ] = β t V (l ) i [0] + (1 -β) t # k=1 β t -k I (l ) i [k] = β t V (l ) i [0] + (1 -β) ) I (l ) i ⊛ β * [t ] (5)

3.. Map no-reset membrane potentials to output spikes

We map the time series of no-reset membrane potentials Ṽ(l) i to output spikes S (l ) i (which contains at most one spike). We obtain a time series of erroneous output spikes S(l) i by passing no-reset membrane potentials Ṽ(l) i through the spike function f (Equation 1) S(l) i [t ] = f ( Ṽ (l ) i [t ]) (6) Due to the removal of the spike reset mechanism, only the first spike occurrence in S(l) i follows the dynamics set out by the LIF model (Equation 3) and thus all spikes succeeding the first spike occurrence are removed (compliant with the single spike assumption). We achieve this by constructing correct output spikes S (l ) i with S (l ) i [t ] = 0 for t ∈ {1, 2, . . . , T } except S (l ) i [t ] = 1 for the smallest t satisfying Ṽ (l ) i [t ] > 1 (if such Ṽ (l ) i [t ] exists). A straightforward solution would be to iterate over all elements in S(l) i and set all spikes succeeding the first to zero, but such sequential calculation is the very problem we set out to remediate. We propose a vectorised solution to this problem which is comprised of two steps: 1. Map the erroneous output spikes S(l) i to a latent representation z (l ) i = φ( S(l) i ), where every element therein encodes an ordering of the spikes. This is achieved by passing the erroneous output spikes S(l) i through proposed function φ (Proposition 1), which maps all elements besides the first spike occurrence to a value other than one (z (l ) i [t ] ∕ = 1 for all t except for the smallest t satisfying S(l) i [t ] = 1 if such t exists). 2. Obtain the correct output spikes S (l ) i = g (z (l ) i ) by passing the latent representation z (l ) i through function g , which uses the encoded spike ordering to produce the correct outputs spikes S (l ) i by mapping every value besides one to zero. 3 g (z (l ) i )[t ] = ! 1, if z (l ) i [t ] = 1 0, otherwise Proposition 1. Function φ( S(l) i )[t ] = + t k=1 S(l) i [k](t -k + 1) acting on S(l) i ∈ {0, 1} T contains at most one element equal to one φ( S(l) i )[t ] = 1 for the smallest t satisfying S(l) i [t ] = 1 (if such t exists). Proof. Firstly, if S(l) i [t ] = 0 for all t ∈ [1, T ] then φ( S(l) i )[t ] = 0 for all t ∈ [1, T ] (follows from substitu- tion). Secondly, if S(l) i [t 1 ] = 1 for smallest t 1 ∈ [1, T ] then φ( S(l) i )[t 1 ] = 1 (follows from substitution) and there can exist no t 2 > t 1 such that φ( S(l) i )[t 2 ] = 1 as φ( S(l) i )[t + 1] = t +1 # k=1 S(l) i [k] ) (t + 1) -k + 1 * = t # k=1 S(l) i [k] ) (t + 1) -k + 1 * + S(l) i [t + 1] = t # k=1 S(l) i [k](t -k + 1) + t # k=1 S(l) i [k] + S(l) i [t + 1] = φ( S(l) i )[t ] + t +1 # k=1 S(l) i [k] (8) Thus φ( S(l) i )[t 2 ] > φ( S(l) i )[t 1 ] for all t 2 > t 1 as + t 2 k=1 S(l) i [k] ≥ + t 1 k=1 S(l) i [k] = 1 > 0.

4. EXPERIMENTS AND RESULTS

We investigate our model's speedup advantages and performance on real datasets in comparison to prior work. All models were implemented using PyTorch (Paszke et al., 2017) with benchmarks and training conducted on a cluster of NVIDIA Tesla A100 GPUs. 

4.1. SPEEDUP BENCHMARKS

We evaluate the speedup advantages of our model over the standard single-spike model trained using surrogate gradients, by simulating the forward and backward passes for different numbers of hidden units, layers, simulation steps and batch sizes on a synthetic spike dataset (see Appendix). Robust speedup for different numbers of hidden units and simulation steps We observe a considerable training speedup across a range of hidden units and simulation steps in a single layer (Figure 3a ). We obtain an optimal speedup of 16.77× for n = 100 units and t = 2 7 time steps, where our model takes 4.34±0.9ms compared to the 72.82±2.7ms it takes the standard model to complete a training pass (Figure 3b ). Our model still obtains a reasonable speedup of 3.40× for largest benchmarked n = 1000 units and t = 2 11 time steps (albeit the forward pass speedup being slower).foot_3 These speedups are even more pronounced when the membrane time constants are fixed (obtaining a maximum speedup of 17.42×) or when using smaller batch sizes (with batch sizes b = 32 and b = 64 obtaining a maximum speedup of 35.05× and 25.16×, respectively; See Appendix).

Applicability to deeper networks

We find our model to obtain substantial training speedups when using multiple layers (Figure 3c ) and layers containing thousands of neurons (Figure 3d ). The training speedups remain similar across an increasing numbers of layers for different number of hidden units (Figure 3c ). Furthermore, we obtain a speedup of ∼ 3× when using a large number of neurons (ranging between 2 • 10foot_2 to 2 • 10 4 neurons) in a single layer (Figure 3d ). Inter-estingly, these speedups remain approximately the same across the different number of neurons, even when the batch size is changed.foot_4  Speedup advantages and room for improvement Previous attempts at accelerating SNN training either speed up the backward pass (Perez-Nieves & Goodman, 2021) or remove it completely (Bellec et al., 2020) . These methods however still sequentially compute the forward pass, which our model is able to accelerate (Figure 3e ). Furthermore, we observe the backward pass to slow down relative to the forward pass for increasing time steps (Figure 3f ). Further training speedup may therefore be achieved using sparse gradient descent, as auto differentiation frameworks are not optimised for the sparse nature of SNNs (Perez-Nieves & Goodman, 2021).

4.2. PERFORMANCE ON REAL DATASETS

We investigate the applicability of our model to classify real data from different domains and of varying complexity (Table 1 ). These include the Yin-Yang dataset (Kriener et al., 2022) in which the goal is to classify spatial coordinates belonging to different groups, and the MNIST (LeCun, 1998) and Fashion-MNIST (F-MNIST) (Xiao et al., 2017) image datasets, where the objective is to classify images of handwritten digits and fashion items. All these analog datasets were converted into a spike representation using the time-to-first-spike encoding (see Appendix). We also test performance on two neuromorphic datasets, being the vision N-MNIST (Orchard et al., 2015) and the more difficult auditory SHD dataset (Cramer et al., 2020) . The N-MNIST dataset is the MNIST dataset mapped onto a spike code using a neuromorphic vision sensor and the SHD dataset comprises spoken digit waveforms converted into spikes using a model of auditory bushy neurons in the cochlear nucleus. 

Obtaining competitive results across different image and neuromorphic datasets

The results of our model across all datasets are comparable or superior to prior reported results using singlespike SNNs. We reach an accuracy of 98.02%, 97.91% and 89.05% using a single hidden layer network on the Yin-Yang, MNIST and F-MNIST datasets respectively, where best performing prior work reported an accuracy of 95.90%, 98.50% and 88.1% respectively. Furthermore, our singlespike model nearly obtains the same accuracies to those obtained in the standard multi-spike SNN on these datasets (Yin-Yang and MNIST < 0.3% difference; F-MNIST ∼ 2% difference; Figure 4a ). Single-spike neurons solve challenging temporal problems using neural heterogeneity It has been noted that single-spike SNNs are well suited for static datasets (such as spike encoded images) and less suited for processing temporally complex stimuli (such as audio or video) due to the single-spike constraint (Zenke et al., 2021; Eshraghian et al., 2021) . Prior single-spike SNN training techniques have attempted to optimise network connectivity without learning other neural parameters, such as membrane time-constants, which have shown to improve performance in multi-spike SNNs (Perez-Nieves et al., 2021) . We explored the effect of learning the membrane time-constants in our single-spike model. We obtained an accuracy of 44.50% using a network trained with fixed time constants on the temporally-complex auditory SHD dataset. However, by including learnable time constants we were able to obtain a much higher accuracy of 70.32%, which is similar to the performance obtained by a standard SNN with trainable time constants 70.81% or recurrent connections 71.40%.

Drastic speedup in training

We obtain over a four-fold training speedup across all datasets over the standard single-spike SNN, with a maximum speedup of 13.98× on the Yin-Yang dataset (Figure 4b ). We observe similar training speedups over the multi-spike SNN (see Appendix). Differences in speedups are due to the different temporal lengths and input dimensions of the datasets, as well the different network architectures employed (see section 4.1).foot_6  Increased spike sparsity Our single-spike SNN is able to solve various datasets with a large reduction in spikes compared to a standard multi-spike SNN (Figure 4c ), with over a 44% and up to a 81% reduction in spikes. This corroborates the value of obtaining more energy-efficient computations using single-rather than multi-spike neuromorphic systems (Liang et al., 2021; Oh et al., 2021; Zhou et al., 2021) , as energy consumption scales approximately proportional to the number of emitted spikes (Panda et al., 2020) .

Training deeper convolutional architectures

We evaluate our model in deeper convolutional architectures, which to date remains largely unexplored in single-spike SNNs (Mirsadeghi et al., 2022) . We trained a multi-layer convolutional network on the MNIST and F-MNIST datasets, obtaining accuracies (MNIST: 99.32% and F-MNIST: 90.57%) similar to best performing prior work (MNIST: 99.4% and F-MNIST: 92.8%), whilst being faster to train in comparison to the control (MNIST-speedup ∼ 1.37× and F-MNIST-speedup ∼ 1.39×).

Robust learning and bypassing the dead neuron problem

A limitation of current single-spike SNN training methods is the dead neuron problem, referring to the hinderance in learning when neurons do not spike, as the learning signal is dependent on the occurrence thereof (Eshraghian et al., 2021) . Our model is able to overcome this problem as we use surrogate gradients for training, in which the learning signal is instead passed through the membrane potentials. We experimentally verified this by showing how networks instantiated with zero starting activity (fatal to other single-spike training methods) still manage to solve different datasets (Figure 4d ).

5. DISCUSSION

SNNs emulated on neuromorphic hardware are a promising avenue towards addressing the energy and scaling constraints of ANNs (Wunderlich et al., 2019) . Single-spike SNNs further amplify these energy improvements through extreme spike sparsity, as energy consumption scales approximately proportionally to the number of emitted spikes (Panda et al., 2020) . To date, SNN training remains challenging due to the non-differentiable nature of the spike function, prohibiting the direct use of the backprop training algorithm which underpins the success of ANNs. Various extensions of backprop for SNNs have been proposed, but fall short in particular aspects. Gradients can be passed through the timing of spikes (Bohte et al., 2002; Mostafa, 2017; Kheradpisheh & Masquelier, 2020) , yet this method suffers from the dead neuron problem, requires careful regularisation or imposes computationally-expensive modelling constraints. Alternatively, gradients can be passed through the membrane potentials using surrogate gradients (Shrestha & Orchard, 2018; Neftci et al., 2019) , and although this method improves upon the problems of passing gradients through the spike times, it is painfully slow. In this work, we address these problems by proposing a new general model (e.g. neurons can be IF or LIF) for training single-spike SNNs, without imposing any modelling (e.g. requiring PSP kernels) or training constraints (e.g. requiring careful regularisation) and support training of neural parameters other than synaptic connectivity (e.g. membrane time constants). We mathematically show how training can be sped up by replacing the slow sequential operations with faster convolutional ones. We experimentally validate this speedup across various numbers of units, time steps, layers and batch sizes, obtaining up to a 16.77× speedup. We show that our model can be trained across different network architectures (e.g. feedforward, hierarchical and convolutional) and obtain competitive results on different image and neuromorphic datasets. Our results compare well against multi-spike SNNs (< 2% accuracy difference on all datasets) and obtain up to an 81% reduction in spike counts. Furthermore, our method circumvents the dead neuron problem and, for the first time, we show how single-spike SNNs can solve temporally-complex datasets on a par with multi-spike SNNs by including trainable membrane time constants. Our findings therefore challenge the dogma that single-spike SNNs are only suited to non-temporal problems (Eshraghian et al., 2021; Zenke et al., 2021) . We obtain training speedups on all datasets, however, find that the backward pass slows down relative to the forward pass for longer timespans. Future work could mitigate this bottleneck and accelerate training using sparse gradient descent, which has shown to accelerate the backward pass in standard SNNs by taking advantage of spike sparsity (Perez-Nieves & Goodman, 2021). Currently, our single-spike model performs slightly worse compared to its multi-spike counterpart, where better performance could be achieved by extending our model to the multi-spike setting and permitting recurrent connectivity. Finally, it remains an open question how the inclusion of trainable membrane time constants in our model boost performance, requiring further theoretical analysis.

6. REPRODUCIBILITY STATEMENT

The theoretical construction and derivations of our model are outlined in section 3 and we provide accompanying derivations in the Appendix. All code is publicly available at https://github.com/webstorms/Block under the BSD 3-Clause Licence. This includes instructions on installation, data processing and running experiments to reproduce all results and figures portrayed in the paper. Training details are also provided in the Appendix.

A APPENDIX

A Proof. This mapping from any LIF model to the normalised LIF model is achieved using the following transformation (taken from Hunsberger (2018) ). Ṽ (t ) = V (t ) -V r est V t h -V r est (9) Rearranging this expression with respect to V (t ) = Ṽ (t )(V t h -V r est ) + V r est and substituting this into the LIF model we obtain τ dV (t ) d t = -V (t ) + V r est + R I (t ) (V t h -V r est )τ d Ṽ (t ) d t = - " Ṽ (t )(V t h -V r est ) + V r est $ + V r est + R I (t ) τ d Ṽ (t ) d t = -Ṽ (t ) + R V t h -V r est I (t ) % &' ( Input current Ĩ (t ) This new LIF form has a resting potential Ṽrest = 0 and firing threshold Ṽth = 1 (obtained by substituting V (t ) = V r est and V (t ) = V t h in Equation 9 respectively). Thus, without loss of generality, any LIF model can be mapped to a normalised form using linear transformation Equation 9. A Proof. We proceed using the forward Euler method. Let I (t ) = I be constant with respect to time, for which the ordinary differential equation becomes separable. τ d tV (t ) d t = -V (t ) + I , dV V (t ) -I = - 1 τ , d t ln(V (t ) -I ) = - 1 τ t + ln(k) V (t ) = k exp(- 1 τ t ) + I (11) For initial solution V (t 0 ) at time t 0 we derive k = (V (t 0 )-I ) exp( t 0 τ ). Then for constant I and initial solution V (t 0 ) we obtain solution. V (t ) = (V (t 0 ) -I ) exp(- t -t 0 τ ) + I = exp(- t -t 0 τ )V (t 0 ) + (1 -exp(- t -t 0 τ ))I (12) To obtain the discretised update equation, we define simulation update time step ∆t = tt 0 , decay factor β = exp(-∆t τ ), assign continuous time points to discretised time steps t ← t 0 and t + 1 ← t 0 + ∆t and assume the input current to be approximately constant and equal to I [t + 1] between discretised update steps t to t + 1. V [t + 1] = βV [t ] + (1 -β)I [t + 1] (13) A.1.3 UNROLLING THE LEAKY INTEGRATE AND F IRE MODEL WITHOUT THE RESET TERM Proposition 4. Equation V [t ] = β t V [0] + (1 -β) t # i =1 β t -i I [i ] is equivalent to difference equation V [t ] = βV [t -1] + (1 -β)I [t ] for t ≥ 1. Proof. We proceed to proof equivalence by induction. For t = 1 we obtain V [1] = β 1 V [0] + (1 -β) 1 # i =1 β 1-i I [i ] = β 1 V [0] + (1 -β)I [1] Hence the relation holds true for the base case t = 1. Assume the relation holds true for t = k ≥ 1, then for t = k + 1 we derive V [k + 1] = βV [k] + (1 -β)I [k + 1] = β " β k V [0] + (1 -β) k # i =1 β k-i I [i ] $ + (1 -β)I [k + 1] = β k+1 V [0] + (1 -β) k # i =1 β (k+1)-i I [i ] + (1 -β)I [k + 1] = β k+1 V [0] + (1 -β) k+1 # i =1 β (k+1)-i I [i ] This implies equivalence for t = k+1 if t = k holds true. By the principle of induction, equivalence is established given that both the base case and inductive step hold true. A.2 ADDITIONAL MODEL THEORY: WHY IS OUR MODEL FASTER? To address the question why our model is faster than the standard single-spike model, we analyse their respective computational complexities. Consider a single neuron with N presynaptic neurons simulated for T time steps. Our model has a computational complexity of O(N T 2 ) and the computational complexity of the standard model is O(N T ). However, the sequential complexity of our model is constant time O(1) (as our model eschews all sequential dependence) and the sequential complexity of the standard model is linear O(T ). Our model performs more calculations than the standard single-spike model, yet is able to obtain faster training speeds, as -unlike the standard single-spike model -all these calculations are highly parallelisable.

A.3.1 SYNTHETIC SPIKE DATASET FOR THE SPEED BENCHMARKS

We generated binary input spike tensors of shape B ×N ×T (B being the batch size, N the number of input neurons and T the number of simulation steps). For every batch dimension b a firing rate r b ∼ U(u min , u max ) was uniformly sampled (with u min = 0Hz and u max = 200Hz), from which a random binary spike matrix of shape N × T was constructed, such that every input neuron in this matrix had an expected firing rate of r b Hz.

A.3.2 TIME-TO-F IRST-SPIKE ENCODING

We encoded all analog non-spiking input data into a spike raster using the time-to-first-spike coding method (Kheradpisheh & Masquelier, 2020). Here, every scalar value I i ∈ {0, I max } within an input tensor is converted into a spike train with a single spike, where the time of spike t i ∈ {0, T } is determined by the following equation  t i = ⌊ I max -I i I max T ⌋ b,c = + t V L b,c [t ] (see table 2).

A.4.2 BETA CLIPPING

As the beta β (l ) i (a transformation of the membrane time constant) of every neuron was optimised, we had to enforce correct neuron dynamics by clipping the values into the range [0, 1]. Note that β (l ) i = 0 implies no memory i.e. a binary neuron, 0 < β (l ) i < 1 implies decaying memory i.e. a LIF neuron and β (l ) i = 1 implies full memory i.e. an IF neuron. β (l ) i = ! 1, if β (l ) i > 1 0, if β (l ) i < 0 (17) A.4.3 WEIGHT INITIALISATION The network weights in a layer were sampled from a uniform distribution U(- + N -1 , + N -1 ), except for the Yin-Yang dataset for which the weights were sampled from U(- + 2N -1 , + 2N -1 ). For the feedforward layers N was set to the number of afferent connections to the layer and for the convolutional layers N = k 2 for kernel shape k × k. The bias terms were initialised to 0 in all networks. All neurons in the hidden layers were initialised with a membrane time constant τ = 10ms and τ = 20ms for the readout neurons.  A.4.5 SURROGATE GRADIENT The backprop algorithm requires all nodes within the computational graph of optimisation to be differentiable. This requirements is however violated in a SNN due to the non-differentiable heavy stepwise spike function f . To permit the use of backprop, we replaced the undefined derivate dV (Zenke & Ganguli, 2018) , which has been shown to work well in practice (Zenke & Vogels, 2021) . Here hyperparameter β sur (which we set to 10 in all experiments) defines the slope of the gradient. 



Note, we use () to refer to continuous time and [] to refer to discrete time. Bold face variables denotes arrays as opposed to scalar values. We still permit gradients to flow through the points where g (z (l ) i )[t ] = 0. This is due to the convolutional algorithm chosen by cudnn(Chetlur et al., 2014). Again, this is due to the convolutional algorithm chosen by cudnn. Results reported byKheradpisheh et al. (2022). Note that -unlike the neuromorphic datasets -the training speedup for the image datasets is dependent on the selected number of simulation time steps for transforming an image into the temporal domain (see Appendix for chosen values).



Figure 1: Spiking neuron dynamics. a. Left: A multi-spike neuron emitting and receiving (per presynaptic terminal) multiple spikes. Right: Input and output activity of the neuron (bottom panel: Input raster, middle panel: Input current I and top panel: Membrane potential V . Dotted line represents the firing threshold and a dot above denotes a spike). b. Left: A single-spike neuron emitting and receiving (per presynaptic terminal) at most one spike per stimulus. Right: Input and output activity of the neuron).

Figure 2: Illustration of our model. a. The computational graph of our model for 4 time steps.Input spikes S (0) induce currents I(1) , which charge the membrane potential without reset Ṽ(1) . These no-reset membrane potentials are mapped to erroneous output spikes S(1) , which are then transformed to a latent representation z(1) encoding an ordering of spikes and finally mapped to the correct output spikes S (1) (same coloured edges denote output from same source). b. Example activity of our model throughout the different stacks of processing.

Figure 3: Training speedup of our model over the standard model. a. Total training speedup as a function of the number of hidden neurons n and simulation steps t (left), alongside the corresponding forward and backward pass speedups (right). b. Training durations of both models for fixed hidden neurons n = 100 and variable batch size b. c. Training speedup over different number of layers for fixed time steps t = 2 7 and batch size b = 128. d. Training speedup over large number of hidden neurons n for fixed time steps t = 2 7 and variable batch size b. e. Forward pass speedup for fixed time steps t = 2 7 and variable batch size b. f. Forward vs the backward pass speedup of our model for fixed time steps t = 2 7 and variable batch size b. b-f use a 10 sample average with the mean and s.d. plotted.

Figure 4: Analysis of our models performance on real datasets. a. Difference in accuracy between the standard multi-spike and our model. b. Training speedup of our model vs the standard single-spike model. c. Reduction in spikes of our single-spike model vs the standard multi-spike model (a-c use a 3 sample average with the mean and s.d. plotted). d. Training robustness of our model to solve different datasets when starting with zero network activity, which is fatal to other single-spike training methods. Top panel: Normalised training loss over time. Bottom panel: Normalised network activity over time, where the red cross denotes the absence of any spikes.

16) A.4 TRAINING DETAILS AND HYPERPARAMETERS A.4.1 READOUT NEURONS The output layer L of every trained network contained the same number of neurons as the number of classes contained within the dataset being trained on. As suggested by Zenke & Vogels (2021), every neuron had a firing threshold set to infinity (i.e. the spiking and reset mechanism was removed) from which the output o b,c of readout neuron c to input sample b was either taken to be the maximum membrane potential over time o b,c = max t V L b,c [t ] or the summated membrane potential over time o

A.4.4 SUPERVISED TRAINING LOSS All networks were trained to minimise a cross-entropy loss L =c log(p b,c ) (18) with B and C being the number of batch samples and dataset classes respectively, and y b,c ∈ {0, 1} C and p b,c being the one hot target vector and network prediction probabilities respectively. The prediction probabilities p b,c were obtained by passing the readout neuron outputs o b,c through the softmax function. p b,c = exp o b,c + C k=1 exp o b,k

d f sur (V ) dV = (β sur |V | + 1) -2(20)A.4.6 TRAINING PROCEDURE All models were trained using the Adam optimiser (with default parameters) (Kingma & Ba, 2014). Training started with an initial learning rate, which was decayed by a factor of 10 every time the number of epochs reached a new milestone, after which the best performing model (that achieved lowest training loss) was loaded and training continued. A.4.7 TRAINING HYPERPARAMETERS

Figure 5: Total training speedup using smaller batch sizes as a function of the number of hidden neurons n and simulation steps t (left), alongside the corresponding forward and backward pass speedups (right). a. Speedups using batch size b = 32. b. Speedups using batch size b = 64.

Figure 6: Training speedup of our model over the standard model (using fixed membrane time constants). a. Total training speedup as a function of the number of hidden neurons n and simulation steps t (left), alongside the corresponding forward and backward pass speedups (right). b. Training durations of both models for fixed hidden neurons n = 100 and variable batch size b. c. Training speedup over different number of layers for fixed time steps t = 2 7 and batch size b = 128. d. Training speedup over large number of hidden neurons n for fixed time steps t = 2 7 and variable batch size b. e. Forward pass speedup for fixed time steps t = 2 7 and variable batch size b. f. Forward vs the backward pass speedup of our model for fixed time steps t = 2 7 and variable batch size b. b-f use a 10 sample average with the mean and s.d. plotted.

Performance comparison to existing literature (* denotes self-implementation, † denotes data augmentation and β denotes trainable time constants).

.1 SPIKING NEURAL NETWORK DERIVATIONS A.1.1 NORMALISING THE LEAKY INTEGRATE AND F IRE MODEL

.1.2 DISCRETISING THE LEAKY INTEGRATE AND F IRE MODEL

Dataset and corresponding training parameters.

