HETEROGENEOUS NEURONAL AND SYNAPTIC DYNAM-ICS FOR SPIKE-EFFICIENT UNSUPERVISED LEARNING: THEORY AND DESIGN PRINCIPLES

Abstract

This paper shows that the heterogeneity in neuronal and synaptic dynamics reduces the spiking activity of a Recurrent Spiking Neural Network (RSNN) while improving prediction performance, enabling spike-efficient (unsupervised) learning. We analytically show that the diversity in neurons' integration/relaxation dynamics improves an RSNN's ability to learn more distinct input patterns (higher memory capacity), leading to improved classification and prediction performance. We further prove that heterogeneous Spike-Timing-Dependent-Plasticity (STDP) dynamics of synapses reduce spiking activity but preserve memory capacity. The analytical results motivate Heterogeneous RSNN design using Bayesian optimization to determine heterogeneity in neurons and synapses to improve E, defined as the ratio of spiking activity and memory capacity. The empirical results on time series classification and prediction tasks show that optimized HRSNN increases performance and reduces spiking activity compared to a homogeneous RSNN.

1. INTRODUCTION

Spiking neural networks (SNNs) (Ponulak & Kasinski, 2011) use unsupervised bio-inspired neurons and synaptic connections, trainable with either biological learning rules such as spike-timingdependent plasticity (STDP) (Gerstner & Kistler, 2002) or supervised statistical learning algorithms such as surrogate gradient (Neftci et al., 2019) . Empirical results on standard SNNs also show good performance for various tasks, including spatiotemporal data classification, (Lee et al., 2017; Khoei et al., 2020) , sequence-to-sequence mapping (Zhang & Li, 2020) , object detection (Chakraborty et al., 2021; Kim et al., 2020) , and universal function approximation (Gelenbe et al., 1999; Iannella & Back, 2001 ). An important motivation for the application of SNN in machine learning (ML) is the sparsity in the firing (activation) of the neurons, which reduces energy dissipation during inference (Wu et al., 2019) . Many prior works have empirically shown that SNN has lower firing activity than artificial neural networks and can improve energy efficiency (Kim et al., 2022; Srinivasan & Roy, 2019) . However, there are very few analytical studies on how to reduce the spiking activity of an SNN while maintaining its learning performance. Understanding and optimizing the relations between spiking activity and performance will be key to designing energy-efficient SNNs for complex ML tasks. In this paper, we derive analytical results and present design principles from optimizing the spiking activity of a recurrent SNN (RSNN) while maintaining prediction performance. Most SNN research in ML considers a simplified network model with a homogeneous population of neurons and synapses (homogeneous RSNN (MRSNN)) where all neurons have uniform integration/relaxation dynamics, and all synapses use the same long-term potentiation (LTP) and long-term depression (LTD) dynamics in STDP learning rules. On the contrary, neurobiological studies have shown that a brain has a wide variety of neurons and synapses with varying firing and plasticity dynamics, respectively (Destexhe & Marder, 2004; Gouwens et al., 2019; Hansel et al., 1995; Prescott et al., 2008) . We show that optimizing neuronal and synaptic heterogeneity will be key to simultaneously reducing spiking activity while improving performance. We define the spike efficiency E of an RSNN as the ratio of its memory capacity C and average spiking activity S. Given a fixed number of neurons and synapses, a higher C implies a network can learn more patterns and hence, perform better in classification or prediction tasks (Aceituno et al., 2020; Goldmann et al., 2020) ; a lower spiking rate implies that a network is less active, and hence, will consume less energy while making inferences (Sorbaro et al., 2020; Rathi et al., 2021) . We analytically show that a Heterogeneous Recurrent SNN (HRSNN) model leads to a more spike-efficient learning architecture by reducing spiking activity while improving C (i.e., performance) of the learning models. In particular, we make the following contributions to the theoretical understanding of an HRSNN. • We prove that for a finite number of neurons, models with heterogeneity among the neuronal dynamics has higher memory capacity C. • We prove that heterogeneity in the synaptic dynamics reduces the spiking activity of neurons while maintaining C. Hence, a model with heterogeneous synaptic dynamics has a lesser firing rate than a model with homogeneous synaptic dynamics. • We connect the preceding results to prove that simultaneously using heterogeneity in neurons and synapses, as in an HRSNN, improves the spike efficiency of a network. We empirically characterize HRSNN considering the tasks of (a) classifying time series ( Spoken Heidelberg Digits (SHD)) and (b) predicting the evolution of a dynamical system (a modified chaotic Lorenz system). The theoretical results are used to develop an HRSNN architecture where a modified Bayesian Optimization (BO) is used to determine the optimal distribution of neuron and synaptic parameters to maximize E. HRSNN exhibits a better performance (higher classification accuracy and lower NRMSE loss) with a lesser average spike count S than MRSNN. Related Works Inspired by the biological observations, recent empirical studies showed potential for improving SNN performance with heterogeneous neuron dynamics (Perez-Nieves et al., 2021; Chakraborty & Mukhopadhyay, 2023) . However, there is a lack of theoretical understanding of why heterogeneity improves SNN performance, which is critical for optimizing SNNs for complex tasks. She et al. (2022) have analytically studied the universal sequence approximation capabilities of a feedforward network of neurons with varying dynamics. However, they did not consider heterogeneity in plasticity dynamics, and the results are applicable only for a feed-forward SNN and do not extend to recurrent SNNs (RSNN). The recurrence is not only a fundamental component of a biological brain (Soures & Kudithipudi, 2019) , but as a machine learning (ML) model, RSNN also shows good performance in modeling spatiotemporal and nonlinear dynamics (Pyle & Rosenbaum, 2017; Gilra & Gerstner, 2017) . Hence, it is critical to understand whether heterogeneity can improve learning in an RSNN. To the best of our knowledge, this is the first work that analytically studies the impact of heterogeneity in synaptic and neuronal dynamics in an RSNN. This work shows that only using neuronal heterogeneity improves performance and does not impact spiking activity. The number of spikes required for the computation increases exponentially with the number of neurons. Therefore, simultaneously analyzing and optimizing neuronal and synaptic heterogeneity, as demonstrated in this work, is critical to design an energy-efficient recurrent SNN.

2. PRELIMINARIES AND DEFINITIONS

We now define the key terms used in the paper. Table 1 summarizes the key notations used in this paper. Figure 1 shows the general structure of the HRSNN model with heterogeneity in both the LIF neurons and the STDP dynamics. It is to be noted here that there are a few assumptions we use for the rest of the paper: Firstly, the heterogeneous network hyperparameters are estimated before the training and inference. The hyperparameters are frozen after estimation and do not change during the model evaluation. Secondly, this paper introduces neuronal and synaptic dynamics heterogeneity by using a distribution of specific parameters. However, other parameters can also be chosen, which might lead to more interesting/better performance or characteristics. We assume a mean-field model where the synaptic weights converge for the analytical proofs. In addition, it must be noted that LIF neurons have been shown to demonstrate different states Brunel (2000) . Hence, for the analytical study of the network, we use the mean-field theory to analyze the collective behavior of a dynamical system comprising many interacting particles. Heterogeneous LIF Neurons We use the Leaky Integrate and Fire (LIF) neuron model in all our simulations. In this model, the membrane potential of the i-th neuron u i (t) varies over time as: τ m dv i (t) dt = -(v i (t) -v rest ) + I i (t) (1) where τ m is the membrane time constant, v rest is the resting potential and I i is the input current. When the membrane potential reaches the threshold value v th a spike is emitted, v i (t) resets to the reset potential v r and then enters a refractory period where the neuron cannot spike. Spikes emitted by the j th neuron at a finite set of times {t j } can be formalized as a spike train S i (t) = ∑ δ (t -t i ). Let the recurrent layer of an RSNN be R. We incorporate heterogeneity in the LIF neurons by using different membrane time constants τ m,i and threshold voltages v th,i for each LIF neuron i in R. This gives a distribution of time constants and threshold voltages of the LIF neurons in R.

Heterogeneous STDP:

The STDP rule for updating a synaptic weight (∆w) is defined by Pool & Mato (2011) : ∆w(∆t) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ A + (w)e -|∆t| τ+ if ∆t ≥ 0 -A -(w)e -|∆t| τ-if ∆t < 0 s.t., A + (w) = η + (w max -w) , A -(w) = η -(w -w min ) (2) where ∆t = t post -t pre is the time difference between the post-synaptic spike and the pre-synaptic one, with synaptic time-constant τ ± . In heterogeneous STDP, we use an ensemble of values from a distribution for τ ± and the scaling functions η ± .

Heterogeneity:

We define heterogeneity as a measure of the variability of the hyperparameters in an RSNN that gives rise to an ensemble of neuronal dynamics. Entropy is used to measure population diversity. Assuming that the random variable for the hyperparameters X follows a multivariate Gaussian Distribution (X ∼ N (µ, Σ)), then the differential entropy of x on the multivariate Gaussian distribution, is H(x) = n 2 ln(2π) + 1 2 ln |Σ| + n 2 . Now, if we take any density function q(x) that satisfies ∫ q(x)x i x j dx = Σ ij and p = N (0, Σ), then H(q) ≤ H(p). (Proof in Suppl. Sec. A) The Gaussian distribution maximizes the entropy for a given covariance. Hence, the log-determinant of the covariance matrix bounds entropy. Thus, for the rest of the paper, we use the determinant of the covariance matrix to measure the heterogeneity of the network. Memory Capacity: Given an input signal x(t), the memory capacity C of a trained RSNN model is defined as a measure for the ability of the model to store and recall previous inputs fed into the network (Jaeger, 2001; Jaeger et al., 2001) . In this paper, we use C as a measure of the performance of the model, which is based on the network's ability to retrieve past information (for various delays) from the reservoir using the linear combinations of reservoir unit activations observed at the output. Intuitively, HRSNN can be interpreted as a set of coupled filters that extract features from the input signal. The final readout selects the right combination of those features for classification or prediction. First, the τ -delay C measures the performance of the RC for the task of reconstructing the delayed version of model input x(t) at delay τ (i.e., x(tτ ) ) and is defined as the squared correlation coefficient between the desired output ( τ -time-step delayed input signal, x(tτ )) and the observed network output y τ (t), given as: C = lim τmax→∞ τmax ∑ τ =1 C(τ ) = lim τmax→∞ τmax ∑ τ =1 Cov 2 (x(t -τ ), y τ (t)) Var(x(t)) Var (y τ (t)) , τ ∈ N, where Cov(⋅) and Var(⋅) denote the covariance function and variance function, respectively. The y τ (t) is the model output in this reconstruction task. C measures the ability of RSNN to reconstruct precisely the past information of the model input. Thus, increasing C indicates the network is capable of learning a greater number of past input patterns, which in turn, helps in increasing the performance of the model. For the simulations, we use τ max = 100. Spike-Efficiency: Given an input signal x(t), the spike-efficiency (E) of a trained RSNN model is defined as the ratio of the memory capacity C to the average total spike count per neuron S. E is an analytical measure used to compare how C and hence the model's performance is improved with per unit spike activity in the model. Ideally, we want to design a system with high C using fewer spikes. Hence we define E as the ratio of the memory capacity using N R neurons C(N R ) to the average number of spike activations per neuron ( S) and is given as: E = C(N R ) ∑ N R i=1 Si N R , S i = ∫ T 0 s i (t)dt ≈ N post T ∫ ∞ t ref tΦ i dt (4) where N post is the number of postsynaptic neurons, Φ i is the inter-spike interval spike frequency for neuron i, and T is the total time. It is to be noted here that the total spike count S is obtained by counting the total number of spikes in all the neurons in the recurrent layer until the emission of the first spike at the readout layer.

3. HETEROGENEOUS RSNN: ANALYTICAL RESULTS

We present three main analytical findings. Firstly, neuronal dynamic heterogeneity increases memory capacity by capturing more principal components from the input space, leading to better performance and improved C. Secondly, STDP dynamic heterogeneity decreases spike activation without affecting C, providing better orthogonalization among the recurrent network states and a more efficient representation of the input space, lowering higher-order correlation in spike trains. This makes the model more spike-efficient since the higher-order correlation progressively decreases the information available through neural population (Montani et al., 2009; Abbott & Dayan, 1999) . Finally, incorporating heterogeneity in both neuron and STDP dynamics boosts the C to spike activity ratio, i.e., E, which enhances performance while reducing spike counts. Memory Capacity: The performance of an RSNN depends on its ability to retain the memory of previous inputs. To quantify the relationship between the recurrent layer dynamics and C, we note that extracting information from the recurrent layer is made using a combination of the neuronal states. Hence, more linearly independent neurons would offer more variable states and, thus, more extended memory. Lemma 3.1.1: The state of the neuron can be written as follows: r i (t) = N R ∑ k=0 N R ∑ n=1 λ k n ⟨v -1 n , w in ⟩ (v n ) i x(t -k), where v n , v -1 n ∈ V are, respectively, the left and right eigenvectors of W, w in are the input weights, and λ k n ∈ λ belongs to the diagonal matrix containing the eigenvalues of W; a i = [a i,0 , a i,1 , . . .] represents the coefficients that the previous inputs x t = [x(t), x(t -1), . . .] have on r i (t). Short Proof: (See Suppl. Sec. B for full proof) As discussed by Aceituno et al. (2020) , the state of the neuron can be represented as r(t) = Wr(t -1) + w in x(t), where w in are the input weights. We can simplify this using the coefficients of the previous inputs and plug this term into the covariance between two neurons. Hence, writing the input coefficients a as a function of the eigenvalues of W,  r(t) = ∞ ∑ k=0 W k w in x(t-k) = ∞ ∑ k=0 (VΛ k V -1 ) w in x(t-k) ⇒ r i (t) = N R ∑ k=0 N R ∑ n=1 λ k n ⟨v -1 n , w in ⟩ (v n ) i x(t-k)∎ ∑ N R n=1 ∑ N R m=1 Cov 2 (x n (t), x m (t)) which in turn varies inversely with C. Intuitive Proof: (See Suppl. Sec. B for full proof) Aceituno et al. (2020) showed that the C increases when the variance along the projections of the input into the recurrent layer are uniformly distributed. We show that this can be achieved efficiently by using heterogeneity in the LIF dynamics. More formally, let us express the projection in terms of the state space of the recurrent layer. We show that the raw variance in the neuronal states J can be written as J = ∑ N R n=1 λ 2 n (Σ) (∑ N R n=1 λ n (Σ)) 2 where λ n (Σ) is the nth eigenvalue of Σ. We further show that with higher H, the magnitude of the eigenvalues of W decreases and hence leads to a higher J . Now, we project the inputs into orthogonal directions of the network state space and model the system as r (t) = ∞ ∑ τ =1 a τ x(t -τ ) + ε r (t) where the vectors a τ ∈ R N are correspond to the linearly extractable effect of x(tτ ) onto r(t) and ε r (t) is the nonlinear contribution of all the inputs onto the state of r(t). First, we show that C increases when the variance along the projections of the input into the recurrent layer is more uniform. Intuitively, the variances at directions a τ must fit into the variances of the state space, and since the projections are orthogonal, the variances must be along orthogonal directions. Hence, we show that increasing the correlation among the neuronal states increases the variance of the eigenvalues, which would decrease our memory bound C * . We show that heterogeneity is inversely proportional to N R ∑ n=1 Cov 2 (x n (t), x m (t)). We see that increasing the correlations between neuronal states decreases the heterogeneity of the eigenvalues, which reduces C. We show that the variance in the neuronal states is bounded by the determinant of the covariance between the states; hence, covariance increases when the neurons become correlated. As H increases, neuronal correlation decreases. Aceituno et al. (2020)  λ A,N t ∶= Φ A ⎛ ⎝ 1 N ∑ j∈A ∫ t - 0 h 1 (t -u)dZ j u ⎞ ⎠ Φ B→A ⎛ ⎝ 1 N ∑ j∈B ∫ t - 0 h 2 (t -u)dZ j u ⎞ ⎠ λ B,N t ∶= Φ B ⎛ ⎝ 1 N ∑ j∈B ∫ t - 0 h 3 (t -u)dZ j u ⎞ ⎠ + Φ A→B ⎛ ⎝ 1 N ∑ j∈A ∫ t - 0 h 4 (t -u)dZ j u ⎞ ⎠ where A, B are the populations of the excitatory and inhibitory neurons, respectively, λ i t is the intensity of neuron i, Φ i a positive function denoting the firing rate, and h j→i (t) is the synaptic kernel associated with the synapse between neurons j and i. Hence, we show that the heterogeneous STDP dynamics increase the synaptic noise due to the heavy tail behavior of the system. This increased synaptic noise leads to a reduction in the number of spikes of the post-synaptic neuron. Intuitively, a heterogeneous STDP leads to a non-uniform scaling of correlated spike-trains leading to decorrelation. Hence, we can say that heterogeneous STDP models have learned a better-orthogonalized subspace representation, leading to better encoding of the input space with fewer spikes. ∎ Theorem 2: For a given number of neurons N R , the spike efficiency of the model E = C(N R ) S for HRSNN (E R ) is greater than MRSNN (E M ) i.e., E R ≥ E M Short Proof: (See Suppl. Sec. B for full proof) First, using Lemma 3.2.1, we show that the number of spikes decreases when we use heterogeneity in the LTP/LTD Dynamics. Hence, we compare the efficiencies of HRSNN with that of MRSNN as follows: E R E M = C R (N R ) × SM SR × C M (N R ) = ∑ N R τ =1 Cov 2 (x(t-τ ),a R τ r R (t)) Var(a R τ r R (t)) × ∞ ∫ t ref tΦ R dt ∑ N R τ =1 Cov 2 (x(t-τ ),a M τ r M (t)) Var(a M τ r M (t)) × ∞ ∫ t ref tΦ M dt Since S R ≤ S M and also,the covariance increases when the neurons become correlated, and as neuronal correlation decreases, H X increases (Theorem 1), we see that E R /E M ≥ 1 ⇒ E R ≥ E M ∎ Optimal Heterogeneity using Bayesian Optimization for Distributions To get optimal heterogeneity in the neuron and STDP dynamics, we use a modified Bayesian Optimization (BO) technique. However, using BO for high-dimensional problems remains a significant challenge. In our case, optimizing HRSNN model parameters for 5000 neurons requires the optimization of two parameters per neuron and four parameters per STDP synapse, where standard BO fails to converge to an optimal solution. However, the parameters to be optimized are correlated and can be drawn from a probability distribution as shown by Perez-Nieves et al. (2021) . Thus, we design a modified BO to estimate parameter distributions instead of individual parameters for the LIF neurons and the STDP synapses, for which we modify the BO's surrogate model and acquisition function. This makes our modified BO highly scalable over all the variables (dimensions) used. The loss for the surrogate model's update is calculated using the Wasserstein distance between the parameter distributions. We use the modified Matern function on the Wasserstein metric space as a kernel function for the BO problem. The detailed BO methods are discussed in Suppl. Sec. A. BO uses a Gaussian process to model the distribution of an objective function and an acquisition function to decide points to evaluate. For data points x ∈ X and the corresponding output y ∈ Y , an SNN with network structure V and neuron parameters W acts as a function f V,W (x) that maps input data x to y. The optimization problem can be defined as: min V,W ∑ x∈X,y∈Y L (y, f V,W (x)) where V is the set of hyperparameters of the neurons in R and W is the multi-variate distribution constituting the distributions of: (i) the membrane time constants τ m-E , τ m-I of LIF neurons, (ii) the scaling function constants (A + , A -) and (iii) the decay time constants τ + , τ -for the STDP learning rule in S RR .

4. EXPERIMENTAL RESULTS

Model and Architecture We empirically verify our analytical results using HRSNN for classification and prediction tasks. Fig. 2 shows the overall architecture of the prediction model. Using a rate-encoding methodology, the time-series data is encoded to a series of spike trains. This highdimensional spike train acts as the input to HRSNN. The output spike trains from HRSNN act as the input to a decoder and a readout layer that finally gives the prediction results. For the classification task, we use a similar method. However, we do not use the decoding layer for the signal but directly feed the output spike signals from HRSNN into the fully connected layer. The complete details of the models used and description of the different modules used in Fig. 2 is discussed in Suppl. Sec. A. Datasets: Classification: We use the Spoken Heidelberg Digits (SHD) spiking dataset to benchmark the HRSNN model with other standard spiking neural networks (Cramer et al., 2020) . Prediction: We use a multiscale Lorenz 96 system (Lorenz, 1996) which is a set of coupled nonlinear ODEs and an extension of Lorenz's original model for multiscale chaotic variability of weather and climate systems which we use as a testbed for the prediction capabilities of the HRSNN model (Thornes et al., 2017) . Further details on both datasets are provided in Suppl. Sec. A. Bayesian Optimization Ablation Studies: First, we perform an ablation study of BO for the following three cases: (i) Using Memory Capacity C as the objective function (ii) Using Average Spike Count S as the objective function and (iii) Using E as the objective function. We optimize both LIF neuron parameter distribution and STDP dynamics distributions for each. We plot C, S, the empirical spike efficiency Ê, and the observed RMSE of the model obtained from BO with different numbers of neurons. The results for classification and prediction problems are shown in Fig. 3 (a) and (b), respectively. Ideally, we want to design networks with high C and low spike count, i.e., models in the upper right corner of the graph. The observed results show that BO using E as the objective gives the best accuracy with the fewest spikes. Thus, we can say that this model has learned a better-orthogonalized subspace representation, leading to better encoding of the input space with fewer spikes. Hence, for the remainder of this paper, we focus on this BO model, keeping the E as the objective function. This Bayesian Optimization process to search for the optimal hyperparameters of the model is performed before training and inference using the model and is generally equivalent to the network architecture search process used in deep learning. Once we have these optimal hyper-parameters, we freeze these hyperparameters, learn (unsupervised) the network parameters Heterogeneity Parameter Importance: We use SAGE (Shapley Additive Global importancE) (Covert et al., 2020) , a game-theoretic approach to understand black-box models to calculate the significance of adding heterogeneity to each parameter for improving C and S. SAGE summarizes the importance of each feature based on the predictive power it contributes and considers complex feature interactions using the principles of Shapley value, with a higher SAGE value signifying a more important feature. We tested the HRSNN model using SAGE on the Lorenz96 and the SHD datasets. The results are shown in Fig. 4 . We see that τ m has the greatest SAGE values for C, signifying that it has the greatest impact on improving C when heterogeneity is added. Conversely, we see that heterogeneous STDP parameters (viz., τ ± , η ± ) play a more critical role in determining the average neuronal spike activation. Hence, we confirm the notions proved in Sec. 3 that heterogeneity in neuronal dynamics improves the C while heterogeneity in STDP dynamics improves the spike count. Thus, we need to optimize the heterogeneity of both to achieve maximum E.

Results:

We perform an ablation study to evaluate the performance of the HRSNN model and compare it to standard BP-based spiking models. We study the performances of both the SHD dataset for classification and the Lorenz system for prediction. The results are shown in Table 2 . We compare the Normalized Root Mean Squared Error (NRMSE) loss (prediction), Accuracy (classification), Average Spike Count S and the application level empirical spiking efficiency Ê calculated as 1 NRMSE × S (prediction) and Accuracy S (classification). We perform the experiments using 5000 neurons in R on both classification and prediction datasets. We see that the HRSNN model with heterogeneous LIF and heterogeneous STDP outperforms other HRSNN and MRSNN models in terms of NRMSE scores while keeping the S much lower than HRSNN with heterogeneous LIF and homogeneous STDP. From the experiments, we can conclude that the heterogeneous LIF neurons have the greatest contribution to improving the model's performance. In contrast, heterogeneity in STDP has the most significant impact on a spike-efficient representation of the data. HRSNN with heterogeneous LIF and STDP leverages the best of both worlds by achieving the best RMSE with low spike activations, as seen from Table 2 . Further detailed results on limited training data are added in Suppl. Sec. A. We also compare the generalizability of the HRSNN vs. MRSNN models, where we empirically show that the heterogeneity in STDP dynamics helps improve the overall model's generalizability. In addition, we discuss how HRSNN reduces the effect of higher-order correlations, thereby giving rise to a more efficient representation of the state space. 

5. CONCLUSION

This paper analytically and empirically proved that heterogeneity in neuronal (LIF) and synaptic (STDP) dynamics leads to an unsupervised RSNN with more memory capacity, reduced spiking count, and hence, better spiking efficiency. We show that HRSNN can achieve similar performance as an MRSNN but with sparse spiking leading to the improved energy efficiency of the network. In conclusion, this work establishes important mathematical properties of an RSNN for neuromorphic machine learning applications like time series classification and prediction. However, it is interesting to note that the mathematical results from this paper also conform to the recent neurobiological research that suggests that the brain has large variability between the types of neurons and learning methods. For example, intrinsic biophysical properties of neurons, like densities and properties of ionic channels, vary significantly between neurons where the variance in synaptic learning rules invokes reliable and efficient signal processing in several animals (Marder & Taylor, 2011; Douglass et al., 1993) . Experiments in different brain regions and diverse neuronal types have revealed a wide range of STDP forms with varying neuronal dynamics that vary in plasticity direction, temporal dependence, and the involvement of signaling pathways (Sjostrom et al., 2008; Korte & Schmitz, 2016) . Thus, heterogeneity is essential in encoding and decoding stimuli in biological systems. In conclusion, this work establishes connections between the mathematical properties of an RSNN for neuromorphic machine learning applications like time series classification and prediction with neurobiological observations. There are some key limitations to the analyses in this paper. First, the properties discussed are derived independently. An important extension will be to consider all factors simultaneously. Second, we assumed an idealized spiking network where the memory capacity is used to measure its performance, and the spike count measures the energy. Also, we mainly focused on the properties of RSNN trained using STDP. An interesting connection between synchronization and heterogeneous STDP remains a topic that needs to be studied further -whether we can optimally engineer the synchronization properties to improve the model's performance. Finally, the empirical evaluations were presented for the prediction task on a single dataset. More experimental evaluations, including other tasks and datasets, will strengthen the empirical validations.

A.1.3 SPIKE CODING

Encoding: For the RSNN to process our time series, the signal must be represented as spikes. We use a temporal encoding technique for representing signals in this paper. The spikes are only generated whenever the signal changes in value. The implementation of the temporal encoding used in this research is based on the Step-Forward (SF) algorithm (Petro et al., 2019) . The percentage of neurons to input the spikes to (α) is also chosen to provide good recurrent layer dynamics. Decoding: To represent the recurrent state, we use an exponentially decreasing rate decoding strategy by taking the sum of all the spikes s over the last τ timesteps into account as follows: x X i (t) = τ ∑ n=0 γ n s i (t -n) ∀i ∈ E where X denotes the model representation. The parameters τ and γ are balanced to optimize the memory size of the stored data (e.g., τ ≤ 50 ) and its containment of information, which includes adjusting τ to the pace at which the temporal data is presented and processed. The state of the recurrent layer will be only based on the output of excitatory neurons. Thus, it is crucial for the discount γ not to be too small, as it possibly flattens older values in the window to 0, making part of the sliding window unusable. Recent spikes hardly affect the recurrent layer state when setting γ too high in combination with a large window size. This causes the decoder to react too late to recent information provided by the recurrent layer and complicates the learning process of the readout layer.

A.1.4 READOUT

After the initialization of the recurrent layer, the readout is the only component of the LSM with trainable parameters. It consists of a single fully connected layer for regression or classification. The readout does not have to be any deeper, as the output of the recurrent layer is already a highdimensional representation of the processed input. x and y present the continuous signals of the time series T and model representation X. x T (t + k) = y T (t) ≈ ŷX (t) = f θ (x X (t)) The mean squared error (MSE) is used as the loss function to train the readout, and the network was trained using the stochastic optimizer Adam (Kingma & Ba, 2014) . Lorenz96: (Lorenz, 1996) Our objective is more clearly demonstrated using the canonical chaotic system we will use as a test bed for the prediction capabilities of the HRSNN model. We use a multiscale Lorenz 96 system which is a set of coupled nonlinear ODEs and an extension of Lorenz's original model (Thornes et al., 2017) , (Chattopadhyay et al., 2020) . L (y T , ŷX ) = 1 n n ∑ i=0 (y T i -ŷX i ) 2 A.1.5 DATASETS dX k dt = X k-1 (X k+1 -X k-2 ) + F - hc b Σ j Y j,k dY j,k dt = -cbY j+1,k (Y j+2,k -Y j-1,k ) -cY j,k + hc b X k - he d Σ i Z i,j,k dZ i,j,k dt =edZ i-1,j,k (Z i+1,j,k -Z i-2,j,k ) -geZ i,j,k + he d Y j,k This set of coupled nonlinear ordinary differential equations (ODEs) is a three-tier extension of Lorenz's original model (Lorenz, 1963) and has been proposed by Thornes et al.Thornes et al. (2017) as a fitting prototype for multiscale chaotic variability of the weather and climate system and a useful test bed for novel methods. In these equations, F = 20 is a large-scale forcing that makes the system highly chaotic, and b = c = e = d = g = 10 and h = 1 are tuned to produce appropriate spatiotemporal variability. For this paper, we focus on predicting Y axes, which have relatively moderate amplitudes compared to X, Z and demonstrate high-frequency variability and intermittency, which makes the To apply our RSNNs, we converted all audio samples into 250-by-700 binary matrices. For this, all samples fit within a 1 the second window, shorter samples were padded with zeros, and longer samples were cut by removing the tail. Spikes were then binned in time bins, both of sizes 10ms and 4ms; for the RSNNs, the presence or non-presence of any spikes in the time bin is noted as a single binary event. A.1.6 HYPERPARAMETERS The hyperparameters used in this paper are summarized in Table 4 Table 4 : This subsection proves that the maximum entropy distribution with a fixed covariance matrix is Gaussian. Lemma: Let q(r) be any density satisfying ∫ q(r)x i x j dr = Σ ij . Let p = N (0, Σ). Then h(q) ≤ h(p) Proof. 0 ≤ KL(q∥p) = ∫ q(r) log q(r) p(r) dr = -h(q) -∫ q(r) log p(r)dr = -h(q) -∫ p(r) log p(r)dr = -h(q) + h(p) since q and p yield the same moments for the quadratic form encoded by log p(r).

A.3 OPTIMAL HYPERPARAMETER SELECTION USING BAYESIAN OPTIMIZATION

Most recent research in Bayesian Optimization (BO) applications is limited to low-dimensional problems, as BO fails catastrophically when generalizing to high-dimensional problems (Frazier, 2018) . However, in this paper, we aim to use BO to optimize the neuronal and synaptic parameters of a heterogeneous RSNN model. This BO problem thus entails a huge number of hyperparameters to be optimized; hence, using standard BO algorithms remains a significant challenge. Hence, to overcome this issue, we used a novel BO algorithm based on the assumption that our hyperparameters to be optimized are not completely random and uncorrelated but can be thought of as being drawn from a probability distribution as shown by Perez et al. Perez-Nieves et al. (2021) . Thus, instead of searching for the individual parameters themselves, we use a modified BO to estimate parameter distributions for the LIF neurons and the STDP dynamics. After learning the optimal distributions, we simply sample from the distribution to get the distribution of hyperparameters used in the model. To learn the probability distribution of the data, we modify BO's surrogate model and acquisition function to treat the parameter distributions instead of individual variables. This makes our modified BO highly scalable over all the variables (dimensions) used. The loss for the surrogate model's update is calculated using the Wasserstein distance between the parameter distributions. BO uses a Gaussian process to model the distribution of an objective function and an acquisition function to decide on points to evaluate. For data points in a target dataset x ∈ X and the corresponding label y ∈ Y , an SNN with network structure V and neuron parameters W acts as a function f V,W (x) that maps input data x to predicted label ỹ. The optimization problem in this work is defined as min V,W ∑ x∈X,y∈Y L (y, f V,W (x)) (9) where V is the set of hyperparameters of the neurons in R (Details of hyperparameters given in the Supplementary) and W is the multi-variate distribution constituting the distributions of (i) the membrane time constants τ m-E , τ m-I of the LIF neurons, (ii) the scaling function constants (A + , A -) and (iii) the decay time constants τ + , τ -for the STDP learning rule in S RR . Again, BO needs a prior distribution of the objective function f (⃗ x) on the given data D 1∶k = {⃗ x 1∶k , f (⃗ x 1∶k )} . In the Gaussian Process (GP)-based BO, we assume that the prior distribution of f (⃗ x 1∶k ) follows the multivariate Gaussian distribution, which follows a GP with mean ⃗ µ D 1∶k and covariance ⃗ Σ D 1∶k . Thus, we estimate ⃗ Σ D 1∶k using the modified Matern kernel function. We use the loss function as d(x, x ′ ), which is the Wasserstein distance between the multivariate distributions of the different parameters. That is, given two distributions of hyperparameters x 1 , x 2 , the distance between these two distributions (given as d(x 1 , x 2 ) is used as the loss function in the Matern kernel for the modified BO. We want to learn the optimal distribution of hyperparameters x ′ , which maximizes the performance. It is to be noted here that for higher-dimensional metric spaces, we use the Sinkhorn distance as a regularized version of the Wasserstein distance to approximate the Wasserstein distance (Feydy et al., 2019) . D 1∶k are the points evaluated by the objective function. The GP will estimate the mean ⃗ µ D k∶n and variance ⃗ σ D k∶n for the rest unevaluated data D k∶n . The acquisition function used in this work is the expected improvement (EI) of the prediction fitness as: EI (⃗ x k∶n ) = (⃗ µ D k∶n -f (x best )) Φ( ⃗ Z) + ⃗ σ D k∶n ϕ( ⃗ Z) where Φ(⋅) and ϕ(⋅) denote the probability distribution function and the cumulative distribution function of the prior distributions, respectively. f (x best ) = max f (⃗ x 1∶k ) is the maximum value that . BO will choose the data x j = argmax {EI (⃗ x k∶n ) ; x j ⊆ ⃗ x k∶n } as the next point to be evaluated using the original objective function.

A.3.1 OPTIMIZED HYPERPARAMETERS

The list of the hyperparameters optimized using the Bayesian Optimization technique is shown in Table 5 . We also show the range of the hyperparameters used and the initial values. In addition to this, Table 6 enlist the final optimized distributions of the STDP and the LIF parameters obtained using BO. 

A.3.2 CONVERGENCE ANALYSIS

We compare the convergence analysis of the three Bayesian Optimization techniques and the results are shown in Fig. 6 . Each of the experiments was repeated five times and the mean and variance of the observations are shown in the Figure . It is to be noted here that since we define the BO as a minimization principle, we minimize 1 C , S and 1 E .

A.4 COMPARING BAYESIAN OPTIMIZATION OBJECTIVE FUNCTIONS

We show the results of Bayesian Optimization results for the three cases we are considering in this paper for both the classification and prediction problems. The results for the classification problem are shown in Table 7 . We tabulate the memory capacity, the average spike count and the observed accuracy for the three BO cases. Similarly, the results for the prediction problem are shown in Table 8 . In that case, we tabulate the memory capacity, the average spike count and the observed NRMSE for the three BO cases. We rerun each of the experiments 5 times and report the mean and standard deviation of the results obtained.

A.5 COMPARING THE GENERALIZABILITY

We observed that increasing the neuronal heterogeneity increases the memory capacity of the network. However, this increment in the memory capacity might lead to a model which overfits the training data. However, the heterogeneous STDP model with varying synaptic dynamics gives rise to a heavy-tailed Feller process. Recent works Simsekli et al. (2020) , Chakraborty & Mukhopadhyay (2021) show that the Hausdorff dimension of the trajectories of the sample paths of the learning algorithm can control the generalization error. This is intimately linked to the tail behavior of the driving process. The authors showed that heavier-tailed processes achieve better generalization. Thus, the tail index of the process can be used as a notion of capacity metric that estimates the generalization error, which does not necessarily grow with the number of parameters. The authors discuss that the stochastic process for the synaptic weights behaves like a Lévy motion around a local point. Because of this locally regular behavior, the Hausdorff dimension can be bounded by the Blumenthal-Getoor (BG) index (Blumenthal & Getoor, 1960) , which depends on the tail behavior of the Lévy process. Thus, we can use the BG index as a bound for the Hausdorff dimension of the trajectories from the STDP learning process. Now, as the Hausdorff dimension is a measure of the generalization error and is also controlled by the tail behavior of the process, heavier tails imply less generalization error. In this paper, we empirically study the generalization ability of the HRSNN network using the BG index as a metric. We did the experiments on the 4 ablation study models for the classification task on the SHD dataset, and the results are reported in Table 9 . From the table, we see that the heterogeneity in STDP improves the generalization error the most, while the heterogeneity in the LIF neurons increases the training and testing accuracies.

A.6 RESULTS ON LIMITED TRAINING DATA

We have trained the models with limited training data. We observe that the HRSNN model with heterogeneous LIF and STDP dynamics not only has better testing accuracy but also shows better generalization behavior when compared to other homogeneous RSNN or the other ablation heterogeneous models (with heterogeneity in only one of them). Also, we see that the HRSNN model with heterogeneous STDP shows distinctly better generalization ability than the generalization ability of HRSNN with heterogeneous LIF neurons. On the other hand, the latter showcases significantly higher training and testing accuracy compared to the former model. This can be interpreted as follows: since heterogeneous LIF dynamics increase the memory capacity, it leads to an overfitting of the data. Heterogeneous STDP dynamics help in obtaining more generalizable solutions from this. Each has its own downsides; however, using HRSNN with both heterogeneous LIF and STDP dynamics shows better performance and generalization abilities, as seen from Table 10 .

A.7 FURTHER EVALUATIONS

In Section B, we argued that as the heterogeneity in the neuronal parameters increases, the covariance decreases; hence the neurons become less correlated. In this section, we give empirical results to support the theory. We tested the model on more complex datasets -(i) The Spiking Heidelberg Digits (SHD) dataset (ii) the Spiking Speech Command (SSC) dataset are both audio-based classification datasets for which input spikes and output labels are provided Cramer et al. ( 2020) and (iii) CIFAR10 DVS dataset Li et al. (2017) . • Impact of Heterogeneity on Covariance: We plot the covariance matrices for different levels of heterogeneity J (Eq. 46) for a small network with 50 neurons. The covariance matrix is calculated by taking the average neuronal states before the appearance of the first spike in the final layer. We see that as the heterogeneity in the neuronal parameters increases, the correlation between the neurons decreases. The results are shown in Fig. 7 • Impact of Heterogeneity on Principal Components: From the covariance plots, we see that increasing J . reduces the correlation between neurons. We also plot the probability density functions of the eigenvalues of the covariance matrix of the neurons with increasing heterogeneity in the neuronal parameters. We see that with higher heterogeneity in the neuronal parameters J , the distribution of the eigenvalues of the covariance becomes flatter. This signifies that the covariance matrix has a lower variance for higher J . A flatter distribution also indicates that a larger number of principal components are active. This supports our hypothesis that heterogeneity in the neuronal parameters increases the number of principal components and helps increase the model's memory capacity. The result is shown in this Fig. 8 • Impact of Heterogeneity in STDP on Firing Rate: We plot the mean firing rate of the neurons for the four types of HRSNNs and MRSNN with homogeneous LIF and STDP dynamics. We plot the results for a smaller network with 100 neurons and a Poisson input process. The MRSNN model shows a much higher firing rate, especially at a higher frequency, demonstrating that MRSNN requires significantly more spikes than the HRSNN model. The result is shown in Fig. 9 • Coupling Strength: We note here that in this paper, we use (homogeneous or heterogeneous) STDP to learn the synaptic conductance connecting various neurons in the SNN. Therefore, we do not control the synaptic coupling strength as independent variables and hence, cannot perform control experiments with various extents of coupling strength. An interesting future extension of the results will be quantifying the coupling strength for HRSNN with heterogeneity in LIF and STDP dynamics. We can leverage McKenzie et al. McKenzie et al. (2021) , where the authors proposed statistical tools to estimate synaptic coupling dynamics from spike-spike correlations. We make several approximations and assumptions for this section's theoretical analysis of the heterogeneous RSNN networks. Firstly, it must be noted that in this paper, the analytical relations are derived by taking the heterogeneity individually. i.e., when we consider heterogeneity in the neuronal parameters, we consider homogeneous STDP dynamics and vice-versa. In addition to this, we assume diffusion approximation. That is, if a neuron receives Poissonian uncorrelated input spike trains and the contribution of a single synaptic connection is small compared to the distance between reset and threshold w ≪ (V Θ -V 0 ), the random input can be approximated by Gaussian white noise with mean µ and noise intensity σ 2 . This approximation does not hold if the network features highly correlated activity or receives strong external input common to many neurons. Also, we assume a fast/slow synaptic regime in which the synaptic time constant τ s is much shorter/longer than the membrane time constant τ m . In this work, we consider a mean-field approximation of the HRSNN network with heterogeneity in the parameters of the LIF neurons and the STDP dynamics independently.

B.2 MEAN-FIELD REDUCTION MODEL OF HRSNN

In this section, we model the HRSNN network using heterogeneity in only the LIF neuron parameters. Following the works of Ly et al.Ly (2015) , we can write the equations for the excitatory neurons indexed by j ∈ {1, 2, . . . , N e } are: τ m dv j dt = -v j -g ie (t) (v j -E I ) -g ee (t) (v j -E E ) + σ E η j (t) (11) v j (t * ) ≥ θ j ( refractory period ) ⇒ v j (t * + τ ref ) = 0 (12) τ n dη j dt = -η j + √ τ n ξ j (t) g ee (t) = q j γ ee p ee N e ∑ j ′ ∈{ presyn E cells} G j ′ (t) (14) g ei (t) = γ ei p ei N i ∑ k ′ ∈{ presyn I cells} G k ′ (t) (15) τ d dG j dt = -G j + A j (16) τ r dA j dt = -A j + τ r α ∑ l δ (t -t l ) where the inhibitory and excitatory reversal potentials are E I , and E E , respectively, with E I < 0 < E E . ξ j (t) are uncorrelated white noise processes, p xy is the proportion of neuron type y (randomly chosen) that provides presynaptic input to neuron type x (x, y ∈ {e, i}). The second line in the equations describes the refractory period at spike time t * . When the neuron's voltage crosses threshold θ j , the neuron goes into a refractory period for τ ref where the voltage is undefined, after which we set the neuron's voltage to 0. In the last equation, t l denotes the spike times of the j th excitatory neuron. Now, for the mean-field analysis, we use q ji to model the synaptic heterogeneity between the pre-and post-synaptic neurons by modulating the synaptic conductance for both the excitatory and inhibitory neurons. We note here the numerical assumptions for the mean-field analysis: 1. finite size effects are negligible (N e/i ≫ 1 ) 2. the firing rate of presynaptic neurons is governed by a Poisson process 3. the population firing rate averaged over q and τ m is a good approximation to the average presynaptic input rate and 4. a single p.d.f. function is sufficient to describe the population behavior) (finite N ) Similarly, for the inhibitory neurons indexed by k ∈ {1, 2, . . . , N i }, the equations are: τ m dv k dt = -v k -g ii (t) (v k -E I ) -g ei (t) (v k -E E ) + σ I η k (t) (18) v k (t * ) ≥ 1( refractory period ) ⇒ v j (t * + τ ref ) = 0 (19) τ n dη k dt = -η k + √ τ n ξ k (t) (20) g ie (t) = q j γ ie p ie N e ∑ k ′ ∈{ presyn I cells} G k ′ (t) (21) g ii (t) = γ ii p ii N i ∑ k ′ ∈ {presyn I cells} G k ′ (t) (22) τ d dG k dt = -G k + A k (23) τ r dA k dt = -A k + τ r α ∑ l δ (t -t l ) Please refer to the paper by Ly et al. Ly (2015) for details regarding the equations. Since the recurrent coupled stochastic network is difficult to describe theoretically, we use population density methods, where an equation determines the probability of a neuron being in a particular state. The variables in the populations are determined using distribution functions. The two forms of heterogeneity introduce a large number of dimensions. For simplicity, one can track a family of probability density functions for each (q j , τ j ) pair for each neuron. The subsequent equations are a good approximation to the HRSNN network with the following assumptions: (i) finite size effects are negligible (N e/i ≫ 1) (ii) the firing rate of presynaptic neurons is governed by a Poisson process (iii) the population firing rate averaged over q and τ m is a good approximation to the average presynaptic input rate (iv) a single p.d.f. function is sufficient to describe the population behavior, and the heterogeneity is driven by (q j , τ m , j) For each pair of values (q j , τ m , j), the probability density function ρ is defined by: ∫ Ω ρ (v E , w E , v I , w I , t) dv E dw E dv I dw I = Pr((v E (t), w E (t), v I (t), w I (t)) ∈ Ω) (25) where w X denotes the other states variables of the corresponding neuron type X ∈ {E, I}, consisting of conductance, colored noise: w X = (g X , a X , η X ). The evolution of the p.d.f.'s is governed by a continuity equation and boundary conditions: The definitions of g XY in the LIF neuron equations defined above result in a total conductance of γ XY g Y on average. ∂ρ ∂t = -∇ ⋅ J (26) J ∶= (J v E , J g E , J a E , J η E , J v I , J g I , J a I , J η I ) (27) J v E ∶= - 1 τ m [v E + qγ ei g I (v E -E I ) + qγ ee g E (v E -E E ) + σ E η E ] ρ (28) J v I ∶= - 1 τ m [v I + γ ii g I (v I -E I ) + γ ie g E (v I -E E ) + σ I η I ] ρ (29) J g X ∶= - 1 τ d [g X -a X ] ρ (30) J a X ∶= - a X τ r + v X (t) ∫ a X a X -α X ρ (. . . , a ′ X , . . .) da ′ X (31) J η X ∶= - 1 τ n η X ρ + 1 τ n ∂ 2 ρ ∂η 2 X (32) v X (t) ∶= 1 τ m J v X dw X dqdτ m (33) J w X | ∂w X = 0 (34) We describe an insightful analytic reduction that captures how the range of excitatory firing rates changes in different regimes. We focus on only the excitatory neurons, which have fewer state variables if the inhibitory population is ignored or assumed to be known. Let us denote the approximate excitatory firing rate(s) v E as r. The deterministic firing rate of the equation τ m dv E dt = -v E -q gI (v E -E I ) -q gE (v E -E E ) + ηE is given by r 0 (q, τ m ; wE ) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ 0, if q( gE E E + gI E I )+ ηE 1+q( gE + gI ) ≤ θ 1+q( gE + gI ) τm( g * E E E + gI E I )+ ηE if q( gE E E + gI E I )+ ηE 1+q( gE + gI ) > θ (36) We define: gE ∶= γ ee g E , gI ∶= γ ei g I , ηE ∶= σ E η E . Finally, the given state variables are integrated against their marginal density to get: r(q, θ) = E [ r 0 1 + r 0 τ ref ] = ∫ r 0 1 + r 0 τ ref ρ ( gE , gI , ηE ) d wE (37) There is a slight abuse of notation because the auxiliary variables a X effect the conductances but are not written in the previous equation; the emphasis is on how ( gE , gI , ηE ) directly effects r. Since the external noise is applied indiscriminately, ηE is independent of the other variables and the marginal density factors into: ρ ( gE , gI , ηE ) = ρ ( gE , gI ) e -( ηE /σ E ) 2 σ E √ π (38) However, ρ ( gE , gI ) is still not analytically tractable, leading us to rely on Monte Carlo simulations to numerically estimate ρ ( gE , gI ). It must be noted here that this is a reduction model for the HRSNN network with many simplifying assumptions. It is not a complete mean-field derivation of the HRSNN model with heterogeneous LIF neurons, and heterogeneous STDP dynamics is a fascinating research question but beyond the scope of this paper.

Analytical Results of Memory Capacity

Neuroscience networks of spiking neurons are increasingly used to understand mechanisms underlying phenomena observed in electrophysiological recordings. There are two complementary strategies for studying such a recurrent network of spiking neurons -(a) numerical simulations and (b) analytical methods using mean field models. With numerical simulations, we can simulate any network model without any approximation. However, this method typically works in high dimensional parameter space and is, thus, hard to interpret. Also, it is generally hard to characterize parameter regions where specific behaviors are found using numerical simulations. On the other hand, with analytical calculations, we obtain deeper insights into mechanisms underlying specific behaviors and can obtain critical parameters that control specific behaviors. So, now, we analytically study the variance of the estimated memory capacity with the change in the heterogeneity of neuronal parameters. We plot the change in the estimated memory capacity C, calculated using Eq. 3. We plot this with respect to the neuronal heterogeneity H, measured using the entropy of the neuronal parameters for the HRSNN model. The result is plotted in Fig. 10(a) . We use a HRSNN model with N R = 1000 and sequences of 4,000 random inputs chosen from U[-1; 1]. We see that, as predicted, the memory capacity of the model increases linearly with the increase in heterogeneity within the limits of the application, as proved in Theorem 1. The error bars in Fig. 10 Using heterogeneity in the STDP parameters reduces the average number of spiking activations while keeping the memory capacity almost equal. This result shows that Heterogeneous STDP leads to sparse activation of neurons, as proved in Theorem 2. Comparison with Neuroscience Works: We compare the analytical results obtained with some of the standard recurrent LIF network models in the literature. Brunel et al.Brunel (2000) analytically study the dynamics of sparsely connected a network of sparsely connected excitatory and inhibitory integrate-and-fire neurons. The authors showed the existence of a diverse set of states, including synchronous states in which neurons fire regularly; asynchronous states with stationary global activity and very irregular individual cell activity; and states in which the global activity oscillates but individual cells fire irregularly, typically at rates lower than the global oscillation frequency. In this paper, we use heterogeneity in the LIF neurons. This leads to a diverse set of states for the neurons, which consequently helps orthogonalize the state space dynamics to increase the information stored in the memory of the network. Deneve et al. Denève & Machens (2016) discussed the inefficiency of irregular Poisson rate encoding in the brain. The authors argue that the Poisson point process, which we use to model the spike firing rate, is extremely inefficient as it exponentially increases the number of spikes required to convey information. The authors further discuss that a continuum exists between loosely balanced and tightly balanced spike-coding networks in neuroscience. Though loosely balanced networks are inefficient, they are cheap in terms of the number of connections per neuron and structure (Boerlin et al., 2013; Boerlin & Denève, 2011; Bourdoukan et al., 2012) . On the other hand, tightly-balanced spike-coding networks are highly efficient but extremely structured, dense connections that STDP rules must constantly maintain. For the HRSNN model, since we are engineering an artificial spiking neural network model, our network is highly structured and constantly updated using the heterogeneous STDP rules. Thus, we might say that the HRSNN model is a tightly-coupled network that helps in an efficient transfer of information. This hypothesis is supported by the results shown in Table 2 , where the HRSNN model shows a higher performance using a lesser number of spikes.

B.4 MEMORY CAPACITY

Let x(t) ∈ U (where -∞ < t < +∞ and U ⊂ R is a compact interval) be a single-channel stationary input signal. Assume that we have an RSNN, specified by its internal weight matrix W, its input weight vector w in and the unit output functions f , f out . The network receives x(t) at its input unit. For a given delay τ and an output unit y τ with connection weight vector w out τ we consider the determination coefficient d [w out τ ] (x(t -τ ), y τ (t)) = = d (x(t -τ ), w out τ ( x(t) r(t) ) = Cov 2 (x(t -τ ), y τ (t)) σ 2 (x(t))σ 2 (y τ (t)) where Cov denotes covariance and σ 2 variance. The τ -delay Memory capacity of the network is defined by C τ = max w out τ d [w out τ ] (x(t -τ ), y τ (t)). The Memory capacity of the network is C = ∞ ∑ τ =1 C τ . The determination coefficient of two signals is the squared correlation coefficient. It ranges between 0 and 1 and represents the fraction of variance explainable in one signal by the other. Thus, the Memory capacity measures how much variance of the delayed input signal can be recovered from optimally trained output units, summed over all delays. Note that the output units do not interfere; arbitrarily, many output units y τ can be attached to the same network. The performance of the heterogeneous network model derives from its ability to retain the memory of previous inputs. To quantify the relationship between the recurrent layer dynamics and the memory capacity, we note that the extraction of information from the recurrent layer is made through a linear combination of the neurons' states. Hence, more linearly independent neurons would offer more variable states and, thus, more extended memory. For reservoir computing (RC), Jaeger et al. Jaeger (2002) shows that C is bounded by the reservoir network size of the linear RC with the identity activation function and the independent and identically distributed (i.i.d.) model input. Memory capacity ( C) is used to quantify the memory of RSNN. Such memory capacity measures the ability of RC to reconstruct precisely the past information of the model input. Also, the network's structural properties can greatly impact the C of the linear RC. Now, the question arises what is the need to maximize the memory capacity of the network? The C normally serves as a global index to quantify the memory property of the network. To comprehensively examine the memory property deeply, the local measurement of its memory property is indispensable. Thus, maximizing the C acts as an estimator for better prediction results of the trained network. Since the first-order approximation of the model is linear, the heterogeneity between state variables depends on all the eigenvalues of the adjacency matrix, with a larger mean eigenvalue meaning higher heterogeneity. Hence we can use the eigenvalues {λ i } of the weight matrix W to quantify approximately how fast the input decays in the recurrent layer. In other words, the eigenvalues of W should be related to the memory capacity of the heterogeneous neural network model. Indeed, we find that the average eigenvalue modulus: ⟨|λ|⟩ = 1/N R ∑ N R i=1 |λ i | strongly correlates with H and therefore with C as well. Note that, instead of C and H, ⟨|λ|⟩ is much easier to compute and is solely determined by the recurrent layer network. The memory capacity reflects the precision with which previous inputs can be recovered. The nonlinearity of the recurrent layer and other far-in-the-past inputs induce noise that complicates recovery. Thus, similar to the analysis done by Aceituno et al. Aceituno et al. (2020) for Echo state networks, the variance of the linear part of the recurrent layer is placed to maximize the recoverable information. Thus, the inputs are projected into orthogonal directions of the recurrent layer state space to not add noise to each other. The variance spread across the different dimensions should be evenly distributed within those orthogonal directions, quantified by the neurons' covariance. We start by noticing that the linear nature of the projection vector w out implies that we are treating the system as r(t) = ∞ ∑ τ =0 a τ x(t -τ ) + ε(t) where the vectors a τ ∈ R N R correspond to the linearly extractable effect of x(tτ ) onto r(t) and ε(t) is the nonlinear contribution of all the inputs onto the state of r(t). Previous works have shown that linear recurrent layers have more extended memory, but nonlinearity is needed to perform interesting computations. Here we show that for a fixed ratio of the nonlinearity, greater heterogeneity leads to a lesser neuronal correlation, leading to a higher memory capacity. To maintain this trade-off between linear and non-linear behavior, we will assume that linear and non-linear strengths distribution is fixed. This can be achieved if we impose that the probabilities of the neuron states do not change, meaning that the mean, variance, and other moments of the neuron outputs are unchanged; hence, the strength of the non-linear effects is unchanged. A first constraint can also be obtained from the maintained strength of the linear side of Eq.39 Var ( ∞ ∑ τ =1 a τ x(t -τ )) = c ( ) where c is a constant. Lemma 3.1.1: The state of the neuron can be written as follows: r i (t) = N R ∑ k=0 N R ∑ n=1 λ k n ⟨v -1 n , w in ⟩ (v n ) i x(t -k) where v n , v -1 n ∈ V are, respectively, the left and right eigenvectors of W, and λ k n ∈ λ belongs to the diagonal matrix containing the eigenvalues of W; a i = [a i,0 , a i,1 , . . .] represents the coefficients that the previous inputs x t = [x(t), x(t -1), . . .] have on r i (t). Proof: We build on the work of Aceituno et al. Aceituno et al. (2020) where they showed that higher heterogeneity among the neuronal states implies higher memory capacity. Here we aim to show that as the number of neurons N R in the recurrent layer decreases, heterogeneity increases the spectral radius. More formally, the spectral radius |λ n | is directly proportional to H as N R decreases. We express the state of a neuron r i (t) as r i (t) = ∞ ∑ k=0 (W k w in ) i x(t -k) = ∞ ∑ k=0 a i,k x(t -k) = ⟨a i , x t ⟩ (42) where the vector a i = [a i,0 , a i,1 , . . .] represents the coefficients that the previous inputs x t = [x(t), x(t -1), . . .] have on r i (t). We can then plug this into the covariance between two neurons, Cov (r i , r j ) = lim T →∞ 1 T t+T ∑ q=t ⟨a i , x q ⟩ ⟨a j , x q ⟩ = ⟨a i , a j ⟩ lim T →∞ 1 T T ∑ qi=0 T ∑ qj =0 ⟨x qi , x qj ⟩ = ⟨a i , a j ⟩ lim T →∞ 1 T T ∑ q=0 ⟨x q , x q ⟩ = ⟨a i , a j ⟩ × E [x 2 (t)] = ⟨a i , a j ⟩ Now we write a i as a function of the eigenvalues of W. Using the eigenvalue decomposition of the weight matrix W, we rewrite the state of the neuron as follows: r i (t) = N R ∑ k=0 N R ∑ n=1 λ k n ⟨v -1 n , w in ⟩ (v n ) i x(t -k) where v n , v -1 n ∈ V are, respectively, the left and right eigenvectors of W, and λ k n ∈ λ belongs to the diagonal matrix containing the eigenvalues of W; a i = [a i,0 , a i,1 , . . .] represents the coefficients that the previous inputs x t = [x(t), x(t -1), . . .] have on r i (t). ∎ 2020), the memory capacity increases when the variance along the projections of the input into the recurrent layer state has higher heterogeneity. This can be expressed in terms of the state space of the recurrent layer. Now, we aim to project the inputs into orthogonal directions of the network state space. Thus, we model the system as r(t) = ∞ ∑ τ =1 a τ x(t -τ ) + ε(t) where the vectors a τ ∈ R N correspond to the linearly extractable effect of x(tτ ) onto r(t) and ε(t) is the nonlinear contribution of all the inputs onto the state of r(t). Since our goal is to have a variance as homogeneous as possible along with the directions of a τ , we need a variance that is as homogeneous along with orthogonal directions, where the vectors a τ ∈ R N correspond to the linearly extractable effect of the input variable x(t) onto the states of the neurons (r(t)). Since the eigenvectors of Σ preserve orthogonality across the covariance matrix Σ, the new variances are given by the eigenvalues of the covariance matrix, λ n (Σ). Thus, we work on the distribution of the eigenvalues of the covariance matrix. Specifically, we want to show that increasing the heterogeneity in the neuronal membrane time constants decreases the correlation between the neuron states, which decreases the variance of the neuronal states of the eigenvalues, which would increase the memory capacity C. We quantify the heterogeneity using the mean with respect to the square root of the raw variance of the eigenvalues of the covariance matrix given by J = ∑ N R n=1 λ 2 n (Σ) (∑ N R n=1 λ n (Σ)) 2 (46) where λ n (Σ) is the nth eigenvalue of Σ. To get an intuition of how this metric reflects the heterogeneity in the neuronal parameters, consider the case of two eigenvalues λ 1 , λ 2 ; when λ 1 = λ 2 -very homogeneousthen J = 1 2 , but when λ 1 > 0, λ 2 = 0heterogeneity is more and hence, J = 1. The membrane time constant is given by the product of the membrane resistance R m and membrane capacitance C m , such that τ m = R m C m . R m is the inverse of the permeability; the higher the permeability, the lower the resistance, and vice versa. Thus, the lower the time constant, the faster or more rapidly a membrane will respond to a stimulus. The effects of the time constant on propagation velocity will become clear below. Hence, variability in the membrane time constants will lead to variability in the propagation velocity of action potentials. Now, ( N R ∑ n=1 λ n (Σ)) 2 = (tr[Σ]) 2 = ( N R ∑ n=1 Var (r n (t))) 2 which is constant by the assumption that the probability distributions of the neuron activities are fixed. Hence we can focus on the value of ∑ N R n=1 λ 2 n (Σ) which is true since Σ k e n (Σ) = λ n (Σ)Σ k-1 e n (Σ) = λ k n (Σ)e n (Σ) ⇒ N R ∑ n=1 λ 2 n (Σ) = tr [Σ 2 ] ( ) where e n (Σ) and λ n (Σ) are, resp. the nth eigenvector and eigenvalue of Σ. Hence, we can compute this by decomposing the square of the covariance matrix as follows: N R ∑ n=1 λ 2 n (Σ) = N R ∑ n=1 N R ∑ m=1 Σ nm Σ mn = N R ∑ n=1 N R ∑ m=1 Cov 2 (x n (t), x m (t)) where Σ ij are the factor matrices obtained using Cholesky decomposition of Σ. Thus, ∑ N R n=1 λ 2 n (Σ) increases as the neurons become more correlated; hence heterogeneity decreases. Thus, from Eqs. 46, 49 we can write the heterogeneity as inversely proportional to ∑ N R n=1 Cov 2 (x n (t), x m (t) ). We see that increasing the correlations between neuronal states decreases the heterogeneity of the eigenvalues, which would reduce the memory capacity of the model. We show that the determinant of the covariance between neuronal parameters bounds the heterogeneity. Thus, as H increases → covariance decreases → neurons become less correlated. Aceituno et al.Aceituno et al. (2020) proved that the neuronal state correlation is inversely related to the memory capacity of the network. Hence, we claim that as H increases, the memory capacity C also increases. Hence, for HRSNN, with H > 0, C H ≥ C M . ∎

B.5 SPIKING EFFICIENCY

In this section, we model the spiking activity using a point process called the multivariate Point process model. A point process is a collection of random points on some underlying mathematical space, such as the real line, the Cartesian plane, or more abstract spaces. The notion of using point process models, especially the interactive Hawkes processes, to model the spiking dynamics of LIF network dynamics has been studied in the literature previously (Löcherbach, 2017; Galves & Löcherbach, 2016; Mascart, 2021; Pfaffelhuber et al., 2022) . We leverage these results to prove that heterogeneity in the synaptic dynamics can help reduce the spike count, as already discussed in the paper. We highlighted the key assumptions used in deriving the results in the Suppl. Sec. C. We apologize if there is still confusion, and we will add more in-depth discussion in the final manuscript as discussed below. In their paper, Locherbach et al. Löcherbach (2017) survey some aspects of the study of Hawkes processes in high dimensions to model biological neural systems and study their long-term behavior. Galves et al.Galves & Löcherbach (2016) provided an overview of point processes used as stochastic models for interacting neurons in discrete and continuous time. Similarly, Hawkes processes have met a recent interest in the mathematical neuroscience literature for their ability to model the dependence of a neuron's activity in the network's history (Mascart, 2021; Pfaffelhuber et al., 2022; Galves & Löcherbach, 2016; Gerhard et al., 2017; Zhou et al., 2020; Duval et al., 2022) . Other works have also used a nonlinear interactive Hawkes process to model spiking neural networks with excitatory and inhibitory neurons (Chevallier et al., 2015; Chornoboy et al., 1988; Hansen et al., 2015; Reynaud-Bouret et al., 2014) . Drawing from these works, we use a microscopic model describing a large network of interacting neurons that can generate oscillations in a macroscopic frame. In the model, the activity of each neuron is represented by a point process indicating the successive times at which the neuron emits a spike, where each realization of this point process is the spike train. We take the spiking intensity of a neuron as the probability of emitting a spike during the next instant, depending on the history of the neuron and the activity of other neurons in the network. The neurons interact through their synapses. This means that a spike of a pre-synaptic neuron leads to an increase of the membrane potential of the post-synaptic neuron if the synapse is excitatory or a decrease if the synapse is inhibitory, possibly after some delay, like the process of synaptic integration. The neuron fires a spike when the membrane potential reaches a certain upper threshold. Thus, excitatory inputs from the neurons in the network increase the firing intensity, and inhibitory inputs decrease it. Hawkes processes provide good models of this synaptic integration phenomenon by the structure of their intensity processes. This paper uses a general class of mean-field interacting Hawkes processes, modeling the reciprocal interactions between a population of excitatory neurons and a population of inhibitory neurons. Let us consider a subsection of the HRSNN network as shown in Fig. 11 denoted by N x . We use the multivariate Point process model to create a probabilistic model that relates the inner structure of the sub-network and its spiking activity. In this model, each neuron i has a background spiking intensity ν i caused by neurons outside the network. We know that when a neuron spikes, it impacts its spiking activity and the spiking activity of its output neurons. The impact of a neuron j on neuron i is modeled by a real function h j→i (t). This impact can be excitatory or inhibitory depending on whether the pre-synaptic neuron is excitatory or inhibitory, as shown in Fig. 11 . While the spikes from excitatory neurons try to excite another spike, spikes originating from inhibitory neurons try to inhibit the spiking of the cascading neuron. A Hawkes process is a point process in which each point is commonly associated with event occurrences in time, where every event time impacts the probability that other events will take place subsequently. These processes are characterized by the conditional intensity function, seen as an instantaneous measure of the probability of event occurrences. A Hawkes process is a point process in which each point is commonly associated with event occurrences. In this past-dependent model, every event time impacts the probability that other events take place subsequently. These processes are characterized by the conditional intensity function, seen as an instantaneous measure of the probability of event occurrences. Although the self-exciting Hawkes process remains widely studied, there has been a growing interest in modeling the opposite effect, known as inhibition, in which the apparition of certain events lowers the probability of observing an event. In practice, this amounts to considering negative kernel functions. To maintain the positivity of the intensity function, a non-linear operator is added to the expression, which in turn entails the loss of the cluster representation. This model is known as the non-linear Hawkes process, where the existence of such processes was proved via construction using bi-dimensional marked Poisson processes. The general Hawkes framework can be written as: λ i t = Φ i ⎛ ⎝ ∑ j∈S i,E ∫ t 0 h j→i (t -u)dZ j u ⎞ ⎠ , where λ i t is the intensity of neuron i, Φ i a positive function, Z j,t is the counting process associated with neuron j, h j→i (t) is the synaptic kernel associated with the synapse between neurons j and i. To simplify the notation, we can rewrite Eq. 50 as λ i (t) = Φ i (∑ k∈I ∫ (0,t) h ki (t -s)dZ k (s)) . ( ) where h ik (t -s) measures the influence of neuron k on neuron i and how this influence vanishes with the time. More precisely, h ik (t -s) describes how a spike of neuron k lying back ts time units in the past influences the present spiking rate at time t. The goal of using heterogeneity in the STDP dynamics is to get better orthogonalization among the recurrent network states to lower higher-order correlations in spike trains. Studies have shown that the correlation of higher order progressively decreases the information available through neural population (Montani et al., 2009; Abbott & Dayan, 1999) . Since we are trying to engineer a spikeefficient model, we leverage the heterogeneity in the STDP dynamics to reduce the higher-order correlations. The hypothesis is that using heterogeneity in STDP helps us orthogonalize the recurrent layer that can help us achieve an efficient representation of the input spike patterns with fewer spikes. This may be interpreted as the recurrent layer acting as an orthogonal bases function where inputs are projected onto these bases. Thus, having orthogonal bases can efficiently map inputs without much loss. While heterogeneous LIF neurons help us increase the number of principal components, thereby enabling us to store a greater subclass of features, heterogeneous STDP helps us efficiently encode this orthogonalization of the recurrent layer, resulting in fewer spikes compared to a homogeneous RSNN. Thus, in effect, heterogeneous STDP parameters can learn the output more precisely, which is projected back into the recurrent network. One of the primary reasons why heterogeneous STDP helps project the input to orthogonal activations of the recurrent network can be attributed to the distribution of LTD dynamics, as this increases the competition and helps distribute the input projection to multiple principal components. We discuss that the heterogeneous LTP/LTD dynamics in STDP lead to fewer spikes in the transmission of information. Lemma 3.2.1: If the neuronal firing rate of the HRSNN network with only heterogeneity in LTP/LTD dynamics of STDP is represented as Φ R and that of MRSNN represented as Φ M , then the HRSNN model promotes sparsity in the neural firing which can be represented as Φ R < Φ M . Proof: In this lemma, we show that the average firing rate of the model with heterogeneous STDP (LTP/LTD) dynamics (averaged over the population of neurons) is lesser than the corresponding average neuronal activation rate for a model with homogeneous STDP dynamics. We prove this by taking a sub-network of the HRSNN model as illustrated by Fig. 11 . Now, we model the input spike trains of the pre-synaptic neurons using a multivariate interactive, nonlinear Hawkes process with multiplicative inhibition (Duval et al., 2022) . We consider a population of neurons of size N that is divided into population A (excitatory) with size N A ∶= αN and a population B (inhibitory) with size N B = (1 -α)N . A particular instance of the model is then given in terms of a family of counting processes (Z 1 t , . . . , Z N A t ) (population A) and (Z N A +1 t , . . . , Z N t ) (population B ) with coupled conditional stochastic intensities given respectively by λ A and λ B . Consider on a filtered probability space (Ω, F, (F t ) t≥0 , P) an independent family of i.i.d. Poisson measures (π i ( ds, dz), i ∈ {1, . . . , N }) with intensity measure ds × dz on [0, ∞) × [0, ∞). Let (x, y) ↦ F (x, y) and (x, y) ↦ G(x, y) two nonnegative functions defined on (0, ∞) 2 . In a normal case, the excitatory and inhibitory populations follow the following steps: (1) t ≈ 0, λ A t ≈ µ A is high and λ B t ≈ 0 is small (2) Feedback from A to B ∶ λ B t increases (3) Inhibition of B to A : when λ B t gets high, Φ B→A reduces λ A t (4) h 4 has compact support: after a time θ 4 , B no longer feels the influence of A : intensity of B is back to µ B ≈ 0 and A to its normal high activity µ A (State 1) This leads to oscillations which lead to spikes. However, heterogeneity in the synaptic dynamics increases the stochasticity of the pre-synaptic spike arrival. Thus, due to the heterogeneity, Φ B→A promotes the system in the inhibition state (state 3) and inhibits the system's movement to system 4 and system 1, thereby creating a spike. Hence, Φ A R < Φ A M . Similarly, for the inhibitory neurons, we can show that Φ B R < Φ B M . Thus, we get Φ R < Φ M ∎ This lemma might be interpreted as the heterogeneous STDP dynamics increasing the synaptic noise, which reduces the number of spikes of the post-synaptic neuron. A heterogeneous STDP leads to a non-uniform scaling of correlated spike trains leading to de-correlation. Hence, we can say that heterogeneous STDP models have learned a better-orthogonalized subspace representation, leading to a better encoding of the input space with fewer spikes. It is to be mentioned here that the synaptic noise might be thought of as analogous to the stochasticity in the gradient descent algorithm. As recently proved by Simsekli et al. (Simsekli et al., 2020; 2019) , stochasticity plays an important role in the generalization ability of the model. We might interpret the synaptic noise in the heterogeneous STDP to play a similar role and helps in better generalizability of the HRSNN model. This hypothesis is empirically proven in Supplementary Section A. However, a detailed theoretical analysis would be a very interesting direction for future work. (58) where Pr[∃s] is the probability of occurrence of the post-synaptic spike. Thus, the expected input to the neuron at time t(E [i(t)]), which comprises of its excitatory and inhibitory components where ρ e , ρ i are the rates of incoming spikes and µ we (w, t), µ wi (w, t) the probabilities of the weights associated to time t. Now, considering the case for RSNNs with homogeneous STDP (M ) and with heterogeneous STDP (R), the difference in the variances of the two populations is given as: ∆Var[V M ] -∆Var[V R ] = ∆ ∫ t -∞ [E [i 2 M (t)] -(E [i 2 R (t)] -E[i R (t)] 2 )]dt Since t < t post , STDP potentiates both inhibitory and excitatory synapses, so ∆E [i 2 i (t)] > 0, ∆E [i 2 e (t)] > 0. The term E[i M (t)] 2 = 0 by the symmetry of the weights, and it is maintained at zero by the symmetry of the STDP. But for heterogeneous neuron populations, as described above, there exists an asymmetry of the weights. Based on balanced spiking neural networks with heterogeneous connection strengths, previous works have revealed that such heterogeneous networks possess heavy-tailed, Lévy fluctuations (Shlesinger et al., 1987; Mantegna & Stanley, 1995; Cossell et al., 2015) . This implies E[i R (t)] 2 > 0 ⇒ ∆Var[V R ] < ∆Var[v(t) M ] We calculate the number of post-synaptic spikes triggered when the stimulus is present. Now, representing the spike rate of the HRSNN and the MRSNN as Φ R , Φ M resp., ∫ t 0 Φ R (t)dt ≤ ∫ t 0 Φ M (t) ⇒ S R = N R T tISI R ≤ N R T tISI M = S M Thus, spikes decrease when we use heterogeneity in the LTP/LTD Dynamics. Hence, we compare the efficiencies of the HRSNN with that of MRSNN as follows: E R E M = M R (N R ) × S M S R × M M (N R ) = ∑ N R τ =1 Cov 2 (x(t-τ ),a R τ r R (t)) Var(a R τ r R (t)) × ∞ ∫ t ref tΦ R dt ∑ N R τ =1 Cov 2 (x(t-τ ),a M τ r M (t)) Var(a M τ r M (t)) × ∞ ∫ t ref tΦ M dt Since S R ≤ S M and also,the covariance increases when the neurons become correlated, and as neuronal correlation decreases, H increases (Theorem 1), we see that E R E M ≥ 1 ⇒ E R ≥ E M ∎ C SUPPLEMENTARY SECTION C C.1 HIGHER ORDER CORRELATION In this paper, we took inspiration from results in reservoir computing, which show that we can maximize memory capacity using orthogonalization among reservoir states in the case of reservoir computers (Farkaš & Gergel', 2017; Farkaš et al., 2016) . The goal of using heterogeneous STDP dynamics is to get better orthogonalized recurrent network states to achieve more efficient information transfer with lower higher-order correlations in spike trains. Recent studies (Montani et al., 2009; Abbott & Dayan, 1999) have shown that the correlation of higher order progressively decreases the information available through the neural population. The decrease in information becomes larger as the interaction order grows. Since we are trying to engineer a spike-efficient model, we leverage the heterogeneity in neuronal parameters to reduce the higher-order correlations. The hypothesis is that an orthogonal recurrent layer can help us efficiently represent the input spike patterns with fewer spikes. This may be interpreted as the recurrent layer acting as an orthogonal bases function where the inputs are projected onto these bases. Thus, having orthogonal bases can efficiently map the inputs without much loss. The heterogeneous STDP helps us efficiently achieve this orthogonalization of the recurrent layer, resulting in a lesser voltage variance across the neuron population. This leads to fewer spikes (since the mean is constant) compared to a homogeneous RSNN. Thus, in effect, heterogeneous STDP parameters can learn the output more precisely, which is projected back into the recurrent network. Hence, using heterogeneous STDP parameters leads to a better orthogonalization among the neuronal states and hence, a higher C. In this paper, we show that using a distribution of LTP/LTD dynamics in the STDP parameters helps us in mappings the input onto the orthogonal activations of the recurrent network to capture the principal components of the input signal. The LTD dynamics play an important role in determining the orthogonality of neuronal activations. LTD windows of the STDP rules enable robust sequence learning amid background noise in cooperation with a large signal transmission delay between neurons and a theta rhythm (Hayashi & Igarashi, 2009) . The LTD window in the range of positive spike-timing plays an important role in preventing noise influences with sequence learning. Oja (Oja, 1982; 1989) showed that the LIF neuron's time constant is very fast compared to the time constant of learning in which the weights w ji change. The learning is assumed to take place according to the STDP type conjunction of the inputs ξ i and the integrated effect of the inputs, ν j , with an additional forgetting term attributed to the LTD dynamics: dwji dt = αν j ξ i -f (ν j , ξ i , w ji ) In the case of homogeneous STDP, f (.) is a constant; hence, the model can only efficiently learn the first principal component of the input. However, quite interesting functions emerge when considering STDP to have a distribution. This also helps us determine the next principal components other than the first one. Hence the diversity in the different LTD dynamics increases the competition and helps that not all inputs are mapped to the first principle component. Thus, the diversity in the LTD dynamics helps in projecting the input to orthogonal activations of the recurrent network. Now, for homogeneous RSNNs, several higher-order correlations, which according to our hypothesis, arise because of the poor orthogonalization among the network states. This results in the redundancies of spikes for encoding the same information. In this paper, we use heterogeneous STDP dynamics to learn an efficient orthogonal representation of the state space, which result in the network learning the same patterns but using fewer spikes. (theorem: 3) We also show that heterogeneity in the neuronal parameters decreases the neuronal correlation (theorem 1 And fig 2a ). Thus, since heterogeneity results in better orthogonalization among the neuronal states, it results in fewer higher-order correlations. Moreover, recent studies have shown that the correlation of higher order progressively decreases the information available through the neural population, and the decrease in information becomes larger as the interaction order grows. Since we are trying to engineer an efficient model, we aim to reduce the higher-order correlations using heterogeneity in neuronal parameters (as shown in Theorem 1). In addition to this, to verify this, we used CuBIC (Staude et al., 2010) , a cumulantbased inference of higher-order correlations in massively parallel spike trains. The details of the experimental methodology are given in Supplementary Section C. The outcome of CuBIC is a lower bound ξ on the order of correlation in the spiking activity of large groups of simultaneously recorded neurons. CuBIC can provide statistical evidence for large correlated groups without the discouraging requirements on a sample size that direct tests for higher-order correlations have to meet. This is achieved by exploiting constraining relations among correlations of different orders. However, it must be noted that CuBIC is not designed to estimate the order of correlation directly; the inferred lower bound might not always correspond to the maximal order of correlation present in a given data set.



Figure 1: Concept of HRSNN with variable Neuronal and Synaptic Dynamics

Figure 2: Block Diagram showing the methodology using HRSNN for prediction

Figure 4: Bar chart showing the global importance of different heterogeneous parameters using HRSNN on the dataset. The experiments were repeated five times with different parameters from the same distribution (a) Classification (b) Prediction

Figure 5: Figure showing a snippet for the Y dimension of the Lorenz96 time series used for the prediction problem

by the original function f in all evaluated data D 1∶k and ⃗ Z = ⃗ µ D k∶n -f (x best ) ⃗ σ D k∶n

Figure 6: Figure Showing the convergence behaviors of the three types of BO described in the paper (a) BO optimizing the memory capacity C (b) BO optimizing the average spike count S and (c) BO optimizing the spike efficiency E

Figure 7: Figure showing as the heterogeneity in the neuronal parameters increases, the covariance between the neurons decreases

Figure 9: Figure showing the histograms of the firing rates for the four kinds of heterogeneous RSNN with homogeneous /heterogeneous neurons and synapses

Figure 10: (a)Figure showing the variation of memory capacity with neuronal heterogeneity (b) Figure showing the variation of efficiency E with number of neurons N R

(a) represent the standard deviation of the observations. Analytical Study of Spike Efficiency We calculate the average firing rate of the heterogeneous spiking neural network for the prediction task during inference, and the results are shown in Fig 10(b).

If the memory capacity of the HRSNN and MRSNN networks are denoted by C H and C M respectively, then, C H ≥ C M , where the heterogeneity in the neuronal parameters H varies inversely to the correlation among the neuronal states measured as ∑ (t), x m (t)) which in turn varies inversely with C. Proof: As shown by Aceituno et al.Aceituno et al. (

Figure 11: Figure showing the excitatory and inhibitory pre-synaptic neurons with excitatory and inhibitory spikes respectively incident on the post-synaptic neuron. We use this model to model the nonlinear interacting Hawkes process with inhibition.

For a given number of neurons N R , the spike efficiency of the modelE = C(N R ) S for HRSNN (E R ) is greater than MRSNN (E M ) i.e., E R ≥ E M Proof:To study the effect of the spike time when the weight w k changes, we look into the expected value of the time difference in the post-synaptic spikes, which is given as:E [∆t post ] = E [t post -t post ] = (E [t post ] -t post ) Pr[s]

Notations Table

Theorem 1: If the memory capacity of the HRSNN and MRSNN networks are denoted by C H and C M respectively, then, C H ≥ C M , where the heterogeneity in the neuronal parameters H varies inversely to the correlation among the neuronal states measured as

proved that the neuronal state correlation is inversely related to C. Hence, for HRSNN, with H > 0, C H ≥ C M . ∎ ) with coupled conditional stochastic intensities given respectively by λ A and λ B as follows:

Tableshowingthe comparison of the Accuracy and NRMSE losses for the SHD Classification and Lorenz System Prediction tasks, respectively. We show the average spike rate, calculated as the ratio of the moving average of the number of spikes in a time interval T . For this experiment, we choose T = 4ms and a rolling time span of 2ms, which is repeated until the first spike appears in the final layer. Following the works ofPaul et al. (2022), we show that the normalized average spike rate is the total number of spikes generated by all neurons in an RSNN averaged over the time interval T . The results marked with * denotes we implemented the open-source code for the model and evaluated the given results.

Table showing the hyperparameters used in the experiments and their values

The list of parameter settings for the Bayesian Optimization-based hyperparameter search

Table showing the average final distributions of the hyperparameters

Table showing the performance of the Bayesian Optimization on the SHD Classification dataset for the three different cases where BO 1 optimizes C , BO 2 optimizes S and BO 3 optimizes E 148.83 68.41 ± 4.87 6.42 ± 0.57 1325.25 ± 128.47 67.95 ± 5.11 6.54 ± 0.36 1391.68 ± 140.37 70.1 ± 4.69 400 7.21 ± 0.44 1682.68 ± 239.95 70.11 ± 4.33 6.91 ± 0.51 1425.69 ± 129.57 68.23 ± 5.27 7.01 ± 0.35 1511.79 ± 200.96 71.54 ± 4.06 500 8.59 ± 0.48 1768.24 ± 287.94 71.05 ± 3.98 7.69 ± 0.46 1555.29 ± 139.17 69.31 ± 5.03 8.63 ± 0.38 1621.8 ± 250.46 73.05 ± 3.87 1000 11.22 ± 0.46 2251.17 ± 319.75 72.93 ± 3.38 9.03 ± 0.44 2015.24 ± 147.44 70.89 ± 4.91 12.25 ± 0.39 2102.59 ± 279.86 75.32 ± 3.44 2000 13.3 ± 0.51 2566.21 ± 348.68 75.36 ± 3.29 9.89 ± 0.46 2314.59 ± 151.18 72.33 ± 4.88 13.95 ± 0.37 2410.08 ± 301.57 77.25 ± 3.17 3000 14.47 ± 0.53 2825.47 ± 355.87 77.14 ± 3.38 10.48 ± 0.41 2623.41 ± 177.94 74.63 ± 4.93 14.88 ± 0.42 2708.52 ± 315.34 78.21 ± 3.24 4000 15.17 ± 0.52 3551.07 ± 366.19 78.05 ± 3.25 11.57 ± 0.45 3045.28 ± 225.53 75.15 ± 4.85 15.87 ± 0.38 3218.42 ± 328.19 79.36 ± 3.13 5000 15.64 ± 0.57 4186.49 ± 383.09 78.92 ± 3.31 11.68 ± 0.48 3294.62 ± 241.14 75.87 ± 4.81 16.03 ± 0.41 3573.51 ± 331.18 80.49 ± 3.15

Table showing the performance of the Bayesian Optimization on the Lorenz System Prediction dataset for the three different cases where BO 1 optimizes C , BO 2 optimizes S and BO 3 optimizes E ± 0.39 1443.59 ± 127.23 0.587 ± 0.02 3.89 ± 0.48 1207.35 ± 118.57 0.639 ± 0.027 4.01 ± 0.33 1302.47 ± 96.05 0.613 ± 0.0218 300 5.26 ± 0.37 1499.62 ± 141.73 0.503 ± 0.027 4.44 ± 0.55 1257.26 ± 1257.26 0.558 ± 0.034 5.15 ± 0.34 1335.81 ± 168.02 0.531 ± 0.0287 400 6.37 ± 0.41 1528.73 ± 228.87 0.459 ± 0.033 5.87 ± 0.49 1304.35 ± 126.18 0.467 ± 0.04 6.05 ± 0.35 1415.32 ± 213.49 0.482 ± 0.0347 500 7.25 ± 0.45 1601.27 ± 277.97 0.389 ± 0.036 6.25 ± 0.45 1365.35 ± 137.04 0.421 ± 0.043 6.87 ± 0.36 1507.29 ± 219.58 0.411 ± 0.0377 1000 10.12 ± 0.43 1868.14 ± 301.17 0.316 ± 0.04 7.41 ± 0.51 1563.25 ± 146.76 0.396 ± 0.047 9.03 ± 0.37 1699.27 ± 275.79 0.332 ± 0.0417 2000 11.84 ± 0.48 2105.95 ± 331.54 0.293 ± 0.045 8.02 ± 0.5 1854.35 ± 150.28 0.351 ± 0.052 11.44 ± 0.39 2014.12 ± 280.03 0.301 ± 0.0467 3000 13.71 ± 0.51 2408.35 ± 348.26 0.258 ± 0.042 8.94 ± 0.53 2195.82 ± 179.75 0.335 ± 0.058 13.87 ± 0.4 2236.59 ± 281.05 0.241 ± 0.0482 4000 14.15 ± 0.52 2951.56 ± 352.66 0.242 ± 0.063 9.55 ± 0.55 2445.31 ± 217.73 0.326 ± 0.07 14.63 ± 0.41 2546.25 ± 289.81 0.227 ± 0.0649 5000 14.45 ± 0.54 3784.44 ± 353.51 0.203 ± 0.064 9.96 ± 0.58 2684.59 ± 234.63 0.302 ± 0.071 15.12 ± 0.42 2898.27 ± 307.14 0.195 ± 0.0655

Table showing the Ablation Study for the comparison of the Generalizability of heterogeneous networks

Table showing results with limited training data

Table Showing the estimated highest order of correlation for HRSNN vs. MRSNN using CuBIC

ACKNOWLEDGEMENT

This work is supported by the Army Research Office and was accomplished under Grant Number W911NF-19-1-0447. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government.

Supplementary Section

The models used in this paper were the Heterogeneous Recurrent SNN (HRSNN) and the Homogeneous Recurrent SNN (MRSNN). Both models use STDP as the learning method. For MRSNN, we use STDP with uniform parameters for all the synapses. However, for HRSNN, we use a distribution for each parameter to get a rich class of diverse LTP/LTD dynamics. But, at the core, all the training is done using STDP.

A.1.2 LIF NEURON NUMERICAL IMPLEMENTATION

To implement the LIF model, we discretize time into multiples of a small-time step ∆t so that spikes can only happen at multiples of ∆t. (Cramer et al., 2020; Perez-Nieves et al., 2021 ) Thus, we can approximately solve Eq. 1 asIt is to be noted here that we use this approximation for numerically solving the LIF neurons. Hence, although we use continuous notations for the remainder of the paper, it is to be noted that we use the discrete form discussed here for numerical solutions.We assume that F and G satisfywhere Φ A , Φ B→A , Φ B and Φ A→B are nonnegative functions, each of them globally Lipschitz with Φ B→A bounded (and with no loss of generality we assume 0 ≤ Φ B→A ≤ 1 ).Let us consider the family of càdlàg (F t ) t≥0 point processes (Z i t ) t≥0,i=1,...,N given bywhere the intensity λ i , i = 1, . . . , N , is given as:, where A&B are the populations of the excitatory and inhibitory neurons, respectively.The dynamics given by Eq. 53 is of Hawkes type: each particle's intensity depends on the whole system's history, through memory kernels h i , i = 1, . . . , 4 and firing rate functions Φ A and Φ B .The multiplicative influence of inhibitory population B onto population A, is represented using the inhibition kernel Φ B→A which is a decreasing nonnegative function on [0, +∞), with Φ B→A (0) = 1 and Φ B→A (x) → x→∞ 0i.e., activity of population A should decrease as activity of population B rises. The model secondly incorporates retroaction from population A onto population B, which is supposed to be mostly additive, although possibly modulated by a nonlinear feedback kernel Φ A→B . Now, without loss of generality we assume that Φ A and Φ B are linear -i.e., Φ A (x) = µ A +x, Φ B (x) = µ B + x, x ≥ 0, where µ A , µ B ≥ 0, and h i ≥ 0 for i = 1, . . . , 4.Hence, Eq. 53 becomesFor heterogeneous neuron populations, there exists an asymmetry of the weights. Based on balanced spiking neural networks with heterogeneous connection strengths, previous works have revealed that such heterogeneous networks possess heavy-tailed Lévy fluctuations (Shlesinger et al., 1987; Mantegna & Stanley, 1995; Cossell et al., 2015) . The heterogeneous heavy-tailed distributions of synaptic weights have been fitted to lognormal distributions (Buzsáki & Mizuseki, 2014; Kuśmierz et al., 2020) . We model the inputs to neuron i ∈ E as:where µ denotes the mean inputs such that µ) due to random connectivity. Finally, η i denotes temporal fluctuations due to spiking activity. We assume that the pre-synaptic neurons fire as using the interactive Hawkes process described above.Consider a case where µ A ≫ 1, µ B = 0 and h = 1 [0,θ](57)

