A STUDY OF BIOLOGICALLY PLAUSIBLE NEURAL NETWORK: THE ROLE AND INTERACTIONS OF BRAIN-INSPIRED MECHANISMS IN CONTINUAL LEARNING

Abstract

Humans excel at continually acquiring, consolidating, and retaining information from an ever-changing environment, whereas artificial neural networks (ANNs) exhibit catastrophic forgetting. There are considerable differences in the complexity of synapses, the processing of information, and the learning mechanisms in biological neural networks and their artificial counterpart, which may explain the mismatch in performance. We consider a biologically plausible framework that constitutes separate populations of exclusively excitatory and inhibitory neurons which adhere to Dale's principle, and the excitatory pyramidal neurons are augmented with dendritic-like structures for context-dependent processing of stimuli. We then conduct a comprehensive study on the role and interactions of different mechanisms inspired by the brain including sparse non-overlapping representations, Hebbian learning, synaptic consolidation, and replay of past activations that accompanied the learning event. Our study suggests that the employing of multiple complementary mechanisms in a biologically plausible architecture, similar to the brain, may be effective in enabling continual learning in ANNs.

1. INTRODUCTION

The human brain excels at continually learning from a dynamically changing environment whereas standard artificial neural networks (ANNs) are inherently designed for training from stationary i.i.d. data. Sequential learning of tasks in continual learning (CL) violates this strong assumption, resulting in catastrophic forgetting. Although ANNs are inspired by biological neurons (Fukushima, 1980) , they omit numerous details of design principles and learning mechanisms in the brain. These fundamental differences may account for the mismatch in performance and behavior. Biological neural networks are characterized by considerably more complex synapses and dynamic context-dependent processing of information. Also, the individual neurons have a specific role. Each presynaptic neuron has an exclusive excitatory or inhibitory impact on its postsynaptic partners, as postulated by Dale's principle (Strata et al., 1999) . Furthermore, distal dendritic segments in pyramidal neurons, which comprises the majority of excitatory cells in the neocortex, receive additional context information and enable context-dependent processing of information. This, in conjunction with inhibition, allows the network to learn task-specific patterns and avoid catastrophic forgetting (Yang et al., 2014; Iyer et al., 2021; Barron et al., 2017) . Furthermore, the replay of nonoverlapping and sparse neural activities of previous experiences in the neocortex and hippocampus is considered to play a critical role in memory formation, consolidation, and retrieval (Walker & Stickgold, 2004; McClelland et al., 1995) . To protect information from erasure, the brain employs synaptic consolidation in which plasticity rates are selectively reduced in proportion to strengthened synapses (Cichon & Gan, 2015) . Figure 1 : Architecture of one hidden layer in the biologically plausible framework. Each layer consists of separate populations of exclusively excitatory pyramidal cells and inhibitory neurons which adheres to Dale's principle. The shade indicates the strength of weights or activations where darker shade indicating higher value. (a) The pyramidal cells are augmented with dendritic segments which receive an additional context signal c and the dendrite segment whose weights are most aligned with the context vector (bottom row) is selected to modulate the output activity of the feedforward neurons for context-dependent processing of information. (b) The Hebbian update step further strengthens the association between the context and the winning dendritic segment with maximum absolute value (indicated with darker shade for bottom row). Finally, Heterogeneous dropout keeps the activation count of each pyramidal cell (indicated with the gray shade) and drops the neurons which were most active for the previous task (darkest shade dropped) to enforce non-overlapping representations. The top-k remaining cells then project to the next layer (increased shade). tions of exclusively excitatory and inhibitory neurons in each layer which adheres to Dale's principle (Cornford et al., 2020) and the excitatory neurons (mimicking pyramidal cells) are augmented with dendrite-like structures for context-dependent processing of information (Iyer et al., 2021) . Dendritic segments process an additional context signal encoding task information and subsequently modulate the feedforward activity of the excitatory neuron (Figure 1 ). We then systematically study the effect of controlling the overlap in representations, employing the "fire together, wire together" learning paradigm and employing experience replay and synaptic consolidation. Our empirical study shows that: i. An ANN architecture equipped with context-dependent processing of information by dendrites and separate populations of excitatory pyramidal and inhibitory neurons adhering to Dale's principle can learn effectively in CL setup. ii. Enforcing different levels of activation sparsity in the hidden layers using k-winner-take-all activations and employing a complementary dropout mechanism that encourages the model to use a different set of active neurons for each task can effectively control the overlap in representations, and hence reduce interference. iii. Task similarities need to be considered when enforcing such constraints to allow for a balance between forwarding transfer and interference. iv. Mimicking the ubiquitous "fire together, wire together" learning rule in the brain through a Hebbian update step on the connection between context signal and dendritic segments, which further strengthens context gating and facilitates the formation of task-specific subnetworks. v. Synaptic consolidation by utilizing Synaptic Intelligence (Zenke et al., 2017) with importance measures adjusted to take into account the discrepancy in the effect of weight changes in excitatory and inhibitory neurons further reduces forgetting. vi. Replaying the activations of previous tasks in a context-specific manner is critical for consolidating information across different tasks, especially for the challenging Class-IL setting. Our study suggests that employing multiple complementary mechanisms in a biologically plausible architecture, similar to what is believed to exist in the brain, can be effective in enabling CL in ANNs.

2. BIOLOGICALLY PLAUSIBLE FRAMEWORK FOR CL

We provide details of the biologically plausible framework within which we conduct our study.

2.1. DALE'S PRINCIPLE

Biological neural networks differ from their artificial counterparts in the complexity of the synapses and the role of individual units. In particular, the majority of neurons in the brain adhere to Dale's principle, which posits that presynaptic neurons can only have an exclusive excitatory or inhibitory impact on their postsynaptic partners (Strata et al., 1999) . Several studies show that the balanced dynamics (Murphy & Miller, 2009; Van Vreeswijk & Sompolinsky, 1996) of excitatory and inhibitory populations provide functional advantages, including efficient predictive coding (Boerlin et al., 2013) and pattern learning (Ingrosso & Abbott, 2019) . Furthermore, inhibition is hypothesized to play a role in alleviating catastrophic forgetting (Barron et al., 2017) . Standard ANNs, however, lack adherence to Dale's principle, as neurons contain both positive and negative output weights, and signs can change while learning. Cornford et al. ( 2020) incorporate Dale's principle into ANNs (referred to as DANNs), which take into account the distinct connectivity patterns of excitatory and inhibitory neurons (Tremblay et al., 2016) and perform comparable to standard ANNs in the benchmark object recognition task. Each layer l comprises of a separate population of excitatory, h l e ∈ R ne + , and inhibitory h l i ∈ R ni + neurons, where n e ≫ n i and synaptic weights are strictly non-negative. Similar to biological networks, while both populations receive excitatory projections from the previous layer (h l-1 e ), only excitatory neurons project between layers, whereas inhibitory neurons inhibit the activity of excitatory units of the same layer. Cornford et al. (2020) characterized these properties by three sets of strictly positive weights: excitatory connections between layers W l ee ∈ R ne×ne

+

, excitatory projection to inhibitory units W l ie ∈ R ni×ne , and inhibitory projections within the layer W l ei ∈ R ne×ni . The output of the excitatory units is impacted by the subtractive inhibition from the inhibitory units: z l = (W l ee -W l ei W l ie )h l-1 e + b l where b l ∈ R ne is the bias term. Figure 1 shows the interactions and connectivity between excitatory pyramidal cells (triangle symbol) and inhibitory neurons (denoted by i). We aim to employ DANNs as feedforward neurons to show that they can also learn in a challenging CL setting and performance comparable to standard ANNs and provide a biologically plausible framework for further studying the role of inhibition in alleviating catastrophic forgetting.

2.2. ACTIVE DENDRITES

The brain employs specific structures and mechanisms for context-dependent processing and routing of information. The prefrontal cortex, which plays an important role in cognitive control (Miller & Cohen, 2001) , receives sensory inputs as well as contextual information, which allows it to choose the most relevant sensory features for the present task to guide actions (Mante et al., 2013; Fuster, 2015; Siegel et al., 2015; Zeng et al., 2019) . Of particular interest are pyramidal cells, which represent the most populous members of the excitatory family of neurons in the brain (Bekkers, 2011) . The dendritic spines in pyramid cells exhibit highly nonlinear integrative properties which are considered important for learning task-specific patterns (Yang et al., 2014) . Pyramidal cells integrate a range of diverse inputs into multiple independent dendritic segments, whereby contextual inputs in active dendrites can modulate a neuron's response, making it more likely to fire. However, standard ANNs are based on a point neuron model (Lapique, 1907) which is an oversimplified model of biological computations and lacks the sophisticated nonlinear and context-dependent behavior of pyramidal cells. Iyer et al. (2021) model these integrative properties of dendrites by augmenting each neuron with a set of dendritic segments. Multiple dendritic segments receive additional contextual information, which is processed using a separate set of weights. The resultant dendritic output modulates the feedforward activation which is computed by a linear weighted sum of the feedforward inputs. This computation results in a neuron where the magnitude of the response to a given stimulus is highly context-dependent. To enable task-specific processing of information, the prototype vector for task τ is evaluated by taking the element-wise mean of the tasks samples, D τ at the beginning of the task and then subsequently provided as context during training. c τ = 1 |D τ | x∈Dτ x (2) During inference, the closest prototype vector to each test sample, x ′ , is selected as the context using the Euclidean distance among all task prototypes, C, stored in memory. c ′ = arg min cτ ∥x ′ -C τ ∥ 2 Following Iyer et al. (2021) , we augment the excitatory units in each layer with dendritic segments (Figure 1 (a)). The feedforward activity of excitatory units is modulated by the dendritic segments which receive a context vector. Given the context vector, each dendritic segment j computes u T j c, given weight u j ∈ R d and the context vector c ∈ R d where d is the dimensions of the input image. For excitatory neurons, the dendritic segment with the highest response to the context (maximum absolute value with the sign retained) is selected to modulate output activity, h l e = k-WTA(z l × σ(u T κ c)), where κ = arg max j |u T j c| where σ is the sigmoid function (Han & Moraga, 1995) and k-WTA(.) is the k-Winner-Take-All activation function (Ahmad & Scheinkman, 2019) which propagates only the top k neurons and sets the rest to zero. This provides us with a biologically plausible framework where, similar to biological networks, the feedforward neurons adheres to Dale's principle and the excitatory neurons mimic the integrative properties of active dendrties for context-dependent processing of stimulus.

3. CONTINUAL LEARNING SETTINGS

To study the role of different components inspired by the brain in a biologically plausible NN for CL and gauge their roles in the performance and characteristics of the model, we conduct all our experiments under uniform settings. Implementation details and experimental setup are provided in the Appendix. We evaluate the models on two CL scenarios. Domain incremental learning (Domain-IL) refers to the CL scenario in which the classes remain the same in subsequent tasks but the input distribution changes. We consider Rot-MNIST which involves classifying the 10 digits in each task with each digit rotated by an angle between 0 and 180 degrees and Perm-MNIST which applies a fixed random permutation to the pixels for each task. Importantly, there are different variants of Rot-MNIST with varying difficulties. We incrementally rotate the digits to a fixed degree, i.e. {0, 8, ..., (N-1)*8} for task {τ 1 , τ 2 , .., τ N } which is substantially more challenging than random sampling rotations. Importantly, the Rot-MNIST dataset captures the notion of similarity in subsequent tasks where the similarity between two tasks is defined by the difference in their degree of rotation, whereas each task in Perm-MNIST is independent. We also consider the challenging Class incremental learning (Class-IL) scenario where new classes are added with each subsequent task and the agent must learn to distinguish not only amongst the classes within the current task but also across all learned tasks. Seq-MNIST divides the MNIST classification into 5 tasks with 2 classes in each task.

4. EMPIRICAL EVALUATION

To investigate the impact of the different components inspired by the brain, we use the aforementioned biologically plausible framework and study the effect on the performance and characteristics of the model. 

4.1. EFFECT OF INHIBITORY NEURONS

We first study whether feedforward networks with separate populations of excitatory and inhibitory units can work well in the CL setting. Importantly, we note that when learning a sequence of tasks with inhibitory neurons, it is beneficial to take into account the disparities in the degree to which updates to different parameters affect the layer's output distribution (Cornford et al., 2020) and hence forgetting. Specifically, as W l ie and W l ei affect the output distribution to a higher degree than W l ee , we reduce the learning rate for these weights after the first task (see Appendix). Table 1 shows that models with feedforward neurons adhering to Dale's principle perform as well as the standard neurons and can also further mitigate forgetting in some cases. Note that this gain comes with considerably fewer parameters and context-dependent processing, as we keep the number of neurons in each layer the same, and only the excitatory neurons (∼90%) are augmented with dendritic segments. For 20 tasks, Active Dendrite with Dale's principle reduces the parameters from ∼70M to less than ∼64M parameters. We hypothesize that having separate populations within a layer enables them to play a specialized role. In particular, inhibitory neurons can selectively inhibit certain excitatory neurons based on stimulus, which can further facilitate the formation of taskspecific subnetworks and complement the context-dependent processing of information by dendritic segments.

4.2. SPARSE ACTIVATIONS FACILITATE THE FORMATION OF SUBNETWORKS

Neocortical circuits are characterized by high levels of sparsity in neural connectivity and activations (Barth & Poulet, 2012; Graham & Field, 2006) . This is in stark contrast to the dense and highly entangled connectivity in standard ANNs. Particularly for CL, sparsity provides several advantages: sparse non-overlapping representations can reduce interference between tasks (Abbasi et al., 2022; Iyer et al., 2021; Aljundi et al., 2018) , can lead to the natural emergence of task-specific modules (Hadsell et al., 2020) , and sparse connectivity can further ensure fewer task-specific parameters (Mallya et al., 2018) . 1 3 5 7 9 11 13 15 17 19 21 23 We study the effect of different levels of activation sparsity by varying the ratio of active neurons in k-winner-take-all (k-WTA) activations (Ahmad & Scheinkman, 2019) . Each hidden layer of our model has a constant sparsity in its connections (randomly 50% of weights are set to 0 at initialization) and propagates only the top-k activations (in Figure 1 , k-WTA layer). Table 2 shows that sparsity plays a critical role in enabling CL in DNNs. Sparsity in activations effectively reduces interference by reducing the overlap in representations. Interestingly, the stark difference in the effect of different levels of sparse activations on Rot-MNIST and Perm-MNIST highlights the importance of considering task similarity in the design of CL methods. As the tasks in Perm-MNIST are independent of each other, having fewer active neurons (5%) enables the network to learn nonoverlapping representations for each task while the high task similarity in Rot-MNIST can benefit from overlapping representation, which allows for the reusability of features across the tasks. The number of tasks the agent has to learn also has an effect on the optimal sparsity level. In Appendix, we show that having different levels of sparsity in different layers can further improve performance. As the earlier layers learn general features, having a higher ratio of active neurons can enable higher resuability and forward transfer. For the later layers, a smaller ratio of active neurons can reduce interference between task-specific features.

4.3. HETEROGENEOUS DROPOUT FOR NON-OVERLAPPING ACTIVATIONS AND SUBNETWORKS

The information in the brain is encoded by the strong activation of a relatively small set of neurons, forming sparse coding. A different subset of neurons is utilized to represent different types of stimuli Graham & Field (2006) . Furthermore, there is evidence of non-overlapping representations in the brain. To miimic this, we employ Heterogeneous dropout (Abbasi et al., 2022) which in conjunction with context-dependent processing of information can effectively reduce the overlap of representations which leads to less interference between the tasks and thereby less forgetting. During training, we track the frequency of activations for each neuron in a layer for a given task, and in the subsequent tasks, the probability of a neuron being dropped is inversely proportional to its activation counts. This encourages the model to learn the new task by utilizing neurons that have been less active for previous tasks. Figure 1 shows that neurons that have been more active (darker shade) are more likely to be dropped before k-WTA is applied. Specifically, let [a l t ] j denote the activation counter for neuron j in layer l after learning t tasks. For learning task t+1, the probability of this neuron being retained is given by: [p l t+1 ] j = exp( -[a l t ] j max j [a l t ] j ρ) where ρ controls the strength of enforcement of non-overlapping representations with larger values leading to less overlap. This provides us with an efficient mechanism for controlling the degree of overlap between the representations of different tasks, and hence the degree of forward transfer and interference based on the task similarities. Table 3 shows that employing Heterogeneous dropout can further improve the performance of the model. We also analyze the effect of the ρ parameter on the activation counts and the overlap in the representations. Figure 2 shows that Heterogeneous dropout can facilitate the formation of taskspecific subnetworks and Figure 3 shows the symmetric KL-divergence between the distribution of activation counts on the test set of Task 1 and Task 2 on the model trained with different ρ values on Perm-MNIST with two tasks. As we increase the ρ parameter, the activations in each layer become increasingly dissimilar. Similar to the sparsity in activations, in the Appendix, we show that having different dropout ρ for each layer (with lower ρ for earlier layers to encourage resuability and higher ρ for later layers to reduce interference between task representations) can further improve the performance. Heterogeneous dropout provides a simple mechanism for balancing the reusability and interference of features depending on the similarity of tasks.

4.4. HEBBIAN LEARNING STRENGTHENS CONTEXT GATING

For a biologically plausible ANN, it is important not only to incorporate the design elements of biological neurons but also the learning mechanisms it employs. Lifetime plasticity in the brain generally follows the Hebbian principle: a neuron that consistently contributes in making another neuron fire will build a stronger connection to that neuron (Hebb, 2005) . Therefore, we follow the approach in Flesch et al. (2022) to complement error-based learning with Hebbian update to strengthen the connections between contextual information and dendritic segments (Figure 1 (b)). Each supervised parameter update with backpropagation is followed by a Hebbian update step on the dendritic segments to strengthen the connections between the context input and the corresponding dendritic segment which is activated. To constrain the parameters, we use Oja's rule which adds weight decay to Hebbian learning (Oja, 1982) , u κ ← u κ + η h d(c -du κ ) where η h is the learning rate, κ is the index of the winning dendrite with weight u κ and modulating signal d = u T κ c for the context signal c. Figure 4 shows that the Hebbian update step increases the magnitude of the modulating signal from the dendrites on the feedforward activity, which can further strengthen context-dependent gating and facilitate the formation of task-specific subnetworks. Table 1 shows that this results in a consistent improvement in performance.

4.5. SYNAPTIC CONSOLIDATION FURTHER MITIGATES FORGETTING

In addition to their integrative properties, dendrites also play a key role in retaining information and providing protection against erasure (Cichon & Gan, 2015; Yang et al., 2009) . The new spines that are formed on different sets of dendritic branches in response to learning different tasks are protected from being eliminated through mediation of synaptic plasticity and structural changes that persist when learning a new task (Yang et al., 2009) . We employ synaptic consolidation by incorporating Synaptic Intelligence (Zenke et al., 2017 ) (details in Appendix) which maintains an importance estimate of each synapse in an online manner during training and subsequently reduces the plasticity of synapses which are considered important for learned tasks. Notably, we adjust the importance estimate to account for the disparity in the degree to which updates to different parameters affect the output of the layer due to the inhibitory interneuron architecture in the DANN layers (Cornford et al., 2020) . The importance estimate of the excitatory connections to the inhibitory units and the intra-layer inhibitory connections are upscaled to further penalize changes to these weights. Table 1 shows that employing Synaptic Intelligence (+SC) in this manner further mitigates forgetting. Particularly for Rot-MNIST with 20 tasks, it provides considerable performance improvement.

4.6. EXPERIENCE REPLAY IS ESSENTIAL FOR ENABLING CL IN CHALLENGING SCENARIOS

Replay of past neural activation patterns in the brain is considered to play a critical role in memory formation, consolidation, and retrieval (Walker & Stickgold, 2004; McClelland et al., 1995) . The replay mechanism in the hippocampus (Kumaran et al., 2016) has inspired a series of rehearsal-based approaches (Li & Hoiem, 2017; Chaudhry et al., 2018; Lopez-Paz & Ranzato, 2017; Arani et al., 2022) which have proven to be effective in challenging continual learning scenarios (Farquhar & Gal, 2018; Hadsell et al., 2020) . Therefore, to replay samples from previous tasks, we utilize a small episodic memory buffer that is maintained through Reservoir sampling (Vitter, 1985) . It attempts to approximately match the distribution of the incoming stream by assigning equal probability to each new sample to be represented in the buffer. During training, samples from the current task, (x b , y b ) ∼ D τ , are interleaved with memory buffer samples, (x m , y m ) ∼ M to approximate the joint distribution of tasks seen so far. Furthermore, to mimic the replay of the activation patterns that accompanied the learning event in the brain, we also save the output logits, z m , across the training trajectory and enforce a consistency loss when replaying the samples from the episodic memory. Concretely, the loss is given by: L = L cls (f (x b ; θ), y b ) + αL cls (f (x m ; θ), y m ) + β(f (x m ; θ) -z m ) 2 where f (.; θ) is the model parameterized by θ, L cls is the standard cross-entropy loss, and α and β controls the strength of interleaved training and consistency constraint, respectively. Table 1 shows that experience replay (+ER) complements the context-dependent processing of information and enables the model to learn the joint distribution well in varying challenging settings. Particularly, the failure of the model to avoid forgetting in the Class-IL setting (Seq-MNIST) without experience replay suggests that context-dependent processing of information alone does not suffice. and experience replay might be essential. Adding consistency regularization (+CR) further improves performance as the model receives additional relational information about the structural similarity of classes, which facilitates the consolidation of information.

4.7. COMBINING THE INDIVIDUAL COMPONENTS

Having shown the individual effect of each of the brain-inspired components in the biologically plausible framework, here we look at their combined effect. The resultant model is referred to as Bio-ANN. Table 1 shows that the different components complement each other and consistently improve the performance of the model. The empirical results suggest that employing multiple complementary components and learning mechanisms, similar to the brain, can be an effective approach to enable continual learning in ANNs.

5. DISCUSSION

We conducted a study on the effect of different brain-inspired mechanisms under a biologically plausible framework in the CL setting. The underlying model incorporates several key components of the design principles and learning mechanisms in the brain: each layer constitutes separate populations of exclusively excitatory and inhibitory units, which adheres to Dale's principle, and the excitatory pyramidal neurons are augmented with dendritic segments for context-dependent processing of information. We first showed that equipped with the integrative properties of dendrites, feedforward network adhering to the Dale's principle not only performs as well as standard ANNs but also provides gains. Then we studied the individual role of different components. We showed that controlling the sparsity in activations using k-WTA activations and Heterogeneous dropout mechanism that encourages the model to use a different set of neurons for each task is an effective approach for maintaining a balance between reusability of features and interference, which is critical for enabling CL. We further showed that complementing the error-based learning with the "fire together, wire together" learning paradigm can further strengthen the association between the context signal and dendritic segments which process them and facilitate context-dependent gating. To further mitigate forgetting, we incorporated synaptic consolidation in conjunction with experience replay and showed their effectiveness in challenging CL settings. Finally, the combined effect of these components suggests that, similar to the brain, employing multiple complementary mechanisms in a biologically plausible architecture is an effective approach to enable CL in ANN. It also provides a framework for further study of the role of inhibition in mitigating catastrophic forgetting. However, there are several limitations and potential avenues for future research. In particular, as dendritic segments provide an effective mechanism for studying the effect of encoding different information in the context signal, it provides an interesting research avenue as to what information is useful for the sequential learning of tasks and the effect of different context signals. Neuroscience studies suggest that multiple brain regions are involved in processing a stimulus, and while there is evidence that active dendritic segments receive contextual information that is different from the input received by the proximal segments, it is unclear what information is encoded in the contextual information and how it is extracted. Here, we used the context signal as in (Iyer et al., 2021) , which aims to encode the identity of the task by taking the average input image of all samples in the task. Although this approach empirically works well in the Perm-MNIST setting, it is important to consider its utility and limitations under different CL settings. Given the specific design of Perm-MNIST, binary centered digits, and the independent nature of the permutations in each task, the average input image can provide a good approximation of the applied permutation and hence efficiently encode the task identity. However, this is not straightforward for Rot-MNIST where the task similarities are higher and even more challenging for natural images where averaging the input image does not provide a meaningful signal. More importantly, it does not seem biologically plausible to encode task information alone as the context signal and ignore the similarity of classes occurring in different tasks. For instance, it seems more reasonable to process slight rotations of the same digits similarly (as in Rot-MNIST) rather than processing them through different subnetworks. Ideally, we would want the context signal for different rotations of a digit to be highly similar. It is, however, quite challenging to design context signals that can capture a wide range of complexities in sequential learning of tasks. Furthermore, instead of hand engineering the context signal to bias learning towards certain types of task, an effective approach for learning the context signal in an end-to-end training framework is an interesting direction for future search. Overall, our study presents a compelling case for incorporating the design principles and learning machinery of the brain into ANNs and provides credence to the argument that distilling the details of the learning machinery of the brain can bring us closer to human intelligence (Hassabis et al., 2017; Hayes et al., 2021) . Furthermore, deep learning is increasingly being used in neuroscience research to model and analyze brain data (Richards et al., 2019) . The utility of the model for such research depends on two critical aspects: the performance of the model and how close the architecture is to the brain (Cornford et al., 2020; Schrimpf et al., 2020) . The biologically plausible framework in our study incorporates several design components and learning mechanisms of the brain and performs well in a (continual learning) task that is closer to human learning. Therefore, we believe that this work can also be useful for the neuroscience community to evaluate and guide computational neuroscience. Studying the properties of ANNs with higher similarity to the brain may provide insight into the mechanisms of brain functions. We believe that the fields of artificial intelligence and neuroscience are intricately intertwined and progress in one can drive the other as well.

A APPENDIX A.1 RELATED WORK -BIOLOGICAL INSPIRED AI

The human brain has long been a source of inspiration for ANNs design (Hassabis et al., 2017; Kudithipudi et al., 2022) . However, we have failed take full advantage of our enhanced understanding of the brain and there are fundamental differences between the design principles and learning mechanisms employed in the brain and ANNs. These differences may account for the huge gap in performance and behavior. From an architecture design perspective, standard ANNs are predominantly based on the point neuron model (Lapique, 1907) which is an outdated and oversimplified model of biological computations which lacks the sophisticated and context dependent processing in the brain. Furthermore, the neurons in standard ANNs lack adherence to the Dale's principle (Strata et al., 1999) to which the majority of neurons in the brain adhere to. Unlike the brain where presynaptic neurons have an exclusively excitatory or inhibitory impact on their postsynaptic partners, neurons in standard ANNs contain both positive and negative output weights, and signs can change while learning. These constitute as two of the major fundamental differences in the underlying design principle of ANNs and the brain. Two recent studies attempt to address this gap. Cornford et al. ( 2020) incorporated Dale's principle into ANNs (DANNs) in a more biologically plausible manner and show that with certain initialization and regularization considerations, DANNs can perform comparable to standard ANNs in object recognition task, which earlier attempts failed to do so. Our study extends DANNs to the more challenging CL setting and show that accounting for the discrepancy in the effect of weight changes in excitatory and inhibitory neurons can further reduce forgetting in CL. Iyer et al. (2021) propose an alternative to the point neuron model and provide an algorithmic abstraction of pyramidal neurons in the neocortex. Each neuron is augmented with dendritic segments which receive an additional context signal and the output of the dendrite segment modulates the activity of the neuron, allowing context dependent processing of information. Our study builds upon their work and provides a biologically plausible architecture characterized with both adherence to dale's principle and the context dependent processing of pyramidal neurons. This provides us with a framework to study the role of brain inspired mechanisms and allow for the studying the role of inhibition in the challenging continual learning setting which is more closer to human learning. From learning perspective, several approaches have been inspired by the brain, particularly for CL (Kudithipudi et al., 2022) .The replay mechanism in the hippocampus (Kumaran et al., 2016) has inspired a series of rehearsal-based approaches (Hayes et al., 2021; Hadsell et al., 2020) which have proven to be effective in challenging continual learning scenarios (Farquhar & Gal, 2018 ). Another popular approach for continual learning, regularization-based approaches Zenke et al. (2017) ; Kirkpatrick et al. (2017) , have been inspired by neurobiological models which suggest that CL in the neocortex relies on a process of task-specific synaptic consolidation which involves rendering a proportion of synapses less plastic and therefore stable over long timescales (Benna & Fusi, 2016; Yang et al., 2009) . While both these approaches are inspired by the brain, researches have mostly discounted the fact that the brain employs both of them in conjunction to consolidate information rather than in isolation. Therefore, research in both of these methods have been orthogonal. Furthermore they have been applied on top of standard ANNs which are not representative of the complexities of the neuron in the brain. Our study employs replay and synaptic consolidation together in a more biologically plausible architecture and show that they complement each other to improcve the performance. Additionally, our frameworks employs several techniques to mimic the characteristics of the activations in the brain. As the yyramidal neurons in the neocortex have highly sparse connectivity to each other (Hunter et al., 2021; Holmgren et al., 2003) and only a small percentage (<2%) of neurons are active for a given stimuli neurons (Barth & Poulet, 2012), we apply k-winner-take-all (k-WTA) activations (Ahmad & Scheinkman, 2019) to mimic activation sparsity. Several studies have shown the benefits of sparsity in CL (Abbasi et al., 2022; Mallya et al., 2018; Aljundi et al., 2018) , they do not consider that the brain not only utilizes sparsity, it does so in an efficient manner to encode information. The information in the brain is encoded by the strong activation of a relatively small set of neurons, forming sparse coding. A different subset of neurons is utilized to represent different types of stimuli (Foldiak, 2003) and semantically similar stimuli activate similar set of neurons.Heterogeneous dropout (Abbasi et al., 2022) coupled with k-WTA activations aim to mimic these characteristics by encouraging the new task to utilize a new set of neurons for learning. Finally we argue that that it is important not only to incorporate the design elements of biological neurons but also the learning mechanisms it employs. Lifetime plasticity in the brain generally follows the Hebbian principle (Hebb, 2005) . Therefore, we follow the approach in Flesch et al. (2022) to complement error-based learning with Hebbian update to strengthen the connections between contextual information and dendritic segments and show that it strengthens context gating. Our study provides a biologically plausible framework with the the underlying architecture with the context dependent processing of information and adherence to dale's principle. Additionally it employs the learning mechanisms (experience replay, synaptic consolidation and hebbian update) and characteristics (sparse non-overlapping activations and task specific subnetworks) of the brain. To the best of our knowledge, we are the first to provide a comprehensive study of the integration of different brain-inspired mechanisms in a biologically plausible architecture in a CL setting.

A.2 ADDITIONAL RESULTS

Additionally, we conducted experiments on Fashion-MNIST which is more challenging than the MNIST datasets. We considered both the the Class-IL (Seq-FMNIST) and the Domain-IL (Rot-FMNIST) setting. Seq-FMNIST divides the classification into 5 tasks with 2 classes each while Rot-FMNIST involves classifying the 10 classes in each task with the samples rotated by i.e. {0, 8, ..., (N-1)*8} for task {τ 1 , τ 2 , .., τ N }. For brevity, we refer to Active Dendrites + Dale's principle as ActiveDANN. To show the effect of different components better (ActiveDANN without ER fail on the class-IL setting), we consider ActiveDann + ER as the baseline upon which we add the other components. Empirical results in Table A .2 show that the findings on the MNIST settings also translate to Fashion-MNIST and each component leads to performance improvement. 

A.3 EXPERIMENTAL SETUP

To study the role of the different components inspired by the brain in a biologically plausible NN for CL and gauge their roles in the performance and characteristics of the model, we conduct all our experiments under uniform settings. Unless otherwise stated, we use a multi-layer perception (MLP) with two hidden layers with 2048 units and k-WTA activations. Each neuron is augmented with N dendritic segments where N corresponds to the number of tasks and the dimensions match the dimensions of the context vector which corresponds to the input image size (784 for all MNIST based settings). The model is trained using an SGD optimizer with 0.3 learning rate and a batch size of 128 for 3 epochs on each task. We set the weight sparsity to 0 and set the percentage of active neurons to 5%. For our experiments involving Dale's principle, we maintain the same number of total units in each layer divided into 1844 excitatory and 204 inhibitory units. Only the excitatory units are augmented with dendritic segments. Importantly, we use the initialization strategy and corrections for the SGD update as posited in Cornford et al. (2020) to account for the disparities in the degree to which updates to different parameters affect the layer output distribution. The inhibitory unit parameters updates are scaled down relative to excitatory parameter update. Concretely, the gradient updates to W ie were scaled by √ n e -1 and W ei by d -1 , where n e are the number of excitatory neurons in the layer and d is the input dimension to the layer. Furthermore, to select the hyperparameters for different settings we use a small validation set. Note, as the goal was not to achieve the best possible accuracy, rather to show the effect of each component, we did not conduct an extensive hyperparameter search. Table A .3 provides the selected hyperparameters for the effect of individual components experiments in Table 1 and Table A .3 provides the selected hyperparameter for Bio-ANN experiments. We report the mean accuracy over all tasks and 1 std over three different random seeds. The inhibitory interneuron architecture of DANN layers introduces disparities in the degree to which updates to different parameters affect the layer's output distribution e.g. if a single element of W ie is updated, this has an effect on each element of the layer's output. An inhibitory weight update of δ to W ie changes the model distribution approximately n e times more than an excitatory weight update of δ to W ee (Cornford et al., 2020) . The effect of these disparities would be even more pronounced in CL setting as large changes to output distribution when learning a new task can cause more forgetting of previous tasks. To account for these, we further reduce the learning rate of W i e and W e i after learning the first task. Table A .4 shows that accounting for the higher effect of inhibitory neurons can further improve the performance of the model in majority of the settings. It would be interesting to explore better approaches to account for the aforementioned disparities which are tailored for CL and consider the effect on forgetting.

WEIGHTS

Similar to adjusting the learning rate of inhibitory weights, we check whether scaling up the importance estimate of inhibitory neurons can further improve the effectiveness of synaptic consolidation For an effective CL agent, it is important to maintain a balance between the forward transfer and interference between the tasks. As the earlier layers learn general features, a higher portion of the features can be reused to learn the new task which can facilitate forward transfer whereas the later layers learn more task specific features which can cause interference. Heterogeneous dropout provides us with an efficient mechanism for controlling the degree of overlap between the activations and hence the features of each layer. Here, we investigate whether having different levels of sparsity (controlled with the ρ parameter) in different layers can further improve performance. As the earlier layers learn general features, having higher overlap (smaller ρ) between the set of active neurons can enable higher resuability and forward transfer. For the later layers, lesser overlap between the activations (higher ρ) can reduce the interference between task-specific features. To study the effect of heterogeneous dropout in relation with task similarity, we vary the incremental rotation, θ inc , in each subsequent task for Rot-MNIST setting with 5 tasks. The rotation for task τ is given by (τ -1)θ inc . Table A .6 shows the performance of the model for different layerwise ρ values. Generally, heterogeneous dropout consistently improves the performance of the model, especially when the task similarity is low. For θ inc = 32, it provides ∼25% improvement. As the task similarity reduces (θ inc increases), higher values of ρ are more effective. Furthermore, we see that having different ρ values for each layer can provide additional gains in performance. To further study the effect of different levels of sparsity in activations and connections, we vary the number of weights set randomly to zero at initialization (S W ∈ {0, 0.25, 0.50}) and the ratio of active neurons (k l1 , k l2 ) in each hidden layer. Table A .7 shows that sparsity in activation plays a critical role in enabling CL in ANNs. Interestingly, sparsity in connections play a considerable role in Perm-MNIST with higher levels of active neurons (≥ 0.2). Furthermore, exploring finer differences in activation sparsity of different layers may further improve the performance. Similar to heterogeneous dropout, we show the effect of activation sparsity in relation to the task similarity in Table A.7. Similar tasks (lower θ inc ) benefits from higher number of active neurons which can increase the forward transfer whereas dissimilar tasks (higher θ inc ) performs better with higher activation sparsity which can reduce the overlap in representations. 



We will make the code available upon acceptance.



Thus, we study the role and interactions of different mechanisms inspired by the brain in a biologically plausible framework in a CL setup. The underlying model constitutes separate popula-

Figure 2: Total activation counts for the test set of each task (y-axis) for a random set of 25 units in the second hidden layer of the model. Heterogeneous dropout reduces the overlap in activations and facilitates the formation of task-specific subnetworks.

Figure 3: Effect of dropout ρ on the overlap between the distributions of layer two activation counts for each task in Perm-MNIST with 2 tasks. Higher ρ reduces the overlap.

, γ; Experience replay weights α, β Initialize: Model weights θ, Reference weights θ c = {}, Task prototypes C τ = {} Heterogeneous dropout: Overall activation counts A τ = 0, Keep probabilities P τ = 1 Memory buffer M ← -{} Synaptic Intelligence: ω = 0, Ω = 0 ▷ Sample task from data stream 1: for D τ ∈ {D 1 , D 2 , .., D T } do ▷ of prototypes: C τ ← -{C τ , c τ } ▷ Train on task τ 4: while Training do 5: Sample data: (x b , y b ) ∼ D τ and (x m , y m , z m ) ∼ M ▷ Task specific loss 6: Get the model output and activation counts on the current task batch: z b , a b = F (x b , c τ ; θ, P τ ) # Apply Heterogeneous dropout 7: Calculate task loss: L τ = L cls (z b , y b ) on buffer samples: z = F (x m , c m ; θ) # Disable Heterogeneous dropout 11: Calculate replay loss: L er = αL cls (z, y m ) + β(z -z m ) 2 ▷ Synaptic regularization 12:Calculate SI loss:L sc = Ω adj (θ -θ c ) 2 13:Calculate overall loss and clip the gradient between 0 and 1:L = L τ + L er + L sc ∇ θ L = Clip(∇ θ L, 0, 1) ▷ Update Models 14: SGD update: θ = UpdateModel(∇ θ L, η, η Wie , η Wei) 15: Hebbian update on dendritic segments: U = HebbianStep({c τ , c m }, Reservoir(M, (x b , y b , z b )) ▷ Update memory buffer (up importance for inhibitory weights Ω adj = ScaleUpInhib(Ω, λ Wie , λ Wei ) return θ Algorithm 2 Reservoir Sampling Input: Memory Buffer M, Buffer Size B, Number of examples seen so far N , data points (x, y, z) 1: if B > N then ▷ Check if memory is ] ← -(x, y, z) return M

Effect of each component of the biologically plausible framework on different datasets with varying number of tasks. We first show the effect of utilizing feedforward neurons adhering to Dale's principle in conjunction with Active Dendrites to form the framework within which we evaluate the individual effect of brain-inspired mechanisms before combining them all together (along with Heterogeneous Dropout) to forge Bio-ANN. For all the experiments, we set the percentage of active neurons to 5. We provide the average task performance and 1 std of three runs.

Effect of different levels of sparsity in activations on the performance of the model. Columns show the ratio of active neurons (k in k-WTA activation), and the rows provide the number of tasks.

Effect of Heterogeneous dropout with increasing ρ values on different datasets with varying number of tasks.

Effect of each component of the biologically plausible framework on different Seq-FMNIST and Rot-FMNIST. For all experiments, we use a memory budget of 500 samples. HD refers to heterogeneous dropout. We provide the average task performance and 1 std of 5 runs.

The selected hyperparamneters for the experiments showing the individual effect of each component (Table1). The base learning rate for all the experiments is 0.3 and the individual components use the same learning rates for W ie and W ei as (+ Dale's Principle). For + SC experiments, we use λ Wie =10 and λ Wei =10. For ER experiments, we use a memory budget of 500 samples.

The selected hyperparamneters for Bio-ANN experiments in Table1. We use the same learning rate for each setting as + Dale's Principle (TableA.3).

Effect of adjusting the learning rates of W ie and W ei at the end of first task on different datasets with varying number of tasks. TableA.5 shows that scaling the importance estimate in accordance with the degree to which the inhibitory weights affect the output distribution and hence forgetting further improves the performance in majority of cases, especially for lower number of tasks. that regularization methods designed specifically for networks with inhibitory neurons is a promising research direction.

Effect of scaling the importance estimate for W ie and W ei to reduce the parameter shift in the inhibitory weights on Rot-MNIST and Perm-MNIST datasets with varying number of tasks.

Effect of layerwise dropout ρ on Rot-MNIST with 5 tasks with varying degrees of incremental rotation (θ inc ) in each subsequent task. Row 0 shows (ρ l1 , ρ l2 ) the ρ values for the first and second hidden layer respectively.

Effect of different levels of sparsity in activations (ratio of active neurons, (k l1 , k l2 ) in 1 st and 2 nd hidden layer respectively) and connections (ratio of zero weights, S W ) on Rot-MNIST and Perm-MNNIST with increasing number of tasks. The best performance across the different sparsity levels for each task is in bold.

Effect of different levels of activation sparsity on Rot-MNIST with 5 tasks with varying degrees of incremental rotation (θ inc ) in each subsequent task. Row 0 shows (k l1 , k l2 ) the ratio of active neurons in the 1 st and 2 nd hidden layer, respectively. Bio-ANN: A biologically plausible framework for CL.Input: Data stream D; Learning rates η, η Wie , η Wei ; Hebbian learning rate η h ; Heterogeneous dropout ρ; Synaptic consolidation weights λ, λ Wie , λ Wei

