ARTIFICIAL NEURONAL ENSEMBLES WITH LEARNED CONTEXT DEPENDENT GATING

Abstract

Biological neural networks are capable of recruiting different sets of neurons to encode different memories. However, when training artificial neural networks on a set of tasks, typically, no mechanism is employed for selectively producing anything analogous to these neuronal ensembles. Further, artificial neural networks suffer from catastrophic forgetting, where the network's performance rapidly deteriorates as tasks are learned sequentially. By contrast, sequential learning is possible for a range of biological organisms. We introduce Learned Context Dependent Gating (LXDG), a method to flexibly allocate and recall 'artificial neuronal ensembles', using a particular network structure and a new set of regularization terms. Activities in the hidden layers of the network are modulated by gates, which are dynamically produced during training. The gates are outputs of networks themselves, trained with a sigmoid output activation. The regularization terms we have introduced correspond to properties exhibited by biological neuronal ensembles. The first term penalizes low gate sparsity, ensuring that only a specified fraction of the network is used. The second term ensures that previously learned gates are recalled when the network is presented with input from previously learned tasks. Finally, there is a regularization term responsible for ensuring that new tasks are encoded in gates that are as orthogonal as possible from previously used ones. We demonstrate the ability of this method to alleviate catastrophic forgetting on continual learning benchmarks. When the new regularization terms are included in the model along with Elastic Weight Consolidation (EWC) it achieves better performance on the benchmark 'permuted MNIST' than with EWC alone. The benchmark 'rotated MNIST' demonstrates how similar tasks recruit similar neurons to the artificial neuronal ensemble.

1. INTRODUCTION

1.1 CATASTROPHIC FORGETTING Learning sequentially without forgetting prior tasks is commonly known as continual learning or life-long learning. When an artificial neural network is trained on a task and the same model is then trained on a new task, the model's performance on the initial task tends to drop significantly. The model tends to overwrite parameters that are important to prior learned tasks, leading to the well known problem referred to as 'catastrophic forgetting'. This phenomenon, initially coined as 'catastrophic interference ', was first observed in 1989 (McCloskey & Cohen, 1989) and is a longstanding problem encountered when training artificial neural networks. This problem is related to the stability-plasticity dilemma (Mermillod et al., 2013) . The dilemma arises due to the trade off between plasticity when changing parameters for new tasks and stability when keeping them the same for prior learned tasks. The brain is not susceptible to such problems and it is able to adeptly learn many tasks. For example, whilst humans are still quite capable of forgetting, they are able to learn the rules of chess, football, cricket, poker, and many other games sequentially, without forgetting the rules to the others. The mechanisms of how the brain does this are not fully known, though neuroscience experiments examining neuronal activity and plasticity in biological neural networks are providing insights. One way in which the brain may enable continual learning could be by restricting plasticity of the synaptic connections that are important for a memory once it is formed. Upon the introduction of a new stimulus or task, new excitatory connections are made on dendritic spines in the brain (Bonilla-Quintana & Wörgötter, 2020; Makino & Malinow, 2009) . As learning occurs the strength of these connections increases via long-term potentiation, where the response of the synapse for a given stimulus occurs more readily (Alloway, 2001; Yuste & Bonhoeffer, 2001) . Furthermore, the connections which were important for prior tasks experience reduced plasticity, thereby protecting these connections from being lost as new tasks are learned (Bliss & Collingridge, 1993; Honkura et al., 2008) . This property of biological networks inspired the introduction of Synaptic Intelligence (Zenke et al., 2017) and Elastic Weight Consolidation (EWC) in artificial neural networks (Kirkpatrick et al., 2016) , both of which have means of calculating the importance of individual weights to prior tasks. The regularization terms introduced in these methods encourage the network to find solutions that change the most important weights as little as possible. Another proposed mechanism is the idea that memories are encoded in distinct 'neuronal ensembles', that is to say different sets of neurons are recruited to encode memories in different contexts (Mau et al., 2020) . This use of neuronal ensembles may reduce interference between memories and prevent the kind of catastrophic forgetting observed in artificial neural networks (González et al., 2020) . There are three broad categories of methods through which mitigating catastrophic forgetting has been implemented: replay-based methods (Li & Hoiem, 2016; van der Ven & Tolias, 2018; Rebuffi et al., 2016) , regularization methods (Kirkpatrick et al., 2016; Zenke et al., 2017; Jung et al., 2016) and architectural methods (Schwarz et al., 2018; Masse et al., 2018; Aljundi et al., 2016) . The model we introduce in this paper uses a combination of architectural and regularization methods.

1.2. RELEVANT WORK

In 2017, a method was introduced known as Context Dependent Gating (XDG) (Masse et al., 2018) and was demonstrated to successfully alleviate catastrophic forgetting in feedforward and recurrent architectures on image recognition tasks like permuted MNIST and Imagenet. This method works by multiplying the activity of the hidden layers of an artificial neural network by masks or 'gates'. In XDG the gates are binary vectors where a random 20% of the nodes in the hidden layers of the network are allowed to be active for each context and activities for the remaining 80% of units are set to zero. This allows the network to use different sub-networks which have little overlapping representations (though it was allowed, at chance, overlap was roughly 4%). In combination with Elastic weight Consolidation (EWC) (Kirkpatrick et al., 2017) , the XDG method achieved strong performance on continual learning benchmarks. For XDG the gates applied for each task were randomly allocated and the context signal used to retrieve them had to be given as an additional input. In many realistic continual learning scenarios a context label is not specified and the allocation of random gates may not be optimal. The method introduced in this paper is an extension of XDG where gates are instead the outputs of trainable networks themselves.

1.3. OUR CONTRIBUTION

We introduce learned context dependent gating (LXDG), where the gating of the hidden layers is trainable and is dependent on the input of the tasks. Crucially, an advantage over XDG is that this method does not require any additional label specifying which task is being presented when performing on previously learned tasks. For example, if new data from a previously learned task is presented without the associated context label, it would not be possible for XDG to perform well without finding the context label that was used to train that task. The is because XDG requires a context vector to reproduce the correct gating. Some methods have attempted to address this limitation of XDG, such as the Active Dendrites Network (Iyer et al., 2022) . In this case context vectors are still required, but they are derived from a simple statistical analysis of inputs. The LXDG method no longer requires a context vector to be explicitly provided or inferred. The gates for LXDG are learned and dynamically produced using a set of newly introduced regularization terms. The method is able to simultaneously increase task performance and allocate the gates in a single loss function. The random gating in XDG does allow for some level of overlap at chance. However, there is no flexibility for the model to potentially benefit from allowing overlapping learned representations for

