SPARSE DISTRIBUTED MEMORY IS A CONTINUAL LEARNER

Abstract

Continual learning is a problem for artificial neural networks that their biological counterparts are adept at solving. Building on work using Sparse Distributed Memory (SDM) to connect a core neural circuit with the powerful Transformer model, we create a modified Multi-Layered Perceptron (MLP) that is a strong continual learner. We find that every component of our MLP variant translated from biology is necessary for continual learning. Our solution is also free from any memory replay or task information, and introduces novel methods to train sparse networks that may be broadly applicable.

1. INTRODUCTION

Biological networks tend to thrive in continually learning novel tasks, a problem that remains daunting for artificial neural networks. Here, we use Sparse Distributed Memory (SDM) to modify a Multi-Layered Perceptron (MLP) with features from a cerebellum-like neural circuit that are shared across organisms as diverse as humans, fruit flies, and electric fish (Modi et al., 2020; Xie et al., 2022) . These modifications result in a new MLP variant (referred to as SDMLP) that uses a Top-Kfoot_0 activation function (keeping only the k most excited neurons in a layer on), no bias terms, and enforces both L 2 normalization and non-negativity constraints on its weights and data. All of these SDM-derived components are necessary for our model to avoid catastrophic forgetting. We encounter challenges when training the SDMLP that we leverage additional neurobiology to solve, resulting in better continual learning performance. Our first problem is with "dead neurons" that are never active for any input and which are caused by the Top-K activation function (Makhzani & Frey, 2014; Ahmad & Scheinkman, 2019) . Having fewer neurons participating in learning results in more of them being overwritten by any new continual learning task, increasing catastrophic forgetting. Our solution imitates the "GABA Switch" phenomenon where inhibitory interneurons that implement Top-K will excite rather than inhibit early in development (Gozel & Gerstner, 2021) . The second problem is with optimizers that use momentum, which becomes "stale" when training highly sparse networks. This staleness refers to the optimizer continuing to update inactive neurons with an out-of-date moving average, killing neurons and again harming continual learning. To our knowledge, we are the first to formally identify this problem that will in theory affect any sparse model, including recent Mixtures of Experts (Fedus et al., 2021; Shazeer et al., 2017) . Our SDMLP is a strong continual learner, especially when combined with complementary approaches. Using our solution in conjunction with Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) , we obtain, to the best of our knowledge, state-of-the-art performance for CIFAR-10 in the class incremental setting when memory replay is not allowed. Another variant of SDM, developed independently of our work, appears to be state-of-the-art for CIFAR-100, MNIST, and FashionMNIST, with our SDMLP as a close second (Shen et al., 2021) . Excitingly, our continual learning success is "organic" in resulting from the underlying model architecture and does not require any task labels, task boundaries, or memory replay. Abstractly, SDM learns its subnetworks responsible for continual learning using two core model components. First, the Top-K activation function causes the k neurons most activated by an input to specialize towards this input, resulting in the formation of specialized subnetworks. Second, the Lfoot_1 normalization and absence of a bias term together constrain neurons to the data manifold, ensuring that all neurons democratically participate in learning. When these two components are combined, a new learning task only activates and trains a small subset of neurons, leaving the rest of the network intact to remember previous tasks without being overwritten As a roadmap of the paper, we first discuss related work (Section 2). We then provide a short introduction to SDM (Section 3), before translating it into our MLP (Section 4). Next we present our results, comparing the organic continual learning capabilities of our SDMLP against relevant benchmarks (Section 5). Finally, we conclude with a discussion on the limitations of our work, sparse models more broadly, and how SDM relates MLPs to Transformer Attention (Section 6).

2. RELATED WORK

Continual Learning -The techniques developed for continual learning can be broadly divided into three categories: architectural (Goodfellow et al., 2014; 2013 ), regularization (Smith et al., 2022;; Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018), and rehearsal (Lange et al., 2021; Hsu et al., 2018) . Many of these approaches have used the formation of sparse subnetworks for continual learning (Abbasi et al., 2022; Aljundi et al., 2019b; Ramasesh et al., 2022; Iyer et al., 2022; Xu & Zhu, 2018; Mallya & Lazebnik, 2018; Schwarz et al., 2021; Smith et al., 2022; Le et al., 2019; Le & Venkatesh, 2022 ). 2 However, in contrast to our "organic" approach, these methods employ complex algorithms and additional memory consumption to explicitly protect model weights important for previous tasks from overwrites. 3Works applying the Top-K activation to continual learning include (Aljundi et al., 2019b; Gozel & Gerstner, 2021; Iyer et al., 2022) . Srivastava et al. (2013) used a local version of Top-K, defining disjoint subsets of neurons in each layer and applying Top-K locally to each. However, this was only used on a simple task-incremental two-split MNIST task and without any of the additional SDM modifications that we found crucial to our strong performance (Table 2 ). The Top-K activation function has also been applied more broadly in deep learning (Makhzani & Frey, 2014; Ahmad & Scheinkman, 2019; Sengupta et al., 2018; Gozel & Gerstner, 2021; Aljundi et al., 2019b) . Top-K converges with not only the connectivity of many brain regions that utilize inhibitory interneurons but also results showing advantages beyond continual learning, including: greater interpretability (Makhzani & Frey, 2014; Krotov & Hopfield, 2019; Grinberg et al., 2019) , robustness to adversarial attacks (Paiton et al., 2020; Krotov & Hopfield, 2018; Iyer et al., 2022) , efficient sparse computations (Ahmad & Scheinkman, 2019), tiling of the data manifold (Sengupta et al., 2018) , and implementation with local Hebbian learning rules (Gozel & Gerstner, 2021; Sengupta et al., 2018; Krotov & Hopfield, 2019; Ryali et al., 2020; Liang et al., 2020) . FlyModel -The most closely related method to ours is a model of the Drosophila Mushroom Body circuitry (Shen et al., 2021) . This model, referred to as "FlyModel", unknowingly implements the SDM algorithm, specifically the Hyperplane variant (Jaeckel, 1989a) with a Top-K activation



Also called "k Winner Takes All" in related literature. Even without sparsity or the Top-K activation function, pretraining models can still lead to the formation of subnetworks, which translates into better continual learning performance as found inRamasesh et al. (2022). Memory replay methods indirectly determine and protect weights by deciding what memories to replay.

