SPARSE DISTRIBUTED MEMORY IS A CONTINUAL LEARNER

Abstract

Continual learning is a problem for artificial neural networks that their biological counterparts are adept at solving. Building on work using Sparse Distributed Memory (SDM) to connect a core neural circuit with the powerful Transformer model, we create a modified Multi-Layered Perceptron (MLP) that is a strong continual learner. We find that every component of our MLP variant translated from biology is necessary for continual learning. Our solution is also free from any memory replay or task information, and introduces novel methods to train sparse networks that may be broadly applicable.

1. INTRODUCTION

Biological networks tend to thrive in continually learning novel tasks, a problem that remains daunting for artificial neural networks. Here, we use Sparse Distributed Memory (SDM) to modify a Multi-Layered Perceptron (MLP) with features from a cerebellum-like neural circuit that are shared across organisms as diverse as humans, fruit flies, and electric fish (Modi et al., 2020; Xie et al., 2022) . These modifications result in a new MLP variant (referred to as SDMLP) that uses a Top-Kfoot_0 activation function (keeping only the k most excited neurons in a layer on), no bias terms, and enforces both L 2 normalization and non-negativity constraints on its weights and data. All of these SDM-derived components are necessary for our model to avoid catastrophic forgetting. We encounter challenges when training the SDMLP that we leverage additional neurobiology to solve, resulting in better continual learning performance. Our first problem is with "dead neurons" that are never active for any input and which are caused by the Top-K activation function (Makhzani & Frey, 2014; Ahmad & Scheinkman, 2019) . Having fewer neurons participating in learning results in more of them being overwritten by any new continual learning task, increasing catastrophic forgetting. Our solution imitates the "GABA Switch" phenomenon where inhibitory interneurons that implement Top-K will excite rather than inhibit early in development (Gozel & Gerstner, 2021) . The second problem is with optimizers that use momentum, which becomes "stale" when training highly sparse networks. This staleness refers to the optimizer continuing to update inactive neurons with an out-of-date moving average, killing neurons and again harming continual learning. To our



Also called "k Winner Takes All" in related literature.1

