LEARNING TO LIVE WITH DALE'S PRINCIPLE: ANNS WITH SEPARATE EXCITATORY AND INHIBITORY UNITS

Abstract

The units in artificial neural networks (ANNs) can be thought of as abstractions of biological neurons, and ANNs are increasingly used in neuroscience research. However, there are many important differences between ANN units and real neurons. One of the most notable is the absence of Dale's principle, which ensures that biological neurons are either exclusively excitatory or inhibitory. Dale's principle is typically left out of ANNs because its inclusion impairs learning. This is problematic, because one of the great advantages of ANNs for neuroscience research is their ability to learn complicated, realistic tasks. Here, by taking inspiration from feedforward inhibitory interneurons in the brain we show that we can develop ANNs with separate populations of excitatory and inhibitory units that learn just as well as standard ANNs. We call these networks Dale's ANNs (DANNs). We present two insights that enable DANNs to learn well: (1) DANNs are related to normalization schemes, and can be initialized such that the inhibition centres and standardizes the excitatory activity, (2) updates to inhibitory neuron parameters should be scaled using corrections based on the Fisher Information matrix. These results demonstrate how ANNs that respect Dale's principle can be built without sacrificing learning performance, which is important for future work using ANNs as models of the brain. The results may also have interesting implications for how inhibitory plasticity in the real brain operates.

1. INTRODUCTION

In recent years, artificial neural networks (ANNs) have been increasingly used in neuroscience research for modelling the brain at the algorithmic and computational level (Richards et al., 2019; Kietzmann et al., 2018; Yamins & DiCarlo, 2016) . They have been used for exploring the structure of representations in the brain, the learning algorithms of the brain, and the behavioral patterns of humans and non-human animals (Bartunov et al., 2018; Donhauser & Baillet, 2020; Michaels et al., 2019; Schrimpf et al., 2018; Yamins et al., 2014; Kell et al., 2018) . Evidence shows that the ability of ANNs to match real neural data depends critically on two factors. First, there is a consistent correlation between the ability of an ANN to learn well on a task (e.g. image recognition, audio perception, or motor control) and the extent to which its behavior and learned representations match real data (Donhauser & Baillet, 2020; Michaels et al., 2019; Schrimpf et al., 2018; Yamins et al., 2014; Kell et al., 2018) . Second, the architecture of an ANN also helps to determine how well it can match real brain data, and generally, the more realistic the architecture the better the match (Schrimpf et al., 2018; Kubilius et al., 2019; Nayebi et al., 2018) . Given these two factors, it is important for neuroscientific applications to use ANNs that have as realistic an architecture as possible, but which also learn well (Richards et al., 2019; Kietzmann et al., 2018; Yamins & DiCarlo, 2016) . Although there are numerous disconnects between ANNs and the architecture of biological neural circuits, one of the most notable is the lack of adherence to Dale's principle, which states that a neuron releases the same fast neurotransmitter at all of its presynaptic terminals (Eccles, 1976) . Though there are some interesting exceptions (Tritsch et al., 2016) , for the vast majority of neurons in adult vertebrate brains, Dale's principle means that presynaptic neurons can only have an exclusively excitatory or inhibitory impact on their postsynaptic partners. For ANNs, this would mean that units cannot have a mixture of positive and negative output weights, and furthermore, that weights cannot change their sign after initialisation. In other words, a unit can only be excitatory or inhibitory. However, most ANNs do not incorporate Dale's principle. Why is Dale's principle rarely incorporated into ANNs? The reason is that this architectural constraint impairs the ability to learn-a fact that is known to many researchers who have tried to train such ANNs, but one that is rarely discussed in the literature. However, when we seek to compare ANNs to real brains, or use them to explore biologically inspired learning rules (Bartunov et al., 2018; Whittington & Bogacz, 2019; Lillicrap et al., 2020) , ideally we would use a biologically plausible architecture with distinct populations of excitatory and inhibitory neurons, and at the same time, we would still be able to match the learning performance of standard ANNs without such constraints. Some previous computational neuroscience studies have used ANNs with separate excitatory and inhibitory units (Song et al., 2016; Ingrosso & Abbott, 2019; Miconi, 2017; Minni et al., 2019; Behnke, 2003) , but these studies addressed questions other than matching the learning performance of standard ANNs, e.g. they focused on typical neuroscience tasks (Song et al., 2016) , dynamic balance (Ingrosso & Abbott, 2019) , biologically plausible learning algorithms (Miconi, 2017) , or the learned structure of networks (Minni et al., 2019) . Importantly, what these papers did not do is develop means by which networks that obey Dale's principle can match the performance of standard ANNs on machine learning benchmarks, which has become an important feature of many computational neuroscience studies using ANNs (Bartunov et al., 2018; Donhauser & Baillet, 2020; Michaels et al., 2019; Schrimpf et al., 2018; Yamins et al., 2014; Kell et al., 2018) . Here, we develop ANN models with separate excitatory and inhibitory units that are able to learn as well as standard ANNs. Specifically, we develop a novel form of ANN, which we call a "Dale's ANN" (DANN), based on feed-forward inhibition in the brain (Pouille et al., 2009) . Our novel approach is different from the standard solution, which is to create ANNs with separate excitatory and inhibitory units by constraining whole columns of the weight matrix to be all positive or negative (Song et al., 2016) . Throughout this manuscript, we refer to this standard approach as "ColumnEi" models. We have departed from the ColumnEI approach in our work because it has three undesirable attributes. First, constrained weight matrix columns impair learning because they limit the potential solution space (Amit et al., 1989; Parisien et al., 2008) . Second, modelling excitatory and inhibitory units with the same connectivity patterns is biologically misleading, because inhibitory neurons in the brain tend to have very distinct connectivity patterns from excitatory neurons (Tremblay et al., 2016) . Third, real inhibition can act in both a subtractive and a divisive manner (Atallah et al., 2012; Wilson et al., 2012; Seybold et al., 2015; Pouille et al., 2013) , which may provide important functionality. Given these considerations, in DANNs, we utilize a separate pool of inhibitory neurons with a distinct, more biologically realistic connectivity pattern, and a mixture of subtractive and divisive inhibition (Fig. 1 ). This loosely mimics the fast feedforward subtractive and divisive inhibition provided by fast-spiking interneurons in the cortical regions of the brain (Atallah et al., 2012; Hu et al., 2014; Lourenço et al., 2020) . In order to get DANNs to learn as well as standard ANNs we also employ two key insights: 1. It is possible to view this architecture as being akin to normalisation schemes applied to the excitatory input of a layer (Ba et al., 2016; Ioffe & Szegedy, 2015; Wu & He, 2018) , and we use this perspective to motivate DANN parameter initialisation. 2. It is important to scale the inhibitory parameter updates based on the Fisher information matrix, in order to balance the impact of excitatory and inhibitory parameter updates, similar in spirit to natural gradient approaches (Martens, 2014) . Altogether, our principle contribution is a novel architecture that obey's Dale's principle, and that we show can learn as well as standard ANNs on machine learning benchmark tasks. This provides the research community with a new modelling tool that will allow for more direct comparisons with real neural data than traditional ANNs allow, but which does not suffer from learning impairments. Moreover, our results have interesting implications for inhibitory plasticity, and provide a means for future research into how excitatory and inhibitory neurons in the brain interact at the algorithmic level.

2.1. MODEL DEFINITION

Our design for DANNs takes inspiration from the physiology of feedforward inhibitory microcircuits in the neocortex and hippocampus. Based on these circuits, and an interpretation of layers in ANNs as corresponding to brain regions, we construct DANNs with the following architectural constraints: 1. Each layer of the network contains two distinct populations of units, an excitatory and an inhibitory population. 2. There are far fewer inhibitory units than excitatory units in each layer, just as there are far more excitatory neurons than inhibitory neurons (∼ 5-10 times) in cortical regions of the brain (Tremblay et al., 2016; Hu et al., 2014) . 3. As in real neural circuits where only the excitatory populations project between regions, here only excitatory neurons project between layers, and both the excitatory and inhibitory populations of a layer receive excitatory projections from the layer below. 4. All of the synaptic weights are strictly non-negative, and inhibition is enforced via the activation rules for the units (eq. 1). 5. The inhibitory population inhibits the excitatory population through a mixture of subtractive and divisive inhibition. This constrained architecture is illustrated in Figure 1 . Formally, we define the network as follows. Input to the network is received as a vector of positive scalar values x ∈ R d + , which we consider to be the first excitatory population. Each hidden layer, , is comprised of a vector of excitatory units h ∈ R ne + and inhibitory units h I ∈ R ni + , in-line with constraint (1) above. (We will drop the layer index when it is unnecessary for clarity.) Note, for the first layer ( = 1), we have h = x and n e = d. Next, based on constraint (2) we set n e >> n i , and use 10% inhibitory units as default. Following constraint (3), both the excitatory and inhibitory units receive inputs from the excitatory units in the layer below (h -1 ), but the inhibitory units do not project between layers. Instead, excitatory units receive inputs from the inhibitory units of the same layer. In-line with constraint (4), we have three sets of strictly non-negative synaptic weights, one for the excitatory connections between layers, W EE ∈ R ne×ne + , one for the excitatory projection to the inhibitory units W IE ∈ R ni×ne + , and one for the inhibitory projections within layer W EI ∈ R ne×ni + . Finally, per constraint (5), we define the impact of the inhibitory units on the excitatory units as comprising both a subtractive and a divisive component: h = f (z ) z = g γ (z E -W EI h I ) + β where z E = W EE h -1 h I = f I (z I ) = f I (W IE h -1 ) γ = W EI (e α h I ) where for each layer , β ∈ R ne is a bias, g ∈ R ne + controls the gain, γ is the divisive inhibitory term, and α ∈ R ni is a parameter that controls the strength of this divisive inhibition. Here denotes elementwise multiplication (Hadamard product) and the exponential function and division are applied elementwise. In the rest of this manuscript we set f to be the rectified linear function (ReLU). Though a ReLU function is not a perfect match to the input-output properties of real neurons, it captures the essential rectification operation performed by neurons in physiologically realistic low activity regimes (Salinas & Sejnowski, 2000) . In this paper, we model the inhibitory units as linear (i.e. f I (z I ) = z I ) since they receive only positive inputs and have no bias, and therefore their activation would always be in the linear part of the ReLU function. Although we make make this modelling choice mainly for mathematical simplicity, there is some biological justification, as the resting membrane potential of the class of fast-spiking interneurons most related to our model is relatively depolarised and their spike outputs can follow single inputs one-to-one (Hu et al., 2014; Galarreta & Hestrin, 2001) . In future work, for example in which inhibitory connections are included between inhibitory units, we expect that the use of nonlinear functions for inhibitory units will be important.

3. PARAMETER INITIALISATION FOR DALE'S ANNS

In biology, excitation and inhibition are balanced (Isaacson & Scanziani, 2011) , and we use this biological property to derive appropriate weight initialisation for DANNs. First we initialise excitatory parameters from an exponential distribution with rate parameter λ E , W EE iid ∼ Exp(λ E ), and then inhibitory parameters are initialised such that excitation and subtractive inhibition are balanced, i.e. E[z E k ] = E[(W EI z I ) k ], ∀k. This can be achieved in a number of ways (see appendix C.2). In line with biology, we choose to treat excitatory weights onto inhibitory and excitatory units the same, and sample W IE iid ∼ Exp(λ E ) and set W EI ← 1/n i . We note that for a DANN layer with a single inhibitory neuron, e.g. at an output layer with 10 excitatory neurons, the noise inherent in sampling a single weight vector may result in a poor match between the excitatory and inhibitory inputs, so in this case we initialise W IE as 1 ne ne j=1 w EE j,: explicitly (where w EE j,: is the j th row of W EE ). Next, we consider the relationship between this initialisation approach and normalisation schemes (Ba et al., 2016; Ioffe & Szegedy, 2015) . Normalisation acts to both center and scale the unit activities in a network such that they have mean zero and variance one. The weight initialisation given above will produce centered activities at the start of training. We can also draw a connection between the divisive inhibition and standardisation if we assume that the elements of x are sampled from a rectified normal distribution, x iid ∼ max(0, N (0, σ 2 -1 )). Under this assumption, the mean and standard deviation of the excitatory input are proportional (see Appendix D). For example, if we consider the relationship c • E[z E k ] = Var(z E k ) 1/2 for each unit k, we get the scalar proportionality constant c = √ 2π -1/ √ d, as: E[z E k ] = d • E[w EE ]E[x] = d • E[w EE ] σ -1 √ 2π Var(z E k ) = d • Var(w EE )(E[x 2 ] + Var(x)) = d • Var(w EE )σ 2 -1 2π -1 2π with expectation over the data and the parameters, and where w EE , x refer to any element of W EE , x. Therefore, since E[W EE ] 2 = Var(W EE ) for weights drawn from an exponential distribution, we have c = Var(z E k ) 1 2 E[z E k ] = √ 2π -1 √ d This proportionality means that you can perform centering and standardisation operations using the same neurons. For DANNs, e α will dictate the expected standard deviation of the layer's activation z, as it controls the proportionality between subtractive and divisive inhibition for each inhibitory unit. If e α is set to c, then the divisive inhibition approximates dividing z E by its standard deviation, as E[z E k ] • c = E[w EI k,: (e α z I k )] = E[γ k ]. We note that due to the proportionality between the mean and standard deviation of z E , other values of e α will also control the layer's variance with depth. However, given these considerations, we initialise e α ← √ 2π -1/ √ d, thereby achieving standardisation at initialisation. We find that these initialisation schemes enable DANNs to learn well. We next turn to the question of how to perform parameter updates in DANNs in order to learn well.

4. PARAMETER UPDATES FOR DALE'S ANNS

Unlike a layer in a column constrained network, whose affine function is restricted by sign constrained columns, a layer in a DANN is not restricted in its potential function space. This is because excitatory inputs to a layer can still have an inhibitory impact via feedforward inhibition. However, the inhibitory interneuron architecture of DANN layers introduces disparities in the degree to which updates to different parameters affect the layer's output distribution. This can be seen intuitively, for example if a single element of W IE is updated, this has an effect on each element of z. Similarly, an update to w EI ij will change z i depending on the alignment of x and all of the j th inhibitory unit's weights. Therefore, instead of using the euclidean metric to measure distance between parameter settings, we employ an alternative approach. Similar to natural gradient methods, we use an approximation of the Kullback-Leibler divergence (KL divergence) of the layer's output distribution for our metric. In order to help ensure that both excitatory and inhibitory parameter updates have similar impacts on the KL divergence, we scale the updates using correction terms derived below. We provide an extended derivation of these scaling factors in the Appendix E. Given a probability distribution parameterized by some vector θ, a second order approximation to the KL divergence for a change to the parameters θ is D KL P (y|x; θ) P (y|x; θ + δ) ≈ 1 2 δ T F (θ)δ (4) F (θ) = E x∼P (x),y∼P (y|x;θ) ∂ log P (y|x; θ) ∂θ ∂ log P (y|x; θ) ∂θ T (5) Where F (θ) is the Fisher Information matrix (or just the Fisher). In order to calculate the Fisher for the parameters of a neural network, we must interpret the network's outputs in a probabilistic manner. One approach is to view a layer's activation as parameterising a conditional distribution from the natural exponential family P (y|x; θ) = P (y|z), independent in each coordinate of y|z (similar to a GLM, and as done in Ba et al. ( 2016)). The log likelihood of such a distribution can be written as 1 log P (y|x; θ) = y • z -η(z) φ + c(y, φ) (6) E[y|x; θ] = f (z) = η (z) Cov(y|x; θ) = diag(φf (z)) where f (z) is the activation function of the layer, and φ, η, c define the particular distribution in the exponential family. Note that here we are taking η (z) and f (z) to denote ∂η ∂z and ∂f ∂z , respectively. In our networks, we have used softmax activation functions at the output and ReLU activation functions in the hidden layers. In this setting, the log likelihood of the output softmax probability layer would only be defined for a one-hot vector y and would correspond to φ = 1, c(y, φ) = 0, and η(z) = log( i e zi ). For the ReLU activation functions, the probabilistic model corresponds to a tobit regression model, in which y is a censored observation of a latent variable ŷ ∼ N (z, diag(φf (z))). In this case, one could consider either the censored or pre-censored latent random variable, depending on modelling preference. As it fits well with the above framework we analyze the pre-censored random variable ŷ, i.e. f (z) = z in equation 6. Returning to the general case, where we consider layer's activation as parameterising a conditional distribution from the natural exponential family, the fisher of a layer is: F (θ) = E x∼P (x),y∼P (y|x;θ) ∂z ∂θ (y -η (z)) φ (y -η (z)) φ T ∂z ∂θ T (8) = E x∼P (x) ∂z ∂θ diag(f (z)) φ ∂z ∂θ T (9) To estimate the approximate KL divergence resulting from the simple case of perturbing an individual parameter θ ∈ θ of a single-layer DANN, we only need to consider the diagonal entries of the Fisher: D KL P θ P θ+δθ ≈ δ 2 2φ ne k E x∼P (x) f (z k ) ∂z k ∂ θ 2 (10) where δ θ represents a 1-hot vector corresponding to θ multiplied by a scalar δ. We now consider the approximate KL divergence after updates to a single element of W EE , W IE , W EI and α: D KL P θ P θ+δ W EE ij ≈ δ 2 2φ E f (z i )( g i γ i x j ) 2 D KL P θ P θ+δ W IE ij ≈ δ 2 2φ ne k E f (z k )( g k γ k x j ) 2 (w EI ki a ki ) 2 D KL P θ P θ+δ W EI ij ≈ δ 2 2φ d n E f (z i )( g i γ i x n ) 2 (w IE jn a ij ) 2 (13) + δ 2 2φ d n =m E f (z i )( g i γ i ) 2 x n x m w IE jn w IE jm (a ij ) 2 D KL P θ P θ+δα i = δ 2 2φ ne k d j E f (z k )( g k γ k x j ) 2 w EI ki w IE ij (a ki -1) 2 (14) + δ 2 2φ ne k d n =m E f (z k )( g k γ k ) 2 x n x m w IE in w IE im (w EI ki ) 2 (a ki -1) 2 Where a kj = e αj γ k (z E k -(W EI z I ) k ) + 1, and expectations are over the data, E x∼P (x) . Therefore, as a result of the feedforward inhibitory architecture of DANNs, for a parameter update δ, the effect on the model's distribution will be different depending on the updated parameter-type. While the exact effect depends on the degree of co-variance between terms, the most prevalent differences between and within the excitatory and inhibitory parameter-types are the sums over layer input and output dimensions. For example, an inhibitory weight update of δ to w IE ij is expected to change the model distribution approximately n e times more than an excitatory weight update of δ to w EE ij . In order to balance the impact of updating different parameter-types, we update DANN parameters after correcting for these terms: updates to W IE were scaled by √ n e -1 , W EI by d -1 and α by (d √ n e ) -1 . As a result, inhibitory unit parameters updates are scaled down relative to excitatory parameter updates. This leads to an interesting connection to biology, because while inhibitory neuron plasticity is well established, the rules and mechanisms governing synaptic updates are different from excitatory cells (Kullmann & Lamsa, 2007; Kullmann et al., 2012) , and historically interneuron synapses were thought to be resistant to long-term weight changes (McBain et al., 1999) . Next, we empirically verified that our heuristic correction factors captured the key differences between parameter-types in their impact on the KL divergence. To do this we compared parameter gradients before and after correction, to parameter gradients multiplied by an approximation of the diagonal of the Fisher inverse for each layer (which we refer to as Fisher corrected gradients), see Appendix F.3. The model was trained for 50 epochs on MNIST, and updated using the Fisher corrected gradients. Throughout training, we observed that the heuristic corrected gradients were more aligned to the Fisher corrected gradients than the uncorrected gradients were (Fig. 2 ). Thus, our derived correction factors help to balance the impact of excitatory and inhibitory updates on the network's behaviour. Below, we demonstrate that these corrections are key to getting DANNs to learn well. 

5. EXPERIMENTAL RESULTS

Having derived appropriate parameter initialisation and updates for DANNs, we now explore how they compare to traditional ANNs and ColumnEi models on simple benchmark datasets. In brief, we find that column constrained models perform poorly, failing even to achieve zero training-set error, whereas DANNs perform equivalently to traditional ANNs.

5.1. IMPLEMENTATION DETAILS

All models were composed of 4 layers: in general 3 hidden layers of dimension 500 with a ReLU activation function followed by a softmax output with 10 units, and all experiments were run for 50 epochs with batch size 32. Unless stated, for DANNs and ColumnEi models, 50 inhibitory units were included per hidden layer. For DANN models, the softmax output layer was constructed with one inhibitory unit. For ColumnEi models, each hidden layer's activation is z = Wx where 500 columns of W were constrained to be positive and 50 negative (therefore for ColumnEi models h was of dimension 550). ColumnEi layer weights were initialised so that variance did not scale with depth and that activations were centered (see Appendix C.1 for further details). All benchmark datasets (MNIST, Kuzushiji MNIST and Fashion MNIST) were pre-processed so that pixel values were in [0, 1]. Learning rates were selected according to validation error averaged over 3 random seeds, after a random search (Orion; Bouthillier et al. (2019) , log uniform [10, 1e-5], 100 trials, 10k validation split). Selected models were then trained on test data with 6 random seeds. Plots show mean training error per epoch, and mean test set error every 200 updates over random seeds. Tables show final error mean ± standard deviation. For further implementation details and a link to the accompanying code see Appendix F. Note that because our goal in this paper is not to achieve state-of-the-art performance, we did not apply regularisation techniques, such as dropout and weight decay, or common modifications to stochastic gradient descent (SGD). Instead the goal of the experiments presented here was simply to determine whether, in the simplest test case scenario, DANNs can learn better than ColumnEi models and as well as traditional ANNs.

5.2. COMPARISON OF DANNS TO COLUMN-EI MODELS AND MLPS

Figure 3 : Model comparison on MNIST dataset. nc -no update corrections, 1i -one inhibitory unit We first compared model performance on the MNIST dataset (Fig 3 ). We observed that ColumnEi models generalised poorly, and failed to achieve 0 % training error within the 50 epochs. This confirms the fact that such models cannot learn as well as traditional ANNs. In contrast, we observed that DANNs performed equivalently to multi-layer perceptrons (MLPs), and even generalised marginally better. This was also the case for ColumnEi and DANN models constructed with more inhibitory units (Supp. Fig. 6 , 100 inhibitory units per layer). In addition, performance was only slightly worse for DANNs with one inhibitory unit per layer. These results show that DANN performance generalizes to different ratios of excitatory-to-inhibitory units. We also found that not correcting parameter updates using the corrections derived from the Fisher significantly impaired optimization, further verifying the correction factors (Fig 3 ). Next, we compared DANN performance to MLPs trained with batch and layer normalization on more challenging benchmark datasets (Fig 4 ). Again we found that DANNs performed equivalently to these standard architectures, whereas ColumnEi models struggled to achieve acceptable performance. We also explored methods for improving DANN performance (Appendix F.4). First, in order to maintain the positive DANN weight constraint, if after a parameter update a weight was negative, we reset it to zero, i.e. θ ← max(0, θ), and as a result the actual update is no longer that suggested by SGD. We therefore experimented with temporarily reducing the learning rate whenever this parameter clipping would reduce the cosine of the angle made between the gradient and actual updates below a certain constraint (see Appendix F.4). Second, we note that the divisive inhibition term, γ, appears in the denominator of the weight gradients (Appendix E.2) and, therefore, if γ becomes small, the gradients will become large, potentially resulting in inappropriate parameter updates. We therefore wondered if constraining the gradient norm would be particularly effective for DANNs. We tested both of these modifications to DANNs trained on Fashion MNIST (Supp. Fig. 5 ). However, we found that they provided no observable improvement, indicating that the loss landscape and gradients were well behaved over optimization. Finally, we provide an analysis and preliminary experiments detailing how the DANN architecture described above may be extended to recurrent and convolutional neural networks in future work (Appendix B). In brief, we unroll recurrent networks over time and place inhibition between both network layers and timesteps, corresponding to fast feedforward and local recurrent inhibition, respectively. For convolutional architectures, we can directly apply the DANN formulation to activation maps if inhibitory and excitatory filters are of the same size and stride. Supporting this, we found that a DANN version of VGG16 (Simonyan & Zisserman, 2014 ) converged equivalently to a standard VGG16 architecture (Supp.Fig. 7 ). Altogether, our results demonstrate that: (1) the obvious approach to creating ANNs that obey Dale's principle (ColumnEi models) do not learn as well as traditional ANNs, (2) DANNs learn better than ColumnEi models and as well as traditional ANNs, (3) DANN learning is significantly improved by taking appropriate steps to scale updates in excitatory and inhibitory units appropriately.

6. DISCUSSION

Here we presented DANNs, a novel ANN architecture with separate inhibitory and excitatory units. We derived appropriate parameter initialisation and update rules and showed experimentally that, unlike ANNs where some columns are simply constrained to be positive or negative, DANNs perform equivalently to traditional ANNs on benchmark datasets. These results are important as they are, as far as we know, the first example of an ANN architecture that fully adheres to Dale's law without sacrificing learning performance. However, our results also raise an interesting question: why does nature employ Dale's principle? After all, we did not see any improvement over normal ANNs in our experiments. There are two possible hypotheses. First, it is possible that Dale's principle represents an evolutionary local minima, whereby early phylogenetic choices led to constraints on the system that were difficult to escape via natural selection. Alternatively, Dale's principle may provide some computational benefit that we were unable to uncover given the specific tasks and architectures we used here. For example, it has been hypothesized that inhibition may help to prevent catastrophic forgetting (Barron et al., 2017) . We consider exploring these questions an important avenue for future research. There are a number of additional avenues for future work building upon DANNs, the most obvious of which are to further extend and generalize DANNs to recurrent and convolution neural networks (see Appendix B). It would also be interesting to explore the relative roles of subtractive and divisive inhibition. While subtractive inhibition is required for the unconstrained functional space of DANN layers, divisive inhibition may confer some of the same optimisation benefits as normalisation schemes. A related issue would be to explore the continued balance of excitation and inhibition during optimization, because while DANNs are initialised such that these are balanced, and inhibition approximates normalisation schemes, the inhibitory parameters are updated during training, and the model is free to diverge from this initialisation. As a result, the distribution of layer activations may be unstable over successive parameter updates, potentially harming optimization. In the brain, a variety of homeostatic plasticity mechanisms stabilize neuronal activity. For example, reducing excitatory input naturally results in a reduction in inhibition in real neural circuits (Tien & Kerschensteiner, 2018) . It would therefore be interesting to test the inclusion of a homeostatic loss to encourage inhibition to track excitation throughout training. Finally, we note that while fast feedforward inhibition in the mammalian cortex was the main source of inspiration for this work, future investigations may benefit from drawing on a broader range of neurobiology, for example by incorporating principles of invertebrate neural circuits, such as the mushroom bodies of insects (Serrano et al., 2013) . In summary, DANNs sit at the intersection of a number of programs of research. First, they are a new architecture that obeys Dale's principle, but which can still learn well, allowing researchers to more directly compare trained ANNs to real neural data (Schrimpf et al., 2018; Yamins et al., 2014) . Second, DANNs contribute towards computational neuroscience and machine learning work on inhibitory interneurons in ANNs, and in general towards the role of inhibitory circuits and plasticity in neural computation (Song et al., 2016; Sacramento et al., 2018; Costa et al., 2017; Payeur et al., 2020; Atallah et al., 2012; Barron et al., 2017) . Finally, the inhibition in DANNs also has an interesting connection to normalisation methods used to improving learning in deep networks (Ioffe & Szegedy, 2015; Wu & He, 2018; Ba et al., 2016) . As DANNs tie these distinct programs of research together into a single model, we hope they can serve as a basis for future research at the intersection of deep learning and neuroscience. 

B EXTENSION OF DANNS TO OTHER ARCHITECTURES

Here we discuss how our results and analysis of fully-connected feedforward Dale's ANNs may be applied to convolutional and recurrent neural networks.

B.1 EXTENSION TO CONVOLUTIONAL NEURAL NETWORKS

Consider the response of a standard convolutional layer of n output channels with filters of size k × k at a single position j over m input channels: z j = Wx j + b (15) Here, W is a n × k 2 m matrix whose rows correspond to the kernel weights of each output channel, and the vector x j of length k 2 m contains the values over the n input channels for the spatial location i. Concatenating each input location x j as the columns of a matrix X, the full output of the convolutional layer over all input locations can be expressed as Z = WX + b, where b is broadcast over the columns of Z. We can readily make an equivalent DANN formulation for a convolution layer by assuming the same kernel size and stride for excitatory and inhibitory filter-sets W EE and W EI : z j = g γ (W EE x j -W EI W IE x j ) + β, γ = W EI (e α W IE x j ) Here the inhibitory channels are mapped to each excitatory output channel by W IE for subtractive inhibition, and are first scaled by e α for divisive inhibition. For parameter initialisation, by following the approach of He et al. (2015) and considering the response of the layer at a single location, we use the same initialisations as those derived in section 3, but where the input dimension d is the product of kernel size and input channels, k 2 m. Next, the correction factors to parameters updates apply as in section 4 as the KL divergence is summed over each valid input location j, which results in approximately the same multiplicative factor for each parameter, but does not change the approximate relative differences between parameter types: D KL P θ P θ+δθ ≈ j δ 2 2φ n i E x∼P (x) f (z i,j ) ∂z i,j ∂ θ 2 (17) where we consider the full response Z of the layer over all valid kernel locations. In order to confirm our extension to convolutional neural networks we conducted preliminary experiments with DANN versions of convolutional neural networks as described above. Below, we show results of training a standard VGG16 architecture, and a DANN version of the VGG16 architecture (Supp. Fig. 7 ) on CIFAR-10. As can be seen, the DANN network trains approximately as well as the standard VGG16 model. 

B.2 EXTENSION TO RECURRENT NEURAL NETWORKS

We can readily make a connection between the fully-connected Dales ANNs described in Section 2.1 and recurrent neural networks (RNNs) by considering the similarities between depth and time. As has been previously noted, a shallow RNN unrolled over time can be expressed as a deep neural network with weight sharing (Liao & Poggio, 2016) . h t = f (z t ) z t = g t γ t Ŵh t-1 + β ( ) where Ŵ = W EE -W EI W EI , γ t = W EI (e α W EI h t-1 ) where in this simple case, recurrent processing steps over time are applied to the input x = h 0 . In this view, layer depth corresponds to time, and inhibition between layers corresponds to fast feedback inhibition. We note that if there are a sequence of inputs coming at each time-step, x t , then this formulation can still hold, but with a simple modification to incorporate the time-varying inputs. Specifically, we need to add additional input weights, Û: h t = f (z t ) z t = g t γ t Ŵh t-1 + g x γ x Ûx t + β where Ŵ = W EE -W EI W EI , γ t = W EI (e αt W EI h t-1 ) Û = U EE -U EI U EI , γ x = U EI (e αx U EI x t ) All of the existing DANN approaches developed above can be applied to this case.

C PARAMETER INITIALISATIONS

In this section we provide further details regarding parameter initialisations. Throughout we assume that the elements of the input, x, to a layer are iid and also the output of a layer -1 whose pre-activations were distributed N (0, σ 2 -1 ). Therefore x will follow a rectified normal distribution: E[x] = ∞ 0 x • e -x 2 /2σ 2 -1 σ -1 √ 2π dx = σ -1 √ 2π E[x 2 ] = ∞ 0 x 2 • e -x 2 /2σ 2 -1 σ -1 √ 2π dx = σ 2 -1 2 Var(x) = E[x 2 ] -E[x] 2 = σ 2 -1 π -1 2π Var(x) + E[x] 2 = σ 2 -1 2π -1 2π where here, and throughout the text, non-indexed non-bold to refers to any element of a vector or matrix, e.g E[x] refers to the expectation of any element of x, Var(x) refers to the variance of any element of x, etc. In addition, for all models we draw positively constrained weights iid from exponential distributions, and make use of the following properties for w ∼ Exp(λ) E[w] = 1 λ Var(w) = 1 λ 2 = E[w] 2 E[w 2 ] = 2! λ 2 = 2Var(w) = 2E[w] 2

C.1 COLUMN CONSTRAINED EI MODELS AND WEIGHT INITIALISATION

Here we provide detail on the parameter initialisation of column constrained models. Layer activations are z = Wx where columns of W are constrained to be positive or negative. Therefore, for convenience, let us denote W = [W + , W -], and x = [x E , x I ], and we assume x E i , x I j are iid ∀i, j. Note for this model, n e + n i = d, the input dimensionality. As for DANN models, throughout training we preserve the sign constraints of the weights by resetting weights using rectification around zero, i.e. W + ← max(0, W + ), W -← min(0, W -). At initialisation for the column constrained model for each layer we require E[z k ] = 0, Var(z k ) = σ 2 -1 . E[z k ] = n e E[w + ]E[x] -n i E[w -]E[x] n e E[w + ]E[x] = n i E[w -]E[x] E[w -] = E[w + ] n e n i Where w + , w -refer to any element of W + , W -. Var(z k ) = ne i Var(w + ki x E i ) + ni j Var(w - kj x I j ) = n e Var(w + x) + n i Var(w -x) = n e E[w + ] 2 Var(x) + Var(w + )E[x] 2 + Var(w + )Var(x) + n i E[w -] 2 Var(x) + Var(w -)E[x] 2 + Var(w -)Var(x) As weights are drawn from an exponential distribution, Var(w + ) = E[w + ] 2 , we have Var(z k ) = n e E[w + ] 2 (2Var(x) + E[x] 2 ) + n i E[w -] 2 (2Var(x) + E[x] 2 ) = n e E[w + ] 2 (E[x 2 ] + Var(x)) + n i E[w -] 2 (E[x 2 ] + Var(x)) = (E[x 2 ] + Var(x))(n e E[w + ] 2 + n i E[w -] 2 ) = σ 2 -1 ( 2π -1 2π )E[w + ] 2 (n e + n 2 e n i ) Therefore E[w + ] = 1/( 2π-1 2π )(n e + n 2 e ni ) Note that as the input to the network is all positive, the first weight matrix has no negative columns. We therefore use the bias vector to center the activations of the first layer (in other layers it is initialised to zeros). E[z k ] = n e E[w + ]E[x] + β k Therefore we initialise all elements of β to -n e E[w + ]E[x]

C.2 INITIALISATION OF DANN INHIBITORY WEIGHTS FOR BALANCED EXCITATION AND SUBTRACTIVE INHIBITION

Here provide details of inhibitory parameter initialisation such that E[z E k ] = E[(W EI z I ) k ], for W EE iid ∼ Exp(λ E ). E[z E k ] = E[ d i w EE ki x i ] = d 1 λ E E[x] E[(W EI z I ) k ] = E[ ni j w EI kj d i w EI ji x i ] = n i E[w EI ]dE[w IE ]E[x] These expectaions are equal when both sets of excitatory weights are drawn from the same distribution, W IE iid ∼ Exp(λ E ) and W EI ← 1/n i . Or alternatively, inhibitory weights can both drawn from the same distribution, W IE , W EI iid ∼ Exp( λ E n i ). Note, that although the above always holds in expectation, in the case of a multiple inhibitory units we can apply the law of large numbers to conclude that the subtractive inhibition and excitatory input will be approximately equal. Note that while this initialisation is general to different settings of λ E , we initialise λ E ← d(2π -1)/ √ 2π (see section D.1). --------

D PROPORTIONAL RELATIONSHIP BETWEEN EXCITATORY INPUT MEAN AND STANDARD DEVIATION

Here we provide further details regarding the proportionality between z E 's mean and standard deviation. This proportionality constant depends on which statistic or distribution that is of interest for activations (e.g. layer-statistics or unit batch-statistics as in layer and batch normalisation).

D.1 UNIT STATISTICS OVER DATA AND PARAMETER DISTRIBUTIONS

As discussed in the main text, if we consider c • E[z E k ] = Var(z E k ) 1/2 for a unit k, with expectation over the data and parameters, c = √ 2π -1/ √ d: E[z E k ] = d • E[w EE ]E[x] = d • E[w EE ] σ -1 √ 2π Var(z E k ) = Var( d i w EE ki x i ) = d • Var(w EE x) = d • Var(w EE )E[x 2 ] + d • Var(x)E[w EE ] 2 = d • Var(w EE )(E[x 2 ] + Var(x)) = d • Var(w EE )σ 2 -1 2π -1 2π Where E[w EE ] 2 = Var(w EE ) for weights drawn from an exponential distribution. Therefore E[z E k ] • c = Var(z E k ) c = √ 2π -1 √ d Additionally, we see that for Var(z E k ) = σ 2 -1 the variance of the distribution that elements of W EE are drawn from should be Var(w EE ) = 2π d • (2π -1) and so we can set λ E ← d(2π -1)/ √ 2π, for Var(z E k ) = σ 2 -1 .

D.2 UNIT STATISTICS OVER THE DATA DISTRIBUTION

If instead we consider a unit k, with excitatory weights w EE k and expectation and variance taken only over the data we have the approximation: E[z E k ] = E[x] d i w EE ki ≈ d • E[x]E[w EE ] = d • σ -1 √ 2π E[w EE ] Likewise the variance over the data can be approximated as Var(z E k ) = Var(x) d i (w EE ki ) 2 ≈ d • Var(x) • E[(w EE ) 2 ] = d • σ 2 -1 π -1 2π • 2 • E[w EE ] 2 Therefore E x∼p(x) [z E k ] • c = Var x∼p(x) [z E k ] c ≈ √ 2π -2 √ d D.3 LAYER STATISTICS OVER THE DATA AND PARAMETER DISTRIBUTIONS Alternatively we can consider the mean and standard deviation of the layer statistics µ z E , σ z E as calculated if one was to apply layer normalisation to z E . Here again, these statistics are proportionally related, but with the constant √ π/ √ d. If we were to apply layer normalisation to z E , the layer statistics would be as follows: z = g σ z E (z E -µ z E )+β µ z E = 1 n e ne j z E j = 1 n e ne j w EE j,: x σ 2 z E = 1 n e -1 ne j (z E j -µ z E ) 2 We now derive the relationship that the expectation of layer statistics are proportionally related by E[µ z E ] • √ π/ √ d = E(σ 2 z E ) 1/2 . The expectation of E[µ z E ] is straightforward: E[µ z E ] = d • E[w EE ] • E[x] Turning to the derivation of E[σ 2 z E ]: E[σ 2 z E ] = E[ 1 n e -1 ne i (z E i -µ z E ) 2 ] (35) = 1 n e -1 ne i E[(w EE i,: x - 1 n e ne j w EE j,: x) 2 ] (36) = 1 n e -1 ne i E[(ẑ i ) 2 ] ( ) where we have defined ẑi = w EE i,: x -1 ne ne j w EE j,: x. We can obtain E[(ẑ i ) 2 ] by deriving E[ẑ i ] and Var(ẑ i ). As ŵij = w EE ij - 1 n e ne k w EE kj = 1 n e ne k=1,k =i (w EE ij -w EE kj ) we see that E[ ŵij ] = 0, and therefore E[ẑ i ] = 0. For the variance of Var(ẑ i ) we start with Var( ŵij ).  Var( ŵij ) = 1 n 2 e Var( ne k=1,k =i (w EE ij -w EE kj )) (39) = 1 n 2 e ( ne k=1,k =i Var(w EE ij -w EE kj ) + ne k=1,k =1 k,k =i, k =k Cov(w EE ij -w EE kj , w EE ij -w EE k j )) (40) = 1 n 2 e ((n Var(ẑ i ) = d(Var( ŵ)Var(x) + Var( ŵ)E[x] 2 + Var(x)E[ ŵ] 2 ) (44) = d(Var( ŵ)E[x 2 ] + Var(x)E[ ŵ] 2 ) = dVar( ŵ)E[x 2 ] (45) = d(n e -1) n e Var(w EE )E[x 2 ] ( ) Now putting these terms together we can derive E[σ 2 z E ]. E[σ 2 z E ] = 1 n e -1 ne i E[(ẑ i ) 2 ] (47) = 1 n e -1 ne i Var(ẑ i ) (48) = d • Var(w EE ) • E[x 2 ] (49) Therefore returning to E[µ z E ]•c = E(σ 2 z E ) 1/2 and keeping in mind that the variance of an exponential random variable is it's mean squared, c = (d • Var(w EE )E[x 2 ]) 1/2 d • E[w EE ] • E[x] = E[x 2 ] √ d • E[x] We have assumed that x follows a rectified normal distribution. Therefore, E [x] = σ l-1 √ 2π , E[x 2 ] = σ 2 l-1 2 . Resulting in: c = √ π √ d We note that for a DANN layer with a single inhibitory unit, µ z E = z I as W IE ← 1 ne ne j w EE j,: , and W EI ← 1. Therefore DANN divisive inhibition, γ, can be made equivalent to layer standard deviation at initialisation in expectation if e α ← c. However, these calculations apply for the case of multiple interneuron if one makes the approximation µ z E ≈ (W EI W IE x) i for any i.

E PARAMETER UPDATES AND FISHER INFORMATION MATRIX E.1 LAYER FISHER INFORMATION MATRIX

We view a layer's activation as parameterising a conditional distribution from the exponential family P (y|x; θ) = P (y|z), independent in each coordinate of y|z. log P (y|x; θ) = y • z -η(z) φ + c(y, φ) (52) E[y|x; θ] = f (z) = η (z) Cov(y|x; θ) = diag(φf (z)) where f (z) is the activation function of the layer, and φ, η, c define the particular distribution in the exponential family. Note we take η (z), f (z) to denote the ∂η ∂z , ∂f ∂z . F (θ) is defined as: F (θ) = E x∼P (x),y∼P (y|x;θ) ∂ log P (y|x; θ) ∂θ ∂ log P (y|x; θ) ∂θ T (54) As ∂ ∂θ log P (y|x; θ) = ∂z ∂θ ∂ ∂z y • z -η(z) φ + c(y, φ) = 1 φ ∂z ∂θ (y - ∂η ∂z ) we have F (θ) = E x∼P (x),y∼P (y|x;θ) ∂z ∂θ (y -η (z)) φ (y -η (z)) φ T ∂z ∂θ T (56) = E x∼P (x) ∂z ∂θ E y∼P (y|x;θ) (y -η (z)) φ (y -η (z)) φ T x; θ ∂z ∂θ T (57) = E x∼P (x) ∂z ∂θ Cov y x; θ φ 2 ∂z ∂θ T (58) = E x∼P (x) ∂z ∂θ diag(f (z)) φ ∂z ∂θ T ( ) where we recognise the covariance matrix is diagonal: Cov y x; θ = diag(Var(y 1 |x; θ), ..., Var(y ne |x; θ)) To analyse the approximate KL divergence resulting from the simple case of perturbing individual parameters of a single-layer DANN, we only need to consider the diagonal entries of the Fisher. D KL P θ P θ+δθ ≈ 1 2 δ T θ E x∼P (x) ∂z ∂θ diag(f (z)) φ ∂z ∂θ T δ θ (60) = δ 2 2φ E x∼P (x) ∂z ∂ θ diag(f (z)) ∂z ∂ θ T (61) = δ 2 2φ E x∼P (x) (f (z) T ∂z ∂ θ ) ∂z ∂ θ T (62) = δ 2 2φ ne k E x∼P (x) f (z k ) ∂z k ∂ θ 2 where δ θ represents a 1-hot vector corresponding to θ, multiplied by a scalar δ.

E.2 DERIVATIVES

Here we provide derivatives for DANN layer activations with respect to the different parameter groups. The equations for the layer activation can be written z = g γ (z E -W EI z I ) + β where z E = W EE x z I = W IE x γ = W EI (e α z I ) (64) Note ∂z k ∂w EE ij = ∂z k ∂w EI ij = 0 for k = i. ∂z i ∂w EE ij = ∂ ∂w EE ij g i γ i (z E i -(W EI z I ) i ) + β i (65) = g i γ i ∂ ∂w EE ij (z E i ) (66) = g i γ i x j (67) ∂z i ∂w EI ij = ∂ ∂w EI ij g i γ i (z E i -(W EI z I ) i ) + β i (68) = - g i γ 2 i ∂γ i ∂w EI ij (z E i -(W EI z I ) i ) - g i γ i z I j (69) = - g i γ 2 i e αj z I j (z E i -(W EI z I ) i ) - g i γ i z I j (70) = - g i γ i z I j e αj γ i (z E i -(W EI z I ) i ) + 1 (71) = - d k g i γ i w IE jk x k e αj γ i (z E i -(W EI z I ) i ) + 1 (72) In contrast ∂z k ∂w EI ij , ∂z k ∂α j = 0 for k = i. ∂z k ∂w IE ij = ∂ ∂w IE ij g k γ k (z E k -(W EI z I ) k ) + β k (73) = - g k γ 2 k ∂γ k ∂w IE ij (z E k -(W EI z I ) k ) - g k γ k w EI ki x j (74) = - g k γ 2 k e αi w EI ki x j (z E k -(W EI z I ) k ) - g k γ k w EI ki x j (75) = - g k γ k w EI ki x j e αi γ k (z E k -(W EI z I ) k ) + 1 (76) ∂z i ∂α j = ∂ ∂α j g i γ i (z E i -(W EI z I ) i ) + β i (77) = - g i γ 2 i ∂γ i ∂α j (z E i -(W EI z I ) i ) (78) = - g i γ 2 i w EI ij e αj z I j (z E i -(W EI z I ) i ) (79) = - d k g i γ i w EI ij w IE jk x k e αj γ i (z E i -(W EI z I ) i ) ( ) ∂z i ∂g i = 1 γ i (z E i -(W EI z I ) i ) ∂z i ∂b i = 1 (82)

E.3 APPROXIMATE KL DIVERGENCE FOR WEIGHT UPDATES

If we consider an update to an element ij of W EE the approximate KL divergence is D KL P θ P θ+δ W EE ij ≈ δ 2 2φ ne k E x∼P (x) f (z k ) ∂z k ∂w EE ij 2 = δ 2 2φ E x∼P (x) f (z i )( g i γ i x j ) 2 (84) as ∂z k ∂w EE ij = 0 for k = i. In contrast, for an update to an element ij of W IE we sum over n e terms, as ∂z k ∂w IE ij = 0 for k = i. D KL P θ P θ+δ W IE ij ≈ δ 2 2φ ne k E x∼P (x) f (z k ) ∂z k ∂w IE ij 2 = δ 2 2φ ne k E x∼P (x) f (z k ) - g k γ k w EI ki x j a ki 2 = δ 2 2φ ne k E x∼P (x) f (z k )( g k γ k x j ) 2 (w EI ki a ki ) 2 where a kj = e αj γ k (z E k -(W EI z I ) k ) + 1. For an update δ W EI ij , while ∂z k ∂w EI ij = 0 for k = i, the derivative contains a z I j term, so there is instead a squared sum over d terms. D KL P θ P θ+δ W EI ij ≈ δ 2 2φ ne k E x∼P (x) f (z k ) ∂z k ∂w EI ij 2 = δ 2 2φ E x∼P (x) f (z i ) ∂z i ∂w EI ij 2 = δ 2 2φ E x∼P (x) f (z i ) - g i γ i z I j a ij 2 = δ 2 2φ E x∼P (x) f (z i )(z I j ) 2 ( g i γ i ) 2 (a ij ) 2 = δ 2 2φ E x∼P (x) f (z i )( d n w IE j,n x n ) 2 ( g i γ i ) 2 (a ij ) 2 = δ 2 2φ E x∼P (x)   f (z i ) d n (w IE jn ) 2 (x n ) 2 + d n =m w IE jn w IE jm x n x m ( g i γ i ) 2 (a ij ) 2   = δ 2 2φ d n E x∼P (x) f (z i )(w IE jn ) 2 (x n ) 2 ( g i γ i ) 2 (a ij ) 2 + δ 2 2φ d n =m E x∼P (x) f (z i )w IE jn w IE jm x n x m ( g i γ i ) 2 (a ij ) 2 Finally, for alpha  D KL P θ P θ+δα i ≈ δ 2 2φ ne k E x∼P (x) f (z k ) ∂z k ∂α i 2 = δ 2 2φ ne k E x∼P (x)   f (z k ) - d j g k γ k w EI ki w IE ij x j e αi γ k (z E k -(W EI z I ) k 2   = δ 2 2φ ne k E x∼P (x)   f (z k ) - d j g k γ k w EI ki w IE ij x j (a ki -1) 2   = δ 2 2φ ne k E x∼P (x)   f (z k ) d j ( g k γ k w EI ki w IE ij x j (a ki - W EI ← 1/n i end if α ← 1 • log( √ 2π-1 √ d ) g, β ← 1 end for Where number of excitatory output units is n e , number of inhibitory units n i , and input dimensionality d and λ E = d(2π -1)/ √ 2π.

F.2 PARAMETER UPDATES

For DANN parameter updates we used the algorithms detailed below. Note that gradients were corrected as detailed in Section 4 and see Algorithm 3. All below algorithms are computed using the loss gradients ∇θ of parameter θ, in a given model computed on a minibatch sample. Algorithm 2 Parameter updates Require learning rate η, updates ∆θ for each layer l do W EE ← W EE -η∆W EE W IE ← W IE -η∆W IE W EI ← W EI -η∆W EI α ← α -η∆α g ← g -η∆g β ← β -η∆β W EE ← max(W EE , 0) W IE ← max(W IE , 0) W EI ← max(W EI , 0) g ← max(g, 0) end for

F.3 DANN GRADIENT CORRECTION ALGORITHMS

For the majority of experiments we scaled gradients using the heuristic correction terms derived in Section 4 (and see Appendix E). In this case we applied the following algorithm before Algorithm 2. The learning rate scaling method temporarily reduces the learning rate whenever parameter clipping would reduce the cosine of the angle, made between the gradient and actual updates, below a certain constraint. For any optimization problems caused by actual clipped updates not following the gradient, learning rate scaling is a principled way of following the direction of the gradient. We also note, this technique can be generally applied to any other model which is constrained so that it cannot have updates freely follow gradient descent. If the constrained parameter space is an open subset of euclidean space, and we allow the learning rate to be arbitrarily small (Algorithm 6 with lim i→∞ ξ i = 0), updates will always follow the direction of the gradient.



Note the general form of the exponential family is log P (y|z) = z•T (y)-η(z) φ + c(y, φ), but here we only consider distributions from the natural exponential family, where T (y) = y, as this includes distributions of interest for us, such as Normal and Categorical, and also common distributions including Exponential, Poisson, Gamma, etc. https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py



Figure 1: Illustration of DANN architecture. Lines with arrow ends indicate excitatory projections. Lines with bar ends indicate inhibitory projections, which can be both subtractive and divisive.

Figure 2: Empirical verification of update correction terms: Cosine of the angle between gradients multiplied by an approximation of the diagonal of the Fisher inverse for each layer, and either uncorrected gradients (black) or corrected gradients (orange) over 50 epochs. Plot displays a moving average over 500 updates

Figure 4: Model comparison on Fashion MNIST and Kuzushiji MNIST datasets.

Figure 7: Convolutional network results on CIFAR-10 with control and DANN VGG16 models. Plots show mean training and test set error over 6 random seeds.

Learning rate scaling for each layer do Require∇θ , M , ξ i ← 1 c ← 0 while c < M : η ← ξ i c ← CosineSimilarity(max(0, θ -η∇θ ), θ -η∇θ )) i ← i + 1 end for



For i ≤ n e we calculate Var(ẑ i ), keeping in mind that for i ≤ n e , j ≤ d, x j are iid, and equation(38)  shows that ŵij are iid in the j'th coordinate, so we see that

)) 2 EI ki ) 2 w IE in x n w IE im x m (a ki -1)2 we provide pseudo-code for implementation details. Please see the following link for code: https://github.com/linclab/ltlwdp F.1 PARAMETER INITIALIZATION Algorithm 1 Parameter initialization for DANNs for layer L do require n e , n i , d W EE ∼ exp(λ E ) if n i = 1 W IE ← 1

ACKNOWLEDGEMENTS

We would like to thank Shahab Bakhtiari, Luke Prince, and Arna Ghosh for their helpful comments on this work. This work was supported by grants to BAR, including a NSERC Discovery Grant (RGPIN-2020-05105), an Ontario Early Career Researcher Award (ER17-13-242), a Healthy Brains, Healthy Lives New Investigator Start-up (2b-NISU-8), and funding from CIFAR (Learning in Machines and Brains Program, Canada CIFAR AI Chair), the Wellcome Trust and by the Medical Research Council (UK). Additionally, DK was supported by the FRQNT Strategic Clusters Program (2020-RS4-265502-UNIQUE).

SUPPLEMENTARY MATERIAL A SUPPLEMENTARY RESULTS

We also tested that our heuristic correction factors approximated gradient multiplication by the diagonal of F -1 t for each layer (see Figure 2 ). || where F W EE is the elements of F corresponding to W EE ∆θ l ← F * t ∇θ l end for end forHere we note this update can be considered very rough diagonal approximation to natural gradient descent. In addition, various efficient approximations to natural gradient descent that have been utilized such as KFAC Martens & Grosse (2015) could not be applied due to the structure of DANNs, as the mathematical assumptions of KFAC, which were made for feedforward networks with activations as matrix multiplications, do not apply.

F.4 LEARNING RATE SCALING AND GRADIENT NORMALISATION

We also tested whether constraining the gradient norm and scaling the learning rate based on parameter clipping improved DANN performance. For these experiments we applied the following algorithms. 

