Learning with Plasticity Rules: Generalization and Robustness

Abstract

Brains learn robustly, and generalize effortlessly between different learning tasks; in contrast, robustness and generalization across tasks are well known weaknesses of artificial neural nets (ANNs). How can we use our accelerating understanding of the brain to improve these and other aspects of ANNs? Here we hypothesize that (a) Brains employ synaptic plasticity rules that serve as proxies for Gradient Descent (GD); (b) These rules themselves can be learned by GD on the rule parameters; and (c) This process may be a missing ingredient for the development of ANNs that generalize well and are robust to adversarial perturbations. We provide both empirical and theoretical evidence for this hypothesis. In our experiments, plasticity rules for the synaptic weights of recurrent neural nets (RNNs) are learned through GD and are found to perform reasonably well (with no backpropagation). We find that plasticity rules learned by this process generalize from one type of data/classifier to others (e.g., rules learned on synthetic data work well on MNIST/Fashion MNIST) and converge with fewer updates. Moreover, the classifiers learned using plasticity rules exhibit surprising levels of tolerance to adversarial perturbations. In the special case of the last layer of a classification network, we show analytically that GD on the plasticity rule recovers (and improves upon) the perceptron algorithm and the multiplicative weights method. Finally, we argue that applying GD to learning rules is biologically plausible, in the sense that it can be learned over evolutionary time: we describe a genetic setting where natural selection of a numerical parameter over a sequence of generations provably simulates a simple variant of GD.

1. Introduction

The brain is the most striking example of a learning device that generalizes robustly across tasks. Artificial neural networks learn specific tasks from labeled examples through backpropagation with formidable accuracy, but generalize quite poorly to a different task, and are brittle under data perturbations. In addition, it is well known that backpropagation is not biorealistic -it cannot be implemented in brains, as it requires the transfer of information from post-to pre-synaptic neurons. This is not, in itself, a disadvantage of backpropagation -unless one suspects that this lack of biorealism limits ANNs in important dimensions such as cross-task generalization, self-supervision, and robustness. We believe that the quest for ANNs that generalize robustly between learning tasks has much inspiration to gain from the study of the way brains work. In this paper we focus on plasticity rules (Dayan and Abbott, 2001) -laws controlling changes of the strength of a synapse based on the firing history as seen at the post-synaptic neuron. We provide evidence, both experimental and theoretical, that (a) In the case of RNNs, plasticity rules can successfully replace backpropagation and GD resulting in versatile, generalizable and robust learning; and (b) These rules can be learned efficiently through GD on the rule parameters. Plasticity Rules. Hebbian learning ("fire together wire together" Hebb (1949)) is the simplest and most familiar plasticity rule: If there is a synapse (i, j) from neuron i to neuron j, and at some point i fires and shortly thereafter j fires, then the synaptic weight of this synapse gets an increment. Over the seven decades since Hebb, many forms of plasticity have been observed experimentally and/or formalized analytically, many of them quite sophisticated and complex, see Dayan and Abbott ( 2001) for an exposition. All of them dictate a change -increment or decrement -in the synaptic weight of a synapse (i, j) provided neurons i and j both fired in some pattern. Intuitively, the decision for the application of a plasticity rule takes place at the post-synaptic neuron j, since j receives information from the firing of both i and itself. This is consistent with our understanding of the molecular mechanisms that determine synaptic strength, all of which are complex chemical phenomena taking place at (the dendrite of) j. In this paper we consider plasticity rules as objects that can be learned. This fits with the view that existing mechanisms have presumably changed over evolutionary time and are known to differ in their details from one animal species to another. We show experimentally that an RNN can meta-learn a plasticity rule that allows it to learn to perform a classification task without backpropagation. This meta-learning is done by GD on the parameters of the rule. Interestingly, the same plasticity rule then performs well on very different tasks and data sets. There are many ways to parameterize a plasticity rule, from full table specifications to small neural networks that take as input observed activation sequence at both ends of a synapse and output the change to the synaptic weight. Why RNNs? RNNs are inspired by, and can model, recurrent activity observed in the brain; they are also especially well-suited to plasticity rules. To illustrate, suppose that we want to train the feed-forward ANN in Figure 1 (a) with a plasticity rule. It is clear that the space of possible rules is rather meager. In order to change the weight of link (i, j) after each labeled example, node j will decide the nature of the change based on local information, namely, whether i or j or both fired during this epoch. Thus any learned plasticity rule must be some slight generalization of Hebb's rulefoot_0 . But suppose instead that the three hidden layers have been collapsed into one, resulting in the RNN shown in Figure 1 (b), and this collapsed layer fires three times before readout, roughly simulating a feedforward 3-layer network. Now node j knows much more about what happened to link (i, j) during these three epochs; such information was inaccessible in the feedforward setting. Any 2 3 × 2 3 matrix of reals is a possible plasticity rule, where 2 3 is the number of possible firing patterns -such as "fired in the first round, did not fire in the second, fired in the third," or "101" -for each of i and j, and the entries of the matrix denote increments/decrements, additive or multiplicative, of the weight of link (i, j). If one updates the entries of this rule by training on a task, it is possible that this rule may be an adequate proxy for the update calculated by backpropagation. Furthermore, we might hope that this rule may even generalize well, performing far above baseline on very different tasks. Evolution. We proposed to replace GD in deep learning by biorealistic plasticity rules, and then we use GD to learn the plasticity coefficients. Are we contradicting ourselves? After all, the brain did not develop its plasticity rule(s) through GD, but through evolution. But since



We could update all incoming links to node j based on the firing status of all of them; Zenke et al.(2015) suggests that such complex rules may be indeed at work in the animal brain. See the discussion for more on this intriguing research direction.



Figure 1: Feedforward networks vs RNNs

