BACKPROPAGATION AT THE INFINITESIMAL INFER-ENCE LIMIT OF ENERGY-BASED MODELS: UNIFYING PREDICTIVE CODING, EQUILIBRIUM PROPAGATION, AND CONTRASTIVE HEBBIAN LEARNING

Abstract

How the brain performs credit assignment is a fundamental unsolved problem in neuroscience. Many 'biologically plausible' algorithms have been proposed, which compute gradients that approximate those computed by backpropagation (BP), and which operate in ways that more closely satisfy the constraints imposed by neural circuitry. Many such algorithms utilize the framework of energy-based models (EBMs), in which all free variables in the model are optimized to minimize a global energy function. However, in the literature, these algorithms exist in isolation and no unified theory exists linking them together. Here, we provide a comprehensive theory of the conditions under which EBMs can approximate BP, which lets us unify many of the BP approximation results in the literature (namely, predictive coding, equilibrium propagation, and contrastive Hebbian learning) and demonstrate that their approximation to BP arises from a simple and general mathematical property of EBMs at free-phase equilibrium. This property can then be exploited in different ways with different energy functions, and these specific choices yield a family of BP-approximating algorithms, which both includes the known results in the literature and can be used to derive new ones.

1. INTRODUCTION

The backpropagation of error algorithm (BP) (Rumelhart et al., 1986) has become the workhorse algorithm underlying the recent successes of deep learning (Krizhevsky et al., 2012; Silver et al., 2016; Vaswani et al., 2017) . However, from a neuroscientific perspective, BP has often been criticised as not being biologically plausible (Crick et al., 1989; Stork, 1989) . Given that the brain faces a credit assignment problem at least as challenging as deep neural networks, there is a fundemental question of whether the brain uses backpropagation to perform credit assignment. The answer to this question depends on whether there exist biologically plausible algorithms which approximate BP that could be implemented in neural circuitry (Whittington & Bogacz, 2019; Lillicrap et al., 2020) . A large number of potential algorithms have been proposed in the literature (Lillicrap et al., 2016; Xie & Seung, 2003; Nøkland, 2016; Whittington & Bogacz, 2017; Lee et al., 2015; Bengio & Fischer, 2015; Millidge et al., 2020c; a; Song et al., 2020; Ororbia & Mali, 2019) , however insight into the linkages and relationships between them is scarce, and thus far the field largely presents itself as a set of disparate algorithms and ideas without any unifying or fundamental principles. In this paper we provide a theoretical framework which unifies four disparate schemes for approximating BP -predictive coding with weak feedback (Whittington & Bogacz, 2017) and on the first step after initialization (Song et al., 2020) , the Equilibrium Propagation (EP) framework (Scellier & Bengio, 2017) , and Contrastive Hebbian Learning (CHL) (Xie & Seung, 2003) . We show that these algorithms all emerge as special cases of a general mathematical property of the energy based model framework that underlies them and, as such, can be generalized to novel energy functions and to derive novel algorithms that have not yet been described in the literature. The key insight is that for energy based models (EBMs) underlying these algorithms, the total energy can be decomposed into a component corresponding to the supervised loss function that depends on the output of the network, and a second component that relates to the 'internal energy' of the network. Crucially, at the minimum of the internal energy, the dynamics of the neurons point exactly in the direction of the gradient of the supervised loss. Thus, at this point, the network dynamics implicitly provide an exact gradient signal. This fact can be exploited in two ways. Firstly, this instantaneous direction can be directly extracted and used to perform weight updates. This 'first step' approach is taken in (Song et al., 2020) and results in exact BP using only the intrinsic dynamics of the EBM, but typically requires a complex set of control signals to specify when updates should occur. Alternatively, a second equilibrium can be found close to the initial one. If it is close enough -a condition we call the infinitesimal inference limit -, then the vector from the initial to the second equilibrium approximates the direction of the initial dynamics and thus the difference in equilibria approximates the BP loss gradient. This fact is then utilized either implicitly (Whittington & Bogacz, 2017) or explicitly (Scellier & Bengio, 2017; Xie & Seung, 2003) to derive algorithms which approximate BP. Once at this equilibrium, all weights can be updated in parallel rather than sequentially and no control signals are needed. The paper is structured as follows. First, we provided concise introductions to predictive coding networks (PCNs), contrastive Hebbian learning (CHL), and equilibrium propagation (EP). Then, we derive our fundamental results on EBMs, show how this results in a unified framework for understanding existing algorithms, and showcase how to generalize our framework to derive novel BP-approximating algorithms.

2. BACKGROUND AND NOTATION

We assume we are performing credit assignment on a hierarchical stack of feedforward layers x 0 . . . x L with layer indices 0 . . . L in a supervised learning task. The input layer x 0 is fixed to some data element d. The layer states x l are vectors of real values and represent a rate-coded average firing rate for each neuron. We assume a supervised learning paradigm in which a target vector T is provided to the output layer of the network and the network as a whole minimizes a supervised loss L(x L , T ). T is usually assumed to be a one-hot vector in classification tasks but this is not necessary. Synaptic weight matrices W l at each layer affect the dynamics of the layer above. The network states x l are assumed to be free parameters that can vary during inference. We use x = {x l } and W = {W l } to refer to the sets across all layers when specific layer indices are not important. The models we describe are EBMs, which possess an energy function E(x 0 . . . x L , W 0 . . . W L , T ) which is minimized both with respect to neural activities x l and weights W l with dynamics, Inference: dx l dt = - ∂E ∂x l Learning: dW l dt = - ∂E ∂W l which are simply a gradient descent on the energy. For notational simplicity we implicitly set the learning rate η = 1 throughout. Moreover, we assume differentiability (at least up to second order) of the energy function with respect to both weights and activities. In EBMs, we run 'inference' to optimize the activities first using Equation 1 until convergence. Then, after the inference phase is complete we run a single step of weight updates according to Equation 2, which is often called the learning phase. A schematic representation of how EBMs learn with a global energy function is presented in Figure 1A while a visual breakdown of our results is presented in Figure 1B . 2.1 PREDICTIVE CODING Predictive Coding (PC) emerged as a theory of neural processing in the retina (Srinivasan et al., 1982) and was extended to a general theory of cortical function (Mumford, 1992; Rao & Ballard, 1999; Friston, 2005) . The fundamental idea is that the brain performs inference and learning by learning to predict its sensory stimulu and minimizing the resulting prediction errors. Such an approach provides a natural unsupervised learning objective for the brain (Rao & Ballard, 1999) , while also minimizing redundancy and maximizing information transmission by transmitting only unpredicted information (Barlow, 1961; Sterling & Laughlin, 2015) . The learning rules used by PCNs require only local and Hebbian updates (Millidge et al., 2020b) and a variety of neural microcircuits have been proposed that can implement the computations required by PC (Bastos et al., 2012; Keller & Mrsic-Flogel, 2018) . Moreover, recent works have begun exploring the use of large-scale PCNs in machine learning tasks, to some success (Millidge, 2019; Salvatori et al., 2021;  

