PREDICTIVE CODING APPROXIMATES BACKPROP ALONG ARBITRARY COMPUTATION GRAPHS

Abstract

Backpropagation of error (backprop) is a powerful algorithm for training machine learning architectures through end-to-end differentiation. Recently it has been shown that backprop in multilayer-perceptrons (MLPs) can be approximated using predictive coding, a biologically-plausible process theory of cortical computation which relies solely on local and Hebbian updates. The power of backprop, however, lies not in its instantiation in MLPs, but rather in the concept of automatic differentiation which allows for the optimisation of any differentiable program expressed as a computation graph. Here, we demonstrate that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules. We apply this result to develop a straightforward strategy to translate core machine learning architectures into their predictive coding equivalents. We construct predictive coding CNNs, RNNs, and the more complex LSTMs, which include a non-layer-like branching internal graph structure and multiplicative interactions. Our models perform equivalently to backprop on challenging machine learning benchmarks, while utilising only local and (mostly) Hebbian plasticity. Our method raises the potential that standard machine learning algorithms could in principle be directly implemented in neural circuitry, and may also contribute to the development of completely distributed neuromorphic architectures.

1. INTRODUCTION

Deep learning has seen stunning successes in the last decade in computer vision (Krizhevsky et al., 2012; Szegedy et al., 2015) , natural language processing and translation (Vaswani et al., 2017; Radford et al., 2019; Kaplan et al., 2020) , and computer game playing (Mnih et al., 2015; Silver et al., 2017; Schrittwieser et al., 2019; Vinyals et al., 2019) . While there is a great variety of architectures and models, they are all trained by gradient descent using gradients computed by automatic differentiation (AD). The key insight of AD is that it suffices to define a forward model which maps inputs to predictions according to some parameters. Then, using the chain rule of calculus, it is possible, as long as every operation of the forward model is differentiable, to differentiate back through the computation graph of the model so as to compute the sensitivity of every parameter in the model to the error at the output, and thus adjust every single parameter to best minimize the total loss. Early models were typically simple artificial neural networks where the computation graph is simply a composition of matrix multiplications and elementwise nonlinearities, and for which the implementation of automatic differentation has become known as 'backpropagation' (or 'backprop'). However, automatic differentiation allows for substantially more complicated graphs to be differentiated through, up to, and including, arbitrary programs (Griewank et al., 1989; Baydin et al., 2017; Paszke et al., 2017; Revels et al., 2016; Innes et al., 2019; Werbos, 1982; Rumelhart and Zipser, 1985; Linnainmaa, 1970) . In recent years this has enabled the differentiation through differential equation solvers (Chen et al., 2018; Tzen and Raginsky, 2019; Rackauckas et al., 2019) , physics engines (Degrave et al., 2019; Heiden et al., 2019 ), raytracers (Pal, 2019) , and planning algorithms (Amos and Yarats, 2019; Okada et al., 2017) . These advances allow the straightforward training of models which intrinsically embody complex processes and which can encode significantly more prior knowledge and structure about a given problem domain than previously possible. Modern deep learning has also been closely intertwined with neuroscience (Hassabis et al., 2017; Hawkins and Blakeslee, 2007; Richards et al., 2019) . The backpropagation algorithm itself arose as a technique for training multi-layer perceptrons -simple hierarchical models of neurons inspired by the brain (Werbos, 1982) . Despite this origin, and its empirical successes, a consensus has emerged that the brain cannot directly implement backprop, since to do so would require biologically implausible connection rules (Crick, 1989) . There are two principal problems. Firstly, backprop in the brain appears to require non-local information (since the activity of any specific neuron affects all subsequent neurons down to the final output neuron). It is difficult to see how this information could be transmitted 'backwards' throughout the brain with the required fidelity without precise connectivity constraints. The second problem -the 'weight transport problem' is that backprop through MLP style networks requires identical forward and backwards weights. In recent years, however, a succession of models have been introduced which claim to implement backprop in MLP-style models using only biologically plausible connectivity schemes, and Hebbian learning rules (Liao et al., 2016; Guerguiev et al., 2017; Sacramento et al., 2018; Bengio and Fischer, 2015; Bengio et al., 2017; Ororbia et al., 2020; Whittington and Bogacz, 2019) . Of particular significance is Whittington and Bogacz (2017) who show that predictive coding networks -a type of biologically plausible network which learn through a hierarchical process of prediction error minimization -are mathematically equivalent to backprop in MLP models. In this paper we extend this work, showing that predictive coding can not only approximate backprop in MLPs, but can approximate automatic differentiation along arbitrary computation graphs. This means that in theory there exist potentially biologically plausible algorithms for differentiating through arbitrary programs, utilizing only local connectivity. Moreover, in a class of models which we call parameter-linear, which includes many current machine learning models, the required update rules are Hebbian, raising the possibility that a wide range of current machine learning architectures may be faithfully implemented in the brain, or in neuromorphic hardware. In this paper we provide two main contributions. (i) We show that predictive coding converges to automatic differentiation across arbitrary computation graphs. (ii) We showcase this result by implementing three core machine learning architectures (CNNs, RNNs, and LSTMs) in a predictive coding framework which utilises only local learning rules and mostly Hebbian plasticity. Predictive coding is an influential theory of cortical function in theoretical and computational neuroscience. Central to the theory is the idea that the core function of the brain is to minimize prediction errors between what is expected to happen and what actually happens. Predictive coding

2. PREDICTIVE CODING ON ARBITRARY COMPUTATION GRAPHS

z v 0 v 1 v 2 v 3 T ∂L ∂v 2 = ∂L ∂v 3 ∂v 3 ∂v 2 ∂L ∂v 1 = ∂L ∂v 2 ∂v 2 ∂v 1 ∂L ∂v 0 = ∂L ∂v 1 ∂v 1 ∂v 0 v 0 v 1 v 2 0 ϵ 1 ϵ 2 ϵ 3 v 1 v 3 • v 0 = -ϵ 0 + ϵ 1 ∂ v 1 ∂v 0 • v 1 = -ϵ 1 + ϵ 2 ∂ v 2 ∂v 1 • v 2 = -ϵ 2 + ϵ 3 ∂ v 3 ∂v 2 T v 2



Figure 1: Top: Backpropagation on a chain. Backprop proceeds backwards sequentially and explicitly computes the gradient at each step on the chain. Bottom: Predictive coding on a chain. Predictions, and prediction errors are updated in parallel using only local information.

