PREDICTIVE CODING APPROXIMATES BACKPROP ALONG ARBITRARY COMPUTATION GRAPHS

Abstract

Backpropagation of error (backprop) is a powerful algorithm for training machine learning architectures through end-to-end differentiation. Recently it has been shown that backprop in multilayer-perceptrons (MLPs) can be approximated using predictive coding, a biologically-plausible process theory of cortical computation which relies solely on local and Hebbian updates. The power of backprop, however, lies not in its instantiation in MLPs, but rather in the concept of automatic differentiation which allows for the optimisation of any differentiable program expressed as a computation graph. Here, we demonstrate that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules. We apply this result to develop a straightforward strategy to translate core machine learning architectures into their predictive coding equivalents. We construct predictive coding CNNs, RNNs, and the more complex LSTMs, which include a non-layer-like branching internal graph structure and multiplicative interactions. Our models perform equivalently to backprop on challenging machine learning benchmarks, while utilising only local and (mostly) Hebbian plasticity. Our method raises the potential that standard machine learning algorithms could in principle be directly implemented in neural circuitry, and may also contribute to the development of completely distributed neuromorphic architectures.

1. INTRODUCTION

Deep learning has seen stunning successes in the last decade in computer vision (Krizhevsky et al., 2012; Szegedy et al., 2015) , natural language processing and translation (Vaswani et al., 2017; Radford et al., 2019; Kaplan et al., 2020) , and computer game playing (Mnih et al., 2015; Silver et al., 2017; Schrittwieser et al., 2019; Vinyals et al., 2019) . While there is a great variety of architectures and models, they are all trained by gradient descent using gradients computed by automatic differentiation (AD). The key insight of AD is that it suffices to define a forward model which maps inputs to predictions according to some parameters. Then, using the chain rule of calculus, it is possible, as long as every operation of the forward model is differentiable, to differentiate back through the computation graph of the model so as to compute the sensitivity of every parameter in the model to the error at the output, and thus adjust every single parameter to best minimize the total loss. Early models were typically simple artificial neural networks where the computation graph is simply a composition of matrix multiplications and elementwise nonlinearities, and for which the implementation of automatic differentation has become known as 'backpropagation' (or 'backprop'). However, automatic differentiation allows for substantially more complicated graphs to be differentiated through, up to, and including, arbitrary programs (Griewank et al., 1989; Baydin et al., 2017; Paszke et al., 2017; Revels et al., 2016; Innes et al., 2019; Werbos, 1982; Rumelhart and Zipser, 1985; Linnainmaa, 1970) . In recent years this has enabled the differentiation through differential equation solvers (Chen et al., 2018; Tzen and Raginsky, 2019; Rackauckas et al., 2019) , physics engines (Degrave et al., 2019; Heiden et al., 2019 ), raytracers (Pal, 2019) , and planning algorithms (Amos and Yarats, 2019; Okada et al., 2017) . These advances allow the straightforward training of models which intrinsically embody complex processes and which can encode significantly more prior knowledge and structure about a given problem domain than previously possible. Modern deep learning has also been closely intertwined with neuroscience (Hassabis et al., 2017; Hawkins and Blakeslee, 2007; Richards et al., 2019) . The backpropagation algorithm itself arose

