A THEORETICAL FRAMEWORK FOR INFERENCE AND LEARNING IN PREDICTIVE CODING NETWORKS

Abstract

Predictive coding (PC) is an influential theory in computational neuroscience, which argues that the cortex forms unsupervised world models by implementing a hierarchical process of prediction error minimization. PC networks (PCNs) are trained in two phases. First, neural activities are updated to optimize the network's response to external stimuli. Second, synaptic weights are updated to consolidate this change in activity -an algorithm called prospective configuration. While previous work has shown how in various limits, PCNs can be found to approximate backpropagation (BP), recent work has demonstrated that PCNs operating in this standard regime, which does not approximate BP, nevertheless obtain competitive training and generalization performance to BP-trained networks while outperforming them on various tasks. However, little is understood theoretically about the properties and dynamics of PCNs in this regime. In this paper, we provide a comprehensive theoretical analysis of the properties of PCNs trained with prospective configuration. We first derive analytical results concerning the inference equilibrium for PCNs and a previously unknown close connection relationship to target propagation (TP). Secondly, we provide a theoretical analysis of learning in PCNs as a variant of generalized expectation-maximization and use that to prove the convergence of PCNs to critical points of the BP loss function, thus showing that deep PCNs can, in theory, achieve the same generalization performance as BP, while maintaining their unique advantages.

1. INTRODUCTION

Predictive coding (PC) is an influential theory in theoretical neuroscience (Mumford, 1992; Rao & Ballard, 1999; Friston, 2003; 2005) , which is often presented as a potential unifying theory of cortical function (Friston, 2003; 2008; 2010; Clark, 2015b; Hohwy et al., 2008) . PC argues that the brain is fundamentally a hierarchical prediction-error-minimizing system that learns a general world model by predicting sensory inputs. Computationally, one way the theory of PC can be instantiated is with PC networks (PCNs), which are heavily inspired by and can be compared to artificial neural networks (ANNs) on various machine learning tasks (Lotter et al., 2016; Whittington & Bogacz, 2017; Millidge et al., 2020a; b; Song et al., 2020; Millidge et al., 2022) . Like ANNs, PCNs are networks of neural activities and synaptic weights. Unlike ANNs, in PCNs, training proceeds by clamping the input and output of the network to the training data and correct targets, respectively, and first letting the neural activities update towards the configuration that minimizes the sum of prediction errors throughout the network. Once the neural activities have reached an equilibrium, the synaptic weights can be updated with a local and Hebbian update rule, which consolidates this change in neural activities. This learning algorithm is called prospective configuration (Song et al., 2022) , since the activity updates appear to be prospective in that they move towards the values that each neuron should have in order to correctly classify the input. Previous work has shown how, in certain limits such as when the influence of feedback information is small or in the first step of inference, PC can approximate backpropagation (BP) and that this approximation is close enough to enable the training of large-scale networks with the same performance as BP (Whittington & Bogacz, 2017; Millidge et al., 2020a; Song et al., 2020) ; see Appendix A.2 for a full review and comparison. Recent work (Song et al., 2022) has also shown that PCNs trained with prospective configuration under standard conditions can also obtain a training and generalization performance equivalent to BP, with advantages in online, few-shot, and continual learning. Intuitively, the difference between learning in BP and PCNs is that in BP, the error is computed at the output layer and sequentially transported backwards through the network. In PCNs, there is first an inference phase in which the error is redistributed throughout the network until it converges to an equilibrium. Then, the weights are updated to minimize the local errors computed at this equilibrium. Crucially, this equilibrium is in some sense prospective: due to its iterative nature, it can utilize information about the activities of other layers that is not available to BP, and this information can speed up the training by not making redundant updates. However, while the convergence properties of stochastic gradient descent combined with BP are well understood, there is no equivalent understanding of the theoretical properties of prospective configuration yet. In this work, we provide the first comprehensive theoretical study of both the properties of the inference and the learning phase in PCNs. We investigate the nature of the equilibrium computed by the inference phase and analyze its properties in a linear PCN, where an expression for the inference equilibrium can be derived analytically. We show that this equilibrium can be interpreted as an average of the feedforward pass values of the network and the local targets computed by target propagation (TP). We show that the same intuition also holds for nonlinear networks, where the inference equilibrium cannot be computed analytically. Furthermore, we investigate the nature of learning in such networks, which diverges from BP. We present a novel interpretation of PCNs as implementing generalized expectation maximization (EM) with a constrained expectation step, which complements its usual interpretation as variational inference (Friston, 2005; Bogacz, 2017; Buckley et al., 2017) . Moreover, we present a unified theoretical framework, which allows us to understand previously disparate results linking PC and BP, to make precise the connection between PC, TP, and BP, and crucially to prove that PCNs trained with prospective configuration will converge to the same minima as BP. We thus show that, in theory, PCNs have at least an equal learning capacity and scalability as current machine learning architectures.

2. BACKGROUND AND NOTATION

Predictive coding networks (PCNs): Although PCNs can be defined for arbitrary computation graphs (Millidge et al., 2020a; Salvatori et al., 2022a; b) , in this paper we assume an MLP structure. A PCN consists of a hierarchical set of layers with neural activations {x 0 , . . . , x L }, where x l ∈ R w l is a real vector of length w l , which is the number of neurons in the layer. In this paper we refer to the input x 0 and output x L layers as the input and output layers respectively. Each layer makes a prediction of the activity of the next layer xl = f (W l x l-1 ), where W l ∈ R w l × R w l-1 is a weight matrix of trainable parameters, and f is an elementwise activation function. During testing, the first layer neural activities are fixed to an input data point x 0 = d, and predictions are propagated through the network to produce an output xL . During training, both the first and last layers of the network are clamped to the input data point x 0 = d and a target value x L = T , respectively. Both the neural activity and weight dynamics of the PCN can be derived as a gradient descent on a free-energy functional equal to the sum of the squared prediction errors F = L l=0 ϵ 2 l , where ϵ l = x l -f (W l x l-1 ). Typically, the activity updates are run to convergence x * (called the inference phase), and the weights are updated once at convergence (called the learning phase): Inference: ẋl = -∂F/∂x l = -ϵ l + W T l+1 ϵ l+1 • f ′ (W l+1 x l ), for l < L, -ϵ l , for l = L, (1) Learning: Ẇl = -(∂F/∂W l )| x l =x * l = ϵ l • f ′ (W l x * l-1 )x * l-1 T . (2) In the above equations • denotes elementwise multiplication. The network energy function F presented here can be derived as a variational free energy under a hierarchical Gaussian generative

