INCREMENTAL PREDICTIVE CODING: A PARALLEL AND FULLY AUTOMATIC LEARNING ALGORITHM

Abstract

Neuroscience-inspired models, such as predictive coding, have the potential to play an important role in the future of machine intelligence. However, they are not yet used in industrial applications due to some limitations, one of them being the lack of efficiency. In this work, we address this by proposing incremental predictive coding (iPC), a variation of the original framework derived from the incremental expectation maximization algorithm, where every operation can be performed in parallel without external control. We show both theoretically and empirically that iPC is more efficient than the original algorithm by Rao and Ballard 1999, while maintaining performance comparable to backpropagation in image classification tasks. This work impacts several areas, has general applications in computational neuroscience and machine learning, and specific applications in scenarios where automatization and parallelization are important, such as distributed computing and implementations of deep learning models on analog and neuromorphic chips.

1. INTRODUCTION

In recent years, deep learning has reached and surpassed human-level performance in a multitude of tasks, such as game playing (Silver et al., 2017; 2016) , image recognition (Krizhevsky et al., 2012; He et al., 2016) , natural language processing (Chen et al., 2020) , and image generation (Ramesh et al., 2022) . These successes are achieved entirely using deep artificial neural networks trained via backpropagation (BP), which is a learning algorithm that is often criticized for its biological implausibilities (Grossberg, 1987; Crick, 1989; Abdelghani et al., 2008; Lillicrap et al., 2016; Roelfsema & Holtmaat, 2018; Whittington & Bogacz, 2019) , such as lacking local plasticity and autonomy. In fact, backpropagation requires a global control signal required to trigger computations, since gradients must be sequentially computed backwards through the computation graph. These properties are not only important for biological plausibility: parallelization, locality, and automation are key to build efficient models that can be trained end-to-end on non Von-Neumann machines, such as analog chips (Kendall et al., 2020) . A learning algorithm with most of the above properties is predictive coding (PC). PC is an influential theory of information processing in the brain (Mumford, 1992; Friston, 2005) , where learning happens by minimizing the prediction error of every neuron. PC can be shown to approximate backpropagation in layered networks (Whittington & Bogacz, 2017) , as well as on any other model (Millidge et al., 2020) , and can exactly replicate its weight update if some external control is added (Salvatori et al., 2022b) . Also the differences with BP are interesting, as PC allows for a much more flexible training and testing (Salvatori et al., 2022a) , has a rich mathematical formulation (Friston, 2005; Millidge et al., 2022) , and is an energy-based model (Bogacz, 2017) . This makes PC unique, as it is the only model that jointly allows training on neuromorphic chips, is an implementation of influential models of cortical functioning in the brain, and can match the performance of backpropagation in different tasks. Its main drawback, however, is the efficiency, as it is slower than BP. In this work, we address this problem by proposing a variation of PC that is much more efficient than the original one. Simply put, PC is based on the assumption that brains implement an internal generative model of the world, needed to predict incoming stimuli (or data) (Friston et al., 2006; Friston, 2010; Friston et al., 2016) . When presented with a stimulus that differs from the prediction, learning happens by updating internal neural activities and synapses to minimize the prediction error. In computational models, this is done via multiple expectation-maximization (EM) (Dempster et al., 1977) steps on the variational free energy, in this case a function of the total error of the generative model. During the Estep, internal neural activities are updated in parallel until convergence; during the M-step, a weight update to further minimize the same energy function is performed. This approach results in two limitations: first, the E-step is slow, as it can require dozens of iterations before convergence; second, an external control signal is needed to switch from the E to M step. In this paper, we show how to address both of these problems by considering a variation of the EM algorithm, called incremental expectation-maximization (iEM), which performs both E and M steps in parallel (Neal & Hinton, 1998) . This algorithm is provably faster, does not require a control signal to switch between the two steps, and has solid convergence guarantees (Neal & Hinton, 1998; Karimi et al., 2019) . What results is a training algorithm that we call incremental predictive coding (iPC) that is a simple variation of PC that addresses the main drawback of PC (namely, efficiency), with no drawbacks from the learning perspective, as it has been formally proven to have equivalent convergence properties to standard PC. Furthermore, we provide initial evidence that iPC is also potentially more efficient than BP in the specific case of full-batch training. In fact, we theoretically show that, on an ideal parallel machine, to complete one update of all weights on a network with L layers, the time complexity of iPC is O(1), while that of BP is O(L). However, additional engineering efforts are needed to reach this goal, which are beyond the focus of this work: our experiments are performed using PyTorch (Paszke et al., 2017) , which is not designed to parallelize computations across layers on GPUs. We partially address this limitation by performing some experiments on CPUs, which empirically confirm our claims about efficiency, as shown in Fig. 3 .Our contributions are briefly as follows: 1. We first develop the update rule of iPC from the variational free energy of a hierarchical generative model using the incremental EM approach. We then discuss the implications of this change in terms of autonomy and convergence guarantees: it has in fact been proven that iEM converges to a minimum of the loss function (Neal & Hinton, 1998; Karimi et al., 2019) , and hence this result naturally extends to iPC. We conclude by analyzing similarities and differences between iPC, standard PC, and BP. 2. We empirically compare the efficiency of PC and iPC on generation tasks, by replicating some experiments performed in (Salvatori et al., 2021) , and classification tasks, by replicating experiments similar to those presented in (Whittington & Bogacz, 2017) . In both cases, iPC is by far more efficient than the original counterpart. Furthermore, we present initial evidence that iPC can decrease the training loss faster than BP, assuming that a proper parallelization is done. 3. We then test our model on a large number of image classification benchmarks, showing that that iPC performs better than PC, on average, and similarly to BP. Then, we show that iPC requires less parameters than BP to perform well on convolutional neural networks (CNNs). Finally, we show that iPC follows the trends of energy-based models on training robust classifiers (Grathwohl et al., 2019) , and yields better calibrated outputs than BP on the best performing models.

2. PRELIMINARIES

In this section, we introduce the original formulation of predictive coding (PC) as a generative model proposed by Rao and Ballard 1999. Let us consider a generative model g : R d × R D -→ R o , where x ∈ R d is a vector of latent variables called causes, y ∈ R o is the generated vector, and θ ∈ R D is a set of parameters. We are interested in the following inverse problem: given a vector y and a generative model g, we need the parameters θ that maximize the marginal likelihood p(y, θ) = x p(y | x, θ)p(x, θ)dx. (1) Here, the first term inside the integral is the likelihood of the data given the causes, and the second is a prior distribution over the causes. Solving the above problem is intractably expensive. Hence, we need an algorithm that is divided in two phases: inference, where we infer the best causes x given both θ and y, and learning, where we update the parameters θ based on the newly computed causes. This algorithm is expectation-maximization (EM) (Dempster et al., 1977) . The first step, which we call inference or E-step, computes p(x | y, θ), that is the posterior distribution of the causes given

