PREDICTIVE CODING WITH APPROXIMATE LAPLACE MONTE CARLO

Abstract

Predictive coding (PC) accounts of perception now form one of the dominant computational theories of the brain, where they prescribe a general algorithm for inference and learning over hierarchical Gaussian latent general models. Despite this, they have enjoyed little export to the broader field of machine learning, where comparative generative modelling techniques have flourished. In part, this has been due to the poor performance of models trained with PC when evaluated by both sample quality and marginal likelihood. By adopting the perspective of PC as a variational Bayes algorithm under the Laplace approximation, we identify the source of these deficits to lie in the exclusion of an associated Hessian term in the standard PC objective function. To remedy this, we make three primary contributions: we begin by suggesting a simple Monte Carlo estimated evidence lower bound which relies on sampling from the Hessian-parameterised variational posterior. We then derive a novel block diagonal approximation to the full Hessian matrix that has lower memory requirements and favourable mathematical properties. Lastly, we present an algorithm that combines our method with standard PC to reduce memory complexity further. We evaluate models trained with our approach against the standard PC framework on image benchmark datasets. Our approach produces higher log-likelihoods and qualitatively better samples that more closely capture the diversity of the data-generating distribution.

1. INTRODUCTION

In the last two decades, conceptions of the brain as an organ actively engaged in Bayesian inference have become exceedingly prominent in cognitive neuroscience (Pouget et al., 2013; Clark, 2013; Kanai et al., 2015) . Under this paradigm, the brain adopts a probabilistic generative model of the world, with perception corresponding to inference over latent states, and learning to the inference over its parameters. Predictive coding (PC) (Rao and Ballard, 1999; Friston, 2018) , arguably the most notable instantiation of this perspective, describes a method for parameter learning in hierarchical latent Gaussian generative models with arbitrarily complex and highly non-linear parameterisations governing their conditional distributions. This computational scheme remains one of the foremost and popular computational models for explaining cortical function, (Mumford, 1992; Hosoya et al., 2005; Hohwy et al., 2008; Bastos et al., 2012; Shipp, 2016; Feldman and Friston, 2010; Fountas et al., 2022) , emphasizing the importance of evaluating it as a successful technique for training deep generative models of the kind presupposed in the brain. From a machine learning perspective, PC bares a close mathematical relationship to Bayesian techniques such as the variational auto-encoder (VAE) (Kingma and Welling, 2014) , which also relies on optimising an evidence lower bound (ELBO); with a key advantage over VAEs ostensibly being in PC's use of non-amortised inference (Cremer et al., 2018) . Furthermore, PC also benefits from design principles inherited from its origins as a theory of cognitive function -namely asynchronous and local error computation (Whittington and Bogacz, 2019), suggesting a far greater amenability to implementation on energy-efficient neuromorphic hardware. In this work, we show that generative models trained with PC (of the kind described in (Bogacz, 2017; Tschantz et al., 2022; Millidge et al., 2022) ), have poor log marginal likelihoods when evaluated on common image datasets, and poor sample quality, despite producing good reconstructions. To diagnose these issues we begin by adopting the perspective of PC as a variational Bayes algorithm under the Laplace approximation (Friston, 2003; 2005; 2008) . Under this approximation, quadratic assumptions over the log joint density of a generative model result in a Gaussian variational posterior with precision (inverse variance) equal to the Hessian matrix -or curvature -of the negative log joint with respect to its latent states. We then present a simple ELBO-based objective function that accounts for this curvature -and thus the uncertainty over latent states -using samples from the Laplace-optimal variational posterior. We show that our objective has the additional effect of regularising for the sharpness of the probability landscape. Furthermore, to improve upon the memory complexity of computing the full Hessian matrix required for the Laplace ELBO objective, we present a novel block diagonal approximation to the Hessian that has lower memory complexity and is guaranteed positive semi-definite (PSD)ensuring its associated variational posterior can always be sampled from. Finally, to further remove the dependency of memory complexity on the output image dimensionality, we present a combined model, in which the final layer of our generative model is trained with PC, and all higher layers are trained with approximate Laplace Monte Carlo. The resulting method has memory complexity reduced to O(n 2 L ), from O(N 2 ) -where n L , and N are the dimensionalities of the largest latent layer, and all latent layers combined respectively -while retaining improved log likelihoods and sample quality.

2. PREDICTIVE CODING

Predictive coding is an algorithm with origins in computational neuroscience (Rao and Ballard, 1999; Friston, 2003; 2005; Friston and Kiebel, 2009) that prescribes a method for parameter learning in hierarchical latent variable probabilistic graphical models. In it's most common form, (Bogacz, 2017; Millidge et al., 2022; Tschantz et al., 2022) , it can be described succinctly by the following simple recipe: 1. Define a (possibly hierarchical) graphical model over latent (z) and observed (x) states with parameters θ (i.e. log P (x, z|θ)) 2. For x ∼ D, where D is the data-generating distribution Inference: Obtain MAP estimates (z MAP ) for the latent states by enacting a gradient descent on log P (x, z|θ) Learning: Update the parameters θ using stochastic gradient descent with respect to the log joint evaluated at the MAP estimates found at the end of inference: log P (x, z MAP |θ) One common motivation for this algorithm rests upon its interpretation as a variational Bayesian method under a Dirac delta (deterministic) approximate posterior distribution (Friston, 2005; Bogacz, 2017) . Under this interpretation, the inference step outlined in the PC algorithm, corresponds to maximisation of an ELBO (for a particular data point) with respect to the mean of the variational Dirac delta distribution, and learning corresponds to maximising the ELBO (over the entire dataset) with respect to the model parameters θ. Another common interpretation for this algorithm assumes the Laplace approximation (Friston et al., 2007) , under which inference corresponds to optimising the mean of a Gaussian variational posterior with covariance equal to the inverse Hessian of the log joint probability. While, this interpretation retains the inference procedure of PC, it has non-trivial implications for the learning procedure, which we detail in the next section.

3. RELATED WORK AND THE LAPLACE APPROXIMATION

The Laplace approximation has historically been derived in two contexts. The first context adopted Laplace's method for the computation of the ordinarily intractable marginalised model evidence after the maximum a posteriori (MAP) value of the latent states had already been identified (Kass and Raftery, 1995; Tierney and Kadane, 1986) . The second adopted the Laplace approximation for variational inference, wherein, under quadratic assumptions for the log joint, it can be shown that the Gaussian variational posterior which minimises the ELBO has inverse covariance equal to the Hessian of the negative log joint probability evaluated at the variational mode (Friston et al., 2007) . We adopt this second perspective here and thus begin with the definition of the standard ELBO for a latent probabilistic model p(x, z|θ), where x and z are sets of observed and latent random variables respectively, and θ are a set of model parameters:

