BRAIN2GAN; RECONSTRUCTING PERCEIVED FACES FROM THE PRIMATE BRAIN VIA STYLEGAN3

Abstract

Neural coding characterizes the relationship between stimuli and their corresponding neural responses. The usage of synthesized yet photorealistic reality by generative adversarial networks (GANs) allows for superior control over these data: the underlying feature representations that account for the semantics in synthesized data are known a priori and their relationship is perfect rather than approximated post-hoc by feature extraction models. We exploit this property in neural decoding of multi-unit activity (MUA) responses that we recorded from the primate brain upon presentation with synthesized face images in a passive fixation experiment. First, the face reconstructions we acquired from brain activity were remarkably similar to the originally perceived face stimuli. Second, our findings show that responses from the inferior temporal (IT) cortex (i.e., the recording site furthest downstream) contributed most to the decoding performance among the three brain areas. Third, applying Euclidean vector arithmetic to neural data (in combination with neural decoding) yielded similar results as on w-latents. Together, this provides strong evidence that the neural face manifold and the feature-disentangled w-latent space conditioned on StyleGAN3 (rather than the z-latent space of arbitrary GANs or other feature representations we encountered so far) share how they represent the high-level semantics of the high-dimensional space of faces.

1. INTRODUCTION

The field of neural coding aims at deciphering the neural code to characterize how the brain recognizes the statistical invariances of structured yet complex naturalistic environments. Neural encoding seeks to find how properties of external phenomena are stored in the brain by modeling the stimulus-response transformation (van Gerven, 2017) . Vice versa, neural decoding aims to find what information about the original stimulus is present in and can be retrieved from the measured brain activity by modeling the response-stimulus transformation (Haynes & Rees, 2006; Kamitani & Tong, 2005) . In particular, reconstruction is concerned with re-creating the literal stimulus image from brain activity. In both cases, it is common to factorize the direct transformation into two by Figure 1 : Neural coding. The transformation between sensory stimuli and brain responses via an intermediate feature space. Neural encoding is factorized into a nonlinear "analysis" and a linear "encoding" mapping. Neural decoding is factorized into a linear "decoding" and a nonlinear "synthesis" mapping. invoking an in-between feature space (Figure 1 ). Not only does this favor data efficiency as neural data is scarce but it also allows one to test alternative hypotheses about the relevant stimulus features that are stored in and can be retrieved from the brain. The brain can effectively represent an infinite amount of visual phenomena to interpret and act upon the environment. Although such neural representations are constructed from experience, novel yet plausible situations that respect the statistics of the natural environment can also be mentally simulated or imagined (Dijkstra et al., 2019) . From a machine learning perspective, generative models achieve the same objective: they capture the probability density that underlies a (very large) set of observations and can be used to synthesize new instances which appear to be from the original data distribution yet are suitably different from the observed instances. Particularly, generative adversarial networks (GANs) (Goodfellow et al., 2014) are among the most impressive generative models to date which can synthesize novel yet realistic-looking images (e.g., natural images and images of human faces, bedrooms, cars and cats (Brock et al., 2018; Karras et al., 2017; 2019; 2021) from randomly-sampled latent vectors. A GAN consists of two neural networks: a generator network that synthesizes images from randomly-sampled latent vectors and a discriminator network that distinguishes synthesized from real images. During training, these networks are pitted against each other until the generated data are indistinguishable from the real data. The bijective latent-to-image relationship of the generator can be exploited in neural decoding to disambiguate the synthesized images as visual content is specified by their underlying latent code (Kriegeskorte, 2015) and perform analysis by synthesis (Yuille & Kersten, 2006) . Deep convnets have been used to explain neural responses during visual perception, imagery and dreaming (Horikawa & Kamitani, 2017b; a; St-Yves & Naselaris, 2018; Shen et al., 2019b; a; Güc ¸lütürk et al., 2017; VanRullen & Reddy, 2019; Dado et al., 2022) . To our knowledge, the latter three are the most similar studies that also attempted to decode perceived faces from brain activity. (Güc ¸lütürk et al., 2017) used the feature representations from VGG16 pretrained on face recognition (i.e., trained in a supervised setting). Although more biologically plausible, unsupervised learning paradigms seemed to appear less successful in modeling neural representations in the primate brain than their supervised counterparts (Khaligh-Razavi & Kriegeskorte, 2014) with the exception of (VanRullen & Reddy, 2019) and (Dado et al., 2022) who used adversarially learned latent representations of a variational autoencoder-GAN (VAE-GAN) and a GAN, respectively. Importantly, (Dado et al., 2022) used synthesized stimuli to have direct access to the ground-truth latents instead of using post-hoc approximate inference, as VAE-GANs do by design. The current work improves the experimental paradigm of (Dado et al., 2022) and provides several novel contributions: face stimuli were synthesized by a feature-disentangled GAN and presented to a macaque with cortical implants in a passive fixation task. A decoder model was fit on the recorded brain activity and the ground-truth latents. Reconstructions were created by feeding the predicted latents from brain activity from a held-out test set to the GAN. Previous neural decoding studies 



Figure 2: StyleGAN3 generator architecture. The generator takes a 512-dim. latent vector as input and transforms it into a 1024 2 resolution RGB image. We collected a dataset of 4000 training set images and 100 test set images.

