NEURAL DECODING OF VISUAL IMAGERY VIA HIER-ARCHICAL VARIATIONAL AUTOENCODERS

Abstract

Reconstructing natural images from fMRI recordings is a challenging task of great importance in neuroscience. However, current architectures fail to efficiently capture the hierarchical processing of visual information processing, which may bottleneck their representation capacity. Motivated by that fact, we introduce a novel neural network architecture for the problem of neural decoding. Our architecture uses Hierarchical Variational Autoencoders (HVAEs) to learn meaningful representations of natural images and leverages their latent space hierarchy to learn voxel-to-image mappings. By mapping the early stages of the visual pathway to the first set of latent variables and the higher visual cortex areas to the deeper layers in the latent hierarchy, we are able to construct a latent variable neural decoding model that replicates the hierarchical visual information processing. Our model achieves better reconstructions compared to the state of the art and our ablation study indicates that the hierarchical structure of the latent space is responsible for that performance.

1. INTRODUCTION

Decoding visual imagery from brain recordings is a key problem in neuroscience. This problem aims to reconstruct the visual stimuli from fMRI recordings taken while the subject is viewing the stimuli. Even though some of the excitement is fulled by science fiction and the difficulty of the problem (1), the scientific consensus is that neural decoding has real-world, important implications. It is important for understanding how neural activity relates to external stimuli (2), for engineering application such as brain-computer interfaces (3) and for decoding imagery during sleep (4). Given its importance, neuroscience and machine learning researchers have jointly led the development of sophisticated deep learning architectures that allows us to design pipelines that map voxel-based recordings to the corresponding visual stimuli. Based on the target learning task, visual decoding can be categorized into stimuli classification, stimuli identification, and stimuli reconstruction. The former two tasks aim to predict the object category of the presented stimulus or identify the stimulus from an ensemble of possible stimuli. The reconstruction task, which is the most challenging one and the main focus of this paper, aims to construct a replica of the presented stimulus image from the fMRI recordings. Related Work. The proposed methods for the problem of neural decoding can be broadly classified in three categories: non-deep learning methods, non-generative deep learning methods and generative deep learning methods. The non-deep learning class consists of methods that are based on primitive linear models and aim in reconstructing low-level image features (5). Such approaches first extract handcrafted features from real-world images, such as multi-scale image bases (6) or Gabor filters (7), and then learn a linear mapping from the fMRI voxel space to the extracted features. Due to their simplicity, linear models are not able to reconstruct complex real-world images and thus their applicability is restricted to simple images containing only low-level features. 



Methods that use convolutional neural networks as well as encoder-decoder architectures belong to the non-generative deep learning class. Horikawa et al. (8) demonstrated a homology between human and machine vision by designing an architecture with which the features extracted from convolutional neural networks can be predicted from fMRI signals. Based upon those findings, Shen et al. (1) used a pretrained VGG-19 model to extract hierarchical features from stimuli images and learned a mapping from the fMRI voxels in the low/high area to the corresponding low/high VGG-19 features. Beliy et al. (9) designed a CNN-based Encoder-Decoder architecture, where the encoder learns a

