HIPPOCAMPAL REPRESENTATIONS EMERGE WHEN TRAINING RECURRENT NEURAL NETWORKS ON A MEMORY DEPENDENT MAZE NAVIGATION TASK

Abstract

Can neural networks learn goal-directed behaviour using similar strategies to the brain, by combining the relationships between the current state of the organism and the consequences of future actions? Recent work has shown that recurrent neural networks trained on goal based tasks can develop representations resembling those found in the brain, entorhinal cortex grid cells, for instance. Here we explore the evolution of the dynamics of their internal representations and compare this with experimental data. We observe that once a recurrent network is trained to learn the structure of its environment solely based on sensory prediction, an attractor based landscape forms in the network's representation, which parallels hippocampal place cells in structure and function. Next, we extend the predictive objective to include Q-learning for a reward task, where rewarding actions are dependent on delayed cue modulation. Mirroring experimental findings in hippocampus recordings in rodents performing the same task, this training paradigm causes nonlocal neural activity to sweep forward in space at decision points, anticipating the future path to a rewarded location. Moreover, prevalent choice and cueselective neurons form in this network, again recapitulating experimental findings. Together, these results indicate that combining predictive, unsupervised learning of the structure of an environment with reinforcement learning can help understand the formation of hippocampus-like representations containing both spatial and task-relevant information.

1. INTRODUCTION

Recurrent neural networks have been used to perform spatial navigation tasks and the subsequent study of their internal representations has yielded dynamics and structures that are strikingly biological. Metric (Cueva & Wei, 2018; Banino et al., 2018) and non-metric (Recanatesi et al., 2019) representations mimicking grid (Fyhn et al., 2004) and place cells (O' Keefe & Nadel, 1978) respectively form once the recurrent network has learned a predictive task in the context of a complex environment. Cueva et al. (2020) demonstrates not only the emergence of characteristic neural representations, but also hallmarks of head direction system cells when training a recurrent network on a simple angular velocity integration task. Biologically, non-metric representations are associated with landmark spatial memory, in which place cells within the mammalian hippocampus fire when the associated organism is present in a corresponding place field. Extrafield firing of place cells occurs when these neurons spike outside of these contiguous place field regions. Here we show that recurrent neural networks (RNNs) not only form corresponding attractor landscapes, but also produce representations with internal dynamics that closely resemble those found experimentally in the hippocampus when performing goal-directed behaviour. Research in neuroscience such as that of Johnson & Redish (2007) , shows that spatial representations in mice in the CA3 region of the hippocampus frequently fire nonlocally. Griffin et al. (2007) show that a far higher proportion of hippocampal neurons in the CA1 region in rats performing an episodic task in a T-shaped maze encode the phase of the task rather than spatial information (in this case trajectory direction). Ainge et al. (2007) show CA1 place cells encode destination location at the start position of a maze. Lee et al. (2006) demonstrate that place fields of CA1 neurons gradually drift toward reward locations throughout reward training on a T-shaped maze. In this work we show that a recurrent neural network learning a choice-reward based task using reinforcement learning, in conjunction with predictive sensory learning in a T-shaped maze produces an internal representation with consistent extrafield firing associated with consequential decision points. In addition we find that the network's representation, once trained, follows a forward sweeping pattern as identified by Johnson & Redish (2007) . We then show that a higher proportion of units in the trained network show strong selectivity for the encoding or choice phase of the task than the proportion showing selectivity for spatial topology. Importantly, these properties only emerge during predictive learning, where task learning is much faster compared to traditional deep Q learning.

2. METHOD

Figure 1 : Left, the wall observation and cue received by the network at each timestep. Right, the entangled predictive task the LSTM network is pre-trained on in order to generate a non-metric map of the maze environment. We use a form of the cued-choice maze used by Johnson & Redish ( 2007) which has a central T structure with returning arms, shown in Figure 1 . All walls of the maze are tiled with distinct RGB colours which are generated at random and remain fixed throughout. An agent is initially learning to predict the next sensory stimulus given its movement. This combination of unsupervised learning and exploration has been shown previously to produce place cell-like encoding of the agent's position (Recanatesi et al., 2019) . Next, rewards at four possible locations are introduced and the agent is tasked with associating a cue with the rewarding trajectory. The agent has four vision sensors, one in each cardinal direction, reading the wall RGB colours they intersect. The cue tone is played to the agent as it passes the halfway point of the central maze stem. A low frequency cue indicates that the agent will turn left at the top of the maze stem and a high frequency cue indicates a right turn. These cue tones take the form of a high or low valued scalar perturbed with normally distributed noise if at a cue point, with a zero value given at all other locations. These four RGB colours as well as the cue frequency at the current location make up the total input received by the agent. The agent is controlled by a recurrent neural network comprised of a 380 unit Long-Short term memory (Hochreiter & Schmidhuber, 1997) (LSTM) network with a single layered readout for the prediction of RGB values. We first pre-train the network by tasking it with predicting the subsequent observation of wall colours from the currently observable wall colours given its trajectory through the maze. The agent's starting location is at the bottom of the central stem of the T maze and a trajectory of left or right at the top of the central stem is chosen pseudorandomly, depicted with red and blue arrows respectively in Figure 1 and corresponding to the low (red trajectory) or high (blue trajectory) cue tone value given halfway up the stem. As in the experiments by Johnson & Redish (2007) , during pre-training the agent does not choose any of its actions and is only learning to predict the sequence of wall colors it encounters. In a given pre-training iteration, we collect all observations as the agent traverses the maze until it returns to the start location at the bottom of the central stem and finally train the LSTM on the entire collected trajectory. The network is trained with a mean-squared error loss of predicted and target wall colours (Eq. 1), with model parameters optimised using Adam (Kingma & Ba, 2015) and a learning rate of 0.001. (1)



rgb -(W rgb h t + b rgb )) 2

