TOWARDS BIOLOGICALLY PLAUSIBLE DREAMING AND PLANNING in recurrent spiking networks Anonymous

Abstract

Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit word-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) spiking neural network in which "dreaming" (living new experiences in a model-based simulated environment) significantly boosts learning. We also explore "planning", an online alternative to dreaming, that shows comparable performances. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model and the policy. Moreover, we stress that our network is composed of spiking neurons, furhter increasing the biological plausibility and implementability in neuromorphic hardware.

1. INTRODUCTION

Humans can learn a new ability after practicing a few hours (e.g., driving or playing a game), while to solve the same task artificial neural networks require millions of reinforcement learning trials in virtual environments. And even then, their performances might be not comparable to human's ability. Humans and animals, have developed an understanding of the world that allow them to optimize learning. This relies on the building of an inner model of the world. Model-based reinforcement learning Ye et al. (2021) ; Abbeel et al. (2006); Schrittwieser et al. (2020); Ha & Schmidhuber (2018) ; Kaiser et al. (2019); Hafner et al. (2020) have shown to reduce the amount of data required for learning. However, these approaches do not provide insights on biological intelligence since they require biologically implausible ingredients (storing detailed information of experiences to train models, long off-line learning periods, expensive Monte Carlo three search to correct the policy). Moreover, the storage of long sequences is highly problematic on neuromorphic and FPGA platforms, where memory resources are scarce, and the use of an external memory would imply large latencies. The optimal way to learn and exploit the inner-model of the world is still an open question. Taking inspiration from biology, we explore an intriguing idea that a learned model can be used when the neural network is offline. In particular, during deep-sleep, dreaming, and day-dreaming. Sleep is known to be essential for awake performances, but the mechanisms underlying its cognitive functions are still to be clarified. A few computational models have started to investigate the interaction between sleep (both REM and NREM) and plasticity González-Rueda et al. (2018); Wei et al. (2016; 2018) ; Korcsak-Gorzo et al. (2020); Golosio et al. (2021) showing improved performances, and reorganized memories, in the after sleep network. A wake-sleep learning algorithm has shown the possibility to extend the acquired knowledge with new symbolic abstractions and to train the neural network on imagined and replayed problems Ellis et al. (2020) . However, a clear and coherent understanding of the mechanisms that induce generalized beneficial effects is missing. The idea that dreams might be useful to refine learned skill is fascinating and requires to be explored experimentally and in theoretical and computational models. Here, we define "dreaming" as a learning phase of our model, in which it exploits offline the innermodel learned while "awake". During "dreaming" the world-model replaces the actual world, to allow an agent to live new experiences and refine the behavior learned during awake periods, even when the world is not available. We show that this procedure significantly boosts learning speed. This choice is a biologically plausible alternative to experience replay Wang et al. (2016); Munos et al. (2016) , which requires storing detailed information of temporal sequences of previous experiences. Indeed, even though the brain is capable to store episodic memories, it is unlikely that it is capable to sample offline from ten or hundreds of past experiences. 2020) the authors simulated short sequences (T = 15), from initial condition belonging to past experiences. We show that we are able to successfully train the agent in long simulated sequences (T = 50), starting from random initial conditions.Also, we defined "planning", an online alternative to "dreaming", that allows to reduce the compounding-error by simulating online shorter sequences. However, this requires additional computational resources while the network is already performing the task. As stated above, learning the world-model usually requires the storage of an agent's experiences Ye et al. ( 2021 To our knowledge, there are no previous works proposing biologically plausible model-based reinforcement learning in recurrent spiking networks. Our work is a step toward building efficient neuromorphic systems where memory and computation are integrated in the same support, network of neurons. This allows to avoid the problem of latency when accessing to external storages, e.g. to perform experience replay. Indeed, we argue that the use of (1) spiking neurons, (2) online learning rules, (3) memories stored in the network (and not in an external storage), make our model, almost straightforward to be efficiently implemented in a neuromorphic hardware. 



Usually, dreaming approaches are inefficient because of the large compounding errors associated to simulate long sequences in model-based simulated environment. In Ha & Schmidhuber (2018); Hafner et al. (2020); Okada & Taniguchi (2021), the authors have been able to train a neural network only in its dreams. In Ha & Schmidhuber (2018) the authors used evolutionary methods to optimize the policy, that are not a biologically plausible option, if compared to the reward based policy gradient. In Hafner et al. (

); Abbeel et al. (2006); Schrittwieser et al. (2020); Ha & Schmidhuber (2018); Kaiser et al. (2019); Hafner et al. (2020), to offline learn the model. We circumvent this problem by formulating an efficient learning rule, local in space and time, that allows learning online the world-model. In this way, no information storage (except the network parameters) is required. For the sake of biological plausibility, we consider the most prominent model for neurons in the brain: leaky integrate-and-fire neurons. But it is an open problem how recurrent networks of spiking neurons (RSNNs) can learn Bellec et al. (2020); Muratore et al. (2021); Capone et al. (2022), i.e., how their synaptic weights can be modified by local rules for synaptic plasticity so that the computational performance of the network improves. Despite recent advances in this direction Bellec et al. (2020); Göltz et al. (2021); Kheradpisheh & Masquelier (2020), there are yet very few applications to relevant tasks due to the difficulty to analytically tackle the discontinuity of the nature of the spike.

Figure 1: The agent-model spiking network. (A-B) Network structure: the network is composed of two modules: the agent and the model sub-networks. During the "awake" phase the two networks interact with the environment, learning from it. During the dreaming phase the environment is not accessible, and the model replaces the role of the environment. (C) Example frame of the environment perceived by the network. (D) Example of the reconstructed environment during the dreaming phase.

