EPISODIC MEMORY FOR LEARNING SUBJECTIVE-TIMESCALE MODELS

Abstract

In model-based learning, an agent's model is commonly defined over transitions between consecutive states of an environment even though planning often requires reasoning over multi-step timescales, with intermediate states either unnecessary, or worse, accumulating prediction error. In contrast, intelligent behaviour in biological organisms is characterised by the ability to plan over varying temporal scales depending on the context. Inspired by the recent works on human time perception, we devise a novel approach to learning a transition dynamics model, based on the sequences of episodic memories that define the agent's subjective timescale -over which it learns world dynamics and over which future planning is performed. We implement this in the framework of active inference and demonstrate that the resulting subjective-timescale model (STM) can systematically vary the temporal extent of its predictions while preserving the same computational efficiency. Additionally, we show that STM predictions are more likely to introduce future salient events (for example new objects coming into view), incentivising exploration of new areas of the environment. As a result, STM produces more informative action-conditioned roll-outs that assist the agent in making better decisions. We validate significant improvement in our STM agent's performance in the Animal-AI environment against a baseline system, trained using the environment's objective-timescale dynamics.

1. INTRODUCTION

An agent endowed with a model of its environment has the ability to predict the consequences of its actions and perform planning into the future before deciding on its next move. Models can allow agents to simulate the possible action-conditioned futures from their current state, even if the state was never visited during learning. As a result, model-based approaches can provide agents with better generalization abilities across both states and tasks in an environment, compared to their model-free counterparts (Racanière et al., 2017; Mishra et al., 2017) . The most popular framework for developing agents with internal models is model-based reinforcement learning (RL). Model-based RL has seen great progress in recent years, with a number of proposed architectures attempting to improve both the quality and the usage of these models (Kaiser et al., 2020; Racanière et al., 2017; Kansky et al., 2017; Hamrick, 2019) . Nevertheless, learning internal models affords a number of unsolved problems. The central one of them is model-bias, in which the imperfections of the learned model result in unwanted over-optimism and sequential error accumulation for long-term predictions (Deisenroth & Rasmussen, 2011) . Long-term predictions are additionally computationally expensive in environments with slow temporal dynamics, given that all intermediary states must be predicted. Moreover, slow world dynamicsfoot_0 can inhibit the learning of dependencies between temporally-distant events, which can be crucial for environments with sparse rewards. Finally, the temporal extent of future predictions is limited to the objective timescale of the environment over which the transition dynamics has been learned. This leaves little room for flexible and context-dependent planning over varying timescales which is characteristic to animals and humans (Clayton et al., 2003; Cheke & Clayton, 2011; Buhusi & Meck, 2005) . The final issue exemplifies the disadvantage of the classical view on internal models, in which they are considered to capture the ground-truth transition dynamics of the environment. Furthermore, in more complex environments with first-person observations, this perspective does not take into account the apparent subjectivity of first-person experiences. In particular, the agent's learned representations of the environment's transition dynamics implicitly include information about time. Little work has been done to address the concept of time perception in model-based agents (Deverett et al., 2019) . Empirical evidence from the studies of human and animal cognition suggests that intelligent biological organisms do not perceive time precisely and do not possess an explicit clock mechanism responsible for keeping track of time (Roseboom et al., 2019; Sherman et al., 2020; Hills, 2003) . For instance, humans tend to perceive time slower in environments rich in perceptual content (e.g. busy city), and faster in environments with little perceptual change (e.g. empty field). The mechanisms of subjective time perception still remain unknown; however, recent computational models based on episodic memory were able to closely model the deviations of human time perception from veridical perception (Fountas et al., 2020b) . Inspired by this account, in this work we propose subjective-timescale model (STM), an alternative approach to learning a transition dynamics model, by replacing the objective timescale with a subjective one. The latter represents the timescale by which an agent perceives events in an environment, predicts future states, and which is defined by the sequences of episodic memories. These memories are accumulated on the basis of saliency (i.e. how poorly an event was predicted by the agent's transition model), which attempts to mimic the way humans perceive time, and resulting in the agent's ability to plan over varying timescales and construct novel future scenarios. We employ active inference as the agent's underlying cognitive framework. Active inference is an emerging framework within computational neuroscience, which attempts to unify perception and action under the single objective of minimising the free-energy functional. Similar to model-based RL, an active inference agent relies almost entirely on the characteristics and the quality of its internal model to make decisions. Thus, it is naturally susceptible to the previously mentioned problems associated with imperfect, objective-timescale models. The selection of active inference for the purposes of this paper is motivated by its biological plausibility as a normative framework for understanding intelligent behaviour (Friston et al., 2017a; 2006) , which is in line with the general theme of this work. Furthermore, being rooted in variational inference, the free energy objective generates a distinct separation between the information-theoretic quantities that correspond to the different components of the agent's model, which is crucial to define the memory formation criterion. We demonstrate that the resulting characteristics of STM allow the agent to automatically perform both short-and long-term planning using the same computational resources and without any explicit mechanism for adjusting the temporal extent of its predictions. Furthermore, for long-term predictions STM systematically performs temporal jumps (skipping intermediary steps), thus providing more informative future predictions and reducing the detrimental effects of one-step prediction error accumulation. Lastly, being trained on salient events, STM much more frequently imagines futures that contain epistemically-surprising events, which incentivises exploratory behaviour.

2. RELATED WORK

Model-based RL. Internal models are extensively studied in the field of model-based RL. Using linear models to explicitly model transition dynamics has achieved impressive results in robotics (Levine & Abbeel, 2014a; Watter et al., 2015; Bagnell & Schneider, 2001; Abbeel et al., 2006; Levine & Abbeel, 2014b; Levine et al., 2016; Kumar et al., 2016) . In general, however, their application is limited to low-dimensional domains and relatively simple environment dynamics. Similarly, Gaussian Processes (GPs) have been used (Deisenroth & Rasmussen, 2011; Ko et al., 2007) . Their probabilistic nature allows for state uncertainty estimation, which can be incorporated in the planning module to make more cautious predictions; however, GPs struggle to scale to high-dimensional data. An alternative and recently more prevalent method for parametrising transition models is to use neural networks. These are particularly attractive due to their recent proven success in a variety of domains, including deep model-free RL (Silver et al., 2017) , ability to deal with high-dimensional data, and existence of methods for uncertainty quantification (Blundell et al., 2015; Gal & Ghahramani, 2016) . Different deep learning architectures have been utilised including fully-connected neural networks (Nagabandi et al., 2018; Feinberg et al., 2018; Kurutach et al., 2018) and autoregressive models (Ha & Schmidhuber, 2018; Racanière et al., 2017; Ke et al., 2019) , showing promising results in environments with relatively high-dimensional state spaces. In particular, autoregressive



Worlds with small change in state given an action

