GO-EXPLORE WITH A GUIDE: SPEEDING UP SEARCH IN SPARSE REWARD SETTINGS WITH GOAL-DIRECTED INTRINSIC REWARDS

Abstract

Reinforcement Learning (RL) agents have traditionally been very sampleintensive to train, especially in environments with sparse rewards. Seeking inspiration from neuroscience experiments of rats learning the structure of a maze without needing extrinsic rewards, we seek to incorporate additional intrinsic rewards to guide behavior. We propose a potential-based goal-directed intrinsic reward (GDIR), which provides a reward signal regardless of whether the task is achieved, and ensures that learning can always take place. While GDIR may be similar to approaches such as reward shaping in incorporating goal-based rewards, we highlight that GDIR is innate to the agent and hence applicable across a wide range of environments without needing to rely on a properly shaped environment reward. We also note that GDIR is different from curiosity-based intrinsic motivation, which can diminish over time and lead to inefficient exploration. Go-Explore is a well-known state-of-the-art algorithm for sparse reward domains, and we demonstrate that by incorporating GDIR in the "Go" and "Explore" phases, we can improve Go-Explore's performance and enable it to learn faster across multiple environments, for both discrete (2D grid maze environments, Towers of Hanoi, Game of Nim) and continuous (Cart Pole and Mountain Car) state spaces. Furthermore, to consolidate learnt trajectories better, our method also incorporates a novel approach of hippocampal replay to update the values of GDIR and reset state visit and selection counts of states along the successful trajectory. As a benchmark, we also show that our proposed approaches learn significantly faster than traditional extrinsic-reward-based RL algorithms such as Proximal Policy Optimization, TD-learning, and Q-learning.

1. INTRODUCTION

Recently, LeCun (2022) describes how intrinsic cost modules could motivate an agent's behavior. These intrinsic cost modules can be something like hunger, thirst or goal-seeking that is innate to the agent and cannot be modified. The benefit of having such an immutable cost module is that one's previously learnt values of the state will not be affected by a continually changing model (like the function approximators used in typical Deep Reinforcement Learning (RL)) and the agent can learn efficiently without the need to re-visit past experiences each time the model is changed. While Silver et al. (2021) states that environment-based extrinsic reward can be enough for learning complex skills and social behaviors, he also admits that it can be sample inefficient. Adding an intrinsic component to this environmental reward can be seen as giving some self-driven intrinsic motivation to learning such skills or behaviors, and can better lead to attaining these competencies with better sample efficiency. The presence of intrinsic motivation can be seen in neuroscience experiments on rats exploring a maze even without extrinsic food rewards (Fitzgerald et al., 1985; Hughes, 1997) . This is not easily explained from just the perspective of extrinsic environmental rewards and suggests that intrinsic motivation does play a huge part in natural behavior of animals, and it could be the critical missing component in modern RL methods. Motivated by these observations, we seek to find such intrinsic cost/reward functions whereby it is innate to the agent, but is context dependent and can be triggered according to the task at hand. We seek to incorporate this intrinsic cost/reward into RL models and derive the benefits of these intrinsic rewards in driving behavior. Specifically, we model a potential-based goal-directed intrinsic reward (GDIR) which tells an agent how far it is from the goal in order to guide actions. The proposed GDIR fits naturally in a multi-step task setting, where there can be a planner module that tells us what are the sub-tasks required to achieve the goal. How to derive this planner module is tackled in Hierarchical Reinforcement Learning (Al-Emran, 2015; Pateria et al., 2021) . Our proposed GDIR can help to achieve these sub-tasks faster as it provides a learning reward signal regardless of whether the task is achieved. In contrast to existing work in RL, we do not seek to find the best possible solution for a given environment. We seek instead, to find a satisficing solution, whereby the solution is good enough to solve the task. We posit that one pitfall of seeking the optimal solution is that extensive exploration needs to take place, even after solving the environment, in order to ensure that the optimal path is traversed at least once. Hence, we opt for the satisficing route via our novel approach of hippocampal replay, which consolidates successful trajectories and repeats them consistently. This allows our method to learn faster and adapt better to novel environments in real-world systems. Our Contributions. In this paper, we investigate using GDIR in Go-Explore, which is a well-known state-of-the-art algorithm for sparse reward domains. We highlight the following contributions: 1. We propose GDIR, a potential-based goal-directed intrinsic function, which provides a reward signal regardless of whether the task is achieved, and ensures that learning can always take place. 2. We incorporate GDIR into variants of Go-Explore and demonstrate that it enables Go-Explore to learn significantly faster across multiple environments, for both discrete and continuous state spaces. 3. In order to improve consolidation of learnt trajectories, we propose a novel approach of hippocampal replay to update the values of GDIR and reset state visit and selection counts of states along the successful trajectory. We demonstrate that hippocampal replay helps an agent remember successful trajectories which solve its current environment and enables it to perform consistently.

2. RELATED WORK

Intrinsic Motivation. Oudeyer & Kaplan (2009) describes the definition of Intrinsic Motivation in psychology and gives a list of computational approaches to model it. Chentanez et al. (2004) details how to incorporate intrinsic motivation into traditional RL algorithms such as Q-Learning. Schmidhuber (2010) describes a form of intrinsic reward related to discovery of novel, suprising patterns and world model prediction. Baldassarre & Mirolli (2013) details intrinsically motivated learning in natural and artificial systems. Aubret et al. (2019; 2022) gives a general overview of the field of using intrinsic motivation in RL, namely in knowledge acquisition via exploration, empowerment, state space representation, as well as skill learning. Similar to all other prior work on intrinsic motivation, we seek to derive a computational method to incorporate it into a learning system such that it speeds up learning. However, we do not focus on the curiosity-based intrinsic motivation but a goal-directed form of intrinsic rewards, due to reasons explained below. Go-Explore. Go-Explore (Ecoffet et al., 2019) describes two common pitfalls of intrinsic motivation. The first is detachment, whereby intrinsic rewards such as curiosity-driven exploration tend to diminish over time, as the frontiers of the exploration space may be explored multiple times but the extrinsic reward may not be obtained. In the end, the frontier may not be attractive anymore and is neglected in subsequent exploration. The second is derailment, whereby an agent seeking to return to a previously-explored good state may be hindered from doing so, due to the in-built stochasticity of the algorithm, such as epsilon-greedy exploration (Montague, 1999), state-dependent exploration with action-space noise (Rückstieß et al., 2008) , or parameter-space noise (Plappert et al., 2017) . Extensions of Go-Explore. First-return-then-explore (Ecoffet et al., 2021) is an extension of Go-Explore using policy-based exploration, which shows the capability of the Go-Explore algorithm to learn on a difficult MuJoCo robotic task which cannot be solved by curiosity-based intrinsic motivation. Yang et al. ( 2022) details how post-exploration (exploring even after reaching the goal state) can help to expand the agent's state horizon, and even lead to better performance on the same task.

