GO-EXPLORE WITH A GUIDE: SPEEDING UP SEARCH IN SPARSE REWARD SETTINGS WITH GOAL-DIRECTED INTRINSIC REWARDS

Abstract

Reinforcement Learning (RL) agents have traditionally been very sampleintensive to train, especially in environments with sparse rewards. Seeking inspiration from neuroscience experiments of rats learning the structure of a maze without needing extrinsic rewards, we seek to incorporate additional intrinsic rewards to guide behavior. We propose a potential-based goal-directed intrinsic reward (GDIR), which provides a reward signal regardless of whether the task is achieved, and ensures that learning can always take place. While GDIR may be similar to approaches such as reward shaping in incorporating goal-based rewards, we highlight that GDIR is innate to the agent and hence applicable across a wide range of environments without needing to rely on a properly shaped environment reward. We also note that GDIR is different from curiosity-based intrinsic motivation, which can diminish over time and lead to inefficient exploration. Go-Explore is a well-known state-of-the-art algorithm for sparse reward domains, and we demonstrate that by incorporating GDIR in the "Go" and "Explore" phases, we can improve Go-Explore's performance and enable it to learn faster across multiple environments, for both discrete (2D grid maze environments, Towers of Hanoi, Game of Nim) and continuous (Cart Pole and Mountain Car) state spaces. Furthermore, to consolidate learnt trajectories better, our method also incorporates a novel approach of hippocampal replay to update the values of GDIR and reset state visit and selection counts of states along the successful trajectory. As a benchmark, we also show that our proposed approaches learn significantly faster than traditional extrinsic-reward-based RL algorithms such as Proximal Policy Optimization, TD-learning, and Q-learning.

1. INTRODUCTION

Recently, LeCun (2022) describes how intrinsic cost modules could motivate an agent's behavior. These intrinsic cost modules can be something like hunger, thirst or goal-seeking that is innate to the agent and cannot be modified. The benefit of having such an immutable cost module is that one's previously learnt values of the state will not be affected by a continually changing model (like the function approximators used in typical Deep Reinforcement Learning (RL)) and the agent can learn efficiently without the need to re-visit past experiences each time the model is changed. While Silver et al. ( 2021) states that environment-based extrinsic reward can be enough for learning complex skills and social behaviors, he also admits that it can be sample inefficient. Adding an intrinsic component to this environmental reward can be seen as giving some self-driven intrinsic motivation to learning such skills or behaviors, and can better lead to attaining these competencies with better sample efficiency. The presence of intrinsic motivation can be seen in neuroscience experiments on rats exploring a maze even without extrinsic food rewards (Fitzgerald et al., 1985; Hughes, 1997) . This is not easily explained from just the perspective of extrinsic environmental rewards and suggests that intrinsic motivation does play a huge part in natural behavior of animals, and it could be the critical missing component in modern RL methods.

