STEIN VARIATIONAL GOAL GENERATION FOR ADAP-TIVE EXPLORATION IN MULTI-GOAL REINFORCEMENT LEARNING Anonymous

Abstract

Multi-goal Reinforcement Learning has recently attracted a large amount of research interest. By allowing experience to be shared between related training tasks, this setting favors generalization for new tasks at test time. However, in settings with discontinuities in the goal space (e.g. walls in a maze) and when the reward is sparse, a majority of goals are difficult to reach. In this context, some curriculum over goals is needed to help agents learn by adapting training tasks to their current capabilities. In this work we propose a novel approach: Stein Variational Goal Generation (SVGG), which builds on recent automatic curriculum learning techniques for goal-conditioned policies. SVGG samples goals of intermediate difficulty for the agent, by leveraging a learned predictive model of its goal reaching capabilities. In that aim, it models the distribution of goals with particles and relies on Stein Variational Gradient Descent to dynamically attract the goal sampling distribution in areas of appropriate difficulty. We show that SVGG outperforms state-of-the-art multi-goal Reinforcement Learning methods in terms of success coverage in hard exploration problems, and demonstrate that our approach is endowed with a useful recovery property when the environment changes.

1. INTRODUCTION

In standard Reinforcement Learning (RL), agents learn a policy to optimally achieve a single task. By contrast, in Multi-goal RL (Kaelbling, 1993) , they address a set of tasks by having policies conditioned on goals, where each goal corresponds to an individual task. The resulting goal-conditioned policies offer an efficient way for sharing experience between related tasks (Schaul et al., 2015; Pitis et al., 2020b; Yang et al., 2021) . In general, the corresponding agents are equipped with a capability to sample goals from some goal space, but do not know whether a given goal can be achieved or not. Besides, the general ambition in the multi-goal RL context is to obtain an agent able to reach any goal from a desired goal distribution and to do so reliably and efficiently. This is particularly challenging in settings where the desired goal distribution is unknown at train time, which means discovering the goal space by experience and optimizing its coverage without any prior knowledge. To avoid deceptive gradient issues, multi-goal RL often considers the sparse reward context, where the agent only obtains a non-null learning signal when the goal is reached. In that case, the multi-goal framework makes it possible to leverage Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) which helps densify the reward signal by relabeling failures as successes for the goals achieved by accident. However, in settings with discontinuities in the goal space (e.g., walls in a maze), or in hard-exploration problems where the long task horizon results in an exponential decrease of the learning signal (Osband et al., 2016) , many goals remain hard to reach. In these more difficult contexts, and without any desired goal distribution at hand, we want to maximize the performance of the agent on all feasible goals, that we call the success coverage. This metric encompasses the capacity of the agent to explore and master every goal in the environment. To do so, the selection of goals must be structured into a curriculum to help agents to explore and learn progressively by adapting training tasks to their current capabilities (Colas et al., 2018) . The question is: how can we organize a curriculum of goals that maximize the success coverage ?

