STEIN VARIATIONAL GOAL GENERATION FOR ADAP-TIVE EXPLORATION IN MULTI-GOAL REINFORCEMENT LEARNING Anonymous

Abstract

Multi-goal Reinforcement Learning has recently attracted a large amount of research interest. By allowing experience to be shared between related training tasks, this setting favors generalization for new tasks at test time. However, in settings with discontinuities in the goal space (e.g. walls in a maze) and when the reward is sparse, a majority of goals are difficult to reach. In this context, some curriculum over goals is needed to help agents learn by adapting training tasks to their current capabilities. In this work we propose a novel approach: Stein Variational Goal Generation (SVGG), which builds on recent automatic curriculum learning techniques for goal-conditioned policies. SVGG samples goals of intermediate difficulty for the agent, by leveraging a learned predictive model of its goal reaching capabilities. In that aim, it models the distribution of goals with particles and relies on Stein Variational Gradient Descent to dynamically attract the goal sampling distribution in areas of appropriate difficulty. We show that SVGG outperforms state-of-the-art multi-goal Reinforcement Learning methods in terms of success coverage in hard exploration problems, and demonstrate that our approach is endowed with a useful recovery property when the environment changes.

1. INTRODUCTION

In standard Reinforcement Learning (RL), agents learn a policy to optimally achieve a single task. By contrast, in Multi-goal RL (Kaelbling, 1993) , they address a set of tasks by having policies conditioned on goals, where each goal corresponds to an individual task. The resulting goal-conditioned policies offer an efficient way for sharing experience between related tasks (Schaul et al., 2015; Pitis et al., 2020b; Yang et al., 2021) . In general, the corresponding agents are equipped with a capability to sample goals from some goal space, but do not know whether a given goal can be achieved or not. Besides, the general ambition in the multi-goal RL context is to obtain an agent able to reach any goal from a desired goal distribution and to do so reliably and efficiently. This is particularly challenging in settings where the desired goal distribution is unknown at train time, which means discovering the goal space by experience and optimizing its coverage without any prior knowledge. To avoid deceptive gradient issues, multi-goal RL often considers the sparse reward context, where the agent only obtains a non-null learning signal when the goal is reached. In that case, the multi-goal framework makes it possible to leverage Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) which helps densify the reward signal by relabeling failures as successes for the goals achieved by accident. However, in settings with discontinuities in the goal space (e.g., walls in a maze), or in hard-exploration problems where the long task horizon results in an exponential decrease of the learning signal (Osband et al., 2016) , many goals remain hard to reach. In these more difficult contexts, and without any desired goal distribution at hand, we want to maximize the performance of the agent on all feasible goals, that we call the success coverage. This metric encompasses the capacity of the agent to explore and master every goal in the environment. To do so, the selection of goals must be structured into a curriculum to help agents to explore and learn progressively by adapting training tasks to their current capabilities (Colas et al., 2018) . The question is: how can we organize a curriculum of goals that maximize the success coverage ? A first approach consists in focusing on novelty, with the objective of expanding the set of achieved goals Pitis et al. (2020b ), Pong et al. (2019) , Warde-Farley et al. (2018 ), Nair et al. (2018) . This leads to strong exploration results, but success coverage is only optimized implicitly. Another strategy is to bias the goal generation process toward goals of intermediate difficulty (GOIDs) that will intuitively provide a strong learning signal to the agent Florensa et al. ( 2017), Racaniere et al. (2019 ), Sukhbaatar et al. (2017 ), Zhang et al. (2020) . By aiming at performance, those methods target more explicitly success in encountered goals, but benefit from implicit exploration. In this work, we propose a novel method which provides the best of both worlds. Our method, called SVGGfoot_0 , learns a model of the probability of succeeding in reaching goals, and targets goals whose success is the most unpredictable. To model such a distribution over the goal space, we rely on a set of particles where each particle represents a goal candidate. This set of particles is updated via Stein Variational Gradient Descent (Liu & Wang, 2016) to fit our objective of goals of intermediate difficulty. Due to the optimization properties of SVGD, in the absence of goals of intermediate difficulty, the current particles will repel one another and foster exploration. We use this feature to demonstrate that SVGG possesses a very useful recovery property that prevent from catastrophic forgetting and enables the agent to adapt when the environment changes during training.

2. BACKGROUND AND RELATED WORK

In this paper, we consider the multi-goal reinforcement learning setting, defined as a Markov Decision Process (MDP) M g =< S, T, A, g, R g >, where S is a set of states, T is the set of transitions, A the set of actions and the reward function R g is parametrized by a goal g lying in the d-dimensional continuous goal space G ≡ R d . In our setting, each goal g is defined as a set of states S g ⊆ S that are desirable situations for the corresponding task, with states in S g being terminal states of the corresponding MDP. Thus, a goal g is considered achieved when the agent reached at step t any state s t ∈ S g , which implies the following sparse reward function R g : S → {0; 1} in the absence of expert knowledge. With I the indicator function, R g (s t , a t , s t+1 ) = I(s t+1 ∈ S g ) for discrete state spaces and R g (s t , a t , s t+1 ) = I(min s * ∈(Sg) ||s t+1 -s * || 2 < δ) for discrete ones, where δ is a distance threshold. Then, the objective is to learn a goal-conditioned policy (GCP) π : S × G → A which maximizes the expected cumulative reward from any initial state of the environment, given a goal g ∈ G: π * = arg max π E g∼pg E τ ∼π(τ ) [ ∞ t=0 γ t r g t ], where r g t = R g (s t , a t , s t+1 ) stands for the goal-conditioned reward obtained at step t of trajectory τ using goal g, γ is a discount factor in ]0; 1[ and p g is the distribution of goals over G. In our setting we consider that p g is uniform over S (i.e., no known desired distribution), while the work could be extended to cover different distributions.

2.1. AUTOMATIC CURRICULUM FOR SPARSE REWARD RL

Our SVGG method addresses automatic curriculum for sparse reward goal-conditioned RL (GCRL) problems and learns to achieve a continuum of related tasks. Achieved Goals Distributions Our work is strongly related to the MEGA algorithm (Pitis et al., 2020b) , which (1) maintains a buffer of previously achieved goals, (2) models the distribution of achieved goals via a kernel density estimation (KDE), and (3) uses this distribution to define its behavior distribution. By preferably sampling from the buffer goals at the boundary of the set of already reached states, an increase of the support of that distribution is expected. Doing so, MEGA aims at overcoming the limitations of previous related approaches which also model the distribution of achieved goals. For instance, DISCERN (Warde-Farley et al., 2018) only uses a replay buffer of goals whereas RIG (Nair et al., 2018) and Skew-fit (Pong et al., 2019) rather use variational auto-encoding (Kingma & Welling, 2013) of the distribution. While RIG samples from the modeled achieved distribution, and DISCERN and Skew-Fit skew that distribution to sample more diverse achieved goals, MEGA rather focuses on low density regions of the distribution, aiming to expand it. This results in improved exploration compared to competitors. Our work differs from all these works as they only model achieved goals, independently from which goal was targeted when they were



For Stein Variational Goal Generation. Code and instructions to reproduce our results will be made publicly available at https://anonymous.4open.science/r/Stein-Variational-Goal-Generation-1A6E.

