CONTINUOUS GOAL SAMPLING: A SIMPLE TECHNIQUE TO ACCELERATE AUTOMATIC CURRICULUM LEARNING

Abstract

Goal-conditioned reinforcement learning (RL) tackles the problem of training an RL agent to reach multiple goals in an environment, often with sparse rewards only administered upon reaching the goal. In this regard, automatic curriculum learning can improve an agent's learning by sampling goals in a structured order catered to the agent's current ability. This work presents two contributions to improve learning in goal-conditioned RL environments. First, we present a simple, algorithm-agnostic technique to accelerate learning by continuous goal sampling, in which an agent's goals are sampled and changed multiple times within a single episode. Such continuous goal sampling enables faster exploration of the goal space and allows curriculum methods to have a more significant impact on an agent's learning. Second, we propose VDIFF, an automatic curriculum learning method that uses an agent's value function to create a self-paced curriculum by sampling goals on which the agent is demonstrating high learning progress. Through results on 17 multi-goal robotic environments and navigation tasks, we show that continuous goal sampling, combined with VDIFF or existing curriculum learning methods, results in performance gains over state-of-the-art methods.

1. INTRODUCTION

Recent successes in deep reinforcement learning (DRL) have proven that it can tackle complex sequential decision-making problems for tasks in diverse domains such as robotics, video games, and traffic control (Andrychowicz et al., 2020; Vinyals et al., 2019; Li et al., 2016) . Building upon the success of DRL in solving specific robotic tasks, the next step is the development of more general-purpose methods that can solve multiple tasks. With this objective in mind, the area of goalconditioned RL (GCRL) (Schaul et al., 2015) has received increased attention as an extension of standard RL. Here, agents are trained to learn a policy that can achieve multiple goals, with sparse rewards being provided when the agent achieves the desired goal (Andrychowicz et al., 2017) . A natural question arises from the GCRL formulation -given a set or a distribution of goals, in what order should they be presented to the learning agent? A naive strategy is to sample these goals uniformly from the goal distribution. This strategy has two drawbacks. First, humans and other biological agents do not have a random, unstructured order to their learning (Ferster & Skinner, 1957) . For example, a baby does not start walking before it can learn to crawl. Using the same intuition, at the start of learning, when an RL agent essentially has a random policy, sampling hardto-achieve goals will result in no learning signal and wasted samples with high probability. Second, given that the training time is limited, it is often not even possible to adequately cover the entire goal space for high-dimensional problems by random sampling. To address these challenges, curriculum learning methods structure an agent's learning by organizing the order in which goals or tasks are presented to the agent (Soviany et al., 2022) . The fundamental intuition behind many of these methods is to sample goals that are neither too easy nor too hard and are thus maximally informative to the agent. Although driven by the same intuition, the technical objectives used in prior work vary significantly. For example, Matiisen et al. (2019) and Portelas et al. (2020) propose to sample tasks on which an agent has high learning progress (LP), while VDS (Zhang et al., 2020) proposes to sample goals that have high expected epistemic uncertainty. While curriculum learning methods have led to significant improvement over vanilla GCRL, they are still quite sample inefficient in sparse reward settings, especially for problems with high dimensional goal spaces. One of the reasons behind this is that the standard GCRL setup constrains an agent to have a single goal for the entire length of an episode. As a result, episodes where the goal is too hard or too easy to achieve are non-informative and waste agent interactions. The single-goal framework also restricts curriculum learning methods to sampling goals from the initial state and does not give them the flexibility to change an agent's goal online based on its current state. However, since goals do not affect transition dynamics, this constraint is not required to be enforced but is simply a result of the chosen conventional framework. Using this observation and the recent success of curriculum learning methods, this work presents two contributions to accelerate and improve GCRL. First and as the main focus of this work, we present Continuous Goal sampling, a simple technique that proposes a novel extension to GCRL which can accelerate a wide range of curriculum learning algorithms. In continuous goal sampling, goals are sampled multiple times in an episode, instead of the standard practice of sampling a goal only at the start of an episode. Second, inspired by the success of recent LP-based curriculum methods in multi-task RL (Matiisen et al., 2019; Portelas et al., 2020) , we reformulate the LP objective to enable it to work in sparse reward settings with random initial states. We achieve this by using a current and a lagged value function to approximate the expected LP of a goal. Since value functions are a part of most current deep RL algorithms, we reuse them off the shelf for LP computation. Our proposed automatic curriculum learning algorithm, referred to as VDIFF, requires no additional learning, has little to no computational overhead, and is agnostic to the base RL algorithm. We present results on a set of 14 benchmark manipulation environments (Plappert et al., 2018; Gallouédec et al., 2021) and 3 maze navigation environments (Zhang et al., 2020) . Through these results, we show that resampling improves the sample efficiency and performance of explicit curriculum learning algorithms (VDS and VDIFF), implicit curriculum learning methods (HER), and vanilla RL algorithms in multiple environments. Finally, we also show that our proposed curriculum learning method VDIFF is able to outperform existing explicit curriculum learning methods in GCRL. Anonymized code is available here.

2. BACKGROUND

2.1 GOAL-CONDITIONED RL Goal-conditioned RL (GCRL) (Schaul et al., 2015) aims to train an agent to achieve multiple goals in an environment. Formally, it can be described as a finite-horizon Markov decision process (MDP) defined by the tuple (S, G, A, R, T, ρ, γ), where S is the state space, A is the action space, R : S × A×G → R is the the reward function, T : S ×A×S → [0, 1] is the state transition function, ρ(s 0 ) is the initial state distribution, and γ ∈ [0, 1] is the discount factor. At the start of each episode, a goal g is sampled from G and the objective is to learn a policy π(a t |s t , g) which maximizes the expected value V , which is defined as the sum of discounted rewards V (s 0 , g) = E[ T t=0 γ t R(s t , a t , g)]. Given the dependence of R on g, GCRL can also be viewed as the problem of learning a policy π over a distribution of reward functions R g parameterized by a goal g. The inherent binary structure of GCRL setup allows for the definition of a sparse indicator reward function that indicates if an agent has achieved the given goal g (Andrychowicz et al., 2017). An agent receives a reward of 0 and is said to have achieved goal g when d(s t , g) < ϵ, where d(., .) is some distance function in goal space and ϵ is the acceptance threshold. In all other cases, the agent receives a reward of -1. In this work, an episode terminates as soon as an agent achieves the goal.

2.2. CURRICULUM LEARNING SETUP

In goal-conditioned RL, a goal g is sampled from G according to some probability distribution p(g|s 0 ), where s 0 is the initial state. The task of a curriculum learning method is to design p(g|s 0 ) to enable the sampling of meaningful goals that can both improve and accelerate an agent's learning. Curriculum learning methods dynamically change p(g|s 0 ) as an agent's policy evolves during training. To make this dependence on policy more explicit, we will denote the goal distribution designed by a curriculum method as p π (g|s 0 ). Existing curriculum methods generally construct

