CONTINUOUS GOAL SAMPLING: A SIMPLE TECHNIQUE TO ACCELERATE AUTOMATIC CURRICULUM LEARNING

Abstract

Goal-conditioned reinforcement learning (RL) tackles the problem of training an RL agent to reach multiple goals in an environment, often with sparse rewards only administered upon reaching the goal. In this regard, automatic curriculum learning can improve an agent's learning by sampling goals in a structured order catered to the agent's current ability. This work presents two contributions to improve learning in goal-conditioned RL environments. First, we present a simple, algorithm-agnostic technique to accelerate learning by continuous goal sampling, in which an agent's goals are sampled and changed multiple times within a single episode. Such continuous goal sampling enables faster exploration of the goal space and allows curriculum methods to have a more significant impact on an agent's learning. Second, we propose VDIFF, an automatic curriculum learning method that uses an agent's value function to create a self-paced curriculum by sampling goals on which the agent is demonstrating high learning progress. Through results on 17 multi-goal robotic environments and navigation tasks, we show that continuous goal sampling, combined with VDIFF or existing curriculum learning methods, results in performance gains over state-of-the-art methods.

1. INTRODUCTION

Recent successes in deep reinforcement learning (DRL) have proven that it can tackle complex sequential decision-making problems for tasks in diverse domains such as robotics, video games, and traffic control (Andrychowicz et al., 2020; Vinyals et al., 2019; Li et al., 2016) . Building upon the success of DRL in solving specific robotic tasks, the next step is the development of more general-purpose methods that can solve multiple tasks. With this objective in mind, the area of goalconditioned RL (GCRL) (Schaul et al., 2015) has received increased attention as an extension of standard RL. Here, agents are trained to learn a policy that can achieve multiple goals, with sparse rewards being provided when the agent achieves the desired goal (Andrychowicz et al., 2017) . A natural question arises from the GCRL formulation -given a set or a distribution of goals, in what order should they be presented to the learning agent? A naive strategy is to sample these goals uniformly from the goal distribution. This strategy has two drawbacks. First, humans and other biological agents do not have a random, unstructured order to their learning (Ferster & Skinner, 1957) . For example, a baby does not start walking before it can learn to crawl. Using the same intuition, at the start of learning, when an RL agent essentially has a random policy, sampling hardto-achieve goals will result in no learning signal and wasted samples with high probability. Second, given that the training time is limited, it is often not even possible to adequately cover the entire goal space for high-dimensional problems by random sampling. To address these challenges, curriculum learning methods structure an agent's learning by organizing the order in which goals or tasks are presented to the agent (Soviany et al., 2022) . The fundamental intuition behind many of these methods is to sample goals that are neither too easy nor too hard and are thus maximally informative to the agent. Although driven by the same intuition, the technical objectives used in prior work vary significantly. For example, Matiisen et al. (2019) and Portelas et al. (2020) propose to sample tasks on which an agent has high learning progress (LP), while VDS (Zhang et al., 2020) proposes to sample goals that have high expected epistemic uncertainty.

