OUTCOME-DIRECTED REINFORCEMENT LEARNING BY UNCERTAINTY & TEMPORAL DISTANCE-AWARE CURRICULUM GOAL GENERATION

Abstract

Current reinforcement learning (RL) often suffers when solving a challenging exploration problem where the desired outcomes or high rewards are rarely observed. Even though curriculum RL, a framework that solves complex tasks by proposing a sequence of surrogate tasks, shows reasonable results, most of the previous works still have difficulty in proposing curriculum due to the absence of a mechanism for obtaining calibrated guidance to the desired outcome state without any prior domain knowledge. To alleviate it, we propose an uncertainty & temporal distance-aware curriculum goal generation method for the outcomedirected RL via solving a bipartite matching problem. It could not only provide precisely calibrated guidance of the curriculum to the desired outcome states but also bring much better sample efficiency and geometry-agnostic curriculum goal proposal capability compared to previous curriculum RL methods. We demonstrate that our algorithm significantly outperforms these prior methods in a variety of challenging navigation tasks and robotic manipulation tasks in a quantitative and qualitative way.

1. INTRODUCTION

While reinforcement learning (RL) shows promising results in automated learning of behavioral skills, it is still not enough to solve a challenging uninformed search problem where the desired behavior and rewards are sparsely observed. Some techniques tackle this problem by utilizing the shaped reward (Hartikainen et al., 2019) or combining representation learning for efficient exploration (Ghosh et al., 2018) . But, these not only become prohibitively time-consuming in terms of the required human efforts, but also require significant domain knowledge for shaping the reward or designing the task-specific representation learning objective. What if we could design the algorithm that automatically progresses toward the desired behavior without any domain knowledge and human efforts, while distilling the experiences into the general purpose policy? An effective scheme for designing such an algorithm is one that learns on a tailored sequence of curriculum goals, allowing the agent to autonomously practice the intermediate tasks. However, a fundamental challenge is that proposing the curriculum goal to the agent is intimately connected to the efficient desired outcome-directed exploration and vice versa. If the curriculum generation is ineffective for recognizing frontier parts of the explored and feasible areas, an efficient exploration toward the desired outcome states cannot be performed. Even though some prior works propose to modify the curriculum distribution into a uniform one over the feasible state space (Pong et al., 2019; Klink et al., 2022) or generate a curriculum based on the level of difficulty (Florensa et al., 2018; Sukhbaatar et al., 2017) , most of these methods show slow curriculum progress due to the process of skewing the curriculum distribution toward the uniform one rather than the frontier of the explored region or the properties that are susceptible to focusing on infeasible goals where the agent's capability stagnates in the intermediate level of difficulty. Figure 1 : OUTPACE proposes uncertainty and temporal distance-aware curriculum goals to enable the agent to progress toward the desired outcome state automatically. Note that the temporal distance estimation is reliable within the explored region where we query the curriculum goals. Conversely, without the efficient desired outcome-directed exploration, the curriculum proposal could be ineffective when recognizing the frontier parts in terms of progressing toward the desired outcomes because the curriculum goals, in general, are obtained from the agent's experiences through exploration. Even though some prior works propose the success-example-based approaches (Fu et al., 2018; Singh et al., 2019; Li et al., 2021) , these are limited to achieving the only given example states, which means these cannot be generalized to the arbitrary goal-conditioned agents. Other approaches propose to minimize the distance between the curriculum distribution and the desired outcome distribution (Ren et al., 2019; Klink et al., 2022) , but these require an assumption that the distance between the samples can be measured by the Euclidean distance metric, which cannot be generalized for an arbitrary geometric structure of the environment. Therefore, we argue that the development of algorithms that simultaneously address both outcome-directed exploration and curriculum generation toward the frontier is crucial to benefit from the outcome-directed curriculum RL. In this work, we propose Outcome-directed Uncertainty & TemPoral distance-Aware Curriculum goal gEneration (OUTPACE) to address such problems, which requires desired outcome examples only and not prior domain knowledge nor external reward from the environment. Specifically, the key elements of our work consist of two parts. Firstly, our method addresses desired outcomedirected exploration via a Bayesian classifier incorporating an uncertainty quantification based on the conditional normalized maximum likelihood (Zhou & Levine, 2021; Li et al., 2021) , which enables our method to propose the curriculum into the unexplored regions and provide directed guidance toward the desired outcomes. Secondly, our method utilizes Wasserstein distance with a time-step metric (Durugkar et al., 2021b ) not only for a temporal distance-aware intrinsic reward but also for querying the frontier of the explored region during the curriculum learning. By deploying the above two elements, we propose a simple and intuitive curriculum learning objective formalized with a bipartite matching to generate a set of calibrated curriculum goals that interpolates between the initial state distribution and desired outcome state distribution. To sum up, our work makes the following key contributions. • We propose an outcome-directed curriculum RL method which only requires desired outcome examples and does not require an external reward. • To the best of the author's knowledge, we are the first to propose the uncertainty & temporal distance-aware curriculum goal generation method for geometry-agnostic progress by leveraging the conditional normalized maximum likelihood and Wasserstein distance. • Through several experiments in goal-reaching environments, we show that our method outperforms the prior curriculum RL methods, most notably when the environment has a geometric structure, and its curriculum proposal shows properly calibrated guidance toward the desired outcome states in a quantitative and qualitative way.

2. RELATED WORKS

While a number of works have been proposed to improve the exploration problem in RL, it still remains a challenging open problem. For tackling this problem, prior works include statevisitation counts (Bellemare et al., 2016; Ostrovski et al., 2017) , curiosity/similarity-driven exploration (Pathak et al., 2017; Warde-Farley et al., 2018) , the prediction model's uncertainty-based

