OUTCOME-DIRECTED REINFORCEMENT LEARNING BY UNCERTAINTY & TEMPORAL DISTANCE-AWARE CURRICULUM GOAL GENERATION

Abstract

Current reinforcement learning (RL) often suffers when solving a challenging exploration problem where the desired outcomes or high rewards are rarely observed. Even though curriculum RL, a framework that solves complex tasks by proposing a sequence of surrogate tasks, shows reasonable results, most of the previous works still have difficulty in proposing curriculum due to the absence of a mechanism for obtaining calibrated guidance to the desired outcome state without any prior domain knowledge. To alleviate it, we propose an uncertainty & temporal distance-aware curriculum goal generation method for the outcomedirected RL via solving a bipartite matching problem. It could not only provide precisely calibrated guidance of the curriculum to the desired outcome states but also bring much better sample efficiency and geometry-agnostic curriculum goal proposal capability compared to previous curriculum RL methods. We demonstrate that our algorithm significantly outperforms these prior methods in a variety of challenging navigation tasks and robotic manipulation tasks in a quantitative and qualitative way.

1. INTRODUCTION

While reinforcement learning (RL) shows promising results in automated learning of behavioral skills, it is still not enough to solve a challenging uninformed search problem where the desired behavior and rewards are sparsely observed. Some techniques tackle this problem by utilizing the shaped reward (Hartikainen et al., 2019) or combining representation learning for efficient exploration (Ghosh et al., 2018) . But, these not only become prohibitively time-consuming in terms of the required human efforts, but also require significant domain knowledge for shaping the reward or designing the task-specific representation learning objective. What if we could design the algorithm that automatically progresses toward the desired behavior without any domain knowledge and human efforts, while distilling the experiences into the general purpose policy? An effective scheme for designing such an algorithm is one that learns on a tailored sequence of curriculum goals, allowing the agent to autonomously practice the intermediate tasks. However, a fundamental challenge is that proposing the curriculum goal to the agent is intimately connected to the efficient desired outcome-directed exploration and vice versa. If the curriculum generation is ineffective for recognizing frontier parts of the explored and feasible areas, an efficient exploration toward the desired outcome states cannot be performed. Even though some prior works propose to modify the curriculum distribution into a uniform one over the feasible state space (Pong et al., 2019; Klink et al., 2022) or generate a curriculum based on the level of difficulty (Florensa et al., 2018; Sukhbaatar et al., 2017) , most of these methods show slow curriculum progress due to the process of skewing the curriculum distribution toward the uniform one rather than the frontier of the explored region or the properties that are susceptible to focusing on infeasible goals where the agent's capability stagnates in the intermediate level of difficulty. Figure 1 : OUTPACE proposes uncertainty and temporal distance-aware curriculum goals to enable the agent to progress toward the desired outcome state automatically. Note that the temporal distance estimation is reliable within the explored region where we query the curriculum goals. Conversely, without the efficient desired outcome-directed exploration, the curriculum proposal could be ineffective when recognizing the frontier parts in terms of progressing toward the desired outcomes because the curriculum goals, in general, are obtained from the agent's experiences through exploration. Even though some prior works propose the success-example-based approaches (Fu et al., 2018; Singh et al., 2019; Li et al., 2021) , these are limited to achieving the only given example states, which means these cannot be generalized to the arbitrary goal-conditioned agents. Other approaches propose to minimize the distance between the curriculum distribution and the desired outcome distribution (Ren et al., 2019; Klink et al., 2022) , but these require an assumption that the distance between the samples can be measured by the Euclidean distance metric, which cannot be generalized for an arbitrary geometric structure of the environment. Therefore, we argue that the development of algorithms that simultaneously address both outcome-directed exploration and curriculum generation toward the frontier is crucial to benefit from the outcome-directed curriculum RL. In this work, we propose Outcome-directed Uncertainty & TemPoral distance-Aware Curriculum goal gEneration (OUTPACE) to address such problems, which requires desired outcome examples only and not prior domain knowledge nor external reward from the environment. Specifically, the key elements of our work consist of two parts. Firstly, our method addresses desired outcomedirected exploration via a Bayesian classifier incorporating an uncertainty quantification based on the conditional normalized maximum likelihood (Zhou & Levine, 2021; Li et al., 2021) , which enables our method to propose the curriculum into the unexplored regions and provide directed guidance toward the desired outcomes. Secondly, our method utilizes Wasserstein distance with a time-step metric (Durugkar et al., 2021b ) not only for a temporal distance-aware intrinsic reward but also for querying the frontier of the explored region during the curriculum learning. By deploying the above two elements, we propose a simple and intuitive curriculum learning objective formalized with a bipartite matching to generate a set of calibrated curriculum goals that interpolates between the initial state distribution and desired outcome state distribution. To sum up, our work makes the following key contributions. • We propose an outcome-directed curriculum RL method which only requires desired outcome examples and does not require an external reward. • To the best of the author's knowledge, we are the first to propose the uncertainty & temporal distance-aware curriculum goal generation method for geometry-agnostic progress by leveraging the conditional normalized maximum likelihood and Wasserstein distance. • Through several experiments in goal-reaching environments, we show that our method outperforms the prior curriculum RL methods, most notably when the environment has a geometric structure, and its curriculum proposal shows properly calibrated guidance toward the desired outcome states in a quantitative and qualitative way.

2. RELATED WORKS

While a number of works have been proposed to improve the exploration problem in RL, it still remains a challenging open problem. For tackling this problem, prior works include statevisitation counts (Bellemare et al., 2016; Ostrovski et al., 2017) , curiosity/similarity-driven exploration (Pathak et al., 2017; Warde-Farley et al., 2018) , the prediction model's uncertainty-based exploration (Burda et al., 2018; Pathak et al., 2019) , mutual information-based exploration (Eysenbach et al., 2018; Sharma et al., 2019; Zhao et al., 2021; Laskin et al., 2022) , maximizing the entropy of the visited state distribution (Yarats et al., 2021; Liu & Abbeel, 2021a; b) . Unfortunately, these techniques are uninformed about the desired outcomes: the trained agent only knows how to visit frontier states as diverse as possible. On the contrary, we consider a problem where the desired outcome can be specified by the given desired outcome examples, allowing for more efficient outcome-directed exploration rather than naive frontier-directed exploration. Some prior methods that try to accomplish the desired outcome states often utilize the provided success examples (Fu et al., 2018; Singh et al., 2019; Eysenbach et al., 2021; Li et al., 2021) . However, they do not provide a mechanism for distilling the knowledge obtained from the agent's experiences into general-purpose policies that can be used to achieve new test goals. In this work, we utilize the Wasserstein distance not only for an arbitrary goal-conditioned agent but also for querying the frontier of the explored region during curriculum learning. Although the Wasserstein distance has been adopted by some previous research, they are often limited to imitation learning or skill discovery (Dadashi et al., 2020; Haldar et al., 2022; Xiao et al., 2019; Durugkar et al., 2021a; Fickinger et al., 2021) . Another work (Durugkar et al., 2021b) tries to utilize the Wasserstein distance with the time-step metric for training the goal-reaching agent, but it requires a stationary goal distribution for stable distance estimation. Our work is different from these prior works in that the distribution during training is non-stationary for calibrated guidance to the desired outcome states. Suggesting a curriculum can also make exploration easier where the agent learns on a tailored sequence of tasks, allowing the agent to autonomously practice the intermediate tasks in a training process. However, prior works often require a significant amount of samples to measure the curriculum's level of difficulty (Florensa et al., 2018; Sukhbaatar et al., 2017) , learning progress (Portelas et al., 2020) , regret (Jiang et al., 2021) . Curricula are often generated by modifying the goal distribution into the frontier of the explored region via maximizing a certain surrogate objective, such as the entropy of the goal distribution (Pong et al., 2019) , disagreement between the value function ensembles (Zhang et al., 2020) , but these methods do not have convergence mechanism to the desired outcome distribution. While some algorithms formulate the generation of a curriculum as an explicit interpolation between the distribution of target tasks and a surrogate task distribution (Ren et al., 2019; Klink et al., 2022) , these still depend on the Euclidean distance metric when measuring the distance between distributions, which cannot be generalized for an arbitrary geometric structure of the environment. In contrast, our method not only provides calibrated guidance to the desired outcome distribution in a sample efficient way but also shows geometry-agnostic curriculum progress by leveraging the bipartite matching problem with uncertainty & temporal distance-based objective.

3. PRELIMINARY

We consider the Markov decision process (MDP) M = (S, A, G, T, ρ 0 , γ), where S denotes the state space, A the action space, G the goal space, T (s ′ |s, a) the transition dynamics, ρ 0 the initial distribution, ρ π the state visitation distribution when the agent follows the policy π, and γ the discount factor. The MDP in our framework is provided without a reward function and considers an environment in which only the desired outcome sample {g k + } K k=1 are given, assuming that g k + is obtained from the desired outcome distribution G + . Therefore, our work employs a trainable intrinsic reward function r : S × G × A → R, which is detailed in Section 4.1. Also, we represent the set of curriculum goals as {g k c } N k=1 and assume that these are sampled from the curriculum goal distribution G c obtained by our algorithm.

3.1. WASSERSTEIN DISTANCE OVER THE TIME-STEP METRIC

The Wasserstein distance represents how much "work" is required to transport one distribution to another distribution following the optimal transport plan (Villani, 2009; Durugkar et al., 2021b) . In this section, we describe the Wasserstein distance over the time-step metric and how it can be obtained with a potential function f . Consider a metric space (X , d), where X is a set and d is a metric on X , and two probability measures µ, ν on X . The Wasserstein-p distance for a given metric d is defined as follows, W p (µ, ν) := inf γ∈Π(µ,ν) E (X,Y )∼γ [d(X, Y ) p ] 1/p if p=1 = sup ∥f ∥ L ≤1 [E y∼ν [f (y)] -E x∼µ [f (x)]] (1) where a joint distribution γ denotes a transport plan, and Π(µ, ν) denotes the set of all possible joint distributions γ, and the second equality is held by the Kantorovich-Rubinstein duality with 1-Lipschitz functions (f : X → R) (Villani, 2009; Arjovsky et al., 2017) . If we could define the distance metric as d π (s, s g ), which is a time-step metric (quasimetric) based on the number of transition steps experienced before reaching the goal s g ∈ G for the first time when executing the goal-conditioned policy π(a|s, s g ), we could design a temporal-distance aware RL by minimizing the Wasserstein distance W 1 (ρ π , G) that gives an estimate of the work needed to transport the state visitation distribution ρ π to the goal distribution G: W 1 (ρ π , G) = sup ∥f ∥ L ≤1 [E sg∼G [f (s g )] -E s∼ρπ [f (s)]] Then, the potential function f is approximately increasing along the optimal goal-reaching trajectory. That is, if ρ π (s) consists of the states optimally reaching toward the goal s g , f (s) increases along the trajectory and f (s g ) has the maximum value (Durugkar et al., 2021b; a) . Adopting these prior works, the 1-Lipschitz potential function f with respect to d π (s, s g ) could be ensured by enforcing that the difference in values of f on the expected transition from every state is bounded by 1 as follows, (detailed derivations are in Appendix B.) sup s∈S {E a∈π(•|s,sg),s ′ ∈T (•|s,a) [|f (s ′ ) -f (s)|]} ≤ 1. (3) 3.2 CONDITIONAL NORMALIZED MAXIMUM LIKELIHOOD (CNML) For curriculum learning, our work utilizes conditional normalized maximum likelihood (CNML) (Rissanen & Roos; Fogel & Feder, 2018 ) that can perform an uncertainty-aware classification based on previously observed data by minimizing worst-case regret. Let D = {(s p , e p )} n-1 p=1 be a set of data containing pairs of states s 1:n-1 and success labels e 1:n-1 ∈ {0, 1}, where '1' represents the occurrence of the desired event. Given a query point s n , CNML in our framework defines the distribution p CNML (e n |s n ) which predicts the probability that the state s n belongs to the desired outcome distribution G + (e = 1). To explain how CNML predicts the label of s n , we suppose Θ is a set of models, where each model θ ∈ Θ can represent a conditional distribution of labels, p θ (e 1:n |s 1:n ). CNML considers the possibilities that all the possible classes (0,1) will be labeled to the query point s n , and obtains the models θ(e 1:n |s 1:n ) ∈ Θ that represent the augmented datasets D ∪ (s n , e n ) well by solving the maximum likelihood estimation (MLE) problem (LHS of Eq. ( 4)). Then, CNML that minimizes the regret over those maximum likelihood estimators can be written as follows (Bibas et al., 2019) , θi = arg max θ∈Θ E (s,e)∈D∪(sn,en=i) [log p θ (e|s)], p CNML (e n = i|s n ) = p θi (e = i|s n ) 1 j=0 p θj (e = j|s n ) (4) If the query point s n is close to one of the data points in the datasets, CNML will have difficulty in assigning a high likelihood to labels that are significantly different from those of nearby data points. However, if s n is far from the data points in the dataset, each MLE model θi=0,1 will predict the label e n as its own class, which leads to a large discrepancy in the predictions between the models and provides us with normalized likelihoods closer to uniform (RHS of Eq (4)). Thus, by minimizing the regret through labeling all possible classes to the new query data, CNML can provide a reasonable uncertainty estimate (Li et al., 2021; Zhou & Levine, 2021 ) on the queried s n and classify whether the queried s n is similar to the previously observed data either the forms of label 0, 1, or out-ofdistribution data, which is predicted as 0.5. However, the classification technique via CNML described above is in most cases computationally intractable, as it requires solving separate MLE problems until convergence on every queried data. Previous methods proposed some ideas to amortize the cost of computing CNML distribution (Zhou & Levine, 2021; Li et al., 2021) . Following those prior methods, our work adopts MAML (Finn 

4. METHOD

For a calibrated guidance of the curriculum goals to the desired outcome distribution, we propose to progress the curriculum towards the uncertain & temporally distant area before converging to G + , as it is not only the most intuitive way for exploration but also enables the agent to progress without any prior domain knowledge on the environment such as obstacles. In short, our work tries to obtain the distribution of curriculum goals G c that are considered (a) temporally distant from ρ 0 and, (b) uncertain and, (c) being progressed toward the desired outcome distribution G + .

4.1. TEMPORAL DISTANCE-AWARE RL WITH THE INTRINSIC REWARD

This section details the intrinsic reward for the RL agent as well as the method of training the parameterized potential function f π ϕ , trained with the data collected by the policy π(a|s, s g ). We consider a 1-Lipschitz potential function f π ϕ whose value increases as the state is far from the initial state distribution and getting close to the goals s g ∈ G proposed by curriculum learning. Then, we can train an agent that reaches the goals s g in as few steps as possible by minimizing the Wasserstein distance W 1 (ρ π , G). Considering that we can obtain the estimate of W 1 (ρ π , G) by Eq (2), the loss for training the parameterized potential function f π ϕ can be represented as follows (Durugkar et al., 2021b; a) : L ϕ = E s,sg∼B [f π ϕ (s) -f π ϕ (s g )] + λ • E s,s ′ ,sg∼B [max(|f π ϕ (s) -f π ϕ (s ′ )| -1, 0)) 2 ] The penalty term with coefficient λ in Eq ( 5) is from Eq (3) for ensuring the smoothness requirement since we consider the Wasserstein distance over the time-step metric. Then, assuming the parameter ϕ is trained by Eq (5) at every training iteration, we could obtain the supremum of Eq (2). Thus, the reward can be represented as r = f π ϕ (s) -f π ϕ (s g ), which corresponds to -W 1 (ρ π , G).

4.2. CURRICULUM LEARNING

As CNML can provide a near-uniform prior (prediction of 0.5) for out-of-distribution data given the datasets (Section 3.2), we could utilize it by treating the desired outcome states in G + as (e = 1), and data points in the replay buffer B as (e = 0). Then, we could quantify the uncertainty of a state s based on CNML as η ucert (s, G + ) = 1 -p CNML (e = 0|s) -p CNML (e = 1|s) . ( ) which is proportional to the uncertainty of the queried data s. However, η ucert alone cannot provide curriculum guidance toward the desired outcome states because it only performs an uninformed search over uncertainties rather than converging to the desired outcome states. Thus, we modify Eq (6) with an additional guidance term: ηucert (G c , G + ) = E s∼G c [log(η ucert (s, G + ) + c • η guidance (s, G + ))]. where η guidance (s, G + ) = (p CNML (e = 1|s) -0.5) • 1(p CNML (e = 1|s) ≥ 0.5), and c is a hyperparameter that adjusts the preference on the desired outcome states. Since the CNML provides near-uniform prior for out-of-distribution data, η ucert provides large values in the uncertain areas. Also, the guidance term η guidance (s, G + ) reflects the preference for the states considered to be closer to the desired outcome state distribution. However, in practice, we found the uncertainty quantification itself sometimes has numerical errors, and it makes p CNML erroneously predict the states near the initial states or boundaries of the already explored regions as uncertain areas. Therefore, we assume that the curriculum should incorporate the notion of not only the uncertainty but also the temporally distant states from ρ 0 for frontier and desired outcome-directed exploration. Thus, we formulate the final curriculum learning objective as follows: arg max G c [η ucert (G c , G + )] + L • W 1 (ρ o , G c ), where the temporal distance bias term with a coefficient L is represented by the Wasserstein distance from initial state distribution ρ 0 to the curriculum goal distribution G c . And, given a parameterized 1-Lipschitz potential function f π ϕ over the time-step metric d π (s, s g ), we can obtain the estimate of W 1 (ρ o , G c ) by RHS of Eq (1). Also, if we assume Ĝc to be a finite set of K particles that is sampled from already achieved states in the replay buffer B, the objective function we aim to maximize can be represented as follows: max Ĝc :| Ĝc |=K K i=1 [η ucert (s i , G + ) + L • [f π ϕ (s i ) -f π ϕ (s i 0 )]], s i ∈ Ĝc , s i 0 ∈ ρ 0 It enables to propose the curriculum that reflects not only the uncertainty of the states and preference on the desired outcomes but also temporally distant states from ρ 0 , while not requiring prior domain knowledge about the environment such as an obstacle.

4.3. SAMPLING CURRICULUM GOAL VIA BIPARTITE MATCHING

Since we assume that desired outcome examples from G + are given rather than its distribution, we could approximate it by the sampled set Ĝ+ (| Ĝ+ | = K). Then, to solve the curriculum learning problem of Eq (9), we should address the combinatorial setting that requires assigning Ĝc from the entire curriculum goal candidates in the replay buffer B to the Ĝ+ , which is addressed via bipartite matching in this work. With the hyperparameter c = 4, we can rearrange Eq (9) as a minimization problem with the costs of cross-entropy loss (CE) and temporal-distance bias (f π ϕ ) term: (Refer to the Appendix B for the detailed derivation.) min Ĝc :| Ĝc |=K s i ∈ Ĝc ,g i + ∈ Ĝ+ w(s i , g i + ) (10) w(s i , g i + ) = CE(p CNML (e = 1|s i ); y = p CNML (e = 1|g i + )) -L • f π ϕ (s i ) (11) Intuitively, before discovering the desired outcome states in G + , the curriculum goal s i is proposed in a region of the state space considered to be uncertain and temporally distant from ρ 0 in order to recognize the frontier of the explored regions. And it is kept updated to converge to the desired outcome states for minimizing the discrepancy between the predicted labels of Ĝ+ and Ĝc . Then we can construct a bipartite graph G with the cost of the edges w. Let V a and V b be the sets of nodes representing achieved states in replay buffer and Ĝ+ respectively. We define a bipartite graph G({V a , V b }, E) with the weight of the edges E(•, •) = -w(•, •) and separated partitions (V a and V b ). To solve the bipartite matching problem, we utilize the Minimum Cost Maximum Flow 

5. EXPERIMENTS

We include 6 environments to validate our proposed method. Firstly, various maze environments (Point U-Maze, N-Maze, Spiral-Maze) are used to validate the geometry-agnostic curriculum generation capability. Also, we experimented with the Ant-Locomotion and Sawyer-Peg Push, Pick&Place environments to evaluate our method in more complex dynamics or other domains rather than navigation tasks. We compare with other previous curriculum or goal generation methods, where each method has the following properties. 

5.1. EXPERIMENTAL RESULTS

Firstly, to show how each module in our method is trained, we visualized the uncertainty quantification by CNML, and f π ϕ (s) values which are proportional to the required timesteps to reach s from ρ 0 . The uncertainty quantification results (Figure 2 ) show that the classifier p CNML (•|s) successfully discriminates the queried states as already explored region or desired outcome states, otherwise, uncertain states. Due to the geometry-agnostic property of the classifier p CNML (•|s), we could propose the curriculum in the arbitrary geometric structure of the environments, while most of the previous curriculum generation methods do not consider it. We also visualized the values of the trained potential function f π ϕ (s) to show how the intrinsic reward is shaped (Figure 2 ). As the potential function f π ϕ (s) is trained to have high values near the desired outcome states due to the Wasserstein distance with the time-step metric, the results show gradual increases of f π ϕ (s) values along the trajectory toward the desired outcome states. That is, high values of f π ϕ (s) indicate the smaller required timesteps to reach the desired outcome state, and this property brings the advantage in identifying the frontier of the explored region. To validate whether the curriculum goals are properly interpolated from initial states to desired outcome states by combining both objectives for curriculum learning (Eq (8)), we evaluated the progress of the curriculum goals in a quantitative and qualitative way. For quantitative evaluation, we compare with other previous works described above with respect to the distance from the proposed curriculum goals to G + . As we can see in Figure 4 , our method is the only one that consistently interpolates from initial states to G + as training proceeds, while others have difficulty with complex dynamics or geometry of the environments. For qualitative evaluation, we visualized the curriculum goals proposed by our method and other baselines that show somewhat comparable results as training proceeds (Figure 3 ). The results show that our method consistently proposes the proper curriculum based on the required timesteps and uncertainties regardless of the geometry and dynamics of the various environments, while other baselines have difficulty as they utilize the Euclidean distance metric to interpolate the curriculum distribution to the G + . We also evaluated the desired outcome-directed RL performance. As we can see in Figure 5 , our method is able to very quickly learn how to solve these uninformed exploration problems through calibrated curriculum goal proposition.

5.2. ABLATION STUDY

Types of curriculum learning cost. We first evaluate the importance of each curriculum learning objective in Eq (8). Specifically, we experimented only with uncertainty-related objective (onlycnml) and timestep-related objective (only-f) when curriculum learning progresses. As we can see in Figure 6a , both objectives play complementary roles, which support the requirement of both objectives. Without one of them, the agent has difficulty in progressing the curriculum goals toward the desired outcome states due to the local optimum of f π ϕ or the numerical error of p CNML , and more qualitative/quantitative results and analysis about this and other ablation studies are included in Appendix C. Reward type & Goal proposition method. Secondly, we replace the intrinsic reward with the sparse reward, which is typically used in goal-conditioned RL problems, to validate the effects of the timestep-proportionally shaped reward. Also, for comparing the curriculum proposition method, we replace the Bipartite Matching formulation with a GAN-based generative model, which is similar to Florensa et al. (2018) , but we label the highly uncertain states as positive labels instead of success rates. As we can see in Figure 6b , the timestep-proportionally shaped reward shows consistently better results due to the more informed reward signal compared to the sparse one, and the generative model has difficulty in sampling the proper curriculum goals because GAN shows training instability with the drastic change of the positive labels, while our method is relatively insensitive because curriculum candidates are obtained from the experienced states from the buffer B rather than the generative model. Effect of c in η guidance . Lastly, we experiment with different values of hyperparameter c to validate the effect of η guidance on curriculum learning. When c is smaller than the default value 4, it is still possible to explore most of the feasible state space except the area near the desired outcome states due to the uncertainty & temporal distance-aware curriculum (Figure 6c ). But, we could verify that η guidance 's effect becomes smaller as c decreases and η guidance helps to guide the curriculum goals to the desired outcome states precisely. This is consistent with our analysis that the uncertainty & temporal distance themselves can provide curriculum goals in the frontier of the explored region while η guidance can further accelerate the guidance to the desired outcome states.

6. CONCLUSIONS

In this work, we consider an outcome-directed curriculum RL where the agent should progress toward the desired outcome automatically without the reward function and prior knowledge about the environment. We propose OUTPACE, which performs uncertainty, temporal distance-aware curriculum RL with intrinsic reward, based on the classifier by CNML, and Wasserstein distance with time-step metric. We show that our method can outperform the previous methods regarding sample efficiency and curriculum progress quantitatively and qualitatively. Even though our method shows promising results, there are some issues with computational complexity due to the innate properties of the meta-learning inference procedure itself. Thus, it would be interesting future work to find a way to reduce the inference time for less training wall-clock time.

A TRAINING & EXPERIMENTS DETAILS A.1 TRAINING DETAILS

Baselines. The baseline curriculum RL algorithms are trained as follows, • HGG (Ren et al., 2019) : We follow the default setting in the original implementation from https://github.com/Stilwell-Git/Hindsight-Goal-Generation. • CURROT (Klink et al., 2022) : We follow the default setting in the original implementation from https://github.com/psclklnk/currot. • GoalGAN (Florensa et al., 2018) , PLR (Jiang et al., 2021) , VDS (Zhang et al., 2020) , ALP-GMM (Portelas et al., 2020) : We follow the default setting in implementation from https://github.com/psclklnk/currot. All the baselines are trained by SAC (Haarnoja et al., 2018) with the sparse reward except for the SkewFit as it uses a reward based on the conditional entropy. Even though some algorithms' original implementation is based on on-policy algorithms such as TRPO or PPO (Schulman et al., 2015; 2017) , for comparing the sample efficiency, we replace the on-policy algorithm with the off-policy algorithm, SAC, following the referred implementation.  ✗ ✗ G + ✓ B ✗ ✓ GoalGAN ✗ ✗ ✗ ✗ GAN ✗ ✗ CURROT ✗ ✗ U or G + ✓ U ✗ ✗ PLR ✗ ✗ ✗ ✗ B ✗ ✗ VDS ✓ ✗ ✗ ✓ B ✗ ✓ ALP-GMM ✗ ✗ ✗ ✓ GMM ✗ ✗ SkewFit ✓ ✗ ✗ ✓ VAE ✓ ✗ OUTPACE (ours) ✓ ✓ G + ✓ B ✓ ✓ We included a conceptual comparison between our work and previous other curriculum or goal generation methods in Table 1 . • Uncert.-Aware: whether the curriculum goal proposal process is aware of the uncertainty of the candidate goals. • Timestep-Aware: whether the curriculum goal proposal process is aware of the temporal distance from the initial states or from the desired outcome states. • Target curriculum dist: whether there exists a mechanism for the curriculum goals to converge to the target distribution. When there is no target distribution (e.g. just exploring or expanding the curriculum goal distribution as diverse as possible.), we denoted it as ✗. • Off-policy: whether the off-policy RL can be applied. Some baselines need to measure a kind of difficulty, which means they require repeated trials and on-policy RL algorithms with multi-processing such as TRPO, and PPO (Schulman et al., 2015; 2017) . • Curriculum proposal: where the curriculum goals are proposed from. • Without ext. reward: whether the algorithm requires external environmental reward or not. • Without non-forgetting mechanism: whether the algorithm requires implicit or explicit non-forgetting mechanisms. Some baselines mix the previously practiced curriculum goals with a fixed or varying ratio or make the curriculum distribution into uniform over the state space to cover all possible test goal states. Training details. To train the potential function f ϕ via Eq. ( 5), s and s g in the buffer should ideally contain all feasible states in the environment. However, until the policy is learned enough to explore the map, obtaining such ideal distribution is difficult. To mitigate this issue, following Durugkar et al. (2021b) , we approximate such a distribution with a small replay buffer B f containing recent trajectories and the relabelling technique (Andrychowicz et al., 2017) . While this approximation does not provide f ϕ with the ideal state distribution covering all feasible states, we empirically found that this assumption works well since OUTPACE only queries f ϕ for the explored area when it generates curriculum goals. • Point-U-Maze : The observation consists of the xy position, angle, velocity, and angular velocity of the 'point'. The action space consists of the velocity and angular velocity of the 'point'. The initial state of the agent is [0, 0] and the desired outcome states are obtained by adding uniform noise to the default goal point [0, 8]. The size of the map is 12 × 12. • Point-N-Maze : It is the same as the Point-U-Maze environment except that the desired outcome states are obtained by adding uniform noise to the default goal point [8, 16] , and the size of the map is 12 × 20. • Point-Spiral-Maze : It is the same as the Point-U-Maze environment except that the desired outcome states are obtained by adding uniform noise to the default goal point [8, -8], and the size of the map is 20 × 20. • Ant Locomotion : The observation consists of the xyz position, xyz velocity, joint angle, and joint angular velocity of the 'ant'. The action space consists of the torque applied on the rotor of the 'ant'. The initial state of the agent is [0, 0] and the desired outcome states are obtained by adding uniform noise to the default goal point [0, 8]. The size of the map is 12 × 12. • Sawyer-Peg-Push : The observation consists of the xyz position of the end-effector, the object, and the gripper's state. The action space consists of the xyz position of the endeffector and gripper open/close control. The initial state of the object is [0.4, 0.8, 0.02] and the desired outcome states are obtained by adding uniform noise to the default goal point [-0.3, 0.4, 0.02]. We referred to the metaworld (Yu et al., 2020) and EARL (Sharma et al., 2021) environments. Ĝc ← sample K curriculum goals that minimize s i ∈B [CE(p CNML (e = 1|s i ); y = p CNML (e = 1| Ĝ+ )) -L • f π ϕ (s i )] (Section 4.3) 4: for i=1,2,...,K do 5: Env.reset()

6:

g ← Ĝc .pop 7: for t=0,1,...,H-1 do 8: if achieved g then 9: g ← random goal (randomly sample a state with high uncertainty measured by p CNML in a ball B r (s t ).) 10: end if 11: a t ← π(•|s t , g) 12: s t+1 ← Env.step(a t ) 13: end for 14: B ← B ∪ {s 0 , a 0 , s 1 ...}, B f ← B f ∪ {s 0 , a 0 , s 1 ...} 15: end for 16: for i=0,1,...,M do 17: Sample a minibatch b from B and label reward using f π ϕ (s t ) (Section 4.1) 18: Train π and Q with b via SAC (Haarnoja et al., 2018) . Given a state space S, action space A, transition dynamics T : S × A → S, and agent policy π, the time-step metric d π is a quasi-metric where the distance from s ∈ S to s g ∈ S is the expected number of transitions required with the policy π. The time-step metric can be expressed by the expectation of the number of transitions taken under the policy π, where T π (•|s, s g ) is the probability distribution of the timestep required to go from s to s g . d π can also be written recursively as d π (s, s g ) : = E π,τ ∼T π (•|s,sg) [τ ] = 0 ifs = s g 1 + E a∼π(•|s,sg) E s ′ ∼T (•|s,a) [d π (s ′ , s g )] otherwise. (12) Lipschitz smoothness of f If the difference in values of f on the expected transition from every state is bounded by 1, and the policy π can reach the goal s g within a finite number of transitions, then f is the 1-Lipschitz function. Proof. We can write |f (s g ) -f (s 0 )| via telescopic sum, |f (s g ) -f (s 0 )| = E π,τ ∼T π (•|s0,sg) τ -1 t=0 (f (s t+1 ) -f (s t )) ≤ E π,τ ∼T π (•|s0,sg) τ -1 t=0 |f (s t+1 ) -f (s t )| . (13) Since E[|f (s ′ ) -f (s)|] ≤ 1 by Eq (3), we can write E π,τ ∼T π (•|s0,sg) τ -1 t=0 |f (s t+1 ) -f (s t )| ≤ E π,τ ∼T π (•|s0,sg) τ -1 t=0 1 = E π,τ ∼T π (•|s0,sg) [τ ] = d π (s 0 , s g ) (14) Thus, f is the 1-Lipschitz function with respect to the time-step metric d π .

B.2.2 DERIVATION OF EQUATION (11)

By substituting Eq. ( 7) into Eq (9), and omitting f π ϕ (s i 0 ) which is not related to Ĝc , we obtain the following terms, Also, if we use the default value of the hyperparameters c = 4, we can express the above terms as follows, max Ĝc :| Ĝc |=K K i=1 log 1 -1 -2p CNML (e = 1|s i ) + 4(p CNML (e = 1|s i ) -0.5) • 1(p CNML (e = 1|s i ) > 0.5) + L • f π ϕ (s i ) , s i ∈ Ĝc (16) which can be simplified as min Ĝc :| Ĝc |=K K i=1 [-log(p CNML (e = 1|s i )) -L • f π ϕ (s i )], s i ∈ Ĝc (17) If we could assume that p CNML is well trained to classify the desired outcome examples g i + ∈ Ĝ+ from the states s in the already explored region, p CNML (e = 1|g i + ) is approximately equal to 1 (p CNML (e = 1|g i + ) ≈ 1). Then, the terms inside the above minimization objective can be approximately represented as -p CNML (e = 1|g i + ) log(p CNML (e = 1|s i )) -L • f π ϕ (s i ). Then, in practical implementation, we can implement the above terms by cross-entropy loss as w(s i , g i + ) := CE(p CNML (e = 1|s i ); y = p CNML (e = 1|g i + )) -L • f π ϕ (s i ), which is Eq (11) . Even though it is not exactly equivalent to the mathematical definition of the cross-entropy, we can just implement it with cross-entropy loss developed in standard deep learning framework such as PyTorch (Paszke et al., 2019) because choosing s i to maximize log(p CNML (e = 1|s i )) and choosing s i close to g i + in order to be classified as desired outcome examples by the p CNML (predicted labels of s i to be close to 1) have the same intuitive meaning.

B.3 A DETAILED DESCRIPTION OF META-NML

Conditional normalized maximum likelihood (CNML) can perform a conservative k-way classification based on previously seen data (Li et al., 2021; Zhou & Levine, 2021) . Let D train = {(x i , y i )} n-1 i=1 be a set of data containing pairs of inputs x 1:n-1 and labels y 1:n-1 ∈ {1, • • • , k}, where k is the number of possible labels, and Θ is a set of models. Given a new input x n , CNML defines the distribution p CNML (y n = i|x n ) i∈{1,••• ,k} by minimizing the regret R of the worst-case label y n as p CNML = arg min q max yn R(q, x 1:n , y 1:n , Θ), where we define a regret R for label y n of a distribution q and maximum likelihood estimator θ ′ as R(q, x 1:n , y 1:n , Θ) := log θ ′ (y1:n|x1:n) (y 1:n |x 1:n ) -log q(y 1:n ). By solving Eq. ( 20), CNML predicts the distribution of the new label y n as Eq. ( 22) (Bibas et al., 2019) . p CNML (y n = m, x n ) = p θ ′(n) m (y = m|x n ) k j=1 p θ ′(n) j (y = j|x n ) , where a model θ ′(n) m is a model that represents the augmented dataset D = D train ∪ (x n , y n = m) well. Thus, the number of total MLE models required is n × k since we should address each data by augmenting with the label 1, ..., k, respectively (θ ′(α=1:n) β=1:k ). Since our algorithm utilizes 2-way classification (k = 2), we define 2n tasks τ - i=1:n and τ + i=1:n , which are constructed by augmenting negative (x i , y = 0) and positive (x i , y = 1) labels respectively for each data point (x i ). To amortize the training cost of the tasks (tasks τ -,+ i=1:n ) we can apply the meta-learning algorithm (Finn et al., 2017; Li et al., 2021) to this setting and train a model θ which can quickly adapt to the optimal solution after a single step of gradient update with standard classification loss L as min θ E xi∼D,y ′ ∼{0,1} [L(D ∪ (x i , y ′ ), θ ′(i) j=y ′ )], s.t.θ ′(i) j = θ -α∇ θ L(D ∪ (x i , y i ), θ), where Eq. ( 23) and Eq (24) represent the objective of meta-learning and quick adaptation respectively. Training CNML via meta-learning and leveraging CNML are shown in lines 3 and 19 of the algorithm overview (Algorithm 1). Also, we provide the pseudo-code of meta-nml in Algorithm 2.

C.1 FULL RESULTS OF THE MAIN SCRIPT

We included the full results of the main script in this section. The uncertainty quantification is visualized in Figure 9 , the visualization of the trained f π ϕ (s) is in Figure 10 , the visualization of the proposed curriculum goals is in Figure 11 . We do not include the visualization results of the Point-U-Maze, and Sawyer-Peg-Pick&Place results as these environments share the same map with the Ant Locomotion, and Sawyer-Peg-Push environments, respectively. 

C.2 EVALUATION WITH THE GOALS SAMPLED FROM THE UNIFORM DISTRIBUTION

In some curriculum RL algorithms, they incorporate a mechanism for remembering previously practiced curricula either implicitly or explicitly. For example, GoalGAN (Florensa et al., 2018) mixes the previously generated goals with currently generated goals in a specified ratio (e.g. 20 % of previously used goals), and SkewFit (Pong et al., 2019) set the objective as targeting the uniform goal distribution (by maximizing the entropy of the goals H(g)), and CURROT (Klink et al., 2022 ) also utilizes uniform target curriculum distribution in practice, and so do some of the other baselines. Due to these algorithmic designs, they require many iterations for explicitly practicing previously used curriculum goals, or show slow progress of curriculum to match the uniform target distribution. In contrast, our method is based on an intrinsic reward, which is shaped according to the required timesteps proportional values f π ϕ (s) as described in our main script. Thus, our method does not need to explicitly consider the non-forgetting mechanism when we design the algorithm because the reward is already shaped with respect to the timesteps for reaching the arbitrary goal points along the trajectory that reaches the desired outcome state. Because of this property, our method does not consider uniform target distribution or explicitly mixing the previously practiced curriculum goals, and it enables our method to be much faster and show sample-efficient curriculum progress. We experimented with the goals sampled from the uniform distribution on the feasible state space for validating the previous hypothesis, and the experimental results are shown in Figure 12 . Even though our method does not explicitly consider the previously practiced curriculum goals, it shows success in reaching arbitrary goal points sampled from the uniform distribution. Performance degrades are observed in sawyer manipulation environments and these are because we set the uniform distribution as areas within the tables in the environment. But the curriculum goals proposed by our method are converged before the agent explores the entire state space on the table, thus the agent does not have the opportunity to practice the goals from the entire state space. 

C.3 ABLATION STUDY

We conducted ablation studies described in our main script in all environments. Figure 13 shows the average distance from the proposed curriculum goals to the desired final goal states along the training steps, and Figure 14 shows the episode success rates along the training steps. As we can see in these figures, even though there are not many differences in some environments with simple dynamics, we could obtain consistent analysis with the results in the main script in most of the environments. We also visualized the curriculum goals obtained by each ablation study as training proceeds in Fig- ure 15. As we can see in these figures, the curriculum proposal only based on the uncertainty (onlycnml) shows progress toward uncertain areas, but it shows unstable curriculum progress, which is depicted by separated curriculum goals despite the same colors or out of order with respect to human intuitive optimal curriculum progress. This is because the meta-learning-based inference procedure of p CNML has some errors that could lead to the wrong prediction of the states near the boundaries of the already explored regions as uncertain areas (For example, in Figure 9 , there exist some states predicted as uncertain area despite already explored region). Also, even though the curriculum proposal based on the temporal distance only obtained by f π ϕ (only-f) shows some progress that deviates from the initial states, it still has difficulty in most of the environments because f π ϕ is trained to reflect the observed transition data rather than entire state space. That is, the temporal distance estimation is reliable within the explored region. Once the trained f π ϕ has local optimum before discovering temporally more distant states (Figure 16 ), the proposed curriculum goals can be stuck in some areas that are wrongly predicted to be the most far from the initial states in terms of the temporal distance, and it leads to the ineffective exploration of the agent and recurrent failures. 



This work was supported by AI based Flight Control Research Laboratory funded by Defense Acquisition Program Administration under Grant UD200045CD. Seungjae Lee would like to acknowledge financial support from Hyundai Motor Chung Mong-Koo Foundation.



Figure 2: Visualization of the uncertainty quantification along training progress (left) and trained f π ϕ (s) (right) in the Point-N-Maze environment. In the right figure, high reward means temporally close to the desired outcome states, and low reward means the opposite. et al., 2017) to address the computational intractability of CNML by training one meta-learner network that can quickly adapt to each model θi rather than training each model separately. As Finn et al. (2017) requires samples from G + and replay buffer B for the inference, the probability should be represented as p CNML (e = i|s; G + , B), but we use p CNML (e = i|s) for notational simplicity in this work. More details about the meta-learning-based classification are included in Appendix B.

Figure 3: Visualization of the proposed curriculum goals. First row: Ant Locomotion, Second row: Point-N-Maze. algorithm to find K edges with the minimum cost w connecting V a and V b . (Ahuja et al., 1993; Ren et al., 2019). The overall training process is summarized in Algorithm 1 in Appendix B.

HGG (Ren et al., 2019): Minimize the distance between the curriculum and desired outcome state distributions based on the Euclidean distance metric and value function bias. CURROT(Klink et al., 2022): Interpolate between the curriculum and desired outcome state distribution based on the agent's current performance via optimal transport. GoalGAN(Florensa et al., 2018): Generate curriculum goals that are intermediate level of difficulty by training GAN(Goodfellow et al., 2014). PLR(Jiang et al., 2021): Sample increasingly difficult tasks by prioritizing the levels of tasks with high TD errors. ALP-GMM(Portelas et al., 2020): Fit a GMM with an absolute learning progress score approximated by the absolute reward difference. VDS(Zhang et al., 2020): Prioritize goals that maximize the epistemic uncertainty of the value function ensembles. SkewFit(Pong et al., 2019): Maximize the entropy of the goal distribution to be uniform on the feasible state space via skewing the distribution trained by VAE(Kingma & Welling, 2013).

Figure4: Average distance from the curriculum goals to the final goals (Lower is better). Our method's increasing tendencies at initial steps in some environments are due to the geometric structure of the environments themselves. Shading indicates a standard deviation across 5 seeds.

Figure 6: Ablation study in terms of the distance from the proposed curriculum goals to the desired final goal states (Lower is better). First row: Ant Locomotion. Second row: Sawyer Pick & Place. Shading indicates a standard deviation across 5 seeds.

Spiral-Maze (d) Ant Locomotion (e) Sawyer Manip.

Figure 7: Environments used for evaluation: (a)-(c) the agent must navigate various kinds of maze environments. (d) the quadruped ant must navigate the maze to a particular location. (e) the robot has to push or pick&place a peg to the desired location.

Figure 8: The overall diagram of OUTPACE

p CNML (e = 0|s i ) -p CNML (e = 1|s i ) + c(p CNML (e = 1|s i ) -0.5) • 1(p CNML (e = 1|s i ) > 0.5) + L • f π ϕ (s i ) , s i ∈ Ĝc . (15)

Figure 9: Visualization of the uncertainty quantification along training progress. First row: U-Maze (Point, Ant Locomotion), Second row: N-Maze, Third row: Spiral-Maze, Fourth row: Sawyer Peg Push, Pick & Place environments.

Figure 11: Visualization of the proposed curriculum goals. First row: Ant Locomotion (same map size with U-Maze), Second row: Point-N-Maze, Third row: Point-Spiral-Maze, Fourth row: Sawyer Peg Push.

Figure 12: Episode success rates of the evaluation results with the goals uniformly sampled from the feasible state space.

(a) Cost type (b) Reward & Goal proposition type (c) Effect of c in η guidance

Figure 13: Ablation study in terms of the distance from the proposed curriculum goals to the desired final goal states. First row: Point-U-Maze. Second row: Point-N-Maze. Third row: Point-Spiral-Maze. Fourth row: Ant Locomotion. Fifth row: Sawyer Push. Sixth row: Sawyer Pick&Place. Shading indicates a standard deviation across 5 seeds.

Figure 14: Ablation study in terms of the episode success rates. First row: Point-U-Maze. Second row: Point-N-Maze. Third row: Point-Spiral-Maze. Fourth row: Ant Locomotion. Fifth row: Sawyer Push. Sixth row: Sawyer Pick&Place. Shading indicates a standard deviation across 5 seeds.

Figure 15: Ablation study in terms of curriculum goals visualization. First row: Ant Locomotion, Second row: Point-N-Maze, Third row: Point-Spiral-Maze, Fourth row: Sawyer Manipulation.

• SkewFit (Pong et al., 2019): We follow the state-based version of SkewFit. The original implementation was modified and used since only the image-based version is provided in it. (https://github.com/rail-berkeley/rlkit)

Conceptual comparison between our work and the previous curriculum RL algorithms.

Hyperparameters for OUTPACE

Env-specific hyperparameters for OUTPACE

