HINDSIGHT CURRICULUM GENERATION BASED MULTI-GOAL EXPERIENCE REPLAY Anonymous authors Paper under double-blind review

Abstract

In multi-goal tasks with sparse rewards, it is challenging to learn from tons of experiences with zero rewards. Hindsight experience replay (HER), which replays past experiences with additional heuristic goals, has shown it possible for off-policy reinforcement learning (RL) to make use of failed experiences. However, the replayed experiences may not lead to well-explored state-action pairs, especially for a pseudo goal, which instead results in a poor estimate of the value function. To tackle the problem, we propose to resample hindsight experiences based on their likelihood under the current policy and the overall distribution. Based on the hindsight strategy, we introduce a novel multi-goal experience replay method that automatically generates a training curriculum, namely Hindsight Curriculum Generation (HCG). As the range of experiences expands, the generated curriculum strikes a dynamic balance between exploiting and exploring. We implement HCG with the vanilla Deep Deterministic Policy Gradient(DDPG), and experiments on several tasks with sparse binary rewards demonstrate that HCG improves sample efficiency of the state of the art.

1. INTRODUCTION

Multi-goal tasks with sparse rewards present a big challenge for training a reliable RL agent. In multi-goal tasks (Plappert et al., 2018) , an agent learns to achieve multiple different goals and receives no positive feedback until it reaches the position defined by the desired goal. Such a sparse rewards problem makes it difficult to reuse past experiences for that the positive feedback to reinforce policy is rare. It's impractical to carefully engineer a shaped reward function (Ng et al., 1999; Popov et al., 2017) that aligns with each task or assign a set of general auxiliary tasks (Riedmiller et al., 2018) , which relies on expert knowledge. To enrich the positive feedback, Andrychowicz et al. ( 2017) provides HER, a novel multi-goal experience replay strategy, which enables an agent to learn from unshaped reward, e.g. a binary signal indicating successful task solving. Specifically, HER replaces the desired goals with the achieved ones then recalculates the rewards of sampled experiences. By relabeling experiences with pseudo goals, it expands experiences without further exploration and is likely to turn failed experiences into successful ones with positive feedback. Experience relabeling makes better use of failed experiences. However, not all the achieved goals lead the origin experience to a reliable state-action visitation under the current policy. (For simplicity, we denote state-goal pairs as augmented states.) A Value function of a policy for a specific state-action pair can generalize to similar ones, in return the estimate of the value function may get worse without sufficient visitation near the state-action pair. For HER, it almost replays uniform samples of past experiences whilst goals for exploring are finite. When performing a value update, the current policy could not give a credible estimate of the universal value function (Schaul et al., 2015) , if the state-action pair is not well-explored. In other words, the current policy may have difficulty in generalizing to the state-action pair. Recent works focus on improving HER by evaluating the past experiences where we sample pseudo goals from. HER with Energy-Based Prioritization (EBP) (Zhao & Tresp, 2018) defines a trajectory energy function as the sum of the transition energy of the target object over the trajectory. Curriculum-guided HER (CHER) (Fang et al., 2019b) adaptively selects the failed experiences according to the proximity to the true goals and the curiosity of exploration over diverse pseudo goals with gradually changed proportion. Though these variants select valuable goals for replay, it remains a challenge that the agent will not further explore most of the pseudo goals then it is risky to directly generalize over pseudo goals. 1

