HINDSIGHT CURRICULUM GENERATION BASED MULTI-GOAL EXPERIENCE REPLAY Anonymous authors Paper under double-blind review

Abstract

In multi-goal tasks with sparse rewards, it is challenging to learn from tons of experiences with zero rewards. Hindsight experience replay (HER), which replays past experiences with additional heuristic goals, has shown it possible for off-policy reinforcement learning (RL) to make use of failed experiences. However, the replayed experiences may not lead to well-explored state-action pairs, especially for a pseudo goal, which instead results in a poor estimate of the value function. To tackle the problem, we propose to resample hindsight experiences based on their likelihood under the current policy and the overall distribution. Based on the hindsight strategy, we introduce a novel multi-goal experience replay method that automatically generates a training curriculum, namely Hindsight Curriculum Generation (HCG). As the range of experiences expands, the generated curriculum strikes a dynamic balance between exploiting and exploring. We implement HCG with the vanilla Deep Deterministic Policy Gradient(DDPG), and experiments on several tasks with sparse binary rewards demonstrate that HCG improves sample efficiency of the state of the art.

1. INTRODUCTION

Multi-goal tasks with sparse rewards present a big challenge for training a reliable RL agent. In multi-goal tasks (Plappert et al., 2018) , an agent learns to achieve multiple different goals and receives no positive feedback until it reaches the position defined by the desired goal. Such a sparse rewards problem makes it difficult to reuse past experiences for that the positive feedback to reinforce policy is rare. It's impractical to carefully engineer a shaped reward function (Ng et al., 1999; Popov et al., 2017) that aligns with each task or assign a set of general auxiliary tasks (Riedmiller et al., 2018) , which relies on expert knowledge. To enrich the positive feedback, Andrychowicz et al. (2017) provides HER, a novel multi-goal experience replay strategy, which enables an agent to learn from unshaped reward, e.g. a binary signal indicating successful task solving. Specifically, HER replaces the desired goals with the achieved ones then recalculates the rewards of sampled experiences. By relabeling experiences with pseudo goals, it expands experiences without further exploration and is likely to turn failed experiences into successful ones with positive feedback. Experience relabeling makes better use of failed experiences. However, not all the achieved goals lead the origin experience to a reliable state-action visitation under the current policy. (For simplicity, we denote state-goal pairs as augmented states.) A Value function of a policy for a specific state-action pair can generalize to similar ones, in return the estimate of the value function may get worse without sufficient visitation near the state-action pair. For HER, it almost replays uniform samples of past experiences whilst goals for exploring are finite. When performing a value update, the current policy could not give a credible estimate of the universal value function (Schaul et al., 2015) , if the state-action pair is not well-explored. In other words, the current policy may have difficulty in generalizing to the state-action pair. Recent works focus on improving HER by evaluating the past experiences where we sample pseudo goals from. HER with Energy-Based Prioritization (EBP) (Zhao & Tresp, 2018) defines a trajectory energy function as the sum of the transition energy of the target object over the trajectory. Curriculum-guided HER (CHER) (Fang et al., 2019b) adaptively selects the failed experiences according to the proximity to the true goals and the curiosity of exploration over diverse pseudo goals with gradually changed proportion. Though these variants select valuable goals for replay, it remains a challenge that the agent will not further explore most of the pseudo goals then it is risky to directly generalize over pseudo goals. Humans show great capability in abstracting and generalizing knowledge but it takes a large number of experiences to learn to represent similar states with similar features. Fortunately, states with different achieved goals and desired goals may be similar intrinsically if the states and distance between them are similar. During an episode, the distance varies widely for a fixed desired goal, which has the potential for generalization. Therefore, we take advantage of relative goals-distances between the achieved goals and the desired goals-to transform data. The relative goals strategy alleviates the challenge of generalization without sufficient data. By explicitly discovering similar states in the replay buffer, it enables us to see the density of state-action visitations for unexplored goals more conveniently. The density reflects the likelihood of the corresponding state-action pair, indicating whether it is well explored. Besides, the generalization over relative goals is feasible only if the explored relative goals are widely distributed (Schaul et al., 2015) . In a word, it is significant to ensure sufficient state-action visitations and maintain a balanced distribution over valid goals. In this paper, we present to resample hindsight experiences with a relative goals strategy. The main criterion to sample experiences for a reliable generalization is based on 1) the likelihood of the corresponding state-action pair under the current policy; 2) the overall distribution of the relative goals. By constantly adjusting the distribution of goals, we propose Hindsight Curriculum Generation (HCG), which generates a replay curriculum to progressively expand the range of experiences for training. The main advantage is that it makes efficient use of hindsight experiences as well as tries to ensure the generalization over state-action pairs. From the perspective of curriculum learning, the generated curriculum can be seen as a sequence of weights on the training experiences, which guide the learning by automatically generating suitable replay goals. Furthermore, we implemented HCG with the vanilla Deep Deterministic Policy Gradient(DDPG) (P. et al., 2015) on various Robotic Control problems. The robot, a 7-DOF Fetch Robotics arm which has a two-fingered parallel gripper or an anthropomorphic robotic hand with 24 degrees of freedom, performs the training procedure using the MuJoCo simulated physics engine (Todorov et al., 2012) . During the training procedure, our method extracts and leverages information from hindsight experiences with various state-goal pairs. We experimentally demonstrated that our method improves the sample efficiency of the vanilla HER in solving multi-goal tasks with sparse rewards. Ablation studies show that our method is robust on the major hyperparameters.

2. BACKGROUND

In this section we briefly introduce the multi-goal RL framework, universal value function approximators and hindsight experience replay strategy used in the paper.

2.1. MULTI-GOAL RL

Consider an infinite-horizon discounted Markov decision process(MDP), defined by the tuple (S, A, G, P, r, γ), where S is a set of states, A is a set of actions, G is a set of goals, P : S ×A×S → R is the transition probability distribution, r : S × A × G → R is the reward function, and γ ∈ (0, 1) is the discount factor. In multi-goal RL, an agent interacts with its discounted MDP environment in a sequence of episodes. At the beginning of each episode, the agent receives a goal state g ∈ G. In this paper we set that each g ∈ G corresponds to a goal state s g ∈ S. Moreover, we assume that given a state s we can easily find a goal g which is satisfied in this state. At each timestep t, the agent



Figure 1: Tasks in the open-ended environment with sparse binary rewards. They include Pushing, Sliding and Pick-and-Place with a Fetch robotic arm as well as different in-hand object manipulations with a Shadow Dexterous Hand. The agent obtains a reward of 1 if it achieves the desired goal within some task-specific tolerance and 0 otherwise.

