PLAN-BASED RELAXED REWARD SHAPING FOR GOAL-DIRECTED TASKS

Abstract

In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.

1. INTRODUCTION

Reinforcement Learning (RL) provides a general framework for autonomous agents to learn complex behavior, adapt to changing environments, and generalize to unseen tasks and environments with little human interference or engineering effort. However, RL in high-dimensional state spaces generally suffers from a difficult exploration problem, making learning prohibitively slow and sample-inefficient for many real-world tasks with sparse rewards. A possible strategy to increase the sample efficiency of RL algorithms is reward shaping (Mataric, 1994; Randløv & Alstrøm, 1998) , in particular potential-based reward shaping (PB-RS) (Ng et al., 1999) . Reward shaping provides a dense reward signal to the RL agent, enabling it to converge faster to the optimal policy. In robotics tasks, approximate domain knowledge is often available and can be used by a planning algorithm to generate approximate plans. Here, the resulting plan can be provided to the RL agent using plan-based reward shaping (Grzes & Kudenko, 2008; Brys et al., 2015) . Thus, plan-based reward shaping offers a natural way to combine the efficiency of planning with the flexibility of RL. We analyze the use of plan-based reward shaping for RL. The key novelty is that we theoretically introduce Final-Volume-Preserving Reward Shaping (FV-RS), a superset of PB-RS. Intuitively speaking, FV-RS allows for shaping rewards that convey the information encoded in the shaping reward in a more direct way than PB-RS, since the value of following a policy is not only determined by the shaping reward at the end of the trajectory, but can also depend on all intermediate states. While FV-RS inevitably relaxes the optimality guarantees provided by PB-RS, we show in the experiments that FV-RS can significantly improve sample efficiency beyond PB-RS, e.g. allowing RL agents to learn simulated 10-dimensional continuous robotic manipulation tasks after ca. 300 rollout episodes. We argue that the strict notion of optimality in PB-RS is not necessary in many robotics applications, while on the other hand relaxing PB-RS to FV-RS facilitates speeding up the learning process. Using FV-RS could be a better trade-off between optimality and sample efficiency in many domains. The contributions of this work are: • We introduce FV-RS as a new class of reward shaping for RL methods in general. • We propose to specifically use FV-RS for plan-based reward shaping. • We show that compared to no RS and plan-based PB-RS, plan-based FV-RS significantly increases the sample efficiency in several robotic manipulation tasks.

2.1. SPARSE REWARDS AND REWARD SHAPING

In many real-world RL settings, the agent is only given sparse rewards, exacerbating the exploration problem. There exist several approaches in the literature to overcome this issue. These include mechanisms of intrinsic motivation and curiosity (Barto et al., 2004; Oudeyer et al., 2007; Schembri et al., 2007) , which provide the agent with additional intrinsic rewards for events that are novel, salient, or particularly useful for the learning process. In reward optimization (Sorg et al., 2010; Sequeira et al., 2011; 2014) , the reward function itself is being optimized to allow for efficient learning. Similarly, reward shaping (Mataric, 1994; Randløv & Alstrøm, 1998 ) is a technique to give the agent additional rewards in order to guide it during training. In PB-RS (Ng et al., 1999; Wiewiora, 2003; Wiewiora et al., 2003; Devlin & Kudenko, 2012) , this is done in a way that ensures that the resulting optimal policy is the same with and without shaping. Ng et al. (1999) also showed that the reverse statement holds as well; PB-RS is the only type of modification to the reward function that can guarantee such an invariance if no other assumptions about the Markov Decision Process (MDP) are made. In this work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS), a subclass of reward shaping that is broader than PB-RS and not necessarily potential-based, and therefore is not guaranteed to leave the optimal policy invariant. However, FV-RS still guarantees the invariance of the asymptotic state of the MDP under optimal control. In the experiments section, we show that this relaxed notion of reward shaping allows us to substantially improve the sample efficiency during training.

2.2. DEMONSTRATION-AND PLAN-BASED REWARD SHAPING

Learning from Demonstration (LfD) aims at creating a behavioral policy from expert demonstrations. Existing approaches differ considerably in how the demonstration examples are collected and how the policy is derived from this (Argall et al., 2009; Ravichandar et al., 2020) . The HAT algorithm (Taylor et al., 2011) et al. (2016) , this is extended to include multiple demonstrations that are translated into a potential function using Inverse Reinforcement Learning as an intermediate step. In this work, we use a planned sequence in state-space to construct a shaping function similar to Brys et al. (2015) , but in contrast to the aforementioned work, we do not use this shaping function as a potential function for PB-RS. Instead, we use it directly as a reward function for FV-RS. We show that this significantly improves the sample efficiency during training.



introduces an intermediate policy summarization step, in which the demonstrated data is translated into an approximate policy that is then used to bias exploration in a final RL stage. In Hester et al. (2017), the policy is simultaneously trained on expert data and collected data, using a combination of supervised and temporal difference losses. In Salimans & Chen (2018), the RL agent is at the start of each episode reset to a state in the single demonstration.

