PLAN-BASED RELAXED REWARD SHAPING FOR GOAL-DIRECTED TASKS

Abstract

In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.

1. INTRODUCTION

Reinforcement Learning (RL) provides a general framework for autonomous agents to learn complex behavior, adapt to changing environments, and generalize to unseen tasks and environments with little human interference or engineering effort. However, RL in high-dimensional state spaces generally suffers from a difficult exploration problem, making learning prohibitively slow and sample-inefficient for many real-world tasks with sparse rewards. A possible strategy to increase the sample efficiency of RL algorithms is reward shaping (Mataric, 1994; Randløv & Alstrøm, 1998) , in particular potential-based reward shaping (PB-RS) (Ng et al., 1999) . Reward shaping provides a dense reward signal to the RL agent, enabling it to converge faster to the optimal policy. In robotics tasks, approximate domain knowledge is often available and can be used by a planning algorithm to generate approximate plans. Here, the resulting plan can be provided to the RL agent using plan-based reward shaping (Grzes & Kudenko, 2008; Brys et al., 2015) . Thus, plan-based reward shaping offers a natural way to combine the efficiency of planning with the flexibility of RL. We analyze the use of plan-based reward shaping for RL. The key novelty is that we theoretically introduce Final-Volume-Preserving Reward Shaping (FV-RS), a superset of PB-RS. Intuitively speaking, FV-RS allows for shaping rewards that convey the information encoded in the shaping reward in a more direct way than PB-RS, since the value of following a policy is not only determined by the shaping reward at the end of the trajectory, but can also depend on all intermediate states. While FV-RS inevitably relaxes the optimality guarantees provided by PB-RS, we show in the experiments that FV-RS can significantly improve sample efficiency beyond PB-RS, e.g. allowing RL agents to learn simulated 10-dimensional continuous robotic manipulation tasks after ca. 300 rollout episodes. We argue that the strict notion of optimality in PB-RS is not necessary in many robotics applications, while on the other hand relaxing PB-RS to FV-RS facilitates speeding up the learning process. Using FV-RS could be a better trade-off between optimality and sample efficiency in many domains. The contributions of this work are: • We introduce FV-RS as a new class of reward shaping for RL methods in general.

