LEARNING TO DYNAMICALLY SELECT BETWEEN RE-WARD SHAPING SIGNALS

Abstract

Reinforcement learning (RL) algorithms often have the limitation of sample complexity. Previous research has shown that the reliance on large amounts of experience can be mitigated through the presence of additional feedback. Automatic reward shaping is one approach to solving this problem, using automatic identification and modulation of shaping reward signals that are more informative about how agents should behave in any given scenario to learn and adapt faster. However, automatic reward shaping is still very challenging. To better study it, we break it down into two separate sub-problems: learning shaping reward signals in an application and learning how the signals can be adaptively used to provide a single reward feedback in the RL learning process. This paper focuses on the latter sub-problem. Unlike existing research, which tries to learn one shaping reward function from shaping signals, the proposed method learns to dynamically select the right reward signal to apply at each state, which is considerably more flexible. We further show that using an online strategy that seeks to match the learned shaping feedback with optimal value differences can lead to effective reward shaping and accelerated learning. The proposed ideas are verified through experiments in a variety of environments using different shaping reward paradigms.

1. INTRODUCTION

Although numerous successes have been reported in reinforcement learning (RL), it still suffers from several drawbacks that prevent it from performing to expectation in many real-life situations. One critical limitation is the sample complexity. In order to arrive at an acceptable solution, RL requires an enormous amount of experience (i.e., data) before useful behaviors are learned. Reward shaping is one approach that seeks to address this problem, providing additional feedback in the form of shaping rewards to allow an RL agent to learn faster. Moreover, shaping rewards that follow a potential form preserve guarantees that optimal solutions will be found despite the altered feedback (Ng et al., 1999) . However, until recently, most reward shaping signals and functions have been hand-engineered. This task is notoriously difficult, as even slight incorrectness can lead to local optima that do not solve the present problem (Randløv & Alstrøm, 1998) . Automatic reward shaping eliminates the difficulty of shaping reward signal and function design by learning the shaping reward function that in turn enables optimal learning of the policy. Automatic reward shaping in itself is an extremely difficult problem to solve. In order to simplify the problem, we break down the idea into two sub-problems: (1) learning shaping reward signals, and (2) learning how to exploit shaping reward signals to provide an appropriate shaping reward at each state of the learning process. This paper focuses on the latter task, i.e., (2), which we refer to as "automatic reward adaptation". Problem Definition: Given a set of shaping reward signals φ = φ 1 , . . . , φ n , learn to adapt these signals automatically to produce a single shaping reward F (s, a, s ) : φ → R for each state s, action a, next state s tuple in the learning process. The full reward for any transition in the RL problem is R(s, a, s ) = r(s, a, s ) + F (s, a, s ), where r is the original reward of the RL problem before shaping. Our proposed approach to automatic reward adaptation is to learn to dynamically select the right shaping reward signal φ i from φ at each transition encountered in learning. In addition, our method 1

