LEARNING TO DYNAMICALLY SELECT BETWEEN RE-WARD SHAPING SIGNALS

Abstract

Reinforcement learning (RL) algorithms often have the limitation of sample complexity. Previous research has shown that the reliance on large amounts of experience can be mitigated through the presence of additional feedback. Automatic reward shaping is one approach to solving this problem, using automatic identification and modulation of shaping reward signals that are more informative about how agents should behave in any given scenario to learn and adapt faster. However, automatic reward shaping is still very challenging. To better study it, we break it down into two separate sub-problems: learning shaping reward signals in an application and learning how the signals can be adaptively used to provide a single reward feedback in the RL learning process. This paper focuses on the latter sub-problem. Unlike existing research, which tries to learn one shaping reward function from shaping signals, the proposed method learns to dynamically select the right reward signal to apply at each state, which is considerably more flexible. We further show that using an online strategy that seeks to match the learned shaping feedback with optimal value differences can lead to effective reward shaping and accelerated learning. The proposed ideas are verified through experiments in a variety of environments using different shaping reward paradigms.

1. INTRODUCTION

Although numerous successes have been reported in reinforcement learning (RL), it still suffers from several drawbacks that prevent it from performing to expectation in many real-life situations. One critical limitation is the sample complexity. In order to arrive at an acceptable solution, RL requires an enormous amount of experience (i.e., data) before useful behaviors are learned. Reward shaping is one approach that seeks to address this problem, providing additional feedback in the form of shaping rewards to allow an RL agent to learn faster. Moreover, shaping rewards that follow a potential form preserve guarantees that optimal solutions will be found despite the altered feedback (Ng et al., 1999) . However, until recently, most reward shaping signals and functions have been hand-engineered. This task is notoriously difficult, as even slight incorrectness can lead to local optima that do not solve the present problem (Randløv & Alstrøm, 1998) . Automatic reward shaping eliminates the difficulty of shaping reward signal and function design by learning the shaping reward function that in turn enables optimal learning of the policy. Automatic reward shaping in itself is an extremely difficult problem to solve. In order to simplify the problem, we break down the idea into two sub-problems: (1) learning shaping reward signals, and (2) learning how to exploit shaping reward signals to provide an appropriate shaping reward at each state of the learning process. This paper focuses on the latter task, i.e., (2), which we refer to as "automatic reward adaptation". Problem Definition: Given a set of shaping reward signals φ = φ 1 , . . . , φ n , learn to adapt these signals automatically to produce a single shaping reward F (s, a, s ) : φ → R for each state s, action a, next state s tuple in the learning process. The full reward for any transition in the RL problem is R(s, a, s ) = r(s, a, s ) + F (s, a, s ), where r is the original reward of the RL problem before shaping. Our proposed approach to automatic reward adaptation is to learn to dynamically select the right shaping reward signal φ i from φ at each transition encountered in learning. In addition, our method learns using minimal infrastructure and with value-based gradients. We avoid the use of models and additional approximate value functions (which previous approaches to automatic reward shaping typically rely on) and perform updates using already-present values as feedback and with minimal additional computation. The proposed ideas have been verified through experiments in a variety of environments using different shaping reward paradigms. The basis of our shaping signal selection approach is rooted in how humans seem to react in realistic decision-making situations. Given a set of basic reward signals, say "comfort" and "selfpreservation", humans have the uncanny ability to identify when and how much one should listen to any given signal depending on the current situation. For example, in a lane keeping task, if we were near the edge of a lane with no cars around us, we would simply listen to the "comfort" reward signal telling us to move closer to the center of the lane. However, if there was another car moving into the same lane at the same time, we would instead follow the "self-preservation" reward signal dictating that we stay as reasonably far away from other cars as possible. Both signals lead to correct performance but are applicable in entirely different situations, provide different information, and, most importantly, induce different behaviors. While we recognize that explicit selection risks not using available information in other shaping reward signals, we argue that it provides a guaranteed improvement over only the environment reward. In this sense, selection does not hinder the RL agent in learning even though it is incomplete relative to optimal reward shaping, which is extremely difficult to design. This work parallels the area of multi-objective RL (MORL), which investigates how to perform RL in the presence of multiple rewards (Roijers et al., 2013) . However, there are a couple differences between our idea and previous work in automatic reward shaping and MORL. First, much research in automatic reward shaping focuses on the first sub-problem (1) described above, i.e., learning the parameters of some shaping reward signal (usually only one). However, realistic problems often involve a number of (possibly conflicting) goals or signals that cannot be trivially summarized as a single signal or function. Our work focuses on the second sub-problem (2) and investigates if there are better ways to access provided shaping reward signals for more effective reward shaping. Second, while MORL also handles environments with multiple feedback signals, it aims to solve a different problem overall. MORL attempts to learn solutions that best optimize the multiple rewards presented to it, but our idea attempts to best exploit multiple rewards in a way that solves the single objective present in the problem. Furthermore, MORL typically learns a linear combination of signals as the final reward shaping function. While effective in many cases, a learned fixed combination does not always suit each individual state in learning, especially if the environment is dynamic. That being said, we do not discount the potential of combination in automatic reward adaptation. 1 Our goal is simply to consider another approach with promising flexibility.

2. RELATED WORK

The fundamental theory behind reward shaping regarding its optimality-preserving guarantees was documented in Ng et al. (1999) . Since then, the guarantees of potential-based reward shaping have been expanded to time-varying potential options by Harutyunyan et al. (2015) , which more naturally reflects realistic problems and the concept of automatic reward shaping. One of the initial works in this area demonstrated impressive results but the shaping reward function was heavily engineered (Laud & DeJong, 2002) . More recent research has made progress towards greater autonomy. The mechanisms driving automatic reward shaping typically fall into a few categories, with one of the originals involving the use of abstractions to learn values on a simpler version of the problem before using these values as shaping rewards (Marthi, 2007) . Other projects have followed up this idea using both model-free and model-based RL methods (Grześ & Kudenko, 2010) as well as with the estimated modeling of the reward function itself (Marom & Rosman, 2018) . One particular work directly bootstraps the model-based learning in R-max into a shaping reward function (Asmuth et al., 2008) . Yet another builds a graph of the environment before using subgoals as a means of defining shaping rewards which adjust as the graph is updated (Marashi et al., 2012) . Credit assignment is another form of automatic reward shaping. It injects information into previous experiences such that learning is enhanced when replay occurs. Song & Jin (2011) implemented this by identifying critical states and using these landmarks as sub-rewards to make learning easier. De Villiers &



We will study flexible combination of individual shaping signals in our future work.

