LEARNING INTRINSIC SYMBOLIC REWARDS IN REINFORCEMENT LEARNING

Abstract

Learning effective policies for sparse objectives is a key challenge in Deep Reinforcement Learning (RL). A common approach is to design task-related dense rewards to improve task learnability. While such rewards are easily interpreted, they rely on heuristics and domain expertise. Alternate approaches that train neural networks to discover dense surrogate rewards avoid heuristics, but are highdimensional, black-box solutions offering little interpretability. In this paper, we present a method that discovers dense rewards in the form of low-dimensional symbolic trees -thus making them more tractable for analysis. The trees use simple functional operators to map an agent's observations to a scalar reward, which then supervises the policy gradient learning of a neural network policy. We test our method on continuous action spaces in Mujoco and discrete action spaces in Atari and Pygame environments. We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks. Notably, we significantly outperform a widely used, contemporary neural-network based reward-discovery algorithm in all environments considered.

1. INTRODUCTION

Figure 1 : LISR: agents discover latent rewards as symbolic functions and use it to train using standard Deep RL methods RL algorithms aim to learn a target task by maximizing the rewards provided by the underlying environment. Only in a few limited scenarios are the rewards provided by the environment dense and continuously supplied to the learning agent, e.g. a running score in Atari games (Mnih et al., 2015) , or the distance between the robot arm and the object in a picking task (Lillicrap et al., 2015) . In many real world scenarios, these dense extrinsic rewards are sparse or altogether absent. In these environments, it is common approach to hand-engineer a dense reward and combine with the sparse objective to construct a surrogate reward. While the additional density leads to faster convergence of a policy, creating a surrogate reward fundamentally changes the underlying Markov Decision Process (MDP) formulation central to many Deep RL solutions. Thus, the learned policy may differ significantly from the optimal policy (Rajeswaran et al., 2017; Ng et al., 1999) . Moreover, the achieved task performance depends on the heuristics used to construct the dense reward, and the specific function used to mix the sparse and dense rewards. Recent works (Pathak et al., 2017; Zheng et al., 2018; Du et al., 2019) have also explored training a neural network to generate dense local rewards automatically from data. While, these approaches have sometimes outperformed Deep RL algorithms that rely on hand-designed dense rewards, they have only been tested in a limited number of settings. Further, the resulting reward function estimators 1

