LEARNING INTRINSIC SYMBOLIC REWARDS IN REINFORCEMENT LEARNING

Abstract

Learning effective policies for sparse objectives is a key challenge in Deep Reinforcement Learning (RL). A common approach is to design task-related dense rewards to improve task learnability. While such rewards are easily interpreted, they rely on heuristics and domain expertise. Alternate approaches that train neural networks to discover dense surrogate rewards avoid heuristics, but are highdimensional, black-box solutions offering little interpretability. In this paper, we present a method that discovers dense rewards in the form of low-dimensional symbolic trees -thus making them more tractable for analysis. The trees use simple functional operators to map an agent's observations to a scalar reward, which then supervises the policy gradient learning of a neural network policy. We test our method on continuous action spaces in Mujoco and discrete action spaces in Atari and Pygame environments. We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks. Notably, we significantly outperform a widely used, contemporary neural-network based reward-discovery algorithm in all environments considered.

1. INTRODUCTION

Figure 1 : LISR: agents discover latent rewards as symbolic functions and use it to train using standard Deep RL methods RL algorithms aim to learn a target task by maximizing the rewards provided by the underlying environment. Only in a few limited scenarios are the rewards provided by the environment dense and continuously supplied to the learning agent, e.g. a running score in Atari games (Mnih et al., 2015) , or the distance between the robot arm and the object in a picking task (Lillicrap et al., 2015) . In many real world scenarios, these dense extrinsic rewards are sparse or altogether absent. In these environments, it is common approach to hand-engineer a dense reward and combine with the sparse objective to construct a surrogate reward. While the additional density leads to faster convergence of a policy, creating a surrogate reward fundamentally changes the underlying Markov Decision Process (MDP) formulation central to many Deep RL solutions. Thus, the learned policy may differ significantly from the optimal policy (Rajeswaran et al., 2017; Ng et al., 1999) . Moreover, the achieved task performance depends on the heuristics used to construct the dense reward, and the specific function used to mix the sparse and dense rewards. Recent works (Pathak et al., 2017; Zheng et al., 2018; Du et al., 2019) have also explored training a neural network to generate dense local rewards automatically from data. While, these approaches have sometimes outperformed Deep RL algorithms that rely on hand-designed dense rewards, they have only been tested in a limited number of settings. Further, the resulting reward function estimators are black-box neural networks with several thousand parameters -thus rendering them intractable to parse. A symbolic reward function lends itself better applications such as formal verification in AI and in ensuring fairness and removal of bias in the polices that are deployed. In this paper, we present a method that discovers dense rewards in the form of low-dimensional symbolic trees rather than as high-dimensional neural networks. The trees use simple functional operators to map an agent's observations to a scalar reward, which then supervises the policy gradient learning of a neural network policy. We refer to our proposed method as Learned Intrinsic Symbolic Rewards (LISR). The high level concept of LISR is shown in Figure 1 . To summarize, our contributions in this paper are: • We conceptualize intrinsic reward functions as low-dimensional, learned symbolic trees constructed entirely of arithmetic and logical operators. This makes the discovered reward functions relatively easier to parse compared to neural network based representations. • We deploy gradient-free symbolic regression to discover reward functions. To the best of our knowledge, symbolic regression has not previously been used to estimate optimal reward functions for deep RL.

2. RELATED WORK

The LISR architecture relies on the following key elements: • Symbolic Regression on a population of symbolic trees to learn intrinsic rewards • Off-policy RL to train neural networks using the discovered rewards • Evolutionary algorithms (EA) on a population of neural network policies search for an optimal policy Learning Intrinsic Rewards: Some prior works (Liu et al., 2014; Kulkarni et al., 2016; Dilokthanakul et al., 2019; Zheng et al., 2018) have used heuristically designed intrinsic rewards in RL settings leading to interesting formulations such as surprise-based metrics (Huang et al., 2019) . In this work, we benchmark against Pathak et al. ( 2017) where a Curiosity metric was successfully used to outperform A3C on relatively complex environments like VizDoom and Super Mario Bros. LISR differs from these works in that the reward functions discovered are low-dimensional symbolic trees instead of high-dimensional neural networks. Further, unlike LISR, we are not aware other works that benchmark a single intrinsic reward approach on both discrete and continuous control tasks as well as single and multiagent settings. Symbolic Regression in DL is a well known search technique in the space of symbolic functions. A few works have applied symbolic regression to estimate activation functions (Sahoo et al., 2018) , value functions (Kubalík et al., 2019) and to directly learn interpretable RL policies in model based RL (Hein et al., 2018) . To the best of our knowledge, symbolic regression has not previously been used to optimize for the reward function of an RL algorithm. For simplicity of design, we adopt a classic implementation where a population of symbolic trees undergo mutation and crossover to generate new trees. Evolutionary Algorithms (EAs) are a class of gradient-free search algorithms (Fogel, 1995; Spears et al., 1993) where a population of possible solutions undergo mutate and crossover to discover novel solutions in every generation. Selection from this population involves a ranking operation based on a fitness function. Recent works have successfully combined EA and Deep RL to accelerate learning. Evolved Policy Gradients (EPG) (Houthooft et al., 2018) utilized EA to evolve a differentiable loss function parameterized as a convolutional neural network. CERL (Khadka et al., 2019) combined policy gradients (PG) and EA to find the champion policy based on a fitness score. Our work takes motivation from both. Like EPG, we also search in the space of loss functions -albeit in the form of low-dimensional symbolic trees. Like CERL, we allow EA and PG learners to share a replay buffer to accelerate exploration. However, unlike LISR, CERL relies on access to an environment-provided dense reward function for the PG learners.

