DEEP REINFORCEMENT LEARNING WITH CAUSALITY-BASED INTRINSIC REWARD

Abstract

Reinforcement Learning (RL) has shown great potential to deal with sequential decision-making problems. However, most RL algorithms do not explicitly consider the relations between entities in the environment. This makes the policy learning suffer from the problems of efficiency, effectivity and interpretability. In this paper, we propose a novel deep reinforcement learning algorithm, which firstly learns the causal structure of the environment and then leverages the learned causal information to assist policy learning. The proposed algorithm learns a graph to encode the environmental structure by calculating Average Causal Effect (ACE) between different categories of entities, and an intrinsic reward is given to encourage the agent to interact more with entities belonging to top-ranked categories, which significantly boosts policy learning. Several experiments are conducted on a number of simulation environments to demonstrate the effectiveness and better interpretability of our proposed method.

1. INTRODUCTION

Reinforcement learning (RL) is a powerful approach towards dealing with sequential decisionmaking problems. Combined with deep neural networks, deep reinforcement learning (DRL) has been applied in a variety of fields such as playing video games (Mnih et al., 2015; Vinyals et al., 2019; Berner et al., 2019) , mastering the game of Go (Silver et al., 2016) and robotic control (Riedmiller et al., 2018) . However, current DRL algorithms usually learn a black box policy approximated by a deep neural network directly using the state transitions and reward signals, without explicitly understanding the structure information of the environment. Compared with DRL agents, an important reason why humans are believed to be better at learning is the ability to build model on the relations between entities in the environment and then reason based on it. This ability is an important component of human cognition (Spelke & Kinzler, 2007) . As the learning process continues, through interactions with the environment and observations of it, human can gradually understand its actions' causal effects on the entities as well as the relations between entities and then reason based on them to figure it out the most important actions to take in order to improve the efficiency. In scenarios that contain multiple entities with complicated relations, optimal policy may be obtained only when the structured relation information is captured and exploited. However, most current DRL algorithms do not consider structured relation information explicitly. The knowledge learned by an agent is implicitly entailed in the policy or action-value function, which are usually unexplainable neural networks. Therefore, whether the relations are well understood and exploited by the agent is unknown. When the environment is with high complexity, blackbox learning of policies suffers from low efficiency, while policy learning over explicit representation of entity relations can significantly boost the learning efficiency. Based on the fact that entities in an environment are often not independent but causally related, we argue that disentangling the learning task into two sequential tasks, namely relational structure learning and policy learning, and leveraging an explicit environmental structure model to facilitate the policy learning process of DRL agents are expected to boost the performance. With the learned relational structure information, the agent performs exploration with a tendency of prioritizing interaction with critical entities, which is encouraged by intrinsic rewards, to learn optimal policy effectively. Taking this inspiration, we propose a deep reinforcement learning algorithm which firstly learns the relations between entities and then recognize critical entity categories and develop an intrinsic reward based approach to improve policy learning efficiency and explainability. The proposed algo-rithm learns a graph to encode the relation information between categories of entities, by evaluating causal effect of one category of entities to another. Thereafter, intrinsic reward based on the learned graph is given to an agent to encourage it to prioritize interaction with entities belonging to important categories (the categories that are root causes in the graph). Previous works also use graphs to provide additional structured information for the agent to assist policy learning (Wang et al., 2018; Vijay et al., 2019) . However, graphs leveraged by these works are provided by human and thus rely heavily on prior knowledge. Compared with their methods, our algorithm overcomes the deficiency that the graph can not be generated automatically. Our approach requires no prior knowledge and can be combined with existing policy-based or value-based DRL algorithms to boost their learning performance. The key contributions of this work are summarized as follows: • We propose a novel causal RL framework that decomposes the whole task into the structure learning and causal structure aware policy learning. • The learned causal information is leveraged by giving causality based intrinsic reward to an agent, to encourage it to interact with entities belonging to critical categories for accomplishing the task. • We design two new game tasks which contain multiple entities with causal relations as benchmarks to be released to the community. The new benchmarks are designed in such ways that categories of objects are causally related. Experiments are conducted on our designed simulation environments, which show that our algorithm achieves state-of-the-art performance and can facilitate the learning process of DRL agents under other algorithmic frameworks. The paper is organized as follows. In Section 2, we introduce deep reinforcement learning and Average Causal Effect (ACE), which are key components of this work. Then we illustrate our algorithm in Section 3 in details. In Section 4, we show the experimental results on the designed environment to demonstrate the effectiveness of our framework. In Section 5, we introduce previous works that relate to our method. Finally, conclusions and future work are provided in Section 6.

2. BACKGROUND

2.1 DEEP REINFORCEMENT LEARNING An MDP can be defined by a 5-tuple (S, A, P, R, γ), where S is the state space, A is the action space, P is the transition function, R is the reward function and γ is the discount factor (Sutton & Barto, 2018) . A RL agent observes a state s t ∈ S at time step t. Then it selects an action a t from the action space A following a policy π (a t |s t ), which is a mapping of state space to action space. After taking the action, the agent receives a scalar reward r t according to R(s t , a t ). Then the agent transits to the next state s t+1 according to the state transition probability P (s t+1 |s t , a t ). A RL agent aims to learn a policy that maximizes the cumulative discount reward, which can be formulated as R t = T k=0 γ k r t+k where T is the length of the whole episode. In the process of learning an optimal policy, a RL agent generally approximates the state-value function V π (s) or the action value function Q π (s, a). The state value function is the expected cumulative future discounted reward from a state with actions sampled from a policy π: V π (s) = E π T k=0 γ k r t+k |S t = s . (1) Deep Reinforcement Learning (DRL) which combines Deep Neural Networks (DNNs) with RL can be an effective way to deal with high-dimensional state space. It benefits from the representation ability of DNNs, which enable automatic feature engineering and end-to-end learning through gradient descent. Several effective algorithms have been proposed in the literature and we use A2C in this paper as our basic algorithm, which is a synchronous version of A3C (Mnih et al., 2016) . A2C consists of a variety of actor-critic algorithms (Sutton et al., 2000) . It directly optimizes the policy π θ parameterized by θ to maximize the objective J(θ) = E π [ T k=0 γ k r t+k ] by taking steps in the direction of ∇ θ J(θ). The gradient of the policy can be written as: ∇ θ J(θ) = E π [∇ θ log π θ (a|s)A π (s, a)], (2)

