DEEP REINFORCEMENT LEARNING WITH CAUSALITY-BASED INTRINSIC REWARD

Abstract

Reinforcement Learning (RL) has shown great potential to deal with sequential decision-making problems. However, most RL algorithms do not explicitly consider the relations between entities in the environment. This makes the policy learning suffer from the problems of efficiency, effectivity and interpretability. In this paper, we propose a novel deep reinforcement learning algorithm, which firstly learns the causal structure of the environment and then leverages the learned causal information to assist policy learning. The proposed algorithm learns a graph to encode the environmental structure by calculating Average Causal Effect (ACE) between different categories of entities, and an intrinsic reward is given to encourage the agent to interact more with entities belonging to top-ranked categories, which significantly boosts policy learning. Several experiments are conducted on a number of simulation environments to demonstrate the effectiveness and better interpretability of our proposed method.

1. INTRODUCTION

Reinforcement learning (RL) is a powerful approach towards dealing with sequential decisionmaking problems. Combined with deep neural networks, deep reinforcement learning (DRL) has been applied in a variety of fields such as playing video games (Mnih et al., 2015; Vinyals et al., 2019; Berner et al., 2019) , mastering the game of Go (Silver et al., 2016) and robotic control (Riedmiller et al., 2018) . However, current DRL algorithms usually learn a black box policy approximated by a deep neural network directly using the state transitions and reward signals, without explicitly understanding the structure information of the environment. Compared with DRL agents, an important reason why humans are believed to be better at learning is the ability to build model on the relations between entities in the environment and then reason based on it. This ability is an important component of human cognition (Spelke & Kinzler, 2007) . As the learning process continues, through interactions with the environment and observations of it, human can gradually understand its actions' causal effects on the entities as well as the relations between entities and then reason based on them to figure it out the most important actions to take in order to improve the efficiency. In scenarios that contain multiple entities with complicated relations, optimal policy may be obtained only when the structured relation information is captured and exploited. However, most current DRL algorithms do not consider structured relation information explicitly. The knowledge learned by an agent is implicitly entailed in the policy or action-value function, which are usually unexplainable neural networks. Therefore, whether the relations are well understood and exploited by the agent is unknown. When the environment is with high complexity, blackbox learning of policies suffers from low efficiency, while policy learning over explicit representation of entity relations can significantly boost the learning efficiency. Based on the fact that entities in an environment are often not independent but causally related, we argue that disentangling the learning task into two sequential tasks, namely relational structure learning and policy learning, and leveraging an explicit environmental structure model to facilitate the policy learning process of DRL agents are expected to boost the performance. With the learned relational structure information, the agent performs exploration with a tendency of prioritizing interaction with critical entities, which is encouraged by intrinsic rewards, to learn optimal policy effectively. Taking this inspiration, we propose a deep reinforcement learning algorithm which firstly learns the relations between entities and then recognize critical entity categories and develop an intrinsic reward based approach to improve policy learning efficiency and explainability. The proposed algo-

