EXPLAINABLE REINFORCEMENT LEARNING THROUGH GOAL-BASED EXPLANATIONS

Abstract

Deep Reinforcement Learning agents achieve state-of-the-art performance in many tasks at the cost of making them black-boxes, hard to interpret and understand, making their use difficult in trusted applications, such as robotics or industrial applications. We introduce goal-based interpretability, where the agent produces goals which show the reason for its current actions (reach the current goal) and future goals indicate its desired future behavior without having to run the environment, a useful property in environments with no simulator. Additionally, in many environments, the goals can be visualised to make them easier to understand for non-experts. To have a goal-producing agent without requiring domain knowledge, we use 2-layer hierarchical agents where the top layer produces goals and the bottom layer attempts to reach those goals. Most classical reinforcement learning algorithms cannot be used train goalproducing hierarchical agents. We introduce a new algorithm to train these more interpretable agents, called HAC-General with Teacher, an extension of the Hindsight Actor-Critic (HAC) algorithm (Levy et al., 2019) that adds 2 key improvements: (1) the goals now consist of a state s to be reached and a reward r to be collected, making it possible for the goal-producing policy to incentivize the goal-reaching policy to go through high-reward paths and (2) an expert teacher is leveraged to improve the training of the hierarchical agent, in a process similar but distinct to imitation learning and distillation. Contrarily to HAC, there is no requirement that environments need to provide the desired end state. Additionally, our experiments show that it has better performance and learns faster than HAC, and can solve environments that HAC fails to solve.

1. INTRODUCTION

Deep learning has had a huge impact on Reinforcement Learning, making it possible to solve certain problems for the first time, vastly improving performance in many old problems and often exceeding human performance in difficult tasks (Schrittwieser et al., 2019; Badia et al., 2020) . These improvements come at a price though: deep agents are black-boxes which are difficult to understand and their decisions are hard to explain due to the complexity and non-obvious behavior of neural networks. In safety-critical applications, it is often fundamental to check that certain properties are respected or to understand what the behavior of the agent will be (García & Fernández, 2015; Bragg & Habli, 2018) . Simply observing the behavior of the agent is often not enough, since it might take its actions for the wrong reasons or it might have surprising behavior when faced with an unexpected state. Ideally, the agent would explain its behavior, which would allow for auditing, accountability, and safety-checking (Puiutta & Veith, 2020), unlocking the use of Reinforcement Learning systems in critical areas such as robotics, semi-autonomous driving, or industrial applications. We provide three contributions to make more interpretable deep agents. First, we develop a new type of explanation for the agent's behavior. Imagine the following scenario: a robotic agent has to traverse a difficult terrain until it reaches a specific building, where it collects a reward. The agent decomposes its task into a series of goals (for example, positions it has to reach) and tries to reach these goals successively until it reaches the reward zone. The agent is more interpretable since it explicitly produces the successive goals it is trying to accomplish: the current goal explains its shortterm behavior (the joint movements are done to reach the current goal position) and the remaining

