EXPLAINABLE REINFORCEMENT LEARNING THROUGH GOAL-BASED EXPLANATIONS

Abstract

Deep Reinforcement Learning agents achieve state-of-the-art performance in many tasks at the cost of making them black-boxes, hard to interpret and understand, making their use difficult in trusted applications, such as robotics or industrial applications. We introduce goal-based interpretability, where the agent produces goals which show the reason for its current actions (reach the current goal) and future goals indicate its desired future behavior without having to run the environment, a useful property in environments with no simulator. Additionally, in many environments, the goals can be visualised to make them easier to understand for non-experts. To have a goal-producing agent without requiring domain knowledge, we use 2-layer hierarchical agents where the top layer produces goals and the bottom layer attempts to reach those goals. Most classical reinforcement learning algorithms cannot be used train goalproducing hierarchical agents. We introduce a new algorithm to train these more interpretable agents, called HAC-General with Teacher, an extension of the Hindsight Actor-Critic (HAC) algorithm (Levy et al., 2019) that adds 2 key improvements: (1) the goals now consist of a state s to be reached and a reward r to be collected, making it possible for the goal-producing policy to incentivize the goal-reaching policy to go through high-reward paths and (2) an expert teacher is leveraged to improve the training of the hierarchical agent, in a process similar but distinct to imitation learning and distillation. Contrarily to HAC, there is no requirement that environments need to provide the desired end state. Additionally, our experiments show that it has better performance and learns faster than HAC, and can solve environments that HAC fails to solve.

1. INTRODUCTION

Deep learning has had a huge impact on Reinforcement Learning, making it possible to solve certain problems for the first time, vastly improving performance in many old problems and often exceeding human performance in difficult tasks (Schrittwieser et al., 2019; Badia et al., 2020) . These improvements come at a price though: deep agents are black-boxes which are difficult to understand and their decisions are hard to explain due to the complexity and non-obvious behavior of neural networks. In safety-critical applications, it is often fundamental to check that certain properties are respected or to understand what the behavior of the agent will be (García & Fernández, 2015; Bragg & Habli, 2018) . Simply observing the behavior of the agent is often not enough, since it might take its actions for the wrong reasons or it might have surprising behavior when faced with an unexpected state. Ideally, the agent would explain its behavior, which would allow for auditing, accountability, and safety-checking (Puiutta & Veith, 2020) , unlocking the use of Reinforcement Learning systems in critical areas such as robotics, semi-autonomous driving, or industrial applications. We provide three contributions to make more interpretable deep agents. First, we develop a new type of explanation for the agent's behavior. Imagine the following scenario: a robotic agent has to traverse a difficult terrain until it reaches a specific building, where it collects a reward. The agent decomposes its task into a series of goals (for example, positions it has to reach) and tries to reach these goals successively until it reaches the reward zone. The agent is more interpretable since it explicitly produces the successive goals it is trying to accomplish: the current goal explains its shortterm behavior (the joint movements are done to reach the current goal position) and the remaining goals help us understand the agent's overall plan to solve the task and predict its future behavior. We call goal-based explanation or goal-based interpretability the use of plan composed by a series of goals. Both model-based reinforcement learning (Moerland et al., 2020) and planning techniques (Fox et al., 2017) appear similar to goal-based explanations but there are important differences that make this technique novel. Goal-based explanations do not require learning a model of the environment (neither the reward function nor the transition function), thus being compatible with both model-free and model-based reinforcement learning. Planning can be a useful explainability technique, but it has a few limitations: it typically requires knowing the end goals, they often cannot be applied to complex Markov Decision Problems and they may have difficulty handling very large or continuous action spaces or state spaces. Our approach suffers from none of these limitations. Second, we develop a method to make the agent produce the goals that add interpretability. To do so, the agent is structured as a 2-level hierarchy of policies, with a goal-picking policy that produces goals and a goal-reaching policy that attempts to reach them. Goals are (state, minimum desired reward) pairs, meaning the goal-reaching policy has to reach a specific state in at most H steps and collect a minimum amount of reward along the way. To create a goal-based explanation, the goal-picking policy is queried repeatedly: given the agent's state s, we query for the current goal g 1 = (s 1 , r 1 ); we then assume the agent reaches the state s 1 and query for the next goal g 2 = (s 2 , r 2 ); this process for a fixed amount of steps per environment, though in future work more sophisticated algorithms to determine the adequate number of goals could be compared. Our third contribution is developing HAC-General, a new algorithm specifically designed to train goal-producing hierarchical agents. This algorithm builds upon the Hindsight Actor-Critic (HAC) algorithm (Levy et al., 2019) and makes it more widely applicable by not requiring the environment to provide an explicit end-goal. Instead of trying to reach the end-goal as fast as possible and ignoring the environment's rewards, the HAC-General algorithm trains the agent to maximize the collected reward. Our extension tries to preserve the key property that makes the Hindsight Actor-Critic algorithm effective: having an effective strategy to deal with non-stationarity by giving the illusion that the policies in sub-levels are optimal. The HAC-General algorithm is also able to leverage a black-box expert to improve and speed up the training for the hierarchical agent.

2.1. EXPLAINABLE REINFORCEMENT LEARNING

The Reinforcement Learning community has recognized the need for interpretable and explainable agents, and researchers have developed several methods to add explainability and interpretability. Puiutta & Veith (2020) survey explainability techniques; we briefly describe some key methods. To add interpretability, saliency-map methods determine the importance of each input feature for the policy when it generates its output. Perturbation-based methods (Greydanus et al., 2018) measure importance by perturbing different parts of the input and measuring the change in the policy's output. The larger the change in output, the more important the feature; the magnitude of the change quantifies the relative importance of features, making it possible to build the saliency map. In object-based saliency maps (Iyer et al., 2018) , in addition to measuring the importance of raw features, they also measure the importance of the whole objects present in the image. The importance of each object is measured by masking it and measuring the change in the policy's output. Thus, a higher-level object saliency map is created which can be more easily interpreted by non-experts. Another approach is to distill the policy of the black-box agent into a simpler, more interpretable model while trying to preserve the behavior and performance of the black-box policy. Coppens et al. ( 2019) distill the black-box policy into a soft decision tree, a type of decision tree where the leaves output a static distribution over the actions and the inner nodes select the sub-branch using a logistic model. A different approach is taken by Liu et al. (2019) which distill the model into linear model U-Trees, a type of decision tree in which leaf nodes use a linear model to produce their output (Qvalues) instead of outputting a constant value. Both types of decision trees are more interpretable since they follow clear and simpler rules to go down the tree and to pick the output value.

