EXPLAINING RL DECISIONS WITH TRAJECTORIES

Abstract

Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy.

1. INTRODUCTION

Reinforcement learning (Sutton & Barto, 2018) has enjoyed great popularity and has achieved huge success, especially in the online settings, post advent of the deep reinforcement learning (Mnih et al., 2013; Schulman et al., 2017; Silver et al., 2017; Haarnoja et al., 2018) . Deep RL algorithms are now able to handle high-dimensional observations such as visual inputs with ease. However, using these algorithms in the real world requires -i) efficient learning from minimal exploration to avoid catastrophic decisions due to insufficient knowledge of the environment, and ii) being explainable. The first aspect is being studied under offline RL where the agent is trained on collected experience rather than exploring directly in the environment. There is a huge body of work on offline RL (Levine et al., 2020; Kumar et al., 2020; Yu et al., 2020; Kostrikov et al., 2021) . However, more work is needed to address the explainability aspect of RL decision-making. Previously, researchers have attempted explaining decisions of RL agent by highlighting important features of the agent's state (input observation) (Puri et al., 2019; Iyer et al., 2018; Greydanus et al., 2018) . While these approaches are useful, we take a complementary route. Instead of identifying salient state-features, we wish to identify the past experiences (trajectories) that led the RL agent to learn certain behaviours. We call this approach as trajectory-aware RL explainability. Such explainability confers faith in the decisions suggested by the RL agent in critical scenarios (surgical (Loftus et al., 2020 ), nuclear (Boehnlein et al., 2022) , etc.) by looking at the trajectories responsible for the decision. While this sort of training data attribution has been shown to be highly effective in supervised learning (Nguyen et al., 2021) , to the best of our knowledge, this is the first work to study data attribution-based explainability in RL. In the present work, we restrict ourselves to offline RL setting where the agent is trained completely offline, i.e., without interacting with the environment and later deployed in the environment. 1. A novel explainability framework for reinforcement learning that aims to find experiences(trajectories) that lead an RL agent learn certain behaviour. 2. A solution for trajectory attribution in offline RL setting based on state-of-the-art sequence modeling techniques. In our solution, we present a methodology that generates a single embedding for a trajectory of states, actions, and rewards, inspired by approaches in Natural Language Processing (NLP). We also extend this method to generate a single encoding of data containing a set of trajectories. 3. Analysis of trajectory explanations produced by our technique along with analysis of the trajectory embeddings generated, where we demonstrate how different embedding clusters represent different semantically meaningful behaviours. Additionally, we also conduct a study to compare human understanding of RL tasks with trajectories attributed. This paper is organized as follows. In Sec. 2 we cover the works related to explainability in RL and the recent developments in offline RL. We then present our trajectory attribution algorithm in Sec. 3. The experiments and results are presented in Sec. 4. We discuss the implications of our work and its potential extensions in the concluding Sec. 5.

2. BACKGROUND AND RELATED WORK

Explainability in RL. Explainable AI (XAI) refers to the field of machine learning (ML) that focuses on developing tools for explaining the decisions of ML models. Explainable RL (XRL) (Puiutta & Veith, 2020) is a sub-field of XAI that specializes in interpreting behaviours of RL agents. Prior works include approaches that distill the RL policy into simpler models such as decision tree (Coppens et al., 2019) or to human understandable high-level decision language (Verma et al., 2018) . However, such policy simplification fails to approximate the behavior of complex RL models. In addition, causality-based approaches (Pawlowski et al., 2020; Madumal et al., 2020) aim to explain an agent's action by identifying the cause behind it using counterfactual samples. Further, saliency-based methods using input feature gradients (Iyer et al., 2018) and perturbations (Puri et al., 2019; Greydanus et al., 2018) provide state-based explanations that aid humans in understanding the agent's actions. To the best of our knowledge, for the first time, we explore the direction of explaining an agent's behaviour by attributing its actions to past encountered trajectories rather than highlighting state features. Also, memory understanding (Koul et al., 2018; Danesh et al., 2021) is a relevant direction, where finite state representations of recurrent policy networks are analysed for interpretability. However, unlike these works, we focus on sequence embedding generation and avoid using policy networks for actual return optimization. Offline RL. Offline RL (Levine et al., 2020) refers to the RL setting where an agent learns from collected experiences and does not have direct access to the environment during training. There are several specialized algorithms proposed for offline RL including model-free ones (Kumar et al., 2020; 2019) and model-based ones (Kidambi et al., 2020; Yu et al., 2020) . In this work, we use algorithms from both these classes to train offline RL agents. In addition, recently, the RL problem of maximizing long-term return has been cast as taking the best possible action given the sequence of past interactions in terms of states, actions, rewards (Chen et al., 2021; Janner et al., 2021; Reed et al., 2022; Park et al., 2018) . Such sequence modelling approaches to RL, especially the ones based on transformer architecture (Vaswani et al., 2017) , have produced state-of-the-art results in various offline RL benchmarks, and offer rich latent representations to work with. However, little to no work has been done in the direction of understanding these sequence representations and their applications. In this work, we base our solution on transformer-based sequence modelling approaches to leverage their high efficiency in capturing the policy and environment dynamics of the offline RL systems. Previously, researchers in group-driven RL (Zhu et al., 2018) have employed raw state-reward vectors as trajectory representations. We believe transformer-based embeddings, given their proven capabilities, would serve as better representations than state-reward vectors.

