EXPLAINING RL DECISIONS WITH TRAJECTORIES

Abstract

Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy.

1. INTRODUCTION

Reinforcement learning (Sutton & Barto, 2018) has enjoyed great popularity and has achieved huge success, especially in the online settings, post advent of the deep reinforcement learning (Mnih et al., 2013; Schulman et al., 2017; Silver et al., 2017; Haarnoja et al., 2018) . Deep RL algorithms are now able to handle high-dimensional observations such as visual inputs with ease. However, using these algorithms in the real world requires -i) efficient learning from minimal exploration to avoid catastrophic decisions due to insufficient knowledge of the environment, and ii) being explainable. The first aspect is being studied under offline RL where the agent is trained on collected experience rather than exploring directly in the environment. There is a huge body of work on offline RL (Levine et al., 2020; Kumar et al., 2020; Yu et al., 2020; Kostrikov et al., 2021) . However, more work is needed to address the explainability aspect of RL decision-making. Previously, researchers have attempted explaining decisions of RL agent by highlighting important features of the agent's state (input observation) (Puri et al., 2019; Iyer et al., 2018; Greydanus et al., 2018) . While these approaches are useful, we take a complementary route. Instead of identifying salient state-features, we wish to identify the past experiences (trajectories) that led the RL agent to learn certain behaviours. We call this approach as trajectory-aware RL explainability. Such explainability confers faith in the decisions suggested by the RL agent in critical scenarios (surgical (Loftus et al., 2020) , nuclear (Boehnlein et al., 2022) , etc.) by looking at the trajectories responsible for the decision. While this sort of training data attribution has been shown to be highly effective in supervised learning (Nguyen et al., 2021) , to the best of our knowledge, this is the first work to study data attribution-based explainability in RL. In the present work, we restrict ourselves to offline RL setting where the agent is trained completely offline, i.e., without interacting with the environment and later deployed in the environment. 1. A novel explainability framework for reinforcement learning that aims to find experiences(trajectories) that lead an RL agent learn certain behaviour. 2. A solution for trajectory attribution in offline RL setting based on state-of-the-art sequence modeling techniques. In our solution, we present a methodology that generates a single embedding for a trajectory of states, actions, and rewards, inspired by approaches in Natural Language Processing (NLP) . We also extend this method to generate a single encoding of data containing a set of trajectories. 3. Analysis of trajectory explanations produced by our technique along with analysis of the trajectory embeddings generated, where we demonstrate how different embedding clusters represent different semantically meaningful behaviours. Additionally, we also conduct a study to compare human understanding of RL tasks with trajectories attributed. This paper is organized as follows. In Sec. 2 we cover the works related to explainability in RL and the recent developments in offline RL. We then present our trajectory attribution algorithm in Sec. 3. The experiments and results are presented in Sec. 4. We discuss the implications of our work and its potential extensions in the concluding Sec. 5.

2. BACKGROUND AND RELATED WORK

Explainability in RL. Explainable AI (XAI) refers to the field of machine learning (ML) that focuses on developing tools for explaining the decisions of ML models. Explainable RL (XRL) (Puiutta & Veith, 2020 ) is a sub-field of XAI that specializes in interpreting behaviours of RL agents. Prior works include approaches that distill the RL policy into simpler models such as decision tree (Coppens et al., 2019) or to human understandable high-level decision language (Verma et al., 2018) . However, such policy simplification fails to approximate the behavior of complex RL models. In addition, causality-based approaches (Pawlowski et al., 2020; Madumal et al., 2020) aim to explain an agent's action by identifying the cause behind it using counterfactual samples. Further, saliency-based methods using input feature gradients (Iyer et al., 2018) and perturbations (Puri et al., 2019; Greydanus et al., 2018) provide state-based explanations that aid humans in understanding the agent's actions. To the best of our knowledge, for the first time, we explore the direction of explaining an agent's behaviour by attributing its actions to past encountered trajectories rather than highlighting state features. Also, memory understanding (Koul et al., 2018; Danesh et al., 2021) is a relevant direction, where finite state representations of recurrent policy networks are analysed for interpretability. However, unlike these works, we focus on sequence embedding generation and avoid using policy networks for actual return optimization. Offline RL. Offline RL (Levine et al., 2020) refers to the RL setting where an agent learns from collected experiences and does not have direct access to the environment during training. There are several specialized algorithms proposed for offline RL including model-free ones (Kumar et al., 2020; 2019) and model-based ones (Kidambi et al., 2020; Yu et al., 2020) . In this work, we use algorithms from both these classes to train offline RL agents. In addition, recently, the RL problem of maximizing long-term return has been cast as taking the best possible action given the sequence of past interactions in terms of states, actions, rewards (Chen et al., 2021; Janner et al., 2021; Reed et al., 2022; Park et al., 2018) . Such sequence modelling approaches to RL, especially the ones based on transformer architecture (Vaswani et al., 2017) , have produced state-of-the-art results in various offline RL benchmarks, and offer rich latent representations to work with. However, little to no work has been done in the direction of understanding these sequence representations and their applications. In this work, we base our solution on transformer-based sequence modelling approaches to leverage their high efficiency in capturing the policy and environment dynamics of the offline RL systems. Previously, researchers in group-driven RL (Zhu et al., 2018) have employed raw state-reward vectors as trajectory representations. We believe transformer-based embeddings, given their proven capabilities, would serve as better representations than state-reward vectors. First, we encode trajectories in offline data using sequence encoders and then cluster the trajectories using these encodings. Also, we generate a single embedding for the data. Next, we train explanation policies on variants of the original dataset and compute corresponding data embeddings. Finally, we attribute decisions of RL agents trained on entire data to trajectory clusters using action and data embedding distances.

3. TRAJECTORY ATTRIBUTION

Preliminaries. We denote the offline RL dataset using D that comprises a set of n τ trajectories. Each trajectory, denoted by τ j comprises of a sequence of observation (o k ), action (a k ) and per-step reward (r k ) tuples with k ranging from 1 to the length of the trajectory τ j . We begin by training an offline RL agent on this data using any standard offline RL algorithm from the literature. Algorithm. Having obtained the learned policy using an offline RL algorithm, our objective now is to attribute this policy, i.e., the action chosen by this policy, at a given state to a set of trajectories. We intend to achieve this in the following way. We want to find the smallest set of trajectories, the absence of which from the training data leads to different behavior at the state under consideration. That is, we posit that this set of trajectories contains specific behaviors and respective feedback from the environment that trains the RL agent to make decisions in a certain manner. This identified set of trajectories would then be provided as attribution for the original decision. While this basic idea is straightforward and intuitive, it is not computationally feasible beyond a few small RL problems with discrete state and action spaces. The key requirement to scale this approach to large, continuous state and action space problems, is to group the trajectories into clusters which can then be used to analyze their role in the decision-making of the RL agent. In this work, we propose to cluster the trajectories using trajectory embeddings produced with the help of state-ofthe-art sequence modeling approaches. Figure 1 gives an overview of our proposed approach involving five steps: (i) Trajectory encoding, (ii) Trajectory clustering, (iii) Data embedding, (iv) Training explanation policies, and (v) Cluster attribution, each of which is explained in the sequel. (i) Trajectory Encoding. First, we tokenize the trajectories in the offline data according to the specifications of the sequence encoder used (e.g. decision transformer/trajectory transformer). The observation, action and reward tokens of a trajectory are then fed to the sequence encoder to produce corresponding latent representations, which we refer to as output tokens. We define the trajectory embedding as an average of these output tokens. This technique is inspired by average-pooling techniques (Choi et al., 2021; Briggs, 2021) (eo 1,j , ea 1,j , er 1,j , ..., eo T,j , ea T,j , er T,j ) ← E(o1,j, a1,j, r1,j, ..., o T,j , a T,j , r T,j ) // where 3T = #input tokens / * Take mean of outputs to generate τj's embedding tj * / tj ← (eo 1,j + ea 1,j + er 1,j + eo 2,j + ea 2,j + er 2,j + ... + eo T,j + ea T,j + er T,j )/(3T) Append tj to T Output : Return the trajectory embeddings T = {ti} (ii) Trajectory Clustering. Having obtained trajectory embeddings, we cluster them using X-Means clustering algorithm (Pelleg et al., 2000) with implementation provided by Novikov (2019) . While in principle, any suitable clustering algorithm can be used here, we chose X-Means as it is a simple extension to the K-means clustering algorithm (Lloyd, 1982) ; it determines the number of clusters n c automatically. This enables us to identify all possible patterns in the trajectories without forcing n c as a hyperparameter (Refer to Algorithm 2). (iii) Data Embedding. We need a way to identify the least change in the original data that leads to the change in behavior of the RL agent. To achieve this, we propose a representation for data comprising the collection of trajectories. The representation has to be agnostic to the order in which trajectories are present in the collection. So, we follow the set-encoding procedure prescribed in (Zaheer et al., 2017) where we first sum the embeddings of the trajectories in the collection, normalize this sum by division with a constant and further apply a non-linearity, in our case, simply, softmax over the feature dimension to generate a single data embedding (Refer to Algorithm 3). We use this technique to generate data embeddings for n c + 1 sets of trajectories. The first set represents the entire training data whose embedding is denoted by dorig . The remaining n c sets are constructed as follows. For each trajectory cluster c j , we construct a set with the entire training data but the trajectories contained in c j . We call this set the complementary data set corresponding to cluster c j and the corresponding data embedding as the complementary data embedding dj . (iv) Training Explanation Policies. In this step, for each cluster c j , using its complementary data set, we train an offline RL agent. We ensure that all the training conditions (algorithm, weight initialization, optimizers, hyperparameters, etc.) are identical to the training of the original RL policy, except for the modification in the training data. We call this newly learned policy as the explanation policy corresponding to cluster c j . We thus get n c explanation policies at the end of this step. In addition, we compute data embeddings for complementary data sets (Refer to Algorithm 4). (v) Cluster Attribution. In this final step, given a state, we note the actions suggested by all the explanation policies at this state. We then compute the distances of these actions (where we Output : Explanation policies {πj}, Complementary data embeddings { dj} assume a metric over the action space) from the action suggested by the original RL agent at the state. The explanation policies corresponding to the maximum of these distances form the candidate attribution set. For each policy in this candidate attribution set, we compute the distance between its respective complementary data embedding and the data embedding of the entire training data using the Wasserstein metric for capturing distances between softmax simplices (Vallender, 1974) . We then select the policy that has the smallest data distance and attribute the decision of the RL agent to the cluster corresponding to this policy(Refer to Algorithm 5). Our approach comprised of all five steps is summarized in Algorithm 6. 

4. EXPERIMENTS AND RESULTS

Next, we present experimental results to show the effectiveness of our approach in generating trajectory explanations. We address the following key questions: Q1) Do we generate reliable trajectory explanations? (Sec. 4.2) Q2) How does a human understanding of an environment align with trajectories attributed by our algorithm and what is the scope of data attribution techniques? (Sec. 4.3)

4.1. EXPERIMENTAL SETUP

We first describe the environments, models, and metrics designed to study the reliability of our trajectory explanations. RL Environments. We perform experiments on three environments: i) Grid-world (Figure 5 ) which has discrete state and action spaces, ii) Seaquest from Atari suite which has environments with continuous visual observations and discrete action spaces (Bellemare et al., 2013) , and iii) HalfCheetah from MuJoCo environments which are control environments with continuous state and action spaces (Todorov et al., 2012) . Offline Data and Sequence Encoders. For grid-world, we collect offline data of 60 trajectories from policy rollout of other RL agents and train an LSTM-based trajectory encoder following the procedure described in trajectory transformer, replacing the transformer with LSTM. For Seaquest, we collect offline data of 717 trajectories from the D4RL-Atari repository and use a pre-trained decision transformer as trajectory encoder. Similarly, for HalfCheetah, we collect offline data of 1000 trajectories from the D4RL repository (Fu et al., 2020) and use a pre-trained trajectory transformer as a trajectory encoder. To cluster high-level skills in long trajectory sequences, we divide the Seaquest trajectories into 30-length sub-trajectories and the HalfCheetah trajectories into 25length sub-trajectories. These choices were made based on the transformers' input block sizes and the quality of clustering. Offline RL Training and Data Embedding. We train the offline RL agents for each environment using the data collected as follows -for grid-world, we use model-based offline RL, and for Seaquest and HalfCheetah, we employ DiscreteSAC (Christodoulou, 2019) and SAC (Haarnoja et al., 2018) , respectively, using d3rlpy implementations (Takuma Seno, 2021) . We compute data embedding of entire training data for each of the environments. See Appendix A.3 for additional training details. Encoding of Trajectories and Clustering. We encode the trajectory data using sequence encoders and cluster the output trajectory embeddings using the X-means algorithm. More specifically, we obtain 10 trajectory clusters for grid-world, 8 for Seaquest, and 10 for HalfCheetah. These clusters represent meaningful high-level behaviors such as 'falling into the lava', 'filling in oxygen', 'taking long forward strides', etc. This is discussed in greater detail in Section A.4. Complementary Data Sets. We obtain complementary data sets using the aforementioned cluster information and provide 10 complementary data sets for grid-world, 8 for Seaquest, and 10 for HalfCheetah. Next, we compute data embeddings corresponding to these newly formed data sets. Explanation Policies. Subsequently, we train explanation policies on the complementary data sets for each environment. The training produces 10 additional policies for grid-world, 8 policies for Seaquest, and 10 policies for HalfCheetah. In summary, we train the original policy on the entire data, obtain data embedding for the entire data, cluster the trajectories and obtain their explanation policies and complementary data embeddings. Trajectory Attribution. Finally, we attribute a decision made by the original policy for a given state to a trajectory cluster. We choose top-3 trajectories from these attributed clusters by matching the context for the state-action under consideration with trajectories in the cluster in our experiments. Evaluation Metrics. We compare policies trained on different data using three metrics (deterministic nature of policies is assumed throughout the discussion) -1) Initial State Value Estimate denoted by E(V (s 0 )) which is a measure of expected long-term returns to evaluate offline RL training as described in Paine et al. (2020), 2) Local Mean Absolute Action-Value Difference: defined as E(|∆Q πorig |) = E(|Q πorig (π orig (s)) -Q πorig (π j (s))|) that measures how original policy perceives suggestions given by explanation policies, and 3) Action Contrast Measure: a measure of difference in actions suggested by explanation policies and the original action. Here, we use E(1(π orig (s) ̸ = π j (s)) for discrete action space and E((π orig (s) -π j (s)) 2 ) for continuous action space. Further, we compute distances between embeddings of original and complementary data sets using Wasserstein metric: W dist ( dorig , dj ), later normalized to [0, 1]. Finally, the cluster attribution frequency is measured using metric P(c final = c j ).

4.2. TRAJECTORY ATTRIBUTION RESULTS

Qualitative Results. Figure 2 depicts a grid-world state -(1, 1 ), the corresponding decision by the trained offline RL agent -'right', and attribution trajectories explaining the decision. As we can observe, the decision is influenced not only by trajectory (traj.-i) that goes through (1, 1) but also by other distant trajectories(trajs.-ii, iii). These examples demonstrate that distant experiences (e.g. traj.-iii) could significantly influence the RL agent's decisions, deeming trajectory attribution an essential component of future XRL techniques. Further, Figure 3 shows the Seaquest agent (submarine) suggesting action 'left' for the given observation in the context of the past few frames. The corresponding attributed trajectories provide insights into how the submarine aligns itself to target enemies coming from the left. Figure 10 shows HalfCheetah observation, the agent suggested action in terms of hinge torques and corresponding attributed trajectories showing runs that influence the suggested set of torques. This is an interesting use-case of trajectory attribution as it explains complicated torques, understood mainly by the domain experts, in terms of the simple semantic intent of 'getting up from the floor'. Quantitative Analysis. Tables 1, 2 and 3 present quantitative analysis of the proposed trajectory attribution. The initial state value estimate for the original policy π orig matches or exceeds estimates (1, 1 ). This action is attributed to trajectories (i), (ii) and (iii) (We denote gridworld trajectory by annotated ∧,∨,>,< arrows for 'up', 'down', 'right', 'left' actions respectively, along with the time-step associated with the actions (0-indexed)). We can observe that the RL decisions could be influenced by trajectories distant from the state under consideration, and therefore attributing decisions to trajectories becomes important to understand the decision better. for explanation policies trained on different complementary data sets in all three environment settings. This indicates that the original policy, having access to all behaviours, is able to outperform other policies that are trained on data lacking information about important behaviours (e.g. gridworld: reaching a distant goal, Seaquest: fighting in the top-right corner, HalfCheetah: stabilizing the frame while taking strides). Furthermore, local mean absolute action-value difference and action differences turn out to be highly correlated (Tab. 2 and 3), i.e., the explanation policies that suggest the most contrasting actions are usually perceived by the original policy as low-return actions. This evidence supports the proposed trajectory algorithm as we want to identify the behaviours which when removed make agent choose actions that are not considered suitable originally. In addition, we provide the distances between the data embeddings in the penultimate column. The cluster attribution distribution is represented in the last column which depicts how RL decisions are dependent on various behaviour clusters. Interestingly, in the case of grid-world, we found that only a few clusters containing information about reaching goals and avoiding lava had the most significant effect on the original RL policy. We conduct two additional analyses -1) trajectory attribution on Seaquest trained using Discrete BCQ (Sec. A.7), and 2) Breakout trajectory attribution (Sec. A.8). In the first one, we find a similar influence of clusters across the decision-making of policies trained under different algorithms. Table 1 : Quantitative Analysis of Grid-world Trajectory Attribution. The analysis is provided using 5 metrics. Higher the E(V (s 0 )), better is the trained policy. High E(|∆Q πorig |)) along with high E(1(π orig (s) ̸ = π j (s)) is desirable. The policies with lower W dist ( d, dj ) and high action contrast are given priority while attribution. The cluster attribution distribution is given in the final column. Secondly, in the breakout results, we find clusters with high-level meanings of 'depletion of a life' and 'right corner shot' influence decision-making a lot. This is insightful as the two behaviors are highly critical in taking action, one avoids the premature end of the game, and the other is part of the tunneling strategy previously found in Greydanus et al. (2018) . π E(V (s 0 )) E(|∆Q πorig |)) E(1(π orig (s) ̸ = π j (s)) W dist ( d, π E(V (s 0 )) E(|∆Q πorig |)) E(1(π orig (s) ̸ = π j (s)) W dist ( d,

4.3. QUANTIFYING UTILITY OF THE TRAJECTORY ATTRIBUTION: A HUMAN STUDY

One of the key desiderata of explanations is to provide useful relevant information about the behaviour of complex AI models. To this end, prior works in other domains like vision (Goyal et al., 2019) and language (Liu et al., 2019) have conducted human studies to quantify the usefulness of output explanations. Similarly, having established a straightforward attribution technique, we wish to analyze the utility of the generated attributions and their scope in the real world through a human study. Interestingly, humans possess an explicit understanding of RL gaming environments and can reason about actions at a given state to a satisfactory extent. Leveraging this, we pilot a human study with 10 participants who had a complete understanding of the grid-world navigation environment to quantify the alignment between human knowledge of the environment dynamics with actual factors influencing RL decision-making. Study setup. For the study, we design two tasks: i) participants need to choose a trajectory that they think best explains the action suggested in a grid cell, and ii) participants need to identify all relevant trajectories that explain action suggested by RL agent. For instance, in Fig. 4a , we show one instance of the task where the agent is located at (1, 1) and is taking 'right' and a subset of attributed trajectories for this agent action. In both tasks, in addition to attributions proposed by our technique, we add i) a randomly selected trajectory as an explanation and ii) a trajectory selected from a cluster different from the one attributed by our approach. These additional trajectories aid in identifying human bias toward certain trajectories while understanding the agents' behavior. Results. On average, across three studies in Task 1, we found that 70% of the time, human participants chose trajectories attributed by our proposed method as the best explanation for the agent's action. On average, across three studies in Task 2, nine participants chose the trajectories generated by our algorithm (Attr Traj 2). In Fig. 4b , we observe good alignment between human's understanding of trajectories influencing decision-making involved in grid navigation. Interestingly, the results also demonstrate that not all trajectories generated by our algorithm are considered relevant by humans, and often they are considered as good as a random trajectory (Fig. 4b ; Attr Traj 1). In all, as per the Task 1 results, on average 30% of the time, humans fail to correctly identify the factors influencing an RL decision. Additionally, actual factors driving actions could be neglected by humans while understanding a decision as per Task 2 results. These findings highlight the necessity to have data attribution-based explainability tools to build trust among human stakeholders for handing over the decision-making to RL agents in near future. 

5. DISCUSSION

In this work, we proposed a novel explanation technique that attributes decisions suggested by an RL agent to trajectories encountered by the agent in the past. We provided an algorithm that enables us to perform trajectory attribution in offline RL. The key idea behind our approach was to encode trajectories using sequence modelling techniques, cluster the trajectories using these embeddings and then study the sensitivity of the original RL agent's policy to the trajectories present in each of these clusters. We demonstrated the utility of our method using experiments in grid-world, Seaquest and HalfCheetah environments. In the end, we also presented a human evaluation of the results of our method to underline the necessity of trajectory attribution. The ideas presented in this paper, such as generating trajectory embedding using sequence encoders, and creating an encoding of the set of trajectories, can be extended to other domains of RL. For instance, the trajectory embeddings could be beneficial in recognizing hierarchical behaviour patterns as studied under options theory (Sutton et al., 1999) . Likewise, the data encoding could prove helpful in studying transfer learning (Zhu et al., 2020) in RL by utilizing the data embeddings from both the source and target decision-making tasks. From the XRL point of view, we wish to extend our work to online RL settings where the agent constantly collects experiences from the environment. While our results are highly encouraging, one of the limitations of this work is that there are no established evaluation benchmarks as well as no other works to compare with. We believe that more extensive human studies will help address this.  and c ) HalfCheetah. We find that these clusters represent semantically meaningful high-level behaviours. We observe that the trajectory embeddings obtained from the sequence encoders when clustered together demonstrate characteristic high-level behaviours. For instance, in the case of grid-world (Refer 7), the clusters comprise semantically similar trajectories where the agent demonstrates behaviours such as 'falling into the lava', 'achieving the goal in the first quadrant', 'mid-grid journey to the goal', etc. For Seaquest 8, we obtain trajectory clusters that represent high-level behaviours such as 'filling in oxygen', 'fighting along the surface', 'submarine bursting due to collision with enemy', etc. and for HalfCheetah in Fig. 9 , we obtain trajectory clusters that represent high-level actions such as 'taking long forward strides', 'jumping on hind leg', 'running with head down', etc. Although these results look quite promising, in this work we mainly focus on trajectory attribution that leverages these findings. In future, we wish to analyse the trajectory embeddings and the behaviour patterns in greater detail. Achieving Goal in Top right corner -Cluster 1 Mid-grid journey to goal -Cluster 3 Falling into lava -Cluster 9 In Sec. 4.2, we perform trajectory attribution on Seaquest environment trained using Discrete SAC algorithm. Here, we show results of our attribution algorithm in identifying influential trajectories for agents trained on same data but with different RL algorithm. Specifically, we choose Discrete Batch Constrained Q-Learning (Fujimoto et al., 2019b; a) to train a Seaquest policy. Fig. 12 depicts a qualitative explanation generated using our algorithm for DiscreteBCQ trained agent. Table 4 gives quantitative numbers associated with attributions performed in this setting. It is quite interesting to note that our proposed algorithm assigns similar importance to various clusters as done in Table 2 . That is, we find that certain behaviours in the data agnostic to the algorithm used for training play similar role in determining final execution policy. Thus, we find that our algorithm is generalizable and reliable enough to provide consistent insights across various RL algorithms. Table 5 : Quantitative Analysis of Trajectory Attribution for DiscreteBCQ-trained Breakout Agent. We identify that clusters 2 and 3 representing 'corner shots from right' and 'depletion of a life', impact the decision-making significantly 14. This is insightful given how important these behaviours are in general, the first one shows how to avoid ending the game prematurely, and the second one is a well-known strategy in Breakout for playing at the end of the right frame to break the walls on the top left for creating a tunnel. 



Figure1: Trajectory Attribution in Offline RL. First, we encode trajectories in offline data using sequence encoders and then cluster the trajectories using these encodings. Also, we generate a single embedding for the data. Next, we train explanation policies on variants of the original dataset and compute corresponding data embeddings. Finally, we attribute decisions of RL agents trained on entire data to trajectory clusters using action and data embedding distances.

Algorithm 2: clusterTrajectories / * Clustering the trajectories using their embeddings * / Input : Trajectory embeddings T = {ti}, clusteringAlgo C ← clusteringAlgo(T) // Cluster using provided clustering algorithm Output : Return trajectory clusters C = {ci} nc i=1

data embedding for a given set of trajectories * / Input : Trajectory embeddings T = {ti}, Normalizing factor M , Softmax temperature Tsoft s ← i t i M // Sum the trajectory embeddings and normalize them d ← {dj|dj = exp(s j /T soft ) k exp(s k /T soft ) } // Take softmax along feature dimension Output : Return the data embedding d

Algorithm 4: trainExpPolicies / * Train explanation policies & compute related data embeddings * / Input : Offline data{τi}, Traj. Embeddings T , Traj. Clusters C, offlineRLAlgo for cj in C do {τi}j ← {τi} -cj // Compute complementary dataset corresp. to cj Tj ← gatherTrajectoryEmbeddings(T, {τi}j) // Gather corresp. τ embeds Explanation policy, πj ← offlineRLAlgo({τi}j) Complementary data embedding, dj ← generateDataEmbedding(Tj, M, Tsoft)

Figure 2: Grid-world Trajectory Attribution. RL agent suggests taking action 'right' in grid cell(1,1). This action is attributed to trajectories (i), (ii) and (iii) (We denote gridworld trajectory by annotated ∧,∨,>,< arrows for 'up', 'down', 'right', 'left' actions respectively, along with the time-step associated with the actions (0-indexed)). We can observe that the RL decisions could be influenced by trajectories distant from the state under consideration, and therefore attributing decisions to trajectories becomes important to understand the decision better.

Figure 3: Seaquest Trajectory Attribution. The agent (submarine) decides to take 'left' for the given observation under the provided context. Top-3 attributed trajectories are shown on the right (for each training data traj., we show 6 sampled observations and the corresponding actions). As depicted in the attributed trajectories, the action 'left' is explained in terms of the agent aligning itself to face the enemies coming from the left end of the frame.

Figure 4: Column (a): An example of the human study experiment where users are required to identify the attributed trajectories that best explain the state-action behavior of the agent. Column (b): Results from our human study experiments show a decent alignment of human knowledge of navigation task with actual factors influencing RL decision-making. This underlines the utility as well as the scope of the proposed trajectory attribution explanation method.

Figure 6: PCA Plot depicting Clusters of Trajectory Embeddings for a) Grid-world, b) Seaquest,and c) HalfCheetah. We find that these clusters represent semantically meaningful high-level behaviours.

Figure 7: Cluster Behaviours for Grid-world. The figure shows 3 example high-level behaviours along with the action description and id of the cluster representing such behaviour.

Fighting with Head Out -Cluster 7

Figure 13: Trajectory Attribution for DiscreteBCQ-trained Breakout Agent. The agent proposes taking 'RIGHT' in the given observation frame. The corresponding attribution result shows how the ball coming from left would be played if moved to right.

Figure 14: Breakout Trajectory Clusters. The figure shows a PCA plot of breakout trajectories clustered into 11 clusters. Our method identifies cluster 2 ('corner shot from the right') and cluster 3 ('depletion of life') as the most important high-level behaviours in the data for learning.

E(V (s 0 )) E(|∆Q πorig |)) E(1(π orig (s) ̸ = π j (s)) W dist ( d, dj ) P(c final = c j ) orig 1.



Input : Offline Data {τi}, Sequence Encoder E Initialize: Initialize array T to collect the trajectory embeddings for τj in {τi} do / * Using E, get output tokens for all the o, a & r in τj * /

Actions suggested by explanation policies, aj ← πj(s) da orig ,a j ← calcActionDistance(aorig, aj)// Compute action distance K ← argmax(da orig ,a j )// Get candidate clusters using argmax w k ← Wdist( dorig, dk )// Compute Wasserstein distance b/w complementary data embeddings of candidate clusters & orig data embedding cfinal ← argmin(w k )// Choose cluster with min data embedding dist.

dj ) P(c final = c j ) Quantitative Analysis of Seaquest Trajectory Attribution. The analysis is provided using 5 metrics. Higher the E(V (s 0 )), better is the trained policy. High E(|∆Q πorig |)) along with high E(1(π orig (s) ̸ = π j (s)) is desirable. The policies with lower W dist ( d, dj ) and high action contrast are given priority while attribution. The cluster attribution distribution is given in the final column.

Feiyun Zhu, Jun Guo, Zheng Xu, Peng Liao, Liu Yang, and Junzhou Huang. Group-driven reinforcement learning for personalized mhealth intervention. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 590-598. Springer, 2018. Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey. arXiv preprint arXiv:2009.07888, 2020.

Analysis of Trajectory Attribution for DiscreteBCQ-trained Seaquest Agent.π E(V (s 0 )) E(|∆Q πorig |)) E(1(π orig (s) ̸ = π j (s)) W dist ( d, dj ) P(c final = c j )

ACKNOWLEDGEMENTS

We thank anonymous reviewers for their helpful feedback to make this work better. Moreover, NJ acknowledges funding support from NSF IIS-2112471 and NSF CAREER IIS-2141781.Finally, we wish to dedicate this work to the memory of our dear colleague Georgios Theocharous who is not with us anymore. While his premature demise has left an unfillable void, his work has made an indelible mark in the domain of reinforcement learning and in the lives of many researchers. He will forever remain in our memories.

A APPENDIX A.1 OVERVIEW OF PROPOSED TRAJECTORY ATTRIBUTION

The following is an overview of our proposed 5-step trajectory attribution algorithm. The aim of the agent is to reach any of the goal states (green squares) by avoiding lava (red square) and going around the impenetrable walls (grey squares). The reward for reaching the goal is +1; if the agent falls into the lava, it is -1. For any other transitions, the agent receives -0.1. The agent is allowed to take up, down, left or right as the action.

A.3 ADDTIONAL TRAINING DETAILS

1. Seaquest Atari Environment -We employed Discrete SAC to train the original policy along with explanation policies, where the training was performed until saturation in the performance. We used the critic learning rate of 3 × 10 -4 and the actor learning rate of 3 × 10 -4 with a batch size of 256. The trainings were performed parallelly on a single Nvidia-A100 GPU hardware.2. HalfCheetah MuJoCo Environment -We used SAC to train the original policy as well as explanation policies where we trained the agents until training performance saturated. We again used the critic learning rate of 3 × 10 -4 and the actor learning rate of 3 × 10 -4 with a batch size of 512. The policy trainings were performed parallelly on a single Nvidia-A100 GPU.

A.5 HALFCHEETAH TRAJECTORY ATTRIBUTION RESULTS

Due to space constraints, we present the qualitative and quantitative results for the HalfCheetah environment here. Table 3 : Quantitative Analysis of HalfCheetah Trajectory Attribution. The analysis is provided using 5 metrics. Higher the E(V (s 0 )), better is the trained policy. High E(|∆Q πorig |)) along with high E((π orig (s) -π j (s)) 2 ) is desirable. The policies with lower W dist ( d, dj ) and high action contrast are given priority while attribution. The cluster attribution distribution is given in the final column. 

