CONTRASTIVE EXPLANATIONS FOR REINFORCEMENT LEARNING VIA EMBEDDED SELF PREDICTIONS

Abstract

We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP) model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large number of features, we develop a novel method for computing minimal sufficient explanations from an ESP. Our case studies in three domains, including a complex strategy game, show that ESP models can be effectively learned and support insightful explanations.

1. INTRODUCTION

Traditional RL agents explain their action preference by revealing action A or B's predicted values, which provide little insight into its reasoning. Conversely, a human might explain their preference by contrasting meaningful properties of the predicted futures following each action. In this work, we develop a model allowing RL agents to explain action preferences by contrasting human-understandable future predictions. Our approach learns deep generalized value functions (GVFs) (Sutton et al., 2011) to make the future predictions, which are able to predict the future accumulation of arbitrary features when following a policy. Thus, given human-understandable features, the corresponding GVFs capture meaningful properties of a policy's future trajectories. To support sound explanation of action preferences via GVFs, it is important that the agent uses the GVFs to form preferences. To this end, our first contribution is the embedded self-prediction (ESP) model, which: 1) directly "embeds" meaningful GVFs into the agent's action-value function, and 2) trains those GVFs to be "self-predicting" of the agent's Q-function maximizing greedy policy. This enables meaningful and sound contrastive explanations in terms of GVFs. However, this circularly defined ESP model, i.e. the policy depends on the GVFs and vice-versa, suggests training may be difficult. Our second contribution is the ESP-DQN learning algorithm, for which we provide theoretical convergence conditions in the table-based setting and demonstrate empirical effectiveness. Because ESP models combine embedded GVFs non-linearly, comparing the contributions of GVFs to preferences for explanations can be difficult. Our third contribution is a novel application of the integrated gradient (IG) (Sundararajan et al., 2017) for producing explanations that are sound in a well-defined sense. To further support cases with many features, we use the notion of minimal sufficient explanation (Juozapaitis et al., 2019) , which can significantly simplify explanations while remaining sound. Our fourth contribution is case studies in two RL benchmarks and a complex real-time strategy game. These demonstrate insights provide by the explanations including both validating and finding flaws in reasons for preferences. In Defense of Manually-Designed Features. It can be controversial to provide deep learning algorithms with engineered meaningful features. The key question is whether the utility of providing such features is worth the cost of their acquisition. We argue that for many applications that can benefit from informative explanations, the utility will outweigh the cost. Without meaningful features, explanations must be expressed as visualizations on top of lower-level perceptual information (e.g. The model first maps a state-action pair (s, a) to a GVF vector Qπ F of the agent's greedy policy π(s) = Q(s, a). This vector is then processed by the combining function Ĉ, which produces a Q-value estimate Q(s, a). The embedded GVF is self-predicting in the sense that it is predicting values of the greedy policy for which it is being used to compute. saliency/attention maps). Such explanations have utility, but they may not adequately relate to humanunderstandable concepts, require subjective interpretation, and can offer limited insight. Further, in many applications, meaningful features already exist and/or the level of effort to acquire them from domain experts and AI engineers is reasonable. It is thus important to develop deep learning methods, such as our ESP model, that can deliver enhanced explainability when such features are available.

2. EMBEDDED SELF-PREDICTION MODEL

An MDP is a tuple S, A, T, R , with states S, actions A, transition function T (s, a, s ), and reward function R(s, a). A policy π maps states to actions and has Q-function Q π (s, a) giving the expected infinite-horizon β-discounted reward of following π after taking action a in s. The optimal policy π * and Q-function Q * satisfy π * (s) = arg max a Q * (s, a). Q * can be computed given the MDP by repeated application of the Bellman Backup Operator, which for any Q -function Q, returns a new Q-function B[Q](s, a) = R(s, a) + β s T (s, a, s ) max a Q(s , a ). We focus on RL agents that learn an approximation Q of Q * and follow the corresponding greedy policy π(s) = arg max a Q(s, a). We aim to explain a preference for action a over b in a state s, i.e. explain why Q(s, a) > Q(s, b). Importantly, the explanations should be meaningful to humans and soundly reflect the actual agent preferences. Below, we define the embedded self-prediction model, which will be used for producing such explanations (Section 4) in terms of generalized value functions. Generalized Value Functions (GVFs). GVFs (Sutton et al., 2011) are a generalization of traditional value functions that accumulate arbitrary feature functions rather than reward functions. Specifically, given a policy π, an n-dimensional state-action feature function F (s, a) = f 1 (s, a), . . . , f n (s, a) , and a discount factor γ, the corresponding n-dimensional GVF, denoted Q π F (s, a), is the expected infinite-horizon γ-discounted accumulation of F when following π after taking a in s. Given an MDP, policy π, and feature function F , the GVF can be computed by iterating the Bellman GVF operator, which takes a GVF Q F and returns a new GVF B π F [Q F ](s, a) = F (s, a) + γ s T (s, a, s )Q F (s , π(s )). To produce human-understandable explanations, we assume semantically-meaningful features are available, so that the corresponding GVFs describe meaningful properties of the expected future-e.g., expected energy usage, or time spent in a particular spatial region, or future change in altitude. ESP Model Definition. Given policy π and features F , we can contrast actions a and b via the GVF difference ∆ π F (s, a, b) = Q π F (s, a) -Q π F (s, b) , which may highlight meaningful differences in how the actions impact the future. Such differences, however, cannot necessarily be used to soundly explain an agent preference, since the agent may not explicitly consider those GVFs for action selection. Thus, the ESP model forces agents to directly define action values, and hence preferences, in terms of GVFs of their own policies, which allows for such differences to be used soundly. As depicted in Figure 1 , the ESP model embeds a GVF Q π F of the agent's greedy policy π into the agents Q-function Q, via Q(s, a) = Ĉ( QF (s, a)), where Ĉ : R n → R is a learned combining function from GVF vectors to action values. When the GVF discount factor γ is zero, the ESP model becomes a direct combination of the features, i.e. Q(s, a) = Ĉ(F (s, a)), which is the traditional approach to using features for function approximation. By using γ > 0 we can leverage human-



Figure 1: The ESP model provides a estimate of the agent's Q-function for any state-action pair.The model first maps a state-action pair (s, a) to a GVF vector Qπ F of the agent's greedy policy π(s) = Q(s, a). This vector is then processed by the combining function Ĉ, which produces a Q-value estimate Q(s, a). The embedded GVF is self-predicting in the sense that it is predicting values of the greedy policy for which it is being used to compute.

