EXPLAINING BY IMITATING: UNDERSTANDING DECISIONS BY INTERPRETABLE POLICY LEARNING

Abstract

Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker's policy is challenging-with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of decisionmaking behavior that (1) inheres transparency by design, (2) accommodates partial observability, and (3) operates completely offline. To satisfy these key criteria, we propose a novel model-based Bayesian method for interpretable policy learning ("INTERPOLE") that jointly estimates an agent's (possibly biased) belief-update process together with their (possibly suboptimal) belief-action mapping. Through experiments on both simulated and real-world data for the problem of Alzheimer's disease diagnosis, we illustrate the potential of our approach as an investigative device for auditing, quantifying, and understanding human decision-making behavior.

1. INTRODUCTION

A principal challenge in modeling human behavior is in obtaining a transparent understanding of decision-making. In medical diagnosis, for instance, there is often significant regional and institutional variation in clinical practice [1] , much of it the leading cause of rising healthcare costs [2] . The ability to quantify different decision processes is the first step towards a more systematic understanding of medical practice. Purely by observing demonstrated behavior, our principal objective is to answer the question: Under any given state of affairs, what actions are (more/less) likely to be taken, and why? We address this challenge by setting our sights on three key criteria. First, we desire a method that is transparent by design. Specifically, a transparent description of behavior should locate the factors that contribute to individual decisions, in a language readily understood by domain experts [3, 4] . This will be clearer per our subsequent formalism, but we can already note some contrasts: Classical imitation learning-popularly by reduction to supervised classification-does not fit the bill, since black-box hidden states of RNNs are rarely amenable to meaningful interpretation. Similarly, apprenticeship learning algorithms-popularly through inverse reinforcement learning-do not satisfy either, since the high-level nature of reward mappings is not informative as to individual actions observed in the data. Rather than focusing purely on replicating actions (imitation learning) or on matching expert performance (apprenticeship learning), our chief pursuit lies in understanding demonstrated behavior. Second, real-world environments such as healthcare are often partially observable in nature. This requires modeling the accumulation of information from entire sequences of past observations-an endeavor that is prima facie at odds with the goal of transparency. For instance, in a fully-observable setting, (model-free) behavioral cloning is arguably 'transparent' in providing simple mappings of states to actions; however, coping with partial observability using any form of recurrent function (1) transparency by design, (2) partial observability, and (3) offline learning, and makes no assumptions w.r.t. unbiasedness of beliefs or optimality of policies. Observations, beliefs, (optimal) q-values, actions, and policies are denoted z, b, q * , a, and π; bold denotes learned quantities, italics are known (or queryable), and " †" denotes jointly-learned quantities. No assumption No assumption approximation immediately lands in black-box territory. Likewise, while (model-based) methods have been developed for robotic control, their transparency crucially hinges on fully-observable kinematics. Finally, in realistic settings it is often impossible to experiment online-especially in high-stakes environments with real products and patients. The vast majority of recent work in (inverse) reinforcement learning has focused on games, simulations, and gym environments where access to live interaction is unrestricted. By contrast, in healthcare settings the environment dynamics are neither known a priori, nor estimable by repeated exploration. We want a data-driven representation of behavior that is learnable in a completely offline fashion, yet does not rely on knowing/modeling any true dynamics. Contributions Our contributions are three-fold. First, we propose a model for interpretable policy learning ("INTERPOLE")-where sequential observations are aggregated through a decision agent's decision dynamics (viz. subjective belief-update process), and sequential actions are determined by the agent's decision boundaries (viz. probabilistic belief-action mapping). Second, we suggest a Bayesian learning algorithm for estimating the model, simultaneously satisfying the key criteria of transparency, partial observability, and offline learning. Third, through experiments on both simulated and real-world data for Alzheimer's disease diagnosis, we illustrate the potential of our method as an investigative device for auditing, quantifying, and understanding human decision-making behavior.

2. RELATED WORK

We seek to learn an interpretable parameterization of observed behavior to understand an agent's actions. Fundamentally, this contrasts with imitation learning (which seeks to best replicate demonstrated policies) and apprenticeship learning (which seeks to match some notion of performance). Imitation Learning In fully-observable settings, behavior cloning (BC) readily reduces the imitation problem to one of supervised classification [5, [11] [12] [13] ; i.e. actions are simply regressed on observations. While this can be extended to account for partial observability by parameterizing policies via recurrent function approximation [14] , it immediately gives up on ease of interpretability per the black-box nature of RNN hidden states. A plethora of model-free techniques have recently been developed, which account for information in the rollout dynamics of the environment during policy learning (see e.g. [15] [16] [17] [18] [19] [20] )-most famously, generative adversarial imitation learning (GAIL) based on statedistribution matching [6, 21] . However, such methods require repeated online rollouts of intermediate policies during training, and also face the same black-box problem as BC in partially observable settings. Clearly in model-free imitation, it is difficult to admit both transparency and partial observability. Specifically with an eye on explainability, Info-GAIL [22, 23] proposes an orthogonal notion of "interpretability" that hinges on clustering similar demonstrations to explain variations in behavior. However, as with GAIL it suffers from the need for live interaction for learning. Finally, several model-based techniques for imitation learning (MB-IL) have been studied in the domain of robotics. [24] consider kinematic models designed for robot dynamics, while [25] and [7] consider (non-)linear autoregressive exogenous models. However, such approaches invariably operate in fully-observable settings, and are restricted models hand-crafted for specific robotic applications under consideration. Apprenticeship Learning In subtle distinction to imitation learning, methods in apprenticeship learning assume the observed behavior is optimal with respect to some underlying reward function. Apprenticeship thus proceeds indirectly-often through inverse reinforcement learning (IRL) in order to infer a reward function, which (with appropriate optimization) generates learned behavior that matches the performance of the original-as measured by the rewards (see e.g. [8, [26] [27] [28] [29] ). These approaches have been variously extended to cope with partial observability (PO-IRL) [9, 30] , to offline settings through off-policy evaluation [31] [32] [33] , as well as to learned environment models [10] . However, a shortcoming of such methods is the requirement that the demonstrated policy in fact be optimal with respect to a true reward function that lies within an (often limited) hypothesis class under consideration-or is otherwise black-box in nature. Further, learning the true environment dynamics [10] corresponds to the requirement that policies be restricted to the class of functions that map from unbiased beliefs (cf. exact inference) into actions. Notably though, [34] considers both a form of suboptimality caused by time-inconsistent agents as well as biased beliefs. However, perhaps most importantly, due to the indirect, task-level nature of reward functions, inverse reinforcement learning is essentially opposed to our central goal of transparency-that is, in providing direct, action-level descriptions of behavior. In Section 5, we provide empirical evidence of this notion of interpretability. Towards INTERPOLE In contrast, we avoid making any assumptions as to either unbiasedness of beliefs or optimality of policies. After all, the former requires estimating (externally) "true" environment dynamics, and the latter requires specifying (objectively) "true" classes of reward functionsneither of which are necessary per our goal of transparently describing individual actions. Instead, INTERPOLE simply seeks the most plausible explanation in terms of (internal) decision dynamics and (subjective) decision boundaries. To the best of our knowledge, our work is the first to tackle all three key criteria-while making no assumptions on the generative process behind behaviors. Table 1 contextualizes our work, showing typical incarnations of related approaches and their graphical models. Before continuing, we note that the separation between the internal dynamics of an agent and the external dynamics of the environment has been considered in several other works, though often for entirely different problem formulations. Most notably, [35] tackles the same policy learning problem as we do in online, fully-observable environments but for agent's with internal states that cannot be observed. They propose agent Markov models (AMMs) to model such environment-agent interactions. For problems other than policy learning, [36] [37] [38] also consider the subproblem of inferring an agent's internal dynamics; however, none of these works satisfy all three key criteria simultaneously as we do.

3. INTERPRETABLE POLICY LEARNING

We first introduce INTERPOLE's model of behavior, formalizing notions of decision dynamics and decision boundaries. In the next section, we suggest a Bayesian algorithm for model-learning from data. Problem Setup Consider a partially-observable decision-making environment in discrete time. At each step t, the agent takes action a t ∈ A and observes outcome z t ∈ Z. 1 We have at our disposal an observed dataset of demonstrations D ={(a i 1 , z  . = ∪ ∞ t=1 H t . A proper policy π is a mapping π ∈ ∆(A) H from observed histories to action distributions, where π(a|h) is the probability of taking action a given h. We assume that D is generated by an agent acting according to some behavioral policy π b . The problem we wish to tackle, then, is precisely how to obtain an interpretable parameterization of π b . We proceed in two steps: First, we describe a parsimonious belief-update process for accumulating histories-which we term decision dynamics. Then, we take beliefs to actions via a probabilistic mapping-which gives rise to decision boundaries.

Decision Dynamics

We model belief-updates by way of an input-output hidden Markov model (IOHMM) identified by the tuple (S, A, Z, T, O, b 1 ), with S being the finite set of underlying states. T ∈ ∆(S) S×A denotes the transition function such that T (s t+1 |s t , a t ) gives the probability of transitioning into state s t+1 upon action a t in state s t , and O ∈ ∆(Z) A×S denotes the observation function such that O(z t |a t , s t+1 ) gives the probability of observing z t after taking action a t and transitioning into state s t+1 . Finally, let beliefs b t ∈ ∆(S) indicate the probability b t (s) that the environment exists in any state s ∈ S at time t, and let b 1 give the initial state distribution. Note that-unlike in existing uses of the IOHMM formalism-these "probabilities" are for representing the thought process of the human, and may freely diverge from the actual mechanics of the world. To aggregate observed histories as beliefs, we identify b t (s) with P(s t = s|h t )-an interpretation that leads to the recursive belief-update process (where in our problem, quantities T, O, b 1 are unknown): b t+1 (s ) ∝ s∈S b t (s)T (s |s, a t )O(z t |a t , s ) A key distinction bears emphasis: We do not require that this latter set of quantities correspond to (external) environment dynamics-and we do not obligate ourselves to recover any such notion of "true" parameters. To do so would imply the assumption that the agent in fact performs exactly unbiased inference on a perfectly known model of the environment, which is restrictive. It is also unnecessary, since our mandate is simply to model the (internal) mechanics of decision-makingwhich could well be generated from possibly biased beliefs or imperfectly known models of the world. In other words, our objective (see Equation 3) of simultaneously determining the most likely beliefs (cf. decision dynamics) and policies (cf. decision boundaries) is fundamentally more parsimonious. Decision Boundaries Given decision dynamics, a policy is then equivalently a map π ∈ ∆(A) ∆(S) . Now, what is an interpretable parameterization? Consider the three-state example in Figure 1 . We argue that a probabilistic parameterization that directly induces "decision regions" (cf. panel 1b) over the belief simplex is uniquely interpretable. For instance, strong beliefs that a patient has underlying mild cognitive impairment may map to the region where a specific follow-up test is promptly prescribed; this parameterization allows clearly locating such regions-as well as their boundaries. Precisely, we parameterize policies in terms of |A|-many "mean" vectors that correspond to actions: π(a|b) = e -η b-µa 2 / a ∈A e -η b-µ a 2 , s∈S µ a (s) = 1 where η ≥ 0 is the inverse temperature, • the 2 -norm, and µ a ∈ R |S| the mean vector corresponding to action a ∈ A. Intuitively, mean vectors induce decision boundaries (and decision regions) over the belief space ∆(S): At any time, the action whose corresponding mean is closest to the current belief is most likely to be chosen. In particular, lines that are equidistant to the means of any pair of actions form decision boundaries between them. The inverse temperature controls the transitions between such boundaries: A larger η captures more deterministic behavior (i.e. more "abrupt" transitions), whereas a smaller η captures more stochastic behavior (i.e. "smoother" transitions). Note that the case of η = 0 recovers policies that are uniformly random, and η → ∞ recovers argmax policies. A second distinction is due: The exponentiated form of Equation 2 should not be confused with typical Boltzmann [27] or MaxEnt [39] policies common in RL: These are indirect parameterizations via optimal/soft q-values, which themselves require approximate solutions to optimization problems; as we shall see in our experiments, the quality of learned policies suffers as a result. Further, using q-values would imply the assumption that the agent in fact behaves optimally w.r.t. an (objectively) "true" class of reward functions-e.g. linear-which is restrictive. It is also unnecessary, as our mandate is simply to capture their (subjective) tendencies toward different actions-which are generated from possibly suboptimal policies. In contrast, by directly partitioning the belief simplex into probabilistic "decision regions", INTERPOLE's mean-vector representation can be immediately explained and understood. Learning Objective In a nutshell, our objective is to identify the most likely parameterizations T , O, b 1 for decision dynamics as well as η, {µ a } a∈A for decision boundaries, given the observed data: Given: D, S, A, Z Determine: T, O, b 1 , η, {µ a } a∈A Next, we propose a Bayesian algorithm that finds the maximum a posteriori (MAP) estimate of these quantities. Figure 2 illustrates the problem setup.

4. BAYESIAN INTERPRETABLE POLICY LEARNING

Denote with θ . = (T, O, b 1 , η, {µ a } a∈A ) the set of parameters to be determined, and let θ be drawn from some prior P(θ). In addition, denote with D = {(s i 1 , . . . , s i τi+1 )} n i=1 the set of underlying (unobserved) state trajectories, such that D ∪ D gives the complete (fully-observed) dataset. Then the complete likelihood-of the unknown parameters θ with respect to D ∪ D-is given by the following: P(D, D|θ) = n i=1 τ t=1 π(a t |b t [T, O, b 1 , h t-1 ]) action likelihoods × n i=1 b 1 (s 1 ) τ t=1 T (s t+1 |s t , a t )O(z t |a t , s t+1 ) observation likelihoods (4) where π(•|•) is described by η and {µ a } a∈A , and each b t [•] is a function of T , O, b 1 , and h t-1 (Equation 1). Since we do not have access to D, we propose an expectation-maximization (EM)-like algorithm for maximizing the posterior P(θ|D) = P(D|θ)P(θ)/ P(D|θ)dP(θ) over the parameters: Algorithm 1 Bayesian INTERPOLE 1: Parameters: learning rate w ∈ R + 2: Input: dataset D = {h i τi+1 } n i=1 , prior P(θ) 3: Sample θ0 from P(θ) 4: for k = 1, 2, . . . do 5: Compute P( D|D, θk-1 ) Appendix A.1 6: Compute ∇ θ Q(θ; θk-1 ) at θk-1 Appendix A.2 7: θk ← θk-1 + w[∇ θ Q(θ; θk-1 ) + ∇ θ log P(θ)] θ= θk-1 8: while (6) 9: θ ← θk-1 10: Output: MAP estim. θ . = ( T , Ô, b1 , η, {μa} a∈A ) Bayesian Learning Given an initial estimate θ0 , we iteratively improve the estimate by performing the following steps at each iteration k: • "E-step": Compute the expected log-likelihood of the model parameters θ given the previous parameter estimate θk-1 , as follows: Q(θ; θk-1 ) . = E D|D, θk-1 [logP(D, D|θ)] = D logP(D, D|θ)P( D|D, θk-1 ) where we compute the necessary marginalizations of joint distribution P( D|D, θk-1 ) by way of a forward-backward procedure (detailed procedure given in Appendix A.1). • "M-step": Compute a new estimate θk that improves the expected log-posterior-that is, such that: Q( θk ; θk-1 ) + log P( θk ) > Q( θk-1 ; θk-1 ) + log P( θk-1 ) subject to appropriate non-negativity and normalization constraints on parameters, which can be achieved via gradient-based methods (detailed procedure given in Appendix A.2). We stop when it becomes no longer possible to obtain a new estimate further improving the expected log-posterior -that is, when the "M-step" cannot be performed. Algorithm 1 summarizes this learning procedure. Explaining by Imitating Recall our original mandate-to give the most plausible explanation for behavior. Two questions can be asked about our proposal to "explain by imitating"-to which we now have precise answers: One concerns explainability, and the other concerns directness of explanations. First, what constitutes the "most plausible explanation" of behavior? Now, INTERPOLE identifies this as the most likely parameterization of that behavior using a state-based model for beliefs and policies-but otherwise with no further assumptions. In particular, we are only positing that modeling beliefs over states helps provide an interpretable description of how an agent reasons (which we do have ample evidence forfoot_1 )-but we are not assuming that the environment itself takes the form of a state-based model (which is an entirely different claim). Mathematically, the complete likelihood (Equation 4) highlights the difference between decision dynamics (which help explain the agent's behavior) and "true" environment dynamics (which we do not care about). The latter are independent of the agent, and learning them would have involved just the observation likelihoods alone. In contrast, by jointly estimating T, O, b 1 with η, {µ a } a∈A according to both the observation-and action-likelihoods, we are learning the decision dynamics-which in general need not coincide with the environment, but which offer the most plausible explanation of how the agent effectively reasons. The second question is about directness: Given the popularity of the IRL paradigm, could we have simply used an (indirect) reward parameterization, instead of our (direct) mean-vector parameterization? As it turns out, in addition to the "immediate" interpretability of direct, action-level representations, it comes with an extra perk w.r.t. computability: While it is (mathematically) possible to formulate a similar learning problem swapping out µ for rewards, in practice it is (computationally) intractable to perform in our setting. The precise difficulty lies in differentiating through quantities π(a t |b t )-which in turn depend on beliefs and dynamics-in the action-likelihoods (proofs located in Appendix B): Proposition 1 (Differentiability with q-Parameterizations) Consider softmax policies parameterized by q-values from a reward function, such that π(a|b) = e q * (b,a) / a e q * (b,a ) in lieu of Equation 2. Then differentiating through log π(a t |b t ) terms with respect to unknown parameters θ is intractable. In contrast, INTERPOLE avoids ever needing to solve any "forward" problem at all (and therefore does not require resorting to costly-and approximate-sampling-based workarounds) for learning: Proposition 2 (Differentiability with µ-Parameterizations) Consider the mean-vector policy parameterization proposed in Equation 2. Differentiation through the log π(a t |b t ) terms with respect to the unknown parameters θ is easily and automatically performed using backpropagation through time.

5. ILLUSTRATIVE EXAMPLES

Three aspects of INTERPOLE deserve empirical demonstration, and we shall highlight them in turn: • Interpretability: First, we illustrate the usefulness of our method in providing transparent explanations of behavior. This is our primary objective here--of explaining by imitating. • Accuracy: Second, we demonstrate that the faithfulness of learned policies is not given up for transparency. This shows that accuracy and interpretability are not necessarily opposed. • Subjectivity: Third, we show INTERPOLE correctly recovers underlying explanations for behavior--even if the agent is biased. This sets us apart from other state-based algorithms. In order to do so, we show archetypical examples to exercise our framework, using both simulated and real-world experiments in the context of disease diagnosis. State-based reasoning is prevalent in research and practice: three states in progressive clinical dementia [42, 43] , preterminal cancer screening [44, 45] , or even-as recently shown-for cystic fibrosis [46] and pulmonary disease [47] . Decision Environments For our real-world setting, we consider the diagnostic patterns for 1,737 patients during sequences of 6-monthly visits in the Alzheimer's Disease Neuroimaging Initiative [48] database (ADNI). The state space consists of normal functioning ("NL"), mild cognitive impairment ("MCI"), and dementia. For the action space, we consider the decision problem of ordering vs. not ordering an MRI test, which-while often informative of Alzheimer's-is financially costly. MRI outcomes are categorized according to hippocampal volume: {"avg", "above avg", "below avg", "not ordered"}; separately, the cognitive dementia rating-sum of boxes ("CDR-SB") resultwhich is always measured-is categorized as: {"normal", "questionable impairment", "mild/severe dementia"} [42] . In total, the observation space therefore consists of the 12 combinations of outcomes. We also consider a simulated setting to better validate performance. For this we employ a diagnostic environment (DIAG) in the form of an IOHMM with certain (true) parameters T true , O true , b true 1 . Patients fall within diseased (s + ) and healthy (s -) states, and vital-sign measurements available at every step are classified within positive (z + ) and negative (z -) outcomes. For the action space, we consider the decision of continuing to monitor a patient (a = ), or stopping and declaring a final diagnosis-and if so, a diseased (a + ) or healthy (a -) declaration. If we assume agents have perfect knowledge of the true environment, then this setup is similar to the classic "tiger problem" for optimal stopping [49] . Lastly, we also consider a (more realistic) variant of DIAG where the agent's behavior is instead generated by biased beliefs due to incorrect knowledge T, O, b 1 = T true , O true , b true 1 of the environment (BIAS). Importantly, this generates a testable version of real-life settings where decision-makers' (subjective) beliefs often fail to coincide with (objective) probabilities in the world. Benchmark Algorithms Where appropriate, we compare INTERPOLE against the following benchmarks: imitation by behavioral cloning [5] using RNNs for partial observability (R-BC); Bayesian IRL on POMDPs [9] equipped with a learned environment model (PO-IRL); a fully-offline counterpart [10] of Bayesian IRL (Off. PO-IRL); and an adaptation of model-based imitation learning [7] to partially-observable settings, with a learned IOHMM as the model (PO-MB-IL). Algorithms requiring learned models for interaction are given IOHMMs estimated using conventional methods [50] . Further information on environments and benchmark implementations is found in Appendix C. An MRI is less likely to be ordered. An MRI is more likely to be ordered. An MRI is not ordered. An MRI is ordered. Belief updates Interpretability First, we direct attention to the potential utility of INTERPOLE as an investigative device for auditing and quantifying individual decisions. Specifically, modeling the evolution of an agent's beliefs provides a concrete basis for analyzing the corresponding sequence of actions taken: • Explaining Trajectories. Figure 3 shows examples of such decision trajectories for four real ADNI patients. Each vertex of the belief simplex corresponds to one of the three stable diagnoses, and each point in the simplex corresponds to a unique belief (i.e. probability distribution). The closer the point is to a vertex (i.e. state), the higher the probability assigned to that state. For instance, if the belief is located exactly in the middle of the simplex (i.e. equidistant from all vertices), then all states are believed to be equally likely. If the belief is located exactly on a vertex (e.g. directly on top of MCI), then this corresponds to an absolutely certainty of MCI being the underlying state. Patients (a) and (b) are "typical" patients who fit well to the overall learned policy. The former is a normally-functioning patient believed to remain around the decision boundary in all visits except the first; appropriately, they are ordered an MRI during approximately half of their visits. The latter is believed to be deteriorating from MCI towards dementia, hence prescribed an MRI in all visits. • Identifying Belated Diagnoses. In many diseases, early diagnosis is paramount [51] . INTERPOLE allows detecting patients who appear to have been diagnosed significantly later than they should have. Patient (c) was ordered an MRI in neither of their first two visits-despite the fact that the "typical" policy would have strongly recommended one. At a third visit, the MRI that was finally ordered led to near-certainty of cognitive impairment-but this could have been known 12 months earlier! In fact, among all ADNI patients in the database, 6.5% were subject to this apparent pattern of "belatedness", where a late MRI is immediately followed by a jump to near-certain deterioration. • Quantifying Value of Information. Patient (d) highlights how INTERPOLE can be used to quantify the value of a test in terms of its information gain. While the patient was ordered an MRI in all of their visits, it may appear (on the surface) that the third and final MRIs were redundant-since they had little apparent affect on beliefs. However, this is only true for the factual belief update that occurred according to the MRI outcome that was actually observed. Having access to an estimated model of how beliefs are updated in the form of decision dynamics, we can also compute counterfactual belief updates-that is belief updates that could have occurred if the MRI outcome in question were to be different. In the particular case of patient (d), the tests were in fact highly informative, since (as it happened) the patient's CDR-SB scores were suggestive of impairment, and (in the counterfactual) the doctor's beliefs could have potentially leapt drastically towards MCI. On the other hand, among all MRIs ordered for ADNI patients, 19% may indeed have been unnecessary (i.e. triggering apparently insignificant belief-updates both factually as well as counterfactually).

Evaluating Interpretability through Clinician Surveys

To cement the argument for interpretability, we evaluated INTERPOLE by consulting nine clinicians from four different countries (United States, United Kingdom, the Netherlands, and China) for feedback. We focused on evaluating two aspects of interpretability regarding our method: • Decision Dynamics: Whether the proposed representation of (possibly subjective) belief trajectories are preferable to raw action-observation trajectories-that is, whether decision dynamics are a transparent way of modeling how information is aggregated by decision-makers. • Decision Boundaries: Whether the proposed representation of (possibly suboptimal) decision boudnaries are a more transparent way of describing policies, compared with the representation of reward functions (which is the conventional approach in the policy learning literature). For the first aspect, we presented to the participating clinicians the medical history of an example patient from ADNI represented in three ways using: only the most recent action-observation, the complete action-observation trajectory, as well as the belief trajectory as recovered by INTERPOLE. Result: All nine clinicians preferred the belief trajectories over action-observation trajectories. For the second aspect, we showed them the policies learned from ADNI by both Off. PO-IRL and INTERPOLE, which parameterize policies in terms of reward functions and decision boundaries respectively. Result: Seven out of nine clinicians preferred the representation in terms of decision boundaries over that offered by reward functions. Further details can be found in Appendix D. Accuracy Now, a reasonable question is whether such explainability comes at a cost: By learning an interpretable policy, do we sacrifice any accuracy? To be precise, we can ask the following questions: • Is the belief-update process the same? For this, the appropriate metric is the discrepancy with respect to the sequence of beliefs-which we take to be t D KL (b t bt ) (Belief Mismatch). • Is the belief-action mapping the same? Our metric is the discrepancy with respect to the policy distribution itself-which we take to be t D KL (π b (•|b t ) π(•| bt )) (Policy Mismatch). • Is the effective behavior the same? Here, the metrics are those measuring the discrepancy with respect to ground-truth actions observed (Action-Matching) for ADNI, and differences in stopping (Stopping Time Error) for DIAG. Note that the action-matching and stopping time errors evaluate the quality of learned models in imitating per se, whereas belief mismatch and policy mismatch evaluate their quality in explaining. 3The results are revealing, if not necessarily surprising. To begin, we observe for the ADNI setting in Table 2 that INTERPOLE performs first-or second-best across all three action-matching based metrics; where it comes second, it does so only by a small margin to R-BC (bearing in mind that R-BC is specifically optimized for nothing but action-matching). Similarly for the DIAG setting, we observe in Table 3 that INTERPOLE performs the best in terms of stopping-time error. In other words, it appears that little-if any-imitation accuracy is lost by using INTERPOLE as the model. Perhaps more interestingly, we also see in Table 3 that the quality of internal explanations is superiorin terms of both belief mismatch and policy mismatch. In particular, even though the comparators PO-MB-IL, PO-IRL, and Off. PO-IRL are able to map decisions through beliefs, they inherit the conventional approach of attempting to estimate true environment dynamics, which is unnecessaryand possibly detrimental-if the goal is simply to find the most likely explanation of behavior. Notably, while the difference in imitation quality among the various benchmarks is not tremendous, with respect to explanation quality the gap is significant-where INTERPOLE has great advantage. Subjectivity Most significantly, we now show that INTERPOLE correctly recovers the underlying explanations, even if-or perhaps especially if-the agent is driven by subjective reasoning (i.e. with biased beliefs). This aspect sets INTERPOLE firmly apart from the alternative state-based techniques. Belief Trajectories (as in Fig. 1a ) for consecutive z -observations Decision Boundaries (as in Fig. 1b ) Figure 4 : Explaining Subjective Behavior. Markers show the evolution of beliefs that explain ground-truth and learned policies in BIAS, for an example scenario where two consecutive negative (z-) observations are made. While all policies display similar effective behavior, only INTERPOLE correctly identifies the ground-truth decision boundary. This underscores the significance of distinguishing between decision dynamics (which help explain an agent's behavior) and "true" dynamics (which we do not care about). 

6. DISCUSSION

Three points deserve brief comment in closing. First, while we gave prominence to several examples of how INTERPOLE may be used to audit and improve decision-making behavior, the potential applications are not limited to these use cases. Broadly, the chief proposition is that having quantitative descriptions of belief trajectories provides a concrete language that enables investigating observed actions, including outliers and variations in behavior: On that basis, different statistics and post-hoc analyses can be performed on top of INTERPOLE's explanations (see Appendix C for examples). Second, a reasonable question is whether or not it is reasonable to assume access to the state space. For this, allow us to reiterate a subtle distinction. There may be some "ground-truth" external state space that is arbitrarily complex or even impossible to discover, but-as explained-we are not interested in modeling this. Then, there is the internal state space that an agent uses to reason about decisions, which is what we are interested in. In this sense, it is certainly reasonable to assume access to the state space, which is often very clear from medical literature [42] [43] [44] [45] [46] [47] . Since our goal is to obtain interpretable representations of decision, it is therefore reasonable to cater precisely to these accepted state spaces that doctors can most readily reason with. Describing behavior in terms of beliefs over these (already well-understood) states is one of the main contributors to the interpretability of our method. Finally, it is crucial to keep in mind that INTERPOLE does not claim to identify the real intentions of an agent: humans are complex, and rationality is-of course-bounded. What it does do, is to provide an interpretable explanation of how an agent is effectively behaving, which--as we have seen for diagnosis of ADNI patients--offers a yardstick by which to assess and compare trajectories and subgroups. In particular, INTERPOLE achieves this while adhering to our key criteria for healthcare settings, and without imposing assumptions of unbiasedness or optimality on behavioral policies.

A DETAILS OF THE ALGORITHM A.1 FORWARD-BACKWARD PROCEDURE

We compute the necessary marginalizations of the joint distribution P( D|D, θ) using the forwardbackward algorithm. Letting x t:t = {x t , x t+1 , . . . , x t } for any time-indexed quantity x t , the forward messages are defined as α t (s) = P(s t = s, a 1:t-1 , z 1:t-1 | θ), which can be computed dynamically as α t+1 (s ) = P(s t+1 = s , a 1:t , z 1:t | θ) = s∈S P(s t = s, a 1:t-1 , z 1:t-1 | θ)P(s t+1 = s , a t , z t |s t = s, a 1:t-1 , z 1:t-1 , θ) = s∈S α t (s)π(a t |b t ) T (s |s, a t ) Ô(z t |a t , s ) ∝ s∈S α t (s) T (s |s, a t ) Ô(z t |a t , s ) with initial case α 1 (s) = P(s 1 = s) = b 1 (s). The backward messages are defined as β t (s) = P(a t:τ , z t:τ |s t = s, a 1:t-1 , z 1:t-1 , θ), which can also be computed dynamically as β t (s) = P(a t:τ , z t:τ |s t = s, a 1:t-1 , z 1:t-1 , θ) = s ∈S P(s t+1 = s , a t , z t |s t = s, a 1:t-1 , z 1:t-1 , θ)P(a t+1:τ , z t+1:τ |s t+1 = s , a 1:t , z 1:t , θ) = s ∈S π(a t |b t ) T (s |s, a t ) Ô(z t |a t , s )β t+1 (s ) ∝ s ∈S T (s |s, a t ) Ô(z t |a t , s )β t+1 (s ) with initial case β τ +1 (s) = P(∅|s τ +1 = s, a 1:τ , z 1:τ , θ) = 1. Then, the marginal probability of being in state s at time t given the dataset D and the estimate θ can be computed as γ t (s) = P(s t = s|D, θ) = P(s t = s|a 1:τ , z 1:τ , θ) ∝ P(s t = s, a 1:τ , z 1:τ | θ) = α t (s)β(s) and similarly, the marginal probability of transitioning from state s to state s at the end of time t given the dataset D and the estimate θ can be computed as ξ t (s, s ) = P(s t = s, s t+1 = s |D, θ) ∝ P(s t = s, s t+1 = s , a 1:τ , z 1:τ | θ) = P(s t = s, a 1:t-1 , z 1:t-1 | θ)P(s t+1 = s , a t , z t |s t = s, a 1:t-1 , z 1:t-1 , θ) × P(a t+1:τ , z t+1:τ |s t+1 = s , a 1:t , z 1:t , θ) = α t (s)π(a t |b t ) T (s |s, a) Ô(z|a, s )β t+1 (s ) ∝ α t (s) T (s |s, a) Ô(z|a, s )β t+1 (s ) .

A.2 GRADIENT-ASCENT PROCEDURE

Taking the gradient of the expected log-likelihood Q(θ; θ) in ( 5) with respect to the unknown parameters θ = (T, O, b 1 , η, µ a∈A ) first requires computing the Jacobian matrix ∇ bt b t for 1 ≤ t < t ≤ τ , where (∇ b b ) ij = ∂b (i)/∂b(j) for i, j ∈ S. This can be achieved dynamically as ∇ bt b t = ∇ bt+1 b t ∇ bt b t+1 with initial case ∇ b t b t = I, where (∇ bt b t+1 ) ij = ∂b t+1 (i) ∂b t (j) = ∂ ∂b t (j) x∈S b t (x)T (i|x, a t )O(z t |a t , i) x∈S x ∈S b t (x)T (x |x, a t )O(z t |a t , x ) = T (i|j, a t )O(z t |a t , i) x∈S x ∈S b t (x)T (x |x, a t )O(z t |a t , x ) - x ∈S T (x |j, a t )O(z t |a t , x ) ( x∈S x ∈S b t (x)T (x |x, a t )O(z t |a t , x )) 2 .

A.2.1 PARTIAL DERIVATIVES

The derivative of Q(θ; θ) with respect to T (s |s, a) is ∂Q(θ; θ) ∂T (s |s, a) = ∂ ∂T (s |s, a) n i=1 τ t=1 I{a t = a} x∈S x ∈S ξ t (x, x ) log T (x |x, a) + τ t=2 log π(a t |b t ) = n i=1 τ t=1 I{a t = a} ξ t (s, s ) T (s |s, a) + τ t=2 ∂ log π(a t |b t ) ∂T (s |s, a) = n i=1 τ t=1 I{a t = a} ξ t (s, s ) T (s |s, a) + τ t=2 t-1 t =1 ∇ bt log π(a t |b t )∇ b t +1 b t ∇ T (s |s,a) b t +1 , where (∇ bt log π(a t |b t )) 1j = ∂ log(a t |b t ) ∂b t (j) = ∂ ∂b t (j) -η b t -µ at 2 -log a ∈A e -η bt-µ a 2 = -2η(b t (j) -µ at (j)) + 2η a∈A e -η bt-µa 2 a ∈A e -η bt-µ a 2 (b t (j) -µ a (j)) = -2η(b t (j) -µ at (j)) + 2η a∈A π(a|b t )(b t (j) -µ a (j)) and (∇ T (s |s,a) b t +1 ) i1 = ∂b t +1 (i) ∂T (s |s, a) = ∂ ∂T (s |s, a) x∈S b t (x)T (i|x, a t )O(z t |a t , i) x∈S x ∈S b t (x)T (x |x, a t )O(z t |a t , x ) = I{a t = a} I{i = s }b t (s)O(z t |a, s ) x∈S x ∈S b t (x)T (x |x, a)O(z t |a, x ) - b t (s)O(z t |a, s ) ( x∈S x ∈S b t (x)T (x |x, a)O(z t |a, x )) 2 . The derivative of Q(θ; θ) with respect to O(z|a, s ) is ∂Q(θ; θ) ∂O(z|a, s ) = ∂ ∂O(z|a, s ) n i=1 τ t=1 I{a t = a, z t = z} x ∈S γ t+1 (x ) log O(z|a, x ) + τ t=2 log π(a t |b t ) = n i=1 τ t=1 I{a t = a, z t = z} γ t+1 (s ) O(z|a, s ) + τ t=2 ∂ log π(a t |b t ) ∂O(z|a, s ) = n i=1 τ t=1 I{a t = a, z t = z} γ t+1 (s ) O(z|a, s ) + τ t=2 t-1 t =1 ∇ bt log π(a t |b t )∇ b t +1 b t ∇ O(z|a,s ) b t +1 , where (∇ O(z|a,s ) b t +1 ) i1 = ∂b t +1 (i) ∂O(z|a, s ) = ∂ ∂O(z|a, s ) x∈S b t (x)T (i|x, a t )O(z t |a t , i) x∈S x ∈S b t (x)T (x |x, a t )O(z t |a t , x ) = I{a t = a, z t = z} I{i = s } x∈S b t (x)T (s |x, a) x∈S x ∈S b t (x)T (x |x, a)O(z|a, x ) - x∈S b t (x)T (s |x, a) ( x∈S x ∈S b t (x)T (x |x, a)O(z|a, x )) 2 . The derivative of Q(θ; θ) with respect to b 1 (s) is ∂Q(θ; θ) ∂b 1 (s) = ∂ ∂b 1 (s) n i=1 x∈S γ 1 (x) log b 1 (x) + τ t=1 log π(a t |b t ) = n i=1 γ 1 (s) b 1 (s) + τ t=1 ∇ bt log π(a t |b t )∇ b1(s) b t , where (∇ b1(s) b t ) i1 = (∇ b1 b t ) is . The derivative of Q(θ; θ) with respect to η is ∂Q(θ; θ) ∂η = ∂ ∂η n i=1 τ t=1 log π(a t |b t ) = n i=1 τ t=1 ∂ ∂η -η b t -µ at 2 -log a ∈A e -η bt-µ a 2 = n i=1 τ t=1 -b t -µ at 2 + a∈A e -η bt-µa 2 a ∈A e -η bt-µ a 2 b t -µ a 2 = n i=1 τ t=1 -b t -µ at 2 + a∈A π(a|b t ) b t -µ a 2 . Finally, the derivative of Q(θ; θ) with respect to µ a (s) is ∂Q(θ; θ) ∂µ a (s) = ∂ ∂µ a (s) n i=1 τ t=1 log π(a t |b t ) = n i=1 τ t=1 ∂ ∂µ a (s) -η b t -µ at 2 -log a ∈A e -η bt-µ a 2 = n i=1 τ t=1 2ηI{a t = a}(b t (s) -µ a (s)) -2η e -η bt-µa 2 a ∈A e -η bt-µ a 2 (b t (s) -µ a (s)) = n i=1 τ t=1 (2ηI{a t = a}(b t (s) -µ a (s)) -2ηπ(a|b t )(b t (s) -µ a (s)) = n i=1 τ t=1 2η(I{a t = a} -π(a|b t ))(b t (s) -µ a (s)) .

B PROOFS OF PROPOSITIONS

B.1 PROOF OF PROPOSITION 1 First, denote with q * R ∈ R ∆(S)×A the optimal (belief-state) q-value function with respect to the underlying (state-space) reward function R ∈ R S×A , and denote with v * ∈ R ∆(S) the corresponding optimal value function v * R (b) = softmax a ∈A q * R (b, a ). Now, fix some component i of parameters θ; we wish to compute the derivative of log π(a|b) with respect to θ i : ∂ ∂θ i log π(a|b) = ∂ ∂θ i q * R (b, a) -v * R (b) = ∂ ∂θ i q * R (b, a) -log a ∈A e q * R (b,a ) = ∂ ∂θ i q * R (b, a) - a ∈A e q * R (b,a ) a ∈A e q * R (b,a ) • ∂ ∂θ i q * R (b, a ) = ∂ ∂θ i q * R (b, a) - a ∈A π(a |b) ∂ ∂θ i q * R (b, a ) = ∂ ∂θ i q * R (b, a) -E a ∼π(•|b) ∂ ∂θ i q * R (b, a ) where we make explicit here the dependence on R, but note that it is itself a parameter; that is, R = θ j for some j. We see that this in turn requires computing the partial derivative ∂q * R (b, a)/∂θ i . Let γ be some appropriate discount rate, and denote with ρ R ∈ R ∆(S)×A the effective (belief-state) reward Then the partial ∂q * R (b, a)/∂θ i is given as follows: ∂ ∂θ i q * R (b, a) = ∂ ∂θ i ρ R (b, a) + γ b ∈∆(S) P(b |b, a)v * R (b )db = ∂ ∂θ i ρ R (b, a) + γ b ∈∆(S) v * R (b ) ∂ ∂θ i P(b |b, a)db ρ R,i (b,a) + γ b ∈∆(S) P(b |b, a)E a ∼π(•|b ) ∂ ∂θ i q * R (b , a ) db from which we observe that ∂q * R (b, a)/∂θ i is a fixed point of a certain Bellman-like operator. Specifically, fix any function f ∈ R ∆(S)×A ; then ∂q * R (b, a)/∂θ i is the fixed point of the operator directly to distributions over actions. Now, the derivatives of log π(a|b) are given as closed-form expressions in Appendices A.1 and A.2. In particular, note that each b t is computed through a feed-forward structure, and therefore can easily be differentiated with respect to the unknown parameters θ through backpropagation through time: Each time step leading up to an action corresponds to a "hidden layer" in a neural network, and the initial belief corresponds to the "features" that are fed into the network; the transition and observation functions correspond to the weights between layers, the beliefs at each time step correspond to the activations between layers, the actions themselves correspond to class labels, and the action likelihood corresponds to the loss function (see Appendices A.1 and A.2). Finally, note that computing all of the forward-backward messages α t and β t in Appendix A.1 has complexity O(nτ S 2 ), computing all of the Jacobian matricies ∇ bt b t in Appendix A.2 has complexity O(nτ 2 S 3 ), and computing all of the partial derivatives given in Appendix A.2 has complexity at most O(nτ 2 S 2 AZ). Hence, fully differentiating the expected log-likelihood Q(θ; θ) with respect to the unknown parameters θ has an overall (polynomial) complexity O(nτ 2 S 2 max{S, AZ}).

C EXPERIMENT PARTICULARS

C.1 DETAILS OF DECISION ENVIRONMENTS ADNI We have filtered out visits without a CDR-SB measurement, which is almost always taken, and visits that do not occur immediately after the six-month period following the previous visit but instead occur after 12 months or later. This filtering leaves 1,626 patients with typically three consecutive visits each. For MRI outcomes, average is considered to be within half a standard deviation of the population mean. Since there are only two actions in this scenario, we have set η = 1 and relied on the distance between the two means to adjust for the stochasticity of the estimated policy-closer means being somewhat equivalent to a smaller η. BIAS We set all parameters exactly the same way we did in DIAG with one important exception: now O(s -|a = , z + ) = 0.2 while it is still the case that O true (z -|a = , s + ) = 0.4, meaning O = O true anymore. In this scenario, b 1 is also assumed to be known (in addition to T and η) to avoid any invariances between b 1 and O that we have encountered during training. The behavioral dataset is generated as 1000 demonstration trajectories.

R-BC

We train an RNN whose inputs are the observed histories h t and whose outputs are the predicted probabilities π(a|h t ) of taking action a given the observed history h t . The network consists of an LSTM unit of size 64 and a fully-connected hidden layer of size 64. We minimize the crossentropy loss L = -n i=1 τ t=1 a∈A I{a t = a} log π(a|h t ) using Adam optimizer with learning rate 0.001 until convergence, that is when the cross-enropy loss does not improve for 100 consecutive iterations.

PO-IRL

The IOHMM parameters T , O, and b 1 are initialized by sampling them uniformly at random. Then, they are estimated and fixed using conventional IOHMM methods. The reward parameter R is initialized as R0 (s, a) = ε s,a where ε s,a ∼ N (0, 0.001 2 ). Then, it is estimated via Markov chain Monte Carlo (MCMC) sampling, during which new candidate samples are generated by adding Gaussian noise with standard deviation 0.001 to the last sample. A final estimate is formed by averaging every tenth sample among the second set of 500 samples, ignoring the first 500 samples. In order to compute optimal q-values, we have used an off-the-shelf POMDP solver available at https://www.pomdp.org/code/index.html. Off. PO-IRL All parameters are initialized exactly the same way as in PO-IRL. Then, both the IOHMM parameters T , O, and b 1 , and the reward parameter R are estimated jointly via MCMC sampling. When generating new candidate samples, with equal probabilities, we have either sampled new T , O, and b 1 from IOHMM posterior (without changing R) or obtained a new R the same way we did in PO-IRL (without changing T , O, and b 1 ). A final estimate is formed the same way as in PO-IRL.

PO-MB-IL

The IOHMM parameters T , O, and b 1 are initialized by sampling them uniformly at random. Then, they are estimated and fixed using conventional IOHMM methods. Given the IOHMM parameters, we parameterized policies the same way we did in INTERPOLE, that is as described in (2) . The policy parameters {µ a } a∈A are initialized as μ0 a (s) = (1/|S| + ε a,s )/ s ∈S (1/|S| + ε a,s ) where ε a,s ∼ N (0, 0.001 2 ). Then, they are estimated according solely to the action likelihoods in (4) using the EM algorithm. The expected log-posterior is maximized using Adam optimizer with learning rate 0.001 until convergence, that is when the expected log-posterior does not improve for 100 consecutive iterations. INTERPOLE All parameters are initialized exactly the same way as in PO-MB-IL. Then, the IOHMM parameters T , O, and b 1 , and the policy parameters {µ a } a∈A are estimated jointly according to both the action likelihoods and the observation likelihoods in (4). The expected log-posterior is again maximized using Adam optimizer with learning rate 0.001 until convergence.

C.3 FURTHER EXAMPLE: POST-HOC ANALYSES

Policy representations learned by INTEPOLE provide users with means to derive concrete criteria that describe observed behavior in objective terms. These criteria, in turn, enable the quantitative analyses of the behavior using conventional statistical methods. For ADNI, we have considered two such criteria: belatedness of individual diagnoses and informativeness of individual tests. Both of these criteria are relevant to the discussion of early diagnosis, which is paramount for Alzheimer's disease [51] as we have already mentioned during the illustrative examples. Formally, we consider the final diagnoses of a patient to be belated if (i) the patient was not ordered an MRI in one of their visits despite the fact that an MRI being ordered was the most likely outcome according to the policy estimated by INTERPOLE and (ii) the patient was ordered an MRI in a later visit that led to a near-certain diagnosis with at least 90% confidence according to the underlying beliefs estimated by INTERPOLE. We consider An MRI to be uninformative if it neither (factually) caused nor could have (counterfactually) caused a significant change in the underlying belief-state of the patient, where an insignificant change is half a standard deviation less than the mean factual change in beliefs estimated by INTERPOLE. Having defined belatedness and informativeness, one can investigate the frequency of belated diagnoses and uninformative MRIs in different cohorts of patients to see how practice varies between one cohort to another. In Table 5 , we do so for six cohorts: all of the patients, patients who are over 75 years old, patients with apoE4 risk factor for dementia, patients with signs of MCI or dementia since their very first visit, female patients, and male patients. Note that increasing age, apoE4 allele, and female gender are known to be associated with increased risk of Alzheimer's disease [54] [55] [56] [57] . For instance, we see that uninformative MRIs are much more prevalent among patients with signs of MCI or dementia since their first visit. This could potentially be because these patients are monitored much more closely than usual given their condition. Alternatively, one can divide patients into cohorts based on whether they have a belated diagnoses or an uninformative MRI to see which features these criteria correlate with more. We do so in Table 6 . For instance, we see that a considerable percentage of belated diagnoses are seen among male patients. [58, 59] . For example, the guideline could ask the practitioner to quantify risks, side effects, or improvements in subjective terms such as being significant, serious, or potential. Using direct policy learning, how vague elements like these are commonly resolved in practice can be learned in objective terms. Formulating policies in terms of IOHMMs and decision boundaries is expressive enough to model decision trees. An IOHMM with deterministic observations, that is O(z|a, s ) = 1 for some z ∈ Z and for all a ∈ A, s ∈ S, essentially describes a finite-state machine, inputs of which are equivalent to the observations. Similarly, a deterministic decision tree can be defined as a finite-state machine with no looping sequence of transitions. The case where the observations are probabilistic rather than deterministic correspond to the case where the decision tree is traversed in a probabilistic way so that each path down the tree has a probability associated with it at each step of the traversal. Patient arrives. Test for the disease. Is there a significant risk of disease? The patient is healthy. Test for the subtype of the disease. As a concrete example of modeling decision trees in terms of IOHMMs, consider the scenario of diagnosing a disease with two sub-types: Disease-A and Disease-B. Figure 5a depicts the policy of the doctors in the form of a decision tree. Each newly-arriving patient is first tested for the disease in a general sense without any distinction between the two sub-types it has. The patient is then tested for a specific sub-type of the disease only if the doctors deem there is a significant risk that the patient is diseased. Note that which exact level of confidence constitutes as a significant risk is left vague in the decision tree. By modeling this scenario using our framework, we can learn: (i) how the risk is determined based on initial test results and (ii) what amount of risk is considered significant enough to require a subsequent test for the sub-type.

The patient has

Let S = {INI, HLT, DIS, DSA, DSB}, where INI denotes that the patient has newly arrived, HLT denotes that the patient is healthy, DIS denotes that the patient is diseased, DSA denotes that the patient has Disease-A, and DSB denotes that the patient has Disease-B. Figure 5b After taking action a 1 = TST-DIS and observing some initial test result z 1 ∈ Z, the risk of disease, which is the probability that the patient is diseased, can be calculated with a simple belief update: 4) how policies can be represented in terms of reward functions as well as decision boundaries. Then, they were asked two multiple-choice questions, one that is strictly about representing histories, and one that is strictly about representing policies. Importantly, the survey was conducted blindly-i.e. they were given no context whatsoever as pertains this paper and our proposed method. The question slides can be found in Figures 6 and 7 . Essentially, each question first states a hypothesis and shows two/three representations relevant to the hypothesis stated. Then, the participant is asked which of the representations shown most readily expresses the hypothesis. Here are the full responses that we have received, which includes some additional feedback: However, the representation of beliefs on a continuous spectrum around discrete cognitive states could be potentially confusing given that cognitive function is itself a continuum (and 'MCI', 'Dementia' and 'NL' are stations on a spectrum rather than discrete states). Also, while representation C is the clearest illustration, it is the representation that conveys the least actual data and it isn't clear from the visualisation exactly what each shift in 2D space represents. Also, the triangulation in 'C' draws a direct connection between NL and Dementia, implying that this is a potential alternative route for disease progression, although this is more intuitively considered as a linear progression from NL to MCI to Dementia. Q2. For me, the decision boundary representation best expresses the concept of the likelihood of ordering and MRI with the same caveats described above. Option B does best convey the likelihood of ordering an MRI, but doesn't convey the information value provided by that investigation. However, my understanding is that this is not what you are aiming to convey here. • Hypothesis: The more "normal" a patient appears, the less likely an MRI is ordered. • Question #2: Which of the following representations of the doctor's decision-making most readily expresses this hypothesis? An MRI is more likely to be ordered. • Clinician 1 Question 1: C > B > A An MRI is less likely to be ordered. 



While we take it here that Z is finite, our method can easily be generalized to allow continuous observations. In healthcare, diseases are often modeled in terms of states, and beliefs over disease states are eminently transparent factors that medical practitioners (i.e. domain experts) readily comprehend and reason about[40,41]. Belief/policy mismatch are not applicable to ADNI since we have no access to ground-truth beliefs/policies.



Figure 1: The INTERPOLE Model. Here, S ={K, L, M } and A= {1, 2, 3, 4}. (a) Beliefs are updated recursively (Equation1). (b) Actions are chosen with respect to relative locations of mean vectors (Equation2).

Figure 3: Decision Trajectories. Examples of real patients, including: (a) A typical normally-functioning patient, where the decision-maker's beliefs remain mostly on the decision boundary. (b) A typical patient who is believed to be deteriorating towards dementia. (c) A patient who-apparently-could have been diagnosed much earlier than they actually were. (d) A patient with a (seemingly redundant) MRI test that is actually highly informative.

ρ R (b, a) . = s∈S b(s)R(s, a) corresponding to R. Further, let P(b |b, a) = z∈Z P(z|b, a)P(b |b, a, z) = z∈Z s∈S s ∈S b(s)T (s |s, a)O(z|a, s ) δ b -s∈S b(s)T (•|s, a)O(z|a, •) s∈S s ∈S b(s)T (s |s, a)O(z|a, s ) denote the (belief-state) transition probabilities induced by T and O, where δ is the Dirac delta function such that δ(b ) integrates to one if and only if b = 0 is included in the integration region.

We set T true (s -|s -, •) = T true (s + |s + , •) = 1, meaning patients do not heal or contract the diseases as the diagnosis progresses, O true (z -|a = , s + ) = O true (z + |a = , s -) = 0.4, meaning measurements as a test have a false-negative and false-positive rates of 40%, and b true 1 (s + ) = 0.5. Moreover, the behavior policy is given by T = T true , O = O true , b 1 = b true η = 10, µ a= (s + ) = 0.5, and µ a-(s -) = µ a+ (s + ) = 1.3. Intuitively, doctors continue monitoring the patient until they are 90% confident in declaring a final diagnosis. In this scenario, T and η are assumed to be known. The behavior dataset is generated as 100 demonstration trajectories.

( H L T | I N I, T S T -D I S ) × O ( • | T S T -D I S , H L T ) ) = ? π(TST-TYP|b2) = ? T ( D S A | D I S , T S T -T Y P ) × O ( • | T S T -T Y P , D S A )

Figure 5: Two Different Descriptions of the Same Policy: (a) in the form of a decision tree and (b) in terms of an equivalent IOHMM. For the IOHMM in (b), arrows denote possible transitions, where the probability of a transition is proportional to the quantity written above the corresponding arrow. Using direct policy learning, we can infer the risk of disease, b 2 (DIS), and the probability of testing for the sub-type based on the risk, π(TST-TYP|b 2 ), which are left vague in (a).

depicts the state space S with all possible transitions. Note that the initial belief b 1 is such that b 1 (INI) = 1. Let A = {TST-DIS, TST-TYP, STP-HLT, STP-DSA, STP-DSB}, where TST-DIS denotes testing for the disease, TST-TYP denotes testing for the sub-type of the disease, and the remaining actions denote stopping and diagnosing the patient with one of the terminal states, namely states HLT, DSA, and DSB.

DIS) ∝ s∈S b 1 (s)T (DIS|s, TST-DIS)O(z 1 |TST-DIS, DIS) = T (DIS|INI, TST-DIS)O(z 1 |TST-DIS, DIS) . Moreover, we can say that the doctors are more likely to test for the sub-type of the disease as opposed to stopping and diagnosing the patient as healthy, that is π b (TST-TYP|b 2 ) > π b (STP-HLT|b 2 ), when b 2 (DIS) > µ TST-TYP (DIS) + µ STP-HLT (DIS) 2 assuming µ TST-TYP (DIS) > µ STP-HLT (DIS). Note that there are only two possible actions at the second time step: actions TST-TYP and STP-HLT. D DETAILS OF THE CLINICIAN SURVEYS Each participant was provided a short presentation explaining (1) the ADNI dataset and the decisionmaking problem we consider, (2) what rewards and reward functions are, (3) what beliefs and belief simplices are, and (

2: B Additional Feedback: The triangle was initially more confusing than not, but the first example (100% uncertainty) was helpful. It isn't clear how the dots in the triangle are computed. Are these probabilities based on statistics? Diagram is always better than no diagram. • Clinician 2 Question 1: C > B > A Question 2: B Additional Feedback: I always prefer pictures to tables, they are much easier to understand. • Clinician 3 Question 1: C > B > A Question 2: B Additional Feedback: Of course the triangle is more concise and easier to look at. But how is the decision boundary obtained? Does the decision boundary always have to be parallel to one of the sides of the triangle? • Clinician 4 Question 1: C > B > A Question 2: B Additional Feedback: [Regarding Question 1,] representation A and B do not show any interpretation of the diagnostic test results, whereas representation C does. I think doctors are most familiar with representation B, as it more closely resembles the EHR. Although representation C is visually pleasing, I'm not sure how the scale of the sides of the triangle should be interpreted. [Regarding Question 2,] again I like the triangle, but it's hard to interpret what the scale of the sides of the triangle mean. I think option A is again what doctors are more familiar with. Question 1: C Question 2: B Additional Feedback: I thought I'd share with you my thoughts on the medical aspects in your scenario first (although I realise you didn't ask me for them). [...] The Cochrane review concludes that MRI provides low sensitivity and specificity and does not qualify it as an add on test for the early diagnosis due to dementia (Lombardi G et. al. Cochrane database 2020). The reason for MRI imaging is (according to the international guidelines) to exclude non-degenerative or surgical causes of cognitive impairment. [...] In your example the condition became apparent when the CDR-SB score at Month 24 hit 3.0 (supported by the sequence of measurements over time showing worsening CDR-SB score). I imagine the MRI was triggered by slight worsening in the CDR-SB score (to exclude an alternative diagnosis). To answer your specific questions: Q1. The representation C describes your (false) hypothesis that it was the MRI that made the diagnosis of MCI more likely/apparent the best-I really like the triangles. Q2. I really like the decision boundary. Question 1: C Question 2: B Additional Feedback: Q1. Representation C gives the clearest illustration of the diagnostic change following MRI.

Hypothesis: This patient's condition (MCI) only became apparent after the first MRI was ordered. • Question #1: Which of the following representations of the patient's medical history most readily expresses this hypothesis? (Please rank them from most to least accessible.) Most recent CDR-SB: 3.0 (very mild dementia) Most recent hippocampal volume: 5980 mm³ (below avg.)

Figure 6: Slide 5 out of 9, which contains the first question regarding histories.

Figure 7: Slide 9 out of 9, which contains the second question regarding policies.

Comparison with Related Work. INTERPOLE satisfies our key criteria of

Performance Comparison in ADNI. INTER-POLE is best/second-best for action-matching metrics. ± 0.01 0.51 ± 0.07 0.78 ± 0.09 Off. PO-IRL 0.24 ± 0.01 0.54 ± 0.05 0.79 ± 0.09 INTERPOLE 0.17 ± 0.05 0.60 ± 0.04 0.81 ± 0.09

Performance Comparison in DIAG. INTER-POLE is best. Belief mismatch is n/a to R-BC. † ×10 -3

Performance Comparison in BIAS. INTER--|a = , s + ) < O true (z -|a = , s + ); that is, the doctor now incorrectly believes the test to have a smaller false-negative rate than it does in reality -thus biasing their beliefs regarding patient states.

Frequency of belated diagnoses and uninformative MRIs in various patient cohorts.

Features of patients with belated diagnoses and uninformative MRIs.Clinical practice guidelines are often given in the form of decision trees, which usually have vague elements that require the judgement of the practitioner

ACKNOWLEDGMENTS

This work was supported by the US Office of Naval Research (ONR) and Alzheimer's Research UK (ARUK). We thank the clinicians who participated in our survey, the reviewers for their valuable feedback, and the Alzheimer's Disease Neuroimaging Initiative for providing the ADNI dataset.

annex

T π R,i : R ∆(S)×A → R ∆(S)×A defined as follows: which takes the form of a "generalized" Bellman operator on q-functions for POMDPs, where for brevity here we have written ρ R,i (b, a) to denote the expressionMathematically, this means that a recursive procedure can in theory be defined-cf. "∇q-iteration", analogous to q-iteration; see e.g. [52] -that may converge on the gradient under appropriate conditions. Computationally, however, this also means that taking a single gradient is at least as hard as solving POMDPs in general.Further, note that while typical POMDP solvers operate by taking advantage of the convexity property of ρ R (b, a)-see e.g. [53] -here there is no such property to make use of: In general, it is not the case that ρ R,i (b, a) is convex. To see this, consider the following counterexample: Let S .and γ = 1/2. For simplicity, we will simply write b instead of b(s + ). Note that:Now, let the elements of θ be ordered such that p is the i-th element, and consider ρ R,i (b, a)evaluated at p = 1:

B.2 PROOF OF PROPOSITION 2

In contrast, unlike the indirect q-value parameterization above (which by itself requires approximate solutions to optimization problems), the mean-vector parameterization of INTERPOLE maps beliefs

