REWARD LEARNING WITH TREES: METHODS AND EVALUATION

Abstract

Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.

1. INTRODUCTION

For a reinforcement learning (RL) agent to reliably achieve a goal or desired behaviour, this objective must be encoded as a reward function. However, manual reward design is widely understood to be challenging, with risks of under-, over-, and mis-specification leading to undesirable, unsafe and variable outcomes (Pan et al., 2022) . For this reason, there has been growing interest in enabling RL agents to learn reward functions from normative feedback provided by humans (Leike et al., 2018) . These efforts have proven successful from a technical perspective, but an oft-unquestioned aspect of the approach creates a roadblock to practical applications: reward learning typically uses black-box neural networks (NNs), which resist human scrutiny and interpretation. For advocates of explainable AI (XAI), this is a problematic state of affairs. The XAI community is vocal about the safety and accountability risks of opaque learning algorithms (Rudin, 2019) , but an inability to interpret even the objective that an agent is optimising places us in yet murkier epistemic territory, in which an understanding of the causal origins of learnt behaviour, and their alignment with human preferences, becomes virtually unattainable. Black-box reward learning could also be seen as a missed scientific opportunity. A learnt reward function is a tantalising object of study from an XAI perspective, due to its triple status as (1) an explanatory model of revealed human preferences, (2) a normative model of agent behaviour, and (3) a causal link between the two. The approach proposed by Bewley & Lecue (2022) provides a promising way forward. Here, human preference labels over pairs of agent behaviours are used to learn tree-structured reward functions (reward trees), which are hierarchies of local rules that admit visual and textual representation and can be leveraged to monitor and debug agent learning. In this paper, we adapt and extend the method (including by integrating it with model-based RL agents), and compare it to NN-based reward learning in a challenging aircraft handling domain. We find it to be broadly competitive on both quantitative metrics and qualitative assessments, with our new modification to tree growth yielding significant improvements. The resultant trees are small enough to be globally interpretable (≈ 20 leaves), and we demonstrate how they can be analysed, verified, and used to generate explanations. The primary contribution of this paper is positive empirical evidence that reward learning can be done effectively using interpretable models such as trees, even in complex, high-dimensional continuous environments. We also make secondary methodological contributions: improvements to the originally-proposed learning algorithm, as well as metrics and methods for reward evaluation and interpretability that may be useful to others working in what remains a somewhat preparadigmatic field. After reviewing the necessary background and related work in Sections 2 and 3, we present our refinement of reward tree learning in Section 4, and describe how we deploy it online with a model-based agent in Section 5. Section 6 contains our experiments and results, which consider both quantitative and qualitative aspects of learning performance, and an illustrative analysis of learnt tree structures. Finally, Section 7 concludes and discusses avenues for future work.

Markov Decision Processes (MDPs)

In this formulation of sequential decision making, the state of a system at time t, s t ∈ S, and the action of an agent, a t ∈ A, condition the successor state s t+1 according to dynamics D : S × A → ∆(S) (∆(•) denotes the set of all probability distributions over a set). A reward function R : S ×A×S → R then outputs a scalar reward r t+1 given s t , a t and s t+1 . RL uses exploratory data collection to learn action-selection policies π : S → ∆(A), with the goal of maximising the expected discounted sum of future reward, E D,π ∞ h=0 γ h r t+h+1 , γ ∈ [0, 1]. Reward Learning In the usual MDP framing, R is an immutable property of the environment, which belies the practical fact that AI objectives originate in the uncertain goals and preferences of fallible humans (Russell, 2019) . Reward learning (or modelling) (Leike et al., 2018) replaces handspecified reward functions with models learnt from humans via revealed preference cues such as demonstrations (Ng et al., 2000) , scalar evaluations (Knox & Stone, 2008) , approval labels (Griffith et al., 2013 ), corrections (Bajcsy et al., 2017 ), and rankings (Christiano et al., 2017) . The default use of NNs for reward learning severely limits interpretability; reward trees provide a possible solution. XAI for RL (XRL) Surveys of XAI for RL (Puiutta & Veith, 2020; Heuillet et al., 2021 ) divide between intrinsic approaches, which imbue agents with structure such as object-oriented representations (Zhu et al., 2018) or symbolic policy primitives (Verma et al., 2018) , and post hoc analyses of learnt representations (Zahavy et al., 2016) , including computing feature importance/saliency (Huber et al., 2019) . Spatiotemporal scope varies from the local explanation of single actions (van der Waa et al., 2018) to the summary of entire policies via representative trajectories (Amir & Amir, 2018) or critical states (Huang et al., 2018) . While most post hoc methods focus on fixed policies, some provide insight into the dynamics of agent learning (Dao et al., 2018; Bewley et al., 2022) . Explainable Reward Functions At the intersection of reward learning and XRL lie efforts to improve human understanding of reward functions and their effects on action selection. While this area is "less developed" than other XRL sub-fields (Glanois et al., 2021) , a distinction has again emerged between intrinsic approaches which create rewards that decompose into semantic components (Juozapaitis et al., 2019) or optimise for sparsity (Devidze et al., 2021) , and post hoc approaches which apply feature importance analysis (Russell & Santos, 2019), counterfactual probing (Michaud et al., 2020) , or simplifying transformations (Jenner & Gleave, 2022). Sanneman & Shah (2022) use human-oriented metrics to compare the efficacy of reward explanation techniques. In this taxonomy, reward tree learning is an intrinsic approach, as the rule structure is inherently readable. Trees in RL Tree models have a long history in RL (Chapman & Kaelbling, 1991; Džeroski et al., 1998; Pyeatt, 2003) . Their use is increasingly given an XRL motivation. Applications again divide into intrinsic methods, where an agent's policy (Silva et al., 2020 ), value function (Liu et al., 2018; Roth et al., 2019) or dynamics model (Jiang et al., 2019) is a tree, and post hoc tree approximations of an existing agent's policy (Bastani et al., 2018; Coppens et al., 2019) or transition statistics (Bewley et al., 2022) . Related to our focus on human-centric learning, Cobo et al. (2012) learn tree-structured MDP abstractions from demonstrations and Tambwekar et al. ( 2021) distill a differentiable tree policy from natural language. While Sheikh et al. ( 2022) use tree evolution to learn dense intrinsic rewards from sparse environment ones, Bewley & Lecue (2022) are the first to learn and use reward trees in the absence of any ground-truth reward signal, and the first to do so from human feedback.

3. PREFERENCE-BASED REWARD LEARNING

We adopt the preference-based approach to reward learning, in which a human is presented with pairs of agent trajectories (sequences of state, action, next state transitions) and expresses which of each pair they prefer as a solution to a given task of interest. A reward function is then learnt to explain the pattern of preferences. This approach is popular in the existing literature (Wirth et al., 2016; Christiano et al., 2017; Lee et al., 2021b) and has a firm psychological basis. Experimental results indicate that humans find it cognitively easier to make relative (vs. absolute) quality judgements (Kendall, 1975; Wilde et al., 2020) and exhibit lower variance when doing so (Guo et al., 2018) . This is due in part to the lack of requirement for an absolute scale to be maintained in working memory, which is liable to induce bias as it shifts over time (Eric et al., 2007) . We formalise a trajectory ξ i as a sequence (x i 1 ,..., x i T i ), where x i t = ϕ(s i t-1 , a i t-1 , s i t ) ∈ R F represents a single transition as an F -dimensional feature vector. Given N trajectories, Ξ = {ξ i } N i=1 , the human provides K ≤ N (N -1)/2 pairwise preference labels, L = {(i, j)} K k=1 , each of which indicates that the jth trajectory is preferred to the ith (denoted by ξ j ≻ ξ i ). Figure 1 (left) shows how a preference dataset D = (Ξ, L) can be viewed as a directed graph.

